Thanks Josh - yes, I see what you mean, the ambiguity is a problem. For my current use case I want to ignore the key (like TextInputFormat) but I can think of times where I'd want to ignore the value too. I'll use the keys() and values() methods instead.
Dave On 18 December 2012 17:38, Josh Wills <[email protected]> wrote: > Hey Dave, > > Replies inlined. > > > On Tue, Dec 18, 2012 at 9:27 AM, Dave Beech <[email protected]> wrote: > > > Hi devs, > > > > I'm looking at the static factory methods on the From class, which > produce > > Source or TableSource objects to form input to a Crunch job. > > > > Couple of questions: > > - Is there a reason why there are only TableSource versions of the > > formattedFile methods which can take a custom input format? I'd find a > > Source version of these which ignore the key quite useful. (I've already > > knocked up a quick patch, but I just wanted to sound you out before I > went > > ahead and tidied it up properly or created a JIRA for it.) > > > > I think it was the ambiguity about which of the two fields (key or value) > should be ignored-- with SequenceFiles, it's usually the key that is > irrelevant, but with Avro files, it's the value that is ignored. My feeling > was that it's easy to convert to a PCollection<K> or PCollection<V> from > PTable<K, V> via the keys() and values() methods on PTable, and of course > you can create your own Sources whenever it's useful. > > > > > > - How come FileSourceImpl is abstract but FileTableSourceImpl is not? In > my > > patch mentioned above, I've had to remove abstract from FileSourceImpl, > so > > I'm keen to know if that would break anything. > > > > That's probably just an artifact of an earlier revision-- I don't see where > it would have to be abstract based on the current impl. > > > > > > Cheers, > > Dave > > > > > > -- > Director of Data Science > Cloudera <http://www.cloudera.com> > Twitter: @josh_wills <http://twitter.com/josh_wills> >
