Hey Dave, Replies inlined.
On Tue, Dec 18, 2012 at 9:27 AM, Dave Beech <[email protected]> wrote: > Hi devs, > > I'm looking at the static factory methods on the From class, which produce > Source or TableSource objects to form input to a Crunch job. > > Couple of questions: > - Is there a reason why there are only TableSource versions of the > formattedFile methods which can take a custom input format? I'd find a > Source version of these which ignore the key quite useful. (I've already > knocked up a quick patch, but I just wanted to sound you out before I went > ahead and tidied it up properly or created a JIRA for it.) > I think it was the ambiguity about which of the two fields (key or value) should be ignored-- with SequenceFiles, it's usually the key that is irrelevant, but with Avro files, it's the value that is ignored. My feeling was that it's easy to convert to a PCollection<K> or PCollection<V> from PTable<K, V> via the keys() and values() methods on PTable, and of course you can create your own Sources whenever it's useful. > > - How come FileSourceImpl is abstract but FileTableSourceImpl is not? In my > patch mentioned above, I've had to remove abstract from FileSourceImpl, so > I'm keen to know if that would break anything. > That's probably just an artifact of an earlier revision-- I don't see where it would have to be abstract based on the current impl. > > Cheers, > Dave > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
