Hey Dave,

Replies inlined.


On Tue, Dec 18, 2012 at 9:27 AM, Dave Beech <[email protected]> wrote:

> Hi devs,
>
> I'm looking at the static factory methods on the From class, which produce
> Source or TableSource objects to form input to a Crunch job.
>
> Couple of questions:
> - Is there a reason why there are only TableSource versions of the
> formattedFile methods which can take a custom input format? I'd find a
> Source version of these which ignore the key quite useful. (I've already
> knocked up a quick patch, but I just wanted to sound you out before I went
> ahead and tidied it up properly or created a JIRA for it.)
>

I think it was the ambiguity about which of the two fields (key or value)
should be ignored-- with SequenceFiles, it's usually the key that is
irrelevant, but with Avro files, it's the value that is ignored. My feeling
was that it's easy to convert to a PCollection<K> or PCollection<V> from
PTable<K, V> via the keys() and values() methods on PTable, and of course
you can create your own Sources whenever it's useful.


>
> - How come FileSourceImpl is abstract but FileTableSourceImpl is not? In my
> patch mentioned above, I've had to remove abstract from FileSourceImpl, so
> I'm keen to know if that would break anything.
>

That's probably just an artifact of an earlier revision-- I don't see where
it would have to be abstract based on the current impl.


>
> Cheers,
> Dave
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Reply via email to