Re: Confusion regarding SeqFileSource

Josh Wills Mon, 03 Dec 2012 14:43:45 -0800

Hey Mike,

Sorry about that, it's mainly b/c they're tedious to write and I've been
lazy about it. Here's the skinny.

For the SeqFileSource, we assume that you're only interested in the "value"
portion of the key-value pair for each record in the SequenceFile. The
PType<T> should be for whatever data type you expect to read from that
value, which is probably a class that implements Writable. The easy way to
do it is to do:

import static org.apache.crunch.types.writable.Writables.writables;

import org.apache.crunch.io.From;

// This reads the value and ignore the key in each record
PCollection<MyWritable> in = pipeline.read(From.sequenceFile(<path>,
writables(MyWritable.class)));

If you want both the key and the value, you need to read the SequenceFile
as a PTable<K, V>, as:

PTable<MyKey, MyValue> in = pipeline.read(From.sequenceFile(<path>,
writables(MyKey.class), writables(MyValue.class)));

After you read in the values, you're free to convert them to whatever types
you like using parallelDo and friends. I especially recommend using the
Avro-based PTypeFamily, since it will significantly outperform the Writable
family on jobs that involve complex joins or aggregations.

Hope that helps, feel free to send follow-ups.

Josh

On Mon, Dec 3, 2012 at 2:25 PM, Mike Barretta <[email protected]>wrote:

> As there are no examples on using non-text files as input, I'm trying to
> piece together the steps involved in reading in sequence data.
>
> The main piece looks to be the SeqFileSource (as of 0.5 snapshot) which
> takes a path and a PType.  The PType is where my confusion begins.
>
> How does PType relate to InputFormat and OutputFormat? Do I need to
> implement my own PTypes and the associated in/out MapFns?
>
> Thanks,
> Mike
>
>
>

-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: Confusion regarding SeqFileSource

Reply via email to