You're welcome, and thanks for the feedback. As part of documenting the From.java class, I should add functions that do the Writables.writables part for you (i.e., you just pass in the Class<K extends Writable>, Class<V extends Writable> arguments to make that easier to get rolling with. I'll add a JIRA for it.
J On Tue, Dec 4, 2012 at 12:53 PM, Mike Barretta <[email protected]>wrote: > Josh, thank you, that did help. I'd found the From class, but not the > Writables.writables. > > > On Mon, Dec 3, 2012 at 5:42 PM, Josh Wills <[email protected]> wrote: > >> Hey Mike, >> >> Sorry about that, it's mainly b/c they're tedious to write and I've been >> lazy about it. Here's the skinny. >> >> For the SeqFileSource, we assume that you're only interested in the >> "value" portion of the key-value pair for each record in the SequenceFile. >> The PType<T> should be for whatever data type you expect to read from that >> value, which is probably a class that implements Writable. The easy way to >> do it is to do: >> >> import static org.apache.crunch.types.writable.Writables.writables; >> >> import org.apache.crunch.io.From; >> >> // This reads the value and ignore the key in each record >> PCollection<MyWritable> in = pipeline.read(From.sequenceFile(<path>, >> writables(MyWritable.class))); >> >> If you want both the key and the value, you need to read the SequenceFile >> as a PTable<K, V>, as: >> >> PTable<MyKey, MyValue> in = pipeline.read(From.sequenceFile(<path>, >> writables(MyKey.class), writables(MyValue.class))); >> >> After you read in the values, you're free to convert them to whatever >> types you like using parallelDo and friends. I especially recommend using >> the Avro-based PTypeFamily, since it will significantly outperform the >> Writable family on jobs that involve complex joins or aggregations. >> >> Hope that helps, feel free to send follow-ups. >> >> Josh >> >> >> >> On Mon, Dec 3, 2012 at 2:25 PM, Mike Barretta <[email protected]>wrote: >> >>> As there are no examples on using non-text files as input, I'm trying to >>> piece together the steps involved in reading in sequence data. >>> >>> The main piece looks to be the SeqFileSource (as of 0.5 snapshot) which >>> takes a path and a PType. The PType is where my confusion begins. >>> >>> How does PType relate to InputFormat and OutputFormat? Do I need to >>> implement my own PTypes and the associated in/out MapFns? >>> >>> Thanks, >>> Mike >>> >>> >>> >> >> >> -- >> Director of Data Science >> Cloudera <http://www.cloudera.com> >> Twitter: @josh_wills <http://twitter.com/josh_wills> >> >> >
