Gotcha. Alright, I'll try a true MR pipeline, and see if that improves the situtation. Thanks!
On Thu, Dec 13, 2012 at 11:12 AM, Josh Wills <[email protected]> wrote: > Ah-- that is interesting, and almost certainly the reason why we're > writing JSON instead of binary Avro. > > > On Thu, Dec 13, 2012 at 11:08 AM, Jonathan Natkins <[email protected]>wrote: > >> It's 2.0.0 and 1.7.0. I've actually only been running MemPipelines thus >> far, to make sure that I've built the job correctly, so it's possible that >> that's the issue. >> >> >> On Thu, Dec 13, 2012 at 10:56 AM, Josh Wills <[email protected]> wrote: >> >>> That surprises me-- Crunch has its own AvroOutputFormat in order to use >>> the mapreduce.* APIs, but they delegate much of the work to things like >>> DatumWriters/encoders/etc. from Avro's core libraries. >>> >>> Could I get some detail on hadoop/avro version? Is it just 1.0.x and >>> Avro 1.7.0? >>> >>> J >>> >>> >>> On Thu, Dec 13, 2012 at 10:35 AM, Jonathan Natkins >>> <[email protected]>wrote: >>> >>>> Out of curiosity, is there a way to write output from a Crunch pipeline >>>> into an Avro-format file? It seems that if you do the >>>> collection.write(To.avroFile(path)), you end up just writing JSON. It can >>>> certainly be read into an Avro object, but it seems like it would be more >>>> efficient to write binary data to the file, so no parsing has to happen. >>>> >>>> Have I missed an API, or is this a missing feature? >>>> >>>> Thanks, >>>> Natty >>>> >>> >>> >>> >>> -- >>> Director of Data Science >>> Cloudera <http://www.cloudera.com> >>> Twitter: @josh_wills <http://twitter.com/josh_wills> >>> >>> >> > > > -- > Director of Data Science > Cloudera <http://www.cloudera.com> > Twitter: @josh_wills <http://twitter.com/josh_wills> > >
