On Tue, Feb 12, 2013 at 11:27 AM, Mike Barretta <[email protected]>wrote:
> Still debugging this on my end, but to answer your questions: > > Compression: > mapred.map.output.compression.codec = > org.apache.hadoop.io.compress.DefaultCodec > mapred.compress.map.output = true > mapred.output.compress = true > mapred.output.compression.type = BLOCK > Yeah, that looks suspect-- let's try with mapred.output.compress = false. > > Groovy: > Rocks! One thing I'd actually like to see/do would be to translate GPars > parallelization functions (has functions like map, reduce, filter, size, > sum, min/max, sort, groupBy, combine: > http://gpars.org/guide/guide/3.%20Data%20Parallelism.html) into Crunch. > Very cool-- Scala's Collections API was the inspiration for a lot of Scrunch's APIs, same sort of principle applies here. > > > On Tue, Feb 12, 2013 at 12:30 PM, Josh Wills <[email protected]> wrote: > >> Okay-- how about the cluster compression settings? >> >> >> On Tue, Feb 12, 2013 at 9:17 AM, Mike Barretta >> <[email protected]>wrote: >> >>> Josh, >>> >>> Did the swap, but got the same result. >>> >>> Inside that function() is something like: >>> >>> emitter.emit([ >>> it.id >>> it.name, >>> it.value >>> ].join("\t")) >>> >>> If it isn't obvious, I'm trying to output some HDFS tables containing >>> serialized objects to TSV files. Log statements at that emit line show >>> that list.join spitting out a clear-text string. >>> >>> Thanks, >>> Mike >>> >>> >>> >>> On Tue, Feb 12, 2013 at 11:46 AM, Josh Wills <[email protected]>wrote: >>> >>>> I haven't seen that one before-- I'm assuming there's some code in the >>>> function() (or in the assembler?) to force the obj (which is a Thrift >>>> record at some point?) to be a string before it gets emitted. >>>> >>>> To make sure it's not a Crunch bug, would you mind writing: >>>> >>>> crunchPipeline.write(collection, To.textFile("$outputPath/$outputDir")) >>>> >>>> in place of writeTextFile? We do some additional checks in >>>> writeTextFile and I want to be sure we didn't screw something up. >>>> >>>> I'm curious about Hadoop cluster version, and the settings for >>>> compression on your job/cluster ( >>>> https://ccp.cloudera.com/display/CDHDOC/Snappy+Installation for >>>> references to the parameters) as well. Also, how do you like Groovy as a >>>> language for writing Crunch pipelines? I haven't used it since the Grails >>>> days, but I have some friends at SAS who love it. >>>> >>>> Josh >>>> >>>> >>>> On Tue, Feb 12, 2013 at 8:33 AM, Mike Barretta <[email protected] >>>> > wrote: >>>> >>>>> I'm running some simple parallelDos which emit Strings. When I write >>>>> the resulting PCollection out using pipeline.writeTextFile(), I see >>>>> garbled >>>>> garbage like: >>>>> >>>>> >>>>> 7?%?Ȳx?B?_L?v(ԭ?,??%?o;;??b-s?aaPXI???O??E;u?%k?????Z7??oD?r???e̼rX??/????)??Ƥ?r3l?R-}?+?!*??@!??Q?6?N??=????????v*B?=H??!0?ve??b?d?uZ7??4?H?i??uw??‹)Pxy >>>>> ?n%?kۣ???v??xaI?wæ??v^?2i?<93\?G?“???N? >>>>> ??}?/?EG??mK??*?9;vG??Sb?_L??XD?U?M?ݤo?U??c???qwa?q?ԫ.?9??(????H?o?3i|?7Į????B??n?%?\?uxw??Μ???̢??)-?S?su??Ҁ:?????ݹ??#)??V?7?!???????R?>???EZ}v??8ɿ?????ަ%?~?W?pi?|}?/d#??nr?\a?FUh?Yߠ?|sf%M >>>>> v)S??. 4$??? >>>>> >>>>> 3T???^?*?#I????bҀࡑ???x??%?f?Ў??U???h??,~H?T=O >>>>> >>>>> ??z]JWt?q?B?e2? >>>>> >>>>> The code (Groovy - function() is a passed in closure that does the >>>>> emitter.emit()) looks like: >>>>> >>>>> collection.parallelDo(this.class.name + ":" + table, new >>>>> DoFn<Pair<ColumnKey, ColumnDataArrayWritable>, String>() { >>>>> @Override >>>>> void process(Pair<ColumnKey, ColumnDataArrayWritable> input, >>>>> Emitter<String> emitter) { >>>>> input.second().toArray().each { >>>>> def obj = assembler.assemble([PetalUtils.toThrift(input.first(), >>>>> it)]) >>>>> function(obj, emitter) >>>>> } >>>>> } >>>>> }, Writables.strings()) >>>>> crunchPipeline.writeTextFile(collection, "$outputPath/$outputDir") >>>>> >>>>> It's worth noting I saw the same output when running plain word count. >>>>> >>>>> Is this something that's my fault? Or the cluster, cluster >>>>> compression, etc? >>>>> >>>> >>>> >>>> >>>> -- >>>> Director of Data Science >>>> Cloudera <http://www.cloudera.com> >>>> Twitter: @josh_wills <http://twitter.com/josh_wills> >>>> >>> >>> >> >> >> -- >> Director of Data Science >> Cloudera <http://www.cloudera.com> >> Twitter: @josh_wills <http://twitter.com/josh_wills> >> > > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
