Okay-- how about the cluster compression settings?
On Tue, Feb 12, 2013 at 9:17 AM, Mike Barretta <[email protected]>wrote: > Josh, > > Did the swap, but got the same result. > > Inside that function() is something like: > > emitter.emit([ > it.id > it.name, > it.value > ].join("\t")) > > If it isn't obvious, I'm trying to output some HDFS tables containing > serialized objects to TSV files. Log statements at that emit line show > that list.join spitting out a clear-text string. > > Thanks, > Mike > > > > On Tue, Feb 12, 2013 at 11:46 AM, Josh Wills <[email protected]> wrote: > >> I haven't seen that one before-- I'm assuming there's some code in the >> function() (or in the assembler?) to force the obj (which is a Thrift >> record at some point?) to be a string before it gets emitted. >> >> To make sure it's not a Crunch bug, would you mind writing: >> >> crunchPipeline.write(collection, To.textFile("$outputPath/$outputDir")) >> >> in place of writeTextFile? We do some additional checks in writeTextFile >> and I want to be sure we didn't screw something up. >> >> I'm curious about Hadoop cluster version, and the settings for >> compression on your job/cluster ( >> https://ccp.cloudera.com/display/CDHDOC/Snappy+Installation for >> references to the parameters) as well. Also, how do you like Groovy as a >> language for writing Crunch pipelines? I haven't used it since the Grails >> days, but I have some friends at SAS who love it. >> >> Josh >> >> >> On Tue, Feb 12, 2013 at 8:33 AM, Mike Barretta >> <[email protected]>wrote: >> >>> I'm running some simple parallelDos which emit Strings. When I write >>> the resulting PCollection out using pipeline.writeTextFile(), I see garbled >>> garbage like: >>> >>> >>> 7?%?Ȳx?B?_L?v(ԭ?,??%?o;;??b-s?aaPXI???O??E;u?%k?????Z7??oD?r???e̼rX??/????)??Ƥ?r3l?R-}?+?!*??@!??Q?6?N??=????????v*B?=H??!0?ve??b?d?uZ7??4?H?i??uw??‹)Pxy >>> ?n%?kۣ???v??xaI?wæ??v^?2i?<93\?G?“???N? >>> ??}?/?EG??mK??*?9;vG??Sb?_L??XD?U?M?ݤo?U??c???qwa?q?ԫ.?9??(????H?o?3i|?7Į????B??n?%?\?uxw??Μ???̢??)-?S?su??Ҁ:?????ݹ??#)??V?7?!???????R?>???EZ}v??8ɿ?????ަ%?~?W?pi?|}?/d#??nr?\a?FUh?Yߠ?|sf%M >>> v)S??. 4$??? >>> >>> 3T???^?*?#I????bҀࡑ???x??%?f?Ў??U???h??,~H?T=O >>> >>> ??z]JWt?q?B?e2? >>> >>> The code (Groovy - function() is a passed in closure that does the >>> emitter.emit()) looks like: >>> >>> collection.parallelDo(this.class.name + ":" + table, new >>> DoFn<Pair<ColumnKey, ColumnDataArrayWritable>, String>() { >>> @Override >>> void process(Pair<ColumnKey, ColumnDataArrayWritable> input, >>> Emitter<String> emitter) { >>> input.second().toArray().each { >>> def obj = assembler.assemble([PetalUtils.toThrift(input.first(), >>> it)]) >>> function(obj, emitter) >>> } >>> } >>> }, Writables.strings()) >>> crunchPipeline.writeTextFile(collection, "$outputPath/$outputDir") >>> >>> It's worth noting I saw the same output when running plain word count. >>> >>> Is this something that's my fault? Or the cluster, cluster compression, >>> etc? >>> >> >> >> >> -- >> Director of Data Science >> Cloudera <http://www.cloudera.com> >> Twitter: @josh_wills <http://twitter.com/josh_wills> >> > > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
