Re: pipeline writeTextFile spitting out encoded files?

Josh Wills Tue, 12 Feb 2013 09:31:32 -0800

Okay-- how about the cluster compression settings?


On Tue, Feb 12, 2013 at 9:17 AM, Mike Barretta <[email protected]>wrote:

> Josh,
>
> Did the swap, but got the same result.
>
> Inside that function() is something like:
>
> emitter.emit([
>     it.id
>     it.name,
>     it.value
> ].join("\t"))
>
> If it isn't obvious, I'm trying to output some HDFS tables containing
> serialized objects to TSV files.  Log statements at that emit line show
> that list.join spitting out a clear-text string.
>
> Thanks,
> Mike
>
>
>
> On Tue, Feb 12, 2013 at 11:46 AM, Josh Wills <[email protected]> wrote:
>
>> I haven't seen that one before-- I'm assuming there's some code in the
>> function() (or in the assembler?) to force the obj (which is a Thrift
>> record at some point?) to be a string before it gets emitted.
>>
>> To make sure it's not a Crunch bug, would you mind writing:
>>
>> crunchPipeline.write(collection, To.textFile("$outputPath/$outputDir"))
>>
>> in place of writeTextFile? We do some additional checks in writeTextFile
>> and I want to be sure we didn't screw something up.
>>
>> I'm curious about Hadoop cluster version, and the settings for
>> compression on your job/cluster (
>> https://ccp.cloudera.com/display/CDHDOC/Snappy+Installation for
>> references to the parameters) as well. Also, how do you like Groovy as a
>> language for writing Crunch pipelines? I haven't used it since the Grails
>> days, but I have some friends at SAS who love it.
>>
>> Josh
>>
>>
>> On Tue, Feb 12, 2013 at 8:33 AM, Mike Barretta 
>> <[email protected]>wrote:
>>
>>> I'm running some simple parallelDos which emit Strings.  When I write
>>> the resulting PCollection out using pipeline.writeTextFile(), I see garbled
>>> garbage like:
>>>
>>>
>>> 7?%?Ȳx?B?_L?v(ԭ?,??%?o;;??b-s?aaPXI???O??E;u?%k?????Z7??oD?r???e̼rX??/????)??Ƥ?r3l?R-}?+?!*??@!??Q?6?N??=????????v*B?=H??!0?ve??b?d?uZ7??4?H?i??uw??‹)Pxy
>>> ?n%?kۣ???v??xaI?wæ??v^?2i?<93\?G?“???N?
>>> ??}?/?EG??mK??*?9;vG??Sb?_L??XD?U?M?ݤo?U??c???qwa?q?ԫ.?9??(????H?o?3i|?7Į????B??n?%?\?uxw??Μ???̢??)-?S?su??Ҁ:?????ݹ??#)??V?7?!???????R?>???EZ}v??8ɿ?????ަ%?~?W?pi?|}?/d#??nr?\a?FUh?Yߠ?|sf%M
>>>                      v)S??. 4$???
>>>
>>>  3T???^?*?#I????bҀࡑ???x??%?f?Ў??U???h??,~H?T=O
>>>
>>>       ??z]JWt?q?B?e2?
>>>
>>> The code (Groovy - function() is a passed in closure that does the
>>> emitter.emit()) looks like:
>>>
>>> collection.parallelDo(this.class.name + ":" + table, new
>>> DoFn<Pair<ColumnKey, ColumnDataArrayWritable>, String>() {
>>>   @Override
>>>   void process(Pair<ColumnKey, ColumnDataArrayWritable> input,
>>> Emitter<String> emitter) {
>>>     input.second().toArray().each {
>>>       def obj = assembler.assemble([PetalUtils.toThrift(input.first(),
>>> it)])
>>>       function(obj, emitter)
>>>     }
>>>   }
>>> }, Writables.strings())
>>> crunchPipeline.writeTextFile(collection, "$outputPath/$outputDir")
>>>
>>> It's worth noting I saw the same output when running plain word count.
>>>
>>> Is this something that's my fault? Or the cluster, cluster compression,
>>> etc?
>>>
>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>
>


-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: pipeline writeTextFile spitting out encoded files?

Reply via email to