Josh,
Did the swap, but got the same result.
Inside that function() is something like:
emitter.emit([
it.id
it.name,
it.value
].join("\t"))
If it isn't obvious, I'm trying to output some HDFS tables containing
serialized objects to TSV files. Log statements at that emit line show
that list.join spitting out a clear-text string.
Thanks,
Mike
On Tue, Feb 12, 2013 at 11:46 AM, Josh Wills <[email protected]> wrote:
> I haven't seen that one before-- I'm assuming there's some code in the
> function() (or in the assembler?) to force the obj (which is a Thrift
> record at some point?) to be a string before it gets emitted.
>
> To make sure it's not a Crunch bug, would you mind writing:
>
> crunchPipeline.write(collection, To.textFile("$outputPath/$outputDir"))
>
> in place of writeTextFile? We do some additional checks in writeTextFile
> and I want to be sure we didn't screw something up.
>
> I'm curious about Hadoop cluster version, and the settings for compression
> on your job/cluster (
> https://ccp.cloudera.com/display/CDHDOC/Snappy+Installation for
> references to the parameters) as well. Also, how do you like Groovy as a
> language for writing Crunch pipelines? I haven't used it since the Grails
> days, but I have some friends at SAS who love it.
>
> Josh
>
>
> On Tue, Feb 12, 2013 at 8:33 AM, Mike Barretta <[email protected]>wrote:
>
>> I'm running some simple parallelDos which emit Strings. When I write the
>> resulting PCollection out using pipeline.writeTextFile(), I see garbled
>> garbage like:
>>
>>
>> 7?%?Ȳx?B?_L?v(ԭ?,??%?o;;??b-s?aaPXI???O??E;u?%k?????Z7??oD?r???e̼rX??/????)??Ƥ?r3l?R-}?+?!*??@!??Q?6?N??=????????v*B?=H??!0?ve??b?d?uZ7??4?H?i??uw??‹)Pxy
>> ?n%?kۣ???v??xaI?wæ??v^?2i?<93\?G?“???N?
>> ??}?/?EG??mK??*?9;vG??Sb?_L??XD?U?M?ݤo?U??c???qwa?q?ԫ.?9??(????H?o?3i|?7Į????B??n?%?\?uxw??Μ???̢??)-?S?su??Ҁ:?????ݹ??#)??V?7?!???????R?>???EZ}v??8ɿ?????ަ%?~?W?pi?|}?/d#??nr?\a?FUh?Yߠ?|sf%M
>> v)S??. 4$???
>>
>> 3T???^?*?#I????bҀࡑ???x??%?f?Ў??U???h??,~H?T=O
>>
>> ??z]JWt?q?B?e2?
>>
>> The code (Groovy - function() is a passed in closure that does the
>> emitter.emit()) looks like:
>>
>> collection.parallelDo(this.class.name + ":" + table, new
>> DoFn<Pair<ColumnKey, ColumnDataArrayWritable>, String>() {
>> @Override
>> void process(Pair<ColumnKey, ColumnDataArrayWritable> input,
>> Emitter<String> emitter) {
>> input.second().toArray().each {
>> def obj = assembler.assemble([PetalUtils.toThrift(input.first(),
>> it)])
>> function(obj, emitter)
>> }
>> }
>> }, Writables.strings())
>> crunchPipeline.writeTextFile(collection, "$outputPath/$outputDir")
>>
>> It's worth noting I saw the same output when running plain word count.
>>
>> Is this something that's my fault? Or the cluster, cluster compression,
>> etc?
>>
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>