Hey Mike, I can't replicate this problem using the MultipleOutputIT (which I think we added as a test for this problem.) Which version of Crunch and Hadoop are you using? The 0.5.0-incubating release should be up on the maven repos if you want to try that out.
J On Wed, Feb 20, 2013 at 6:43 AM, Josh Wills <[email protected]> wrote: > Hey Mike, > > The code looks right to me. Let me whip up a test and see if I can > replicate it easily-- is there anything funky beyond what's in your snippet > that I should be aware of? > > J > > > On Wed, Feb 20, 2013 at 6:02 AM, Mike Barretta <[email protected]>wrote: > >> I have a number of "tables" in HDFS, represented as folders containing >> SequenceFiles of serialized objects. I'm trying to write a tool that will >> reassemble these objects and output each of the tables into its own CSV >> file. >> >> The wrinkle is that some of the "tables" hold objects with a list of >> related child objects. Those related should get chopped into their own >> table. >> >> Here is essentially what my loop looks like (in Groovy): >> >> //loop through each top-level table >> paths.each { path -> >> def source = From.sequenceFile(new Path(path), >> >> Writables.writables(ColumnKey.class), >> >> Writables.writables(ColumnDataArrayWritable.class) >> ) >> >> //read it in >> def data = crunchPipeline.read(source) >> >> //write it out >> crunchPipeline.write( >> data.parallelDo(new MyDoFn(path), Writables.strings()), >> To.textFile("$path/csv") >> ) >> >> //handle children using same PTable as parent >> if (path == TABLE_MESSAGE_DATA) { >> messageChildPaths.each { childPath -> >> crunchPipeline.write( >> data.parallelDo(new MyDoFn(childPath), >> Writables.strings()), >> To.textFile("$childPath/csv") >> ) >> } >> } >> } >> >> The parent and child jobs generally get grouped into a single map job, >> but most of the time, only some of the children tables get included, which >> is to say, sometimes a child table does not get output. There doesn't seem >> to be a pattern - sometimes all of them get included, sometimes 1 or 2. >> >> Am I missing something? Is there a way to specify which jobs should be >> combined? >> >> Thanks, >> Mike >> > > > > -- > Director of Data Science > Cloudera <http://www.cloudera.com> > Twitter: @josh_wills <http://twitter.com/josh_wills> > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
