just once - each of the parallelDo's happens within the run() of my Tool, and I kick it off with the pipeline.done() vs pipeline.run() - any difference there?
On Wed, Feb 20, 2013 at 3:25 PM, Josh Wills <[email protected]> wrote: > Ah, okay. I just got on a train, so I'll have to do a bit of local > debugging. > > Curious: are you explicitly calling run() between each of these jobs, or > just once after they've all been defined? > > > On Wednesday, February 20, 2013, Mike Barretta <[email protected]> > wrote: > > okay, well, things turned for the worse quickly :) > > Following the same output above, the following jobs were created: > > 13/02/20 19:25:26 INFO exec.CrunchJob: Running job > "com.digitalreasoning.petal.extract.SynthesysKBExtractor: > SeqFile(/Synthesys/MessageData)+[[S1+Text(/Synthesys/export/Contexts)]/[S0+Text(/Synthesys/export/MessageData)]/[S2+Text(/Synthesys/export/ContextualElements)]]" > > 13/02/20 19:25:26 INFO exec.CrunchJob: Job status available at: <snip> > > 13/02/20 19:25:28 INFO input.FileInputFormat: Total input paths to > process : 40 > > 13/02/20 19:25:29 INFO exec.CrunchJob: Running job > "com.digitalreasoning.petal.extract.SynthesysKBExtractor: > SeqFile(/Synthesys/ElementData)+S5+Text(/Synthesys/export/ElementData)" > > 13/02/20 19:25:29 INFO exec.CrunchJob: Job status available at: <snip> > > 13/02/20 19:25:32 INFO input.FileInputFormat: Total input paths to > process : 40 > > 13/02/20 19:25:32 INFO exec.CrunchJob: Running job > "com.digitalreasoning.petal.extract.SynthesysKBExtractor: > SeqFile(/Synthesys/RelationshipData)+S3+Text(/Synthesys/export/RelationshipData)" > > notice that the first (MessageData) shows all three output paths while > the last (RelationshipData) only shows one. This is despite the previous > log messages showing: > > 13/02/20 19:25:04 INFO extract.SynthesysKBExtractor: reading > [RelationshipData] > > 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to > new path: /Synthesys/export/RelationshipData > > 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to > new path: /Synthesys/export/RelationshipStructures > > *forgive the mismatched paths between this email and my previous - am > shorting for brevity, and trying to convey the difference between input and > export paths > > > > On Wed, Feb 20, 2013 at 2:30 PM, Mike Barretta <[email protected]> > wrote: > > > > Was using a very early 0.5.0-incubating build, with hadoop 0.20.2, but > just did a fresh git pull and now with 0.6.0-incubating, things look better > (MessageData and RelationshipData are my parents with children): > > 13/02/20 19:25:04 INFO extract.SynthesysKBExtractor: reading > [MessageData] > > 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to > new path: /Synthesys/MessageData > > 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to > new path: /Synthesys/Contexts > > 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to > new path: /Synthesys/ContextualElements > > 13/02/20 19:25:04 INFO extract.SynthesysKBExtractor: reading > [RelationshipData] > > 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to > new path: /Synthesys/RelationshipData > > 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to > new path: /Synthesys/RelationshipStructures > > 13/02/20 19:25:04 INFO extract.SynthesysKBExtractor: reading > [ElementData] > > 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to > new path: /Synthesys/ElementData > > 13/02/20 19:25:04 INFO extract.SynthesysKBExtractor: reading > [ConceptData] > > 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to > new path: /Synthesys/ConceptData > > I'll try a few more times and let you know if anything funky happens. > > Thanks, as always, for your prompt responses, > > Mike > > > > On Wed, Feb 20, 2013 at 1:06 PM, Josh Wills <[email protected]> wrote: > > > > Hey Mike, > > I can't replicate this problem using the MultipleOutputIT (which I think > we added as a test for this problem.) Which version of Crunch and Hadoop > are you using? The 0.5.0-incubating release should be up on the maven repos > if you want to try that out. > > J > > > > On Wed, Feb 20, 2013 at 6:43 AM, Josh Wills <[email protected]> wrote: > > > > Hey Mike, > > The code looks right to me. Let me whip up a test and see if I can > replicate it easily-- is there anything funky beyond what's in your snippet > that I should be aware of? > > J > > > > On Wed, Feb 20, 2013 at 6:02 AM, Mike Barretta <[email protected]> > wrote: > > > > I have a number of "tables" in HDFS, represented as folders containing > SequenceFiles of serialized objects. I'm trying to write a tool that will > reassemble these objects and output each of the tables into its own CSV > file. > > The wrinkle is that some of the "tables" hold objects with a list of > related child objects. Those related should get chopped into their own > table. > > Here is essentially what my loop looks like (in Groovy): > > //loop through each top-level table > > paths.each { path -> > > def source = From.sequenceFile(new Path(path), > > > Writables.writables(ColumnKey.class), > > > Writables.writables(ColumnDataArrayWritable.class) > > ) > > //read it in > > def data = crunchPipeline.read(source) > > //write it out > > crunchPipeline.write( > > data.parallelDo(new MyDoFn(path), Writables.strings()), > > To.textFile("$path/csv") > > ) > > //handle children using same PTable as pare > > -- > Director of Data Science > Cloudera <http://www.cloudera.com> > Twitter: @josh_wills <http://twitter.com/josh_wills> > >
