Re: inconsistent grouping of map jobs

Mike Barretta Wed, 20 Feb 2013 12:09:10 -0800

okay, well, things turned for the worse quickly :)

Following the same output above, the following jobs were created:


13/02/20 19:25:26 INFO exec.CrunchJob: Running job
"com.digitalreasoning.petal.extract.SynthesysKBExtractor:
SeqFile(/Synthesys/MessageData)+[[S1+Text(/Synthesys/export/Contexts)]/[S0+Text(/Synthesys/export/MessageData)]/[S2+Text(/Synthesys/export/ContextualElements)]]"
13/02/20 19:25:26 INFO exec.CrunchJob: Job status available at: <snip>
13/02/20 19:25:28 INFO input.FileInputFormat: Total input paths to process
: 40
13/02/20 19:25:29 INFO exec.CrunchJob: Running job
"com.digitalreasoning.petal.extract.SynthesysKBExtractor:
SeqFile(/Synthesys/ElementData)+S5+Text(/Synthesys/export/ElementData)"
13/02/20 19:25:29 INFO exec.CrunchJob: Job status available at: <snip>
13/02/20 19:25:32 INFO input.FileInputFormat: Total input paths to process
: 40
13/02/20 19:25:32 INFO exec.CrunchJob: Running job
"com.digitalreasoning.petal.extract.SynthesysKBExtractor:
SeqFile(/Synthesys/RelationshipData)+S3+Text(/Synthesys/export/RelationshipData)"

notice that the first (MessageData) shows all three output paths while the
last (RelationshipData) only shows one.  This is despite the previous log
messages showing:

13/02/20 19:25:04 INFO extract.SynthesysKBExtractor: reading
[RelationshipData]
13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to new
path: /Synthesys/export/RelationshipData
13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to new
path: /Synthesys/export/RelationshipStructures

*forgive the mismatched paths between this email and my previous - am
shorting for brevity, and trying to convey the difference between input and
export paths


On Wed, Feb 20, 2013 at 2:30 PM, Mike Barretta <[email protected]>wrote:

> Was using a very early 0.5.0-incubating build, with hadoop 0.20.2, but
> just did a fresh git pull and now with 0.6.0-incubating, things look better
> (MessageData and RelationshipData are my parents with children):
>
> 13/02/20 19:25:04 INFO extract.SynthesysKBExtractor: reading [MessageData]
> 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to new
> path: /Synthesys/MessageData
> 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to new
> path: /Synthesys/Contexts
> 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to new
> path: /Synthesys/ContextualElements
> 13/02/20 19:25:04 INFO extract.SynthesysKBExtractor: reading
> [RelationshipData]
> 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to new
> path: /Synthesys/RelationshipData
> 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to new
> path: /Synthesys/RelationshipStructures
> 13/02/20 19:25:04 INFO extract.SynthesysKBExtractor: reading [ElementData]
> 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to new
> path: /Synthesys/ElementData
> 13/02/20 19:25:04 INFO extract.SynthesysKBExtractor: reading [ConceptData]
> 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to new
> path: /Synthesys/ConceptData
>
> I'll try a few more times and let you know if anything funky happens.
>
> Thanks, as always, for your prompt responses,
> Mike
>
>
> On Wed, Feb 20, 2013 at 1:06 PM, Josh Wills <[email protected]> wrote:
>
>> Hey Mike,
>>
>> I can't replicate this problem using the MultipleOutputIT (which I think
>> we added as a test for this problem.) Which version of Crunch and Hadoop
>> are you using? The 0.5.0-incubating release should be up on the maven repos
>> if you want to try that out.
>>
>> J
>>
>>
>> On Wed, Feb 20, 2013 at 6:43 AM, Josh Wills <[email protected]> wrote:
>>
>>> Hey Mike,
>>>
>>> The code looks right to me. Let me whip up a test and see if I can
>>> replicate it easily-- is there anything funky beyond what's in your snippet
>>> that I should be aware of?
>>>
>>> J
>>>
>>>
>>> On Wed, Feb 20, 2013 at 6:02 AM, Mike Barretta 
>>> <[email protected]>wrote:
>>>
>>>> I have a number of "tables" in HDFS, represented as folders containing
>>>> SequenceFiles of serialized objects.  I'm trying to write a tool that will
>>>> reassemble these objects and output each of the tables into its own CSV
>>>> file.
>>>>
>>>> The wrinkle is that some of the "tables" hold objects with a list of
>>>> related child objects.  Those related should get chopped into their own
>>>> table.
>>>>
>>>> Here is essentially what my loop looks like (in Groovy):
>>>>
>>>> //loop through each top-level table
>>>> paths.each { path ->
>>>>     def source = From.sequenceFile(new Path(path),
>>>>
>>>> Writables.writables(ColumnKey.class),
>>>>
>>>> Writables.writables(ColumnDataArrayWritable.class)
>>>>     )
>>>>
>>>>     //read it in
>>>>     def data = crunchPipeline.read(source)
>>>>
>>>>     //write it out
>>>>     crunchPipeline.write(
>>>>         data.parallelDo(new MyDoFn(path), Writables.strings()),
>>>>         To.textFile("$path/csv")
>>>>     )
>>>>
>>>>     //handle children using same PTable as parent
>>>>     if (path == TABLE_MESSAGE_DATA) {
>>>>         messageChildPaths.each {  childPath ->
>>>>             crunchPipeline.write(
>>>>                 data.parallelDo(new MyDoFn(childPath),
>>>> Writables.strings()),
>>>>                 To.textFile("$childPath/csv")
>>>>             )
>>>>         }
>>>>     }
>>>> }
>>>>
>>>> The parent and child jobs generally get grouped into a single map job,
>>>> but most of the time, only some of the children tables get included, which
>>>> is to say, sometimes a child table does not get output.  There doesn't seem
>>>> to be a pattern - sometimes all of them get included, sometimes 1 or 2.
>>>>
>>>> Am I missing something? Is there a way to specify which jobs should be
>>>> combined?
>>>>
>>>> Thanks,
>>>> Mike
>>>>
>>>
>>>
>>>
>>> --
>>> Director of Data Science
>>> Cloudera <http://www.cloudera.com>
>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>
>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>
>

Re: inconsistent grouping of map jobs

Reply via email to