Re: Question about mapreduce job planner

Josh Wills Tue, 15 Jan 2013 10:04:36 -0800

Hey Dave,

The way to force a sequential run would be to call pipeline.run() after you
write D to HDFS and before you declare the operations in step 6. What we
would really want here is a single MapReduce job that wrote side outputs on
the map side to create the dataset in step D, but we don't have support for
side-outputs in maps yet. Worth filing a JIRA, I think.


Thanks!
Josh


On Tue, Jan 15, 2013 at 3:41 AM, Dave Beech <[email protected]> wrote:

> Hi all,
>
> I've written a Crunch pipeline and have a question about the resulting
> mapreduce jobs. Please see the steps below:
>
> 1. ) Load text data A and convert to avro -> A'
> 2. ) Load text data B and convert to avro -> B'
> 3. ) Union A' and B' -> C
> 4. ) Filter C -> D
>
> 5. ) Write D to HDFS
>
> 6a. ) Use DoFn to extract strings from D -> E
> 6b. ) Aggregate E ( count strings ) -> F
> 6c. ) Convert F to HBase puts -> G
> 6d. ) Write G to HBase
>
> Running my code generates two mapreduce jobs which run in parallel:
> job A) runs steps 1, 2, 3, 4, 5
> job B) runs steps 1, 2, 3, 4, 6abcd
>
> Without knowing much about the planning algorithm, what I expected to see
> was more like:
> job A) runs steps 1, 2, 3, 4, 5
> job B) runs after A, reads back the data written in step 5 and does steps
> 6abcd.
>
> So the jobs would run sequentially, not in parallel, but in doing so avoid
> reading the full raw input data and performing the conversion/filtering
> logic twice.
>
> Is there a way I should order my pipeline calls or can I give hints to the
> mapreduce compiler to do the jobs in this way? Does the scale factor have
> any influence on this?
>
> Thanks,
> Dave
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: Question about mapreduce job planner

Reply via email to