Question about mapreduce job planner

Dave Beech Tue, 15 Jan 2013 03:42:20 -0800

Hi all,

I've written a Crunch pipeline and have a question about the resulting
mapreduce jobs. Please see the steps below:


1. ) Load text data A and convert to avro -> A'
2. ) Load text data B and convert to avro -> B'
3. ) Union A' and B' -> C
4. ) Filter C -> D

5. ) Write D to HDFS

6a. ) Use DoFn to extract strings from D -> E
6b. ) Aggregate E ( count strings ) -> F
6c. ) Convert F to HBase puts -> G
6d. ) Write G to HBase

Running my code generates two mapreduce jobs which run in parallel:
job A) runs steps 1, 2, 3, 4, 5
job B) runs steps 1, 2, 3, 4, 6abcd

Without knowing much about the planning algorithm, what I expected to see
was more like:
job A) runs steps 1, 2, 3, 4, 5
job B) runs after A, reads back the data written in step 5 and does steps
6abcd.

So the jobs would run sequentially, not in parallel, but in doing so avoid
reading the full raw input data and performing the conversion/filtering
logic twice.

Is there a way I should order my pipeline calls or can I give hints to the
mapreduce compiler to do the jobs in this way? Does the scale factor have
any influence on this?

Thanks,
Dave

Question about mapreduce job planner

Reply via email to