Hey Dave, The way to force a sequential run would be to call pipeline.run() after you write D to HDFS and before you declare the operations in step 6. What we would really want here is a single MapReduce job that wrote side outputs on the map side to create the dataset in step D, but we don't have support for side-outputs in maps yet. Worth filing a JIRA, I think.
Thanks! Josh On Tue, Jan 15, 2013 at 3:41 AM, Dave Beech <[email protected]> wrote: > Hi all, > > I've written a Crunch pipeline and have a question about the resulting > mapreduce jobs. Please see the steps below: > > 1. ) Load text data A and convert to avro -> A' > 2. ) Load text data B and convert to avro -> B' > 3. ) Union A' and B' -> C > 4. ) Filter C -> D > > 5. ) Write D to HDFS > > 6a. ) Use DoFn to extract strings from D -> E > 6b. ) Aggregate E ( count strings ) -> F > 6c. ) Convert F to HBase puts -> G > 6d. ) Write G to HBase > > Running my code generates two mapreduce jobs which run in parallel: > job A) runs steps 1, 2, 3, 4, 5 > job B) runs steps 1, 2, 3, 4, 6abcd > > Without knowing much about the planning algorithm, what I expected to see > was more like: > job A) runs steps 1, 2, 3, 4, 5 > job B) runs after A, reads back the data written in step 5 and does steps > 6abcd. > > So the jobs would run sequentially, not in parallel, but in doing so avoid > reading the full raw input data and performing the conversion/filtering > logic twice. > > Is there a way I should order my pipeline calls or can I give hints to the > mapreduce compiler to do the jobs in this way? Does the scale factor have > any influence on this? > > Thanks, > Dave > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
