Hi all, I've written a Crunch pipeline and have a question about the resulting mapreduce jobs. Please see the steps below:
1. ) Load text data A and convert to avro -> A' 2. ) Load text data B and convert to avro -> B' 3. ) Union A' and B' -> C 4. ) Filter C -> D 5. ) Write D to HDFS 6a. ) Use DoFn to extract strings from D -> E 6b. ) Aggregate E ( count strings ) -> F 6c. ) Convert F to HBase puts -> G 6d. ) Write G to HBase Running my code generates two mapreduce jobs which run in parallel: job A) runs steps 1, 2, 3, 4, 5 job B) runs steps 1, 2, 3, 4, 6abcd Without knowing much about the planning algorithm, what I expected to see was more like: job A) runs steps 1, 2, 3, 4, 5 job B) runs after A, reads back the data written in step 5 and does steps 6abcd. So the jobs would run sequentially, not in parallel, but in doing so avoid reading the full raw input data and performing the conversion/filtering logic twice. Is there a way I should order my pipeline calls or can I give hints to the mapreduce compiler to do the jobs in this way? Does the scale factor have any influence on this? Thanks, Dave
