Chaining the jobs is a fantastically inefficient solution. If you use Pig or Cascading, the optimizer will glue all of your map functions into a single mapper. The result is something like:
(mapper1 -> mapper2 -> mapper3) => reducer Here the parentheses indicate that all of the map functions are executed as a single function formed by composing mapper1, mapper2, and mapper3. Writing multiple jobs to do this forces *lots* of unnecessary traffic to your persistent store and lots of unnecessary synchronization. You can do this optimization by hand, but using a higher level language is often better for maintenance. On Mon, Mar 4, 2013 at 1:52 PM, Russell Jurney <russell.jur...@gmail.com>wrote: > You can chain MR jobs with Oozie, but would suggest using Cascading, Pig > or Hive. You can do this is a couple lines of code, I suspect. Two map > reduce jobs should not pose any kind of challenge with the right tools. > > > On Monday, March 4, 2013, Sandy Ryza wrote: > >> Hi Aji, >> >> Oozie is a mature project for managing MapReduce workflows. >> http://oozie.apache.org/ >> >> -Sandy >> >> >> On Mon, Mar 4, 2013 at 8:17 AM, Justin Woody <justin.wo...@gmail.com>wrote: >> >>> Aji, >>> >>> Why don't you just chain the jobs together? >>> http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining >>> >>> Justin >>> >>> On Mon, Mar 4, 2013 at 11:11 AM, Aji Janis <aji1...@gmail.com> wrote: >>> > Russell thanks for the link. >>> > >>> > I am interested in finding a solution (if out there) where Mapper1 >>> outputs a >>> > custom object and Mapper 2 can use that as input. One way to do this >>> > obviously by writing to Accumulo, in my case. But, is there another >>> solution >>> > for this: >>> > >>> > List<MyObject> ----> Input to Job >>> > >>> > MyObject ---> Input to Mapper1 (process MyObject) ----> Output >>> <MyObjectId, >>> > MyObject> >>> > >>> > <MyObjectId, MyObject> are Input to Mapper2 ... and so on >>> > >>> > >>> > >>> > Ideas? >>> > >>> > >>> > On Mon, Mar 4, 2013 at 10:00 AM, Russell Jurney < >>> russell.jur...@gmail.com> >>> > wrote: >>> >> >>> >> >>> >> >>> http://svn.apache.org/repos/asf/accumulo/contrib/pig/trunk/src/main/java/org/apache/accumulo/pig/AccumuloStorage.java >>> >> >>> >> AccumuloStorage for Pig comes with Accumulo. Easiest way would be to >>> try >>> >> it. >>> >> >>> >> Russell Jurney http://datasyndrome.com >>> >> >>> >> On Mar 4, 2013, at 5:30 AM, Aji Janis <aji1...@gmail.com> wrote: >>> >> >>> >> Hello, >>> >> >>> >> I have a MR job design with a flow like this: Mapper1 -> Mapper2 -> >>> >> Mapper3 -> Reducer1. Mapper1's input is an accumulo table. M1's >>> output goes >>> >> to M2.. and so on. Finally the Reducer writes output to Accumulo. >>> >> >>> >> Questions: >>> >> >>> >> 1) Has any one tried something like this before? Are there any >>> workflow >>> >> control apis (in or outside of Hadoop) that can help me set up the >>> job like >>> >> this. Or am I limited to use Quartz for this? >>> >> 2) If both M2 and M3 needed to write some data to two same tables in >>> >> Accumulo, is it possible to do so? Are there any good accumulo >>> mapreduce >>> >> jobs you can point me to? blogs/pages that I can use for reference >>> (starting >>> >> point/best practices). >>> >> >>> >> Thank you in advance for any suggestions! >>> >> >>> >> Aji >>> >> >>> > >>> >> >> > > -- > Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome. > com >