Re: Chaining Multiple Map reduce jobs.

2009-04-08 Thread Lukáš Vlček
Hi,
by far I am not an Hadoop expert but I think you can not start Map task
until the previous Reduce is finished. Saying this it means that you
probably have to store the Map output to the disk first (because a] it may
not fit into memory and b] you would risk data loss if the system crashes).
As for the job chaining you can check JobControl class (
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/jobcontrol/JobControl.html)http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/jobcontrol/JobControl.html

Also you can look at https://issues.apache.org/jira/browse/HADOOP-3702

Regards,
Lukas

On Wed, Apr 8, 2009 at 11:30 PM, asif md asif.d...@gmail.com wrote:

 hi everyone,

 i have to chain multiple map reduce jobs  actually 2 to 4 jobs , each of
 the jobs depends on the o/p of preceding job. In the reducer of each job
 I'm
 doing very little  just grouping by key from the maps. I want to give the
 output of one MapReduce job to the next job without having to go to the
 disk. Does anyone have any ideas on how to do this?

 Thanx.




-- 
http://blog.lukas-vlcek.com/


Re: Chaining Multiple Map reduce jobs.

2009-04-08 Thread Nathan Marz
You can also try decreasing the replication factor for the  
intermediate files between jobs. This will make writing those files  
faster.


On Apr 8, 2009, at 3:14 PM, Lukáš Vlček wrote:


Hi,
by far I am not an Hadoop expert but I think you can not start Map  
task

until the previous Reduce is finished. Saying this it means that you
probably have to store the Map output to the disk first (because a]  
it may
not fit into memory and b] you would risk data loss if the system  
crashes).

As for the job chaining you can check JobControl class (
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/jobcontrol/JobControl.html) 
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/jobcontrol/JobControl.html 



Also you can look at https://issues.apache.org/jira/browse/HADOOP-3702

Regards,
Lukas

On Wed, Apr 8, 2009 at 11:30 PM, asif md asif.d...@gmail.com wrote:


hi everyone,

i have to chain multiple map reduce jobs  actually 2 to 4 jobs ,  
each of
the jobs depends on the o/p of preceding job. In the reducer of  
each job

I'm
doing very little  just grouping by key from the maps. I want to  
give the
output of one MapReduce job to the next job without having to go to  
the

disk. Does anyone have any ideas on how to do this?

Thanx.





--
http://blog.lukas-vlcek.com/




Re: Chaining Multiple Map reduce jobs.

2009-04-08 Thread jason hadoop
Chapter 8 of my book covers this in detail, the alpha chapter should be
available at the apress web site
Chain mapping rules!
http://www.apress.com/book/view/1430219424

On Wed, Apr 8, 2009 at 3:30 PM, Nathan Marz nat...@rapleaf.com wrote:

 You can also try decreasing the replication factor for the intermediate
 files between jobs. This will make writing those files faster.


 On Apr 8, 2009, at 3:14 PM, Lukáš Vlček wrote:

  Hi,
 by far I am not an Hadoop expert but I think you can not start Map task
 until the previous Reduce is finished. Saying this it means that you
 probably have to store the Map output to the disk first (because a] it may
 not fit into memory and b] you would risk data loss if the system
 crashes).
 As for the job chaining you can check JobControl class (

 http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/jobcontrol/JobControl.html
 )
 http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/jobcontrol/JobControl.html
 

 Also you can look at https://issues.apache.org/jira/browse/HADOOP-3702

 Regards,
 Lukas

 On Wed, Apr 8, 2009 at 11:30 PM, asif md asif.d...@gmail.com wrote:

  hi everyone,

 i have to chain multiple map reduce jobs  actually 2 to 4 jobs , each
 of
 the jobs depends on the o/p of preceding job. In the reducer of each job
 I'm
 doing very little  just grouping by key from the maps. I want to give
 the
 output of one MapReduce job to the next job without having to go to the
 disk. Does anyone have any ideas on how to do this?

 Thanx.




 --
 http://blog.lukas-vlcek.com/





-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422