Re: When a Reduce Task starts?

2010-12-20 Thread Harsh J
On Tue, Dec 21, 2010 at 7:23 AM, li ping wrote: > I think the reduce can be started before all of the map finished. > See the configration item in mapred-site.xml > >   mapred.reduce.slowstart.completed.maps >   0.05 >   Fraction of the number of maps in the job which should be >   complete befor

Re: When a Reduce Task starts?

2010-12-20 Thread real great..
I don think reduce jobs can start unless all maps are over because the values are accumulated at the end of map stage only. are you using the default scheduler? On Tue, Dec 21, 2010 at 7:23 AM, li ping wrote: > I think the reduce can be started before all of the map finished. > See the configrat

Re: When a Reduce Task starts?

2010-12-20 Thread li ping
I think the reduce can be started before all of the map finished. See the configration item in mapred-site.xml mapred.reduce.slowstart.completed.maps 0.05 Fraction of the number of maps in the job which should be complete before reduces are scheduled for the job. Correct me, if I'm w

Re: When a Reduce Task starts?

2010-12-20 Thread Harsh J
Hi, On Tue, Dec 21, 2010 at 12:03 AM, Pedro Costa wrote: > 1 - A reduce task should start only when a map task ends ? Only when all map()s finish, the reduce() is called, yes. > > -- > Pedro > -- Harsh J www.harshj.com

How to record the bad records encountered by hadoop

2010-12-20 Thread felix gao
All, Not sure if this is the right mailing list of this question. I am using pig to do some data analysis and I am wondering if there a way to tell hadoop when it encountered a bad log files either due to uncompression failures or what ever caused the job to die, record the line and if possible th

Re: Reduce Task Priority / Scheduler

2010-12-20 Thread Allen Wittenauer
This makes sense until you realize: a) It won't scale. b) Machines fail. On Dec 20, 2010, at 5:26 AM, Martin Becker wrote: > I wrote a little bit much, so I put a summary up front. Sorry about that. > > Summary: > 1) Is there any point in time, where on

When a Reduce Task starts?

2010-12-20 Thread Pedro Costa
1 - A reduce task should start only when a map task ends ? -- Pedro

Re: Reduce Task Priority / Scheduler

2010-12-20 Thread Martin Becker
I wrote a little bit much, so I put a summary up front. Sorry about that. Summary: 1) Is there any point in time, where one single instance of Hadoop has access to all keys that are to be distributed to the nodes together with corresponding data? Or maybe at least nodes could have Task priorities,

Re: Reduce Task Priority / Scheduler

2010-12-20 Thread Harsh J
The JobTracker wouldn't know what your data is going to be is when it is assigning the Reduce Tasks. If you really do need ordering among your reducers, you should implement a locking mechanism (making sure the dormant reduce tasks stay alive by sending out some status reports). Although, how is

Re: Reduce Task Priority / Scheduler

2010-12-20 Thread Martin Becker
I just reread my first post. Maybe I was not clear enough: It is only important to me that the Reduce tasks _start_ in a specified order based on their key. That is the only additional constraint I need. On Mon, Dec 20, 2010 at 9:51 AM, Martin Becker <_martinbec...@web.de> wrote: > As far as I und

Re: Passing messages

2010-12-20 Thread Martin Becker
Thank you for your suggestions. In this context I heard about ZooKeeper a few times. It seems to be the easiest and most failsafe solution as of yet. Another solution mentioned was using some sort of communication through the file system, which is probably slow and is quite subtle. Of course, I wou

Re: Reduce Task Priority / Scheduler

2010-12-20 Thread Martin Becker
As far as I understood, MapReduce is waiting for all Mappers to finish until it starts running Reduce tasks. Am I mistaken here? If I am not, then I do not see any more synchrony being introduced than there already is (no locks required). Of course I am not aware of all the internals, but MapReduce