Hi, AFAIK no. I'm not sure how much of a task it is to write a HOD-like scheduler, or if its even feasible given the new architecture of single managing JT, directly talking to TT. Probably someone more familiar with the scheduler architecture can help you better. What I was trying to suggest with serialization was write initial mapper data to known location, and instead of streaming from split, ignore that and read form here. Sorry for the delayed response,
Amogh On 2/4/10 2:01 PM, "Raghava Mutharaju" <m.vijayaragh...@gmail.com> wrote: Hi, So is it not possible to avoid redistribution in this case? If that is the case, can a custom scheduler be written -- will it be any easy task? Regards, Raghava. On Thu, Feb 4, 2010 at 2:52 AM, Amogh Vasekar <am...@yahoo-inc.com> wrote: Hi, >>Will there be a re-assignment of Map & Reduce nodes by the Master? In general using available schedulers, I believe so. Because if it weren't, and I submit job 2 needing different/additional set of inputs, the data locality considerations would be somewhat hampered right? When we had HOD, this was certainly possible. Amogh On 2/4/10 1:06 AM, "Raghava Mutharaju" <m.vijayaragh...@gmail.com <http://m.vijayaragh...@gmail.com> > wrote: Hi Amogh, Thank you for the reply. >>> What you need, I believe, is "just run on whatever map has". You got that right :). An example of sequential program would be Bubble Sort which needs several iterations for the end result and in each iteration it needs to work on the previous output (partially sorted list) rather than the initial input. In my case also, the same thing should happen. >>> If you are using an exclusive private cluster, you can probably localize >>> <k,v> from first iteration and >>> use dummy input data ( to ensure same >>> number of mapper tasks as first round, and use custom >>> classes of >>> MapRunner, RecordReader to not read data from supplied input ) Yes, it would be a local cluster, the one in my university. If we set the no of map tasks, would it not be followed in each iteration? As mentioned in the documentation, I think I need to use JobClient to control the no of iterations. >>> But how can you ensure that you get the same nodes always to run your map >>> reduce job on a >>> shared cluster? while (!done) { JobClient.runJob(jobConf); <<Do something to check termination condition>>} If I write something like that in the code, would not the Map node run on the same data chunk it has each time? Will there be a re-assignment of Map & Reduce nodes by the Master? Regards, Raghava. On Wed, Feb 3, 2010 at 9:59 AM, Amogh Vasekar <am...@yahoo-inc.com <http://am...@yahoo-inc.com> > wrote: Hi, If each of your sequential iteration is map+reduce, then no. The lifetime of a split is confined to a single map reduce job. The split is actually a reference to data, which is used to schedule job as close as possible to data. The record reader then uses same object to pass the <k,v> in split. What you need, I believe, is "just run on whatever map has". If you are using an exclusive private cluster, you can probably localize <k,v> from first iteration and use dummy input data ( to ensure same number of mapper tasks as first round, and use custom classes of MapRunner, RecordReader to not read data from supplied input )But how can you ensure that you get the same nodes always to run your map reduce job on a shared cluster? Please correct me if I misunderstood your question. Amogh On 2/3/10 11:34 AM, "Raghava Mutharaju" <m.vijayaragh...@gmail.com <http://m.vijayaragh...@gmail.com> <http://m.vijayaragh...@gmail.com> > wrote: Hi all, I to run a map reduce task repeatedly in order to achieve the desired result. Is it possible that at the beginning of each iteration, the data set is not distributed (divided into chunks and distributed) again and again i.e. once the distribution occurs for the first time, map nodes should work on the same chunk in every iteration. Can this be done? I only have a brief experience with MapReduce and I think that the input data set is redistributed every time. Thank you. Regards, Raghava.