Re: Basic question on how reducer works

Pavan Kulkarni Sun, 08 Jul 2012 21:12:17 -0700

Oh.Thanks a lot Harsh .

On Sun, Jul 8, 2012 at 11:38 PM, Harsh J <ha...@cloudera.com> wrote:


> Pavan,
>
> This is covered in the MR tutorial doc:
> http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#Task+Logs
>
> On Mon, Jul 9, 2012 at 8:26 AM, Pavan Kulkarni <pavan.babu...@gmail.com>
> wrote:
> > I too had similar problems.
> > I guess we should also set the debug mode for
> > that specific class in the log4j.properties file .Isn't it?
> >
> > And I didn't quite get what you mean by task's userlogs?
> > where are these logs located ? In the logs directory I only see
> > logs for all the daemons.Thanks
> >
> >
> > On Sun, Jul 8, 2012 at 6:27 PM, Grandl Robert <rgra...@yahoo.com> wrote:
> >>
> >> I see. I was looking into tasktracker log :).
> >>
> >> Thanks a lot,
> >> Robert
> >>
> >> ________________________________
> >> From: Harsh J <ha...@cloudera.com>
> >> To: Grandl Robert <rgra...@yahoo.com>; mapreduce-user
> >> <mapreduce-user@hadoop.apache.org>
> >> Sent: Sunday, July 8, 2012 9:16 PM
> >>
> >> Subject: Re: Basic question on how reducer works
> >>
> >> The changes should appear in your Task's userlogs (not the TaskTracker
> >> logs). Have you deployed your changed code properly (i.e. do you
> >> generate a new tarball, or perhaps use the MRMiniCluster to do this)?
> >>
> >> On Mon, Jul 9, 2012 at 4:57 AM, Grandl Robert <rgra...@yahoo.com>
> wrote:
> >> > Hi Harsh,
> >> >
> >> > Your comments were extremely helpful.
> >> >
> >> > Still I am wondering why if I add LOG.info entries into MapTask.java
> or
> >> > ReduceTask.java in most of the functions(including
> >> > Old/NewOutputCollector),
> >> > the logs are not shown. In this way it's hard for me to track which
> >> > functions are called and which not. Even more in ReduceTask.java.
> >> >
> >> > Do you have any ideas ?
> >> >
> >> > Thanks a lot for your answer,
> >> > Robert
> >> >
> >> > ________________________________
> >> > From: Harsh J <ha...@cloudera.com>
> >> > To: mapreduce-user@hadoop.apache.org; Grandl Robert <
> rgra...@yahoo.com>
> >> > Sent: Sunday, July 8, 2012 1:34 AM
> >> >
> >> > Subject: Re: Basic question on how reducer works
> >> >
> >> > Hi Robert,
> >> >
> >> > Inline. (Answer is specific to Hadoop 1.x since you asked for that
> >> > alone, but certain things may vary for Hadoop 2.x).
> >> >
> >> > On Sun, Jul 8, 2012 at 7:07 AM, Grandl Robert <rgra...@yahoo.com>
> wrote:
> >> >> Hi,
> >> >>
> >> >> I have some questions related to basic functionality in Hadoop.
> >> >>
> >> >> 1. When a Mapper process the intermediate output data, how it knows
> how
> >> >> many
> >> >> partitions to do(how many reducers will be) and how much data to go
> in
> >> >> each
> >> >> partition for each reducer ?
> >> >
> >> > The number of reducers is non-dynamic and is user-specified, and is
> >> > set in the job configuration. Hence the Partitioner knows about the
> >> > value it needs to use for its numPartitions (== numReduces for the
> >> > job).
> >> >
> >> > For this one in 1.x code, look at MapTask.java, in the constructors of
> >> > internal classes OldOutputCollector (Stable API) and
> >> > NewOutputCollector (New API).
> >> >
> >> > The data estimated to be going into a partition, for limit/scheduling
> >> > checks, is currently a naive computation, done by summing upon the
> >> > estimate output sizes of each map. See
> >> > ResourceEstimator#getEstimatedReduceInputSize for the overall
> >> > estimation across maps, and see Task#calculateOutputSize for the
> >> > per-map estimation code.
> >> >
> >> >> 2. A JobTracker when assigns a task to a reducer, it will also
> specify
> >> >> the
> >> >> locations of intermediate output data where it should retrieve it
> right
> >> >> ?
> >> >> But how a reducer will know from each remote location with
> intermediate
> >> >> output what portion it has to retrieve only ?
> >> >
> >> > The JT does not send in the information of locations when a reduce is
> >> > scheduled. When the reducers begin their shuffle phase, they query the
> >> > TaskTracker to get the map completion events, via
> >> > TaskTracker#getMapCompletionEvents protocol call. The TaskTracker by
> >> > itself calls the JobTracker#getTaskCompletionEvents protocol call to
> >> > get this info underneath. The returned structure carries the host that
> >> > has completed the map successfully, which the Reduce's copier relies
> >> > on to fetch the data from the right host's TT.
> >> >
> >> > The reduce merely asks the data assigned for it for the specific
> >> > completed maps at each TT. Note that a reduce task ID is also its
> >> > partition ID, so it merely has to ask the data for its own task ID #
> >> > and the TT serves, over HTTP, the right parts of the intermediate data
> >> > to it.
> >> >
> >> > Feel free to ping back if you need some more clarification! :)
> >> >
> >> > --
> >> > Harsh J
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Harsh J
> >>
> >>
> >
> >
> >
> > --
> >
> > --With Regards
> > Pavan Kulkarni
> >
>
>
>
> --
> Harsh J
>



-- 

--With Regards
Pavan Kulkarni

Re: Basic question on how reducer works

Reply via email to