In MR1, the tasktracker serves the mapper files (so that tasks don't have
to stick around taking up resources).  In MR2, the shuffle service, which
lives inside the nodemanager, serves them.

-Sandy


On Thu, May 23, 2013 at 10:22 AM, John Lilley <john.lil...@redpoint.net>wrote:

>  Ling,****
>
> Thanks for the response!  I could use more clarification on item 1.
> Specifically****
>
> **·         **mapred.reduce.parallel.copies  limits the number of
> outbound connections for a reducer, but not the inbound connections for a
> mapper.  Does tasktracker.http.threads limit the number of simultaneous
> inbound connections for a mapper, or only the size of the thread pool
> servicing the connections?  (i.e. is it one thread per inbound connection?).
> ****
>
> **·         **Who actually creates the listen port for serving up the
> mapper files?  The mapper task?  Or something more persistent in MapReduce?
> ****
>
> Thanks,****
>
> John****
>
> ** **
>
> *From:* erlv5...@gmail.com [mailto:erlv5...@gmail.com] *On Behalf Of *Kun
> Ling
> *Sent:* Wednesday, May 22, 2013 7:50 PM
> *To:* user
>
> *Subject:* Re: Shuffle phase replication factor****
>
> ** **
>
> Hi John, ****
>
> ** **
>
> ** **
>
>    1. for the number of  simultaneous connection limitations. You can
> configure this using the mapred.reduce.parallel.copies flag. the default
>  is 5. ****
>
> ** **
>
>    2. For the aggressively disconnect implication, I am afraid it is only
> a little. Normally, each reducer will connect to each mapper task, and
> asking for the partions of the map output file.   Because there are about 5
> simultaneous connections to fetch the map output for each reducer. For a
> large MR cluster with 1000 node, and a Huge MR job with 1000 Mapper, and
> 1000 reducer, for each node, there are only about 5 connections. So the
> imply is only a little.****
>
> ** **
>
> ** **
>
>   3.  What happens to the pending/ failing coonection, the short answer
> is: just try to reconnect.    There is a List<>, which maintain all the
> output of the Mapper that need to copied, and the element will be removed
> iff the map output is successfully copied.  A forever loop will keep on
> look into the List, and fetch the corrsponding map output.****
>
> ** **
>
> ** **
>
>   All the above answer is based on the Hadoop 1.0.4 source code,
> especially the ReduceTask.java file.****
>
> ** **
>
> yours,****
>
> Ling Kun****
>
> ** **
>
> On Wed, May 22, 2013 at 10:57 PM, John Lilley <john.lil...@redpoint.net>
> wrote:****
>
> Ummmm, is that also the limit for the number of simultaneous connections?
> In general, one does not need a 1:1 map between threads and connections.**
> **
>
> If this is the connection limit, does it imply  that the client or server
> side aggressively disconnects after a transfer?  ****
>
> What happens to the pending/failing connection attempts that exceed the
> limit?****
>
> Thanks!****
>
> john****
>
>  ****
>
> *From:* Rahul Bhattacharjee [mailto:rahul.rec....@gmail.com]
> *Sent:* Wednesday, May 22, 2013 8:52 AM****
>
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Shuffle phase replication factor****
>
>  ****
>
> There are properties/configuration to control the no. of copying threads
> for copy.
> tasktracker.http.threads=40
> Thanks,
> Rahul****
>
>  ****
>
> On Wed, May 22, 2013 at 8:16 PM, John Lilley <john.lil...@redpoint.net>
> wrote:****
>
> This brings up another nagging question I’ve had for some time.  Between
> HDFS and shuffle, there seems to be the potential for “every node
> connecting to every other node” via TCP.  Are there explicit mechanisms in
> place to manage or limit simultaneous connections?  Is the protocol simply
> robust enough to allow a server-side to disconnect at any time to free up
> slots and the client-side will retry the request?****
>
> Thanks****
>
> john****
>
>  ****
>
> *From:* Shahab Yunus [mailto:shahab.yu...@gmail.com]
> *Sent:* Wednesday, May 22, 2013 8:38 AM****
>
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Shuffle phase replication factor****
>
>  ****
>
> As mentioned by Bertrand, Hadoop, The Definitive Guide, is well... really
> definitive :) place to start. It is pretty thorough for starts and once you
> are gone through it, the code will start making more sense too.****
>
>  ****
>
> Regards,****
>
> Shahab****
>
>  ****
>
> On Wed, May 22, 2013 at 10:33 AM, John Lilley <john.lil...@redpoint.net>
> wrote:****
>
> Oh I see.  Does this mean there is another service and TCP listen port for
> this purpose?****
>
> Thanks for your indulgence… I would really like to read more about this
> without bothering the group but not sure where to start to learn these
> internals other than the code.****
>
> john****
>
>  ****
>
> *From:* Kai Voigt [mailto:k...@123.org]
> *Sent:* Tuesday, May 21, 2013 12:59 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Shuffle phase replication factor****
>
>  ****
>
> The map output doesn't get written to HDFS. The map task writes its output
> to its local disk, the reduce tasks will pull the data through HTTP for
> further processing.****
>
>  ****
>
> Am 21.05.2013 um 19:57 schrieb John Lilley <john.lil...@redpoint.net>:****
>
>  ****
>
> When MapReduce enters “shuffle” to partition the tuples, I am assuming
> that it writes intermediate data to HDFS.  What replication factor is used
> for those temporary files?****
>
> john****
>
>  ****
>
>  ****
>
> -- ****
>
> Kai Voigt****
>
> k...@123.org****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
>
>
> ****
>
> ** **
>
> --
> http://www.lingcc.com ****
>

Reply via email to