[ https://issues.apache.org/jira/browse/TEZ-3754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16041022#comment-16041022 ]
Jonathan Eagles commented on TEZ-3754: -------------------------------------- Agree about the intention of the jira. However, can you comment on the _*mapreduce.shuffle.listen.queue.size*_ setting for the cluster? The default is _*128*_, but I have found that this shows too many connection timeouts. Instead, I have found the following setting much more appropriate, and in line with what the namenode ipc is usually configured to handle its connections. {code} <property> <name>mapreduce.shuffle.listen.queue.size</name> <value>4096</value> </property> {code} > With large ordered fetches, stalling shuffle could lead to slowdown of job > -------------------------------------------------------------------------- > > Key: TEZ-3754 > URL: https://issues.apache.org/jira/browse/TEZ-3754 > Project: Apache Tez > Issue Type: Bug > Reporter: Rajesh Balamohan > Assignee: Rajesh Balamohan > Attachments: TEZ-3754.1.patch > > > When large ordered fetches are getting downloaded, depending on the memory > availability in merger, it is possible to stall the shuffle ({{Type::WAIT}}). > This happens after establishing the connection,validation at server side etc. > However, depending on memory availability allocator could return > {{Type::WAIT}} to retry fetching later. > In corner cases, retries happen very frequently and this creates pressure on > the server side. Server side populates the headers and starts sending the map > output. This effort is wasted in the client side as data is not downloaded. > When multiple nodes have similar issue, following connections timesout (180 > seconds), which turns out to be > very expensive. > {noformat} > 2017-06-06 04:27:49,574 [INFO] [fetcher {Map_1} #44] > |orderedgrouped.FetcherOrderedGrouped|: fetcher#44 - MergerManager returned > Status.WAIT ... > 2017-06-06 04:27:49,575 [INFO] [fetcher {Map_1} #31] > |shuffle.HttpConnection|: for > url=http://ctr-e133-1493418528701-76897-01-000005:13562/mapOutput?job=job_1496458948260_0050&reduce=1931&map=attempt_1496458948260_0050_1_00_000843_0_10007,attempt_1496458948260_0050_1_00_000841_1_10008 > sent hash and receievd reply 0 ms > 2017-06-06 04:27:49,575 [INFO] [fetcher {Map_1} #31] > |orderedgrouped.FetcherOrderedGrouped|: fetcher#31 - MergerManager returned > Status.WAIT ... > 2017-06-06 04:27:49,576 [INFO] [fetcher {Map_1} #43] > |shuffle.HttpConnection|: for > url=http://ctr-e133-1493418528701-76897-01-000005:13562/mapOutput?job=job_1496458948260_0050&reduce=1931&map=attempt_1496458948260_0050_1_00_000841_1_10008,attempt_1496458948260_0050_1_00_000843_0_10007 > sent hash and receievd reply 0 ms > 2017-06-06 04:27:49,576 [INFO] [fetcher {Map_1} #43] > |orderedgrouped.FetcherOrderedGrouped|: fetcher#43 - MergerManager returned > Status.WAIT ... > 2017-06-06 04:27:49,578 [INFO] [fetcher {Map_1} #32] > |shuffle.HttpConnection|: for > url=http://ctr-e133-1493418528701-76897-01-000005:13562/mapOutput?job=job_1496458948260_0050&reduce=1931&map=attempt_1496458948260_0050_1_00_000843_0_10007,attempt_1496458948260_0050_1_00_000841_1_10008 > sent hash and receievd reply 0 ms > 2017-06-06 04:27:49,578 [INFO] [fetcher {Map_1} #32] > |orderedgrouped.FetcherOrderedGrouped|: fetcher#32 - MergerManager returned > Status.WAIT ... > 2017-06-06 04:27:49,579 [INFO] [fetcher {Map_1} #42] > |shuffle.HttpConnection|: for > url=http://ctr-e133-1493418528701-76897-01-000005:13562/mapOutput?job=job_1496458948260_0050&reduce=1931&map=attempt_1496458948260_0050_1_00_000841_1_10008,attempt_1496458948260_0050_1_00_000843_0_10007 > sent hash and receievd reply 0 ms > 2017-06-06 04:27:49,579 [INFO] [fetcher {Map_1} #42] > |orderedgrouped.FetcherOrderedGrouped|: fetcher#42 - MergerManager returned > Status.WAIT ... > 2017-06-06 04:27:49,580 [INFO] [fetcher {Map_1} #33] > |shuffle.HttpConnection|: for > url=http://ctr-e133-1493418528701-76897-01-000005:13562/mapOutput?job=job_1496458948260_0050&reduce=1931&map=attempt_1496458948260_0050_1_00_000843_0_10007,attempt_1496458948260_0050_1_00_000841_1_10008 > sent hash and receievd reply 0 ms > 2017-06-06 04:27:49,580 [INFO] [fetcher {Map_1} #33] > |orderedgrouped.FetcherOrderedGrouped|: fetcher#33 - MergerManager returned > Status.WAIT ... > 2017-06-06 04:27:49,581 [INFO] [fetcher {Map_1} #41] > |shuffle.HttpConnection|: for > url=http://ctr-e133-1493418528701-76897-01-000005:13562/mapOutput?job=job_1496458948260_0050&reduce=1931&map=attempt_1496458948260_0050_1_00_000841_1_10008,attempt_1496458948260_0050_1_00_000843_0_10007 > sent hash and receievd reply 0 ms > 2017-06-06 04:27:49,581 [INFO] [fetcher {Map_1} #41] > |orderedgrouped.FetcherOrderedGrouped|: fetcher#41 - MergerManager returned > Status.WAIT ... > 2017-06-06 04:27:49,583 [INFO] [fetcher {Map_1} #34] > |shuffle.HttpConnection|: for > url=http://ctr-e133-1493418528701-76897-01-000005:13562/mapOutput?job=job_1496458948260_0050&reduce=1931&map=attempt_1496458948260_0050_1_00_000843_0_10007,attempt_1496458948260_0050_1_00_000841_1_10008 > sent hash and receievd reply 0 ms > 2017-06-06 04:27:49,583 [INFO] [fetcher {Map_1} #34] > |orderedgrouped.FetcherOrderedGrouped|: fetcher#34 - MergerManager returned > Status.WAIT ... > 2017-06-06 04:27:49,584 [INFO] [fetcher {Map_1} #40] > |shuffle.HttpConnection|: for > url=http://ctr-e133-1493418528701-76897-01-000005:13562/mapOutput?job=job_1496458948260_0050&reduce=1931&map=attempt_1496458948260_0050_1_00_000841_1_10008,attempt_1496458948260_0050_1_00_000843_0_10007 > sent hash and receievd reply 0 ms > 2017-06-06 04:27:49,584 [INFO] [fetcher {Map_1} #40] > |orderedgrouped.FetcherOrderedGrouped|: fetcher#40 - MergerManager returned > Status.WAIT ... > 2017-06-06 04:27:49,586 [INFO] [fetcher {Map_1} #35] > |shuffle.HttpConnection|: for > url=http://ctr-e133-1493418528701-76897-01-000005:13562/mapOutput?job=job_1496458948260_0050&reduce=1931&map=attempt_1496458948260_0050_1_00_000843_0_10007,attempt_1496458948260_0050_1_00_000841_1_10008 > sent hash and receievd reply 0 ms > 2017-06-06 04:27:49,586 [INFO] [fetcher {Map_1} #35] > |orderedgrouped.FetcherOrderedGrouped|: fetcher#35 - MergerManager returned > Status.WAIT ... > 2017-06-06 04:27:49,587 [INFO] [fetcher {Map_1} #39] > |shuffle.HttpConnection|: for > url=http://ctr-e133-1493418528701-76897-01-000005:13562/mapOutput?job=job_1496458948260_0050&reduce=1931&map=attempt_1496458948260_0050_1_00_000841_1_10008,attempt_1496458948260_0050_1_00_000843_0_10007 > sent hash and receievd reply 0 ms > 2017-06-06 04:27:49,587 [INFO] [fetcher {Map_1} #39] > |orderedgrouped.FetcherOrderedGrouped|: fetcher#39 - MergerManager returned > Status.WAIT ... > 2017-06-06 04:27:49,588 [INFO] [fetcher {Map_1} #36] > |shuffle.HttpConnection|: for > url=http://ctr-e133-1493418528701-76897-01-000005:13562/mapOutput?job=job_1496458948260_0050&reduce=1931&map=attempt_1496458948260_0050_1_00_000843_0_10007,attempt_1496458948260_0050_1_00_000841_1_10008 > sent hash and receievd reply 0 ms > 2017-06-06 04:27:49,588 [INFO] [fetcher {Map_1} #36] > |orderedgrouped.FetcherOrderedGrouped|: fetcher#36 - MergerManager returned > Status.WAIT ... > 2017-06-06 04:27:49,589 [INFO] [fetcher {Map_1} #38] > |shuffle.HttpConnection|: for > url=http://ctr-e133-1493418528701-76897-01-000005:13562/mapOutput?job=job_1496458948260_0050&reduce=1931&map=attempt_1496458948260_0050_1_00_000841_1_10008,attempt_1496458948260_0050_1_00_000843_0_10007 > sent hash and receievd reply 0 ms > 2017-06-06 04:27:49,589 [INFO] [fetcher {Map_1} #38] > |orderedgrouped.FetcherOrderedGrouped|: fetcher#38 - MergerManager returned > Status.WAIT ... > 2017-06-06 04:27:49,590 [INFO] [fetcher {Map_1} #37] > |shuffle.HttpConnection|: for > url=http://ctr-e133-1493418528701-76897-01-000005:13562/mapOutput?job=job_1496458948260_0050&reduce=1931&map=attempt_1496458948260_0050_1_00_000843_0_10007,attempt_1496458948260_0050_1_00_000841_1_10008 > sent hash and receievd reply 0 ms > {noformat} > In such cases, it would be good to wait until {{usedMemory}} gets released > during "next" iteration as opposed to retrying immediately. -- This message was sent by Atlassian JIRA (v6.3.15#6346)