[ 
https://issues.apache.org/jira/browse/TEZ-3754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16080445#comment-16080445
 ] 

Jonathan Eagles commented on TEZ-3754:
--------------------------------------

[~jlowe], do you why mapreduce explicitly turned this off? I don't have a 
history only that initial behavior of mapreduce was like the intent of this 
jira and that they turned this off. My memory of this is fuzzy, so I could be 
totally wrong.

> With large ordered fetches, stalling shuffle could lead to slowdown of job
> --------------------------------------------------------------------------
>
>                 Key: TEZ-3754
>                 URL: https://issues.apache.org/jira/browse/TEZ-3754
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>         Attachments: TEZ-3754.1.patch
>
>
> When large ordered fetches are getting downloaded, depending on the memory 
> availability in merger, it is possible to stall the shuffle ({{Type::WAIT}}). 
> This happens after establishing the connection,validation at server side etc. 
> However, depending on memory availability allocator could return 
> {{Type::WAIT}} to retry fetching later. 
> In corner cases, retries happen very frequently and this creates pressure on 
> the server side. Server side populates the headers and starts sending the map 
> output. This effort is wasted in the client side as data is not downloaded. 
> When multiple nodes have similar issue, following connections timesout (180 
> seconds), which turns out to be
> very expensive.
> {noformat}
> 2017-06-06 04:27:49,574 [INFO] [fetcher {Map_1} #44] 
> |orderedgrouped.FetcherOrderedGrouped|: fetcher#44 - MergerManager returned 
> Status.WAIT ...
> 2017-06-06 04:27:49,575 [INFO] [fetcher {Map_1} #31] 
> |shuffle.HttpConnection|: for 
> url=http://ctr-e133-1493418528701-76897-01-000005:13562/mapOutput?job=job_1496458948260_0050&reduce=1931&map=attempt_1496458948260_0050_1_00_000843_0_10007,attempt_1496458948260_0050_1_00_000841_1_10008
>  sent hash and receievd reply 0 ms
> 2017-06-06 04:27:49,575 [INFO] [fetcher {Map_1} #31] 
> |orderedgrouped.FetcherOrderedGrouped|: fetcher#31 - MergerManager returned 
> Status.WAIT ...
> 2017-06-06 04:27:49,576 [INFO] [fetcher {Map_1} #43] 
> |shuffle.HttpConnection|: for 
> url=http://ctr-e133-1493418528701-76897-01-000005:13562/mapOutput?job=job_1496458948260_0050&reduce=1931&map=attempt_1496458948260_0050_1_00_000841_1_10008,attempt_1496458948260_0050_1_00_000843_0_10007
>  sent hash and receievd reply 0 ms
> 2017-06-06 04:27:49,576 [INFO] [fetcher {Map_1} #43] 
> |orderedgrouped.FetcherOrderedGrouped|: fetcher#43 - MergerManager returned 
> Status.WAIT ...
> 2017-06-06 04:27:49,578 [INFO] [fetcher {Map_1} #32] 
> |shuffle.HttpConnection|: for 
> url=http://ctr-e133-1493418528701-76897-01-000005:13562/mapOutput?job=job_1496458948260_0050&reduce=1931&map=attempt_1496458948260_0050_1_00_000843_0_10007,attempt_1496458948260_0050_1_00_000841_1_10008
>  sent hash and receievd reply 0 ms
> 2017-06-06 04:27:49,578 [INFO] [fetcher {Map_1} #32] 
> |orderedgrouped.FetcherOrderedGrouped|: fetcher#32 - MergerManager returned 
> Status.WAIT ...
> 2017-06-06 04:27:49,579 [INFO] [fetcher {Map_1} #42] 
> |shuffle.HttpConnection|: for 
> url=http://ctr-e133-1493418528701-76897-01-000005:13562/mapOutput?job=job_1496458948260_0050&reduce=1931&map=attempt_1496458948260_0050_1_00_000841_1_10008,attempt_1496458948260_0050_1_00_000843_0_10007
>  sent hash and receievd reply 0 ms
> 2017-06-06 04:27:49,579 [INFO] [fetcher {Map_1} #42] 
> |orderedgrouped.FetcherOrderedGrouped|: fetcher#42 - MergerManager returned 
> Status.WAIT ...
> 2017-06-06 04:27:49,580 [INFO] [fetcher {Map_1} #33] 
> |shuffle.HttpConnection|: for 
> url=http://ctr-e133-1493418528701-76897-01-000005:13562/mapOutput?job=job_1496458948260_0050&reduce=1931&map=attempt_1496458948260_0050_1_00_000843_0_10007,attempt_1496458948260_0050_1_00_000841_1_10008
>  sent hash and receievd reply 0 ms
> 2017-06-06 04:27:49,580 [INFO] [fetcher {Map_1} #33] 
> |orderedgrouped.FetcherOrderedGrouped|: fetcher#33 - MergerManager returned 
> Status.WAIT ...
> 2017-06-06 04:27:49,581 [INFO] [fetcher {Map_1} #41] 
> |shuffle.HttpConnection|: for 
> url=http://ctr-e133-1493418528701-76897-01-000005:13562/mapOutput?job=job_1496458948260_0050&reduce=1931&map=attempt_1496458948260_0050_1_00_000841_1_10008,attempt_1496458948260_0050_1_00_000843_0_10007
>  sent hash and receievd reply 0 ms
> 2017-06-06 04:27:49,581 [INFO] [fetcher {Map_1} #41] 
> |orderedgrouped.FetcherOrderedGrouped|: fetcher#41 - MergerManager returned 
> Status.WAIT ...
> 2017-06-06 04:27:49,583 [INFO] [fetcher {Map_1} #34] 
> |shuffle.HttpConnection|: for 
> url=http://ctr-e133-1493418528701-76897-01-000005:13562/mapOutput?job=job_1496458948260_0050&reduce=1931&map=attempt_1496458948260_0050_1_00_000843_0_10007,attempt_1496458948260_0050_1_00_000841_1_10008
>  sent hash and receievd reply 0 ms
> 2017-06-06 04:27:49,583 [INFO] [fetcher {Map_1} #34] 
> |orderedgrouped.FetcherOrderedGrouped|: fetcher#34 - MergerManager returned 
> Status.WAIT ...
> 2017-06-06 04:27:49,584 [INFO] [fetcher {Map_1} #40] 
> |shuffle.HttpConnection|: for 
> url=http://ctr-e133-1493418528701-76897-01-000005:13562/mapOutput?job=job_1496458948260_0050&reduce=1931&map=attempt_1496458948260_0050_1_00_000841_1_10008,attempt_1496458948260_0050_1_00_000843_0_10007
>  sent hash and receievd reply 0 ms
> 2017-06-06 04:27:49,584 [INFO] [fetcher {Map_1} #40] 
> |orderedgrouped.FetcherOrderedGrouped|: fetcher#40 - MergerManager returned 
> Status.WAIT ...
> 2017-06-06 04:27:49,586 [INFO] [fetcher {Map_1} #35] 
> |shuffle.HttpConnection|: for 
> url=http://ctr-e133-1493418528701-76897-01-000005:13562/mapOutput?job=job_1496458948260_0050&reduce=1931&map=attempt_1496458948260_0050_1_00_000843_0_10007,attempt_1496458948260_0050_1_00_000841_1_10008
>  sent hash and receievd reply 0 ms
> 2017-06-06 04:27:49,586 [INFO] [fetcher {Map_1} #35] 
> |orderedgrouped.FetcherOrderedGrouped|: fetcher#35 - MergerManager returned 
> Status.WAIT ...
> 2017-06-06 04:27:49,587 [INFO] [fetcher {Map_1} #39] 
> |shuffle.HttpConnection|: for 
> url=http://ctr-e133-1493418528701-76897-01-000005:13562/mapOutput?job=job_1496458948260_0050&reduce=1931&map=attempt_1496458948260_0050_1_00_000841_1_10008,attempt_1496458948260_0050_1_00_000843_0_10007
>  sent hash and receievd reply 0 ms
> 2017-06-06 04:27:49,587 [INFO] [fetcher {Map_1} #39] 
> |orderedgrouped.FetcherOrderedGrouped|: fetcher#39 - MergerManager returned 
> Status.WAIT ...
> 2017-06-06 04:27:49,588 [INFO] [fetcher {Map_1} #36] 
> |shuffle.HttpConnection|: for 
> url=http://ctr-e133-1493418528701-76897-01-000005:13562/mapOutput?job=job_1496458948260_0050&reduce=1931&map=attempt_1496458948260_0050_1_00_000843_0_10007,attempt_1496458948260_0050_1_00_000841_1_10008
>  sent hash and receievd reply 0 ms
> 2017-06-06 04:27:49,588 [INFO] [fetcher {Map_1} #36] 
> |orderedgrouped.FetcherOrderedGrouped|: fetcher#36 - MergerManager returned 
> Status.WAIT ...
> 2017-06-06 04:27:49,589 [INFO] [fetcher {Map_1} #38] 
> |shuffle.HttpConnection|: for 
> url=http://ctr-e133-1493418528701-76897-01-000005:13562/mapOutput?job=job_1496458948260_0050&reduce=1931&map=attempt_1496458948260_0050_1_00_000841_1_10008,attempt_1496458948260_0050_1_00_000843_0_10007
>  sent hash and receievd reply 0 ms
> 2017-06-06 04:27:49,589 [INFO] [fetcher {Map_1} #38] 
> |orderedgrouped.FetcherOrderedGrouped|: fetcher#38 - MergerManager returned 
> Status.WAIT ...
> 2017-06-06 04:27:49,590 [INFO] [fetcher {Map_1} #37] 
> |shuffle.HttpConnection|: for 
> url=http://ctr-e133-1493418528701-76897-01-000005:13562/mapOutput?job=job_1496458948260_0050&reduce=1931&map=attempt_1496458948260_0050_1_00_000843_0_10007,attempt_1496458948260_0050_1_00_000841_1_10008
>  sent hash and receievd reply 0 ms
> {noformat}
> In such cases, it would be good to wait until {{usedMemory}} gets released 
> during "next" iteration as opposed to retrying immediately.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to