[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15511336#comment-15511336
 ] 

Haibo Chen commented on MAPREDUCE-6728:
---------------------------------------

Thanks for you reviews,[~templedf]! I have updated the patch to incorporate 
most of your suggestions.

bq. Instead of reusing the INITIAL_DELAY, you should define a retry delay 
instead. You might also want to consider some kind of backoff. The most correct 
approach would be to define the delay in the ShuffleHandler and pass it back in 
the Retry-after header.
Agreed. This is added into the new path.

bq. I don't love defining an inner exception, but it appears to be the best 
option. Can we call it something like TryAgainLaterException so that it's 
really clear what it means? Should it be static? It should probably be private.
Renamed the exception and added private as you suggested, but still kept the 
static modifier as I don't see the association of this Exception class with an 
instance of Fetcher.

bq. Is there a clever way to not duplicate the code to put back the remaining 
attempts? It appears in both catch clauses.
I don't see a clean way to get rid of the duplicate. 

bq. In the TestFetcher test, watching the calls to hostFailed and copyFailed 
seems brittle. Maybe instead watch the ioErrs counter?
Good point. The new patch now verifies failure counts.



> Give fetchers hint when ShuffleHandler rejects a shuffling connection
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6728
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6728
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: mrv2
>            Reporter: Haibo Chen
>            Assignee: Haibo Chen
>         Attachments: mapreduce6728.001.patch, mapreduce6728.002.patch, 
> mapreduce6728.prelim.patch
>
>
> If # of open shuffle connection to a node goes over the max, ShuffleHandler 
> closes the connection immediately without giving fetchers any hint of the 
> reason, which causes fetchers to fail due to exceptions 
> java.net.SocketException: Unexpected end of file from server
>       at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:772)
>       at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:633)
>       at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:769)
>       at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:633)
>       at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1323)
>       at 
> java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468)
>       at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.verifyConnection(Fetcher.java:430)
>       at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.setupConnectionsWithRetry(Fetcher.java:395)
>       at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.openShuffleUrl(Fetcher.java:266)
>       at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:323)
>       at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)
> OR 
> java.net.SocketException: Connection reset
>       at java.net.SocketInputStream.read(SocketInputStream.java:196)
>       at java.net.SocketInputStream.read(SocketInputStream.java:122)
>       at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>       at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
>       at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
>       at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
>       at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:633)
>       at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:769)
>       at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:633)
>       at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1323)
>       at 
> java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468)
>       at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.verifyConnection(Fetcher.java:430)
>       at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.setupConnectionsWithRetry(Fetcher.java:395)
>       at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.openShuffleUrl(Fetcher.java:266)
>       at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java
> Such failures are counted as fetcher failures



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

Reply via email to