[jira] [Commented] (HADOOP-15541) AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions

Steve Loughran (JIRA) Tue, 10 Jul 2018 05:30:02 -0700


    [ 
https://issues.apache.org/jira/browse/HADOOP-15541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16538498#comment-16538498
 ]


Steve Loughran commented on HADOOP-15541:
-----------------------------------------

Looks good. 
One issue: should we always force close on a read failure, rather than treat 
SocketTimeoutException as special? I guess there are some potential failure 
modes (source was deleted during the read) which could trigger IOEs during the 
GET (maybe? Do we test this with a large enough file to be sure there's no 
caching going on? If not, I could imagine adding it to the huge files 
test....). 

IF we say "every IOE -> forced abort', then its a simpler path on read. What 
you have here though is the core fix: on socket errors, don't try and recycle 
things.

What do you think? If you want this one as is, you've got my +1. I'm just 
wondering if the need to add a separate catch for SocketTimeoutException is 
needed

> AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-15541
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15541
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs/s3
>    Affects Versions: 2.9.1, 2.8.4, 3.0.2, 3.1.1
>            Reporter: Sean Mackrory
>            Assignee: Sean Mackrory
>            Priority: Major
>         Attachments: HADOOP-15541.001.patch
>
>
> I've gotten a few reports of read timeouts not being handled properly in some 
> Impala workloads. What happens is the following sequence of events (credit to 
> Sailesh Mukil for figuring this out):
>  * S3AInputStream.read() gets a SocketTimeoutException when it calls 
> wrappedStream.read()
>  * This is handled by onReadFailure -> reopen -> closeStream. When we try to 
> drain the stream, SdkFilterInputStream.read() in the AWS SDK fails because of 
> checkLength. The underlying Apache Commons stream returns -1 in the case of a 
> timeout, and EOF.
>  * The SDK assumes the -1 signifies an EOF, so assumes the bytes read must 
> equal expected bytes, and because they don't (because it's a timeout and not 
> an EOF) it throws an SdkClientException.
> This is tricky to test for without a ton of mocking of AWS SDK internals, 
> because you have to get into this conflicting state where the SDK has only 
> read a subset of the expected bytes and gets a -1.
> closeStream will abort the stream in the event of an IOException when 
> draining. We could simply also abort in the event of an SdkClientException. 
> I'm testing that this results in correct functionality in the workloads that 
> seem to hit these timeouts a lot, but all the s3a tests continue to work with 
> that change. I'm going to open an issue with the AWS SDK Github as well, but 
> I'm not sure what the ideal outcome would be unless there's a good way to 
> distinguish between a stream that has timed out and a stream that read all 
> the data without huge rewrites.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-15541) AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions

Reply via email to