[ https://issues.apache.org/jira/browse/HADOOP-15541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Steve Loughran updated HADOOP-15541: ------------------------------------ Resolution: Fixed Fix Version/s: 3.1.1 Status: Resolved (was: Patch Available) > AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions > ------------------------------------------------------------------------- > > Key: HADOOP-15541 > URL: https://issues.apache.org/jira/browse/HADOOP-15541 > Project: Hadoop Common > Issue Type: Bug > Components: fs/s3 > Affects Versions: 2.9.1, 2.8.4, 3.0.2, 3.1.1 > Reporter: Sean Mackrory > Assignee: Sean Mackrory > Priority: Major > Fix For: 3.1.1 > > Attachments: HADOOP-15541.001.patch > > > I've gotten a few reports of read timeouts not being handled properly in some > Impala workloads. What happens is the following sequence of events (credit to > Sailesh Mukil for figuring this out): > * S3AInputStream.read() gets a SocketTimeoutException when it calls > wrappedStream.read() > * This is handled by onReadFailure -> reopen -> closeStream. When we try to > drain the stream, SdkFilterInputStream.read() in the AWS SDK fails because of > checkLength. The underlying Apache Commons stream returns -1 in the case of a > timeout, and EOF. > * The SDK assumes the -1 signifies an EOF, so assumes the bytes read must > equal expected bytes, and because they don't (because it's a timeout and not > an EOF) it throws an SdkClientException. > This is tricky to test for without a ton of mocking of AWS SDK internals, > because you have to get into this conflicting state where the SDK has only > read a subset of the expected bytes and gets a -1. > closeStream will abort the stream in the event of an IOException when > draining. We could simply also abort in the event of an SdkClientException. > I'm testing that this results in correct functionality in the workloads that > seem to hit these timeouts a lot, but all the s3a tests continue to work with > that change. I'm going to open an issue with the AWS SDK Github as well, but > I'm not sure what the ideal outcome would be unless there's a good way to > distinguish between a stream that has timed out and a stream that read all > the data without huge rewrites. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org