Re: checkpoint failed due to s3 exception: request timeout

Andrey Zagrebin Wed, 29 Aug 2018 01:20:32 -0700

Hi,

the current Flink 1.6.0 version uses Presto Hive s3 connector 0.185 [1], which 
has this option:
S3_MAX_CLIENT_RETRIES = "presto.s3.max-client-retries”;


If you add “s3.max-client-retries” to flink conf, flink-s3-fs-presto [2] should 
automatically prefix it and configure PrestoS3FileSystem correctly.

Cheers,
Andrey

[1] 
https://github.com/prestodb/presto/blob/0.185/presto-hive/src/main/java/com/facebook/presto/hive/PrestoS3FileSystem.java
 
<https://github.com/prestodb/presto/blob/0.185/presto-hive/src/main/java/com/facebook/presto/hive/PrestoS3FileSystem.java>
[2] 
https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/aws.html#shaded-hadooppresto-s3-file-systems-recommended
 
<https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/aws.html#shaded-hadooppresto-s3-file-systems-recommended>


> On 29 Aug 2018, at 08:49, vino yang <yanghua1...@gmail.com> wrote:
> 
> Hi Tony,
> 
> Maybe you can consider looking at the doc information for this class, this 
> class comes from flink-s3-fs-presto.[1]
> 
> [1]: 
> https://ci.apache.org/projects/flink/flink-docs-release-1.6/api/java/org/apache/hadoop/conf/Configuration.html
>  
> <https://ci.apache.org/projects/flink/flink-docs-release-1.6/api/java/org/apache/hadoop/conf/Configuration.html>
> 
> Thanks, vino.
> 
> Tony Wei <tony19920...@gmail.com <mailto:tony19920...@gmail.com>> 
> 于2018年8月29日周三 下午2:18写道：
> Hi Vino,
> 
> I thought this config is for aws s3 client, but this client is inner 
> flink-s3-fs-presto.
> So, I guessed I should find a way to pass this config to this library.
> 
> Best,
> Tony Wei
> 
> 2018-08-29 14:13 GMT+08:00 vino yang <yanghua1...@gmail.com 
> <mailto:yanghua1...@gmail.com>>:
> Hi Tony,
> 
> Sorry, I just saw the timeout, I thought they were similar because they both 
> happened on aws s3. 
> Regarding this setting, isn't "s3.max-client-retries: xxx" set for the client?
> 
> Thanks, vino.
> 
> Tony Wei <tony19920...@gmail.com <mailto:tony19920...@gmail.com>> 
> 于2018年8月29日周三 下午1:17写道：
> Hi Vino,
> 
> Thanks for your quick reply, but I think these two questions are different. 
> The checkpoint in that question 
> finally finished, but my checkpoint failed due to s3 client timeout. You can 
> see from my screenshot that 
> showed the checkpoint failed in a short time.
> 
> According to configuration, do you mean pass the configuration as program's 
> input arguments? I don't 
> think it will work. At least I need to find a way to pass it to s3 filesystem 
> builder in my program. However,
> I will ask for help to pass it by flink-conf.yaml, because I used that to 
> config the global setting for s3
> filesystem and I thought it might have a simple way to support this setting 
> like other s3.xxx config.
> 
> Very much appreciate for your answer and help.
> 
> Best,
> Tony Wei
> 
> 2018-08-29 11:51 GMT+08:00 vino yang <yanghua1...@gmail.com 
> <mailto:yanghua1...@gmail.com>>:
> Hi Tony,
> 
> A while ago, I have answered a similar question.[1]
> 
> You can try to increase this value appropriately. You can't put this 
> configuration in flink-conf.yaml, you can put it in the submit command of the 
> job[2], or in the configuration file you specify.
> 
> [1]: 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Why-checkpoint-took-so-long-td22364.html#a22375
>  
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Why-checkpoint-took-so-long-td22364.html#a22375>
> [2]: https://ci.apache.org/projects/flink/flink-docs-release-1.6/ops/cli.html 
> <https://ci.apache.org/projects/flink/flink-docs-release-1.6/ops/cli.html>
> 
> Thanks, vino.
> 
> Tony Wei <tony19920...@gmail.com <mailto:tony19920...@gmail.com>> 
> 于2018年8月29日周三 上午11:36写道：
> Hi,
> 
> I met checkpoint failure problem that cause by s3 exception.
> 
> org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>  Your socket connection to the server was not read from or written to within 
> the timeout period. Idle connections will be closed. (Service: Amazon S3; 
> Status Code: 400; Error Code: RequestTimeout; Request ID: B8BE8978D3EFF3F5), 
> S3 Extended Request ID: 
> ePKce/MjMFPPNYi90rGdYmDw3blfvi0xR2CcJpCISEgxM92/6JZAU4whpfXeV6SfG62cnts0NBw=
> 
> The full stack trace and screenshot is provided in the attachment.
> 
> My setting for flink cluster and job:
> flink version 1.4.0
> standalone mode
> 4 slots for each TM
> presto s3 filesystem
> rocksdb statebackend
> local ssd
> enable incremental checkpoint
> No weird message beside the exception in the log file. No high ratio of GC 
> during the checkpoint
> procedure. And still 3 of 4 parts uploaded successfully on that TM. I didn't 
> find something that 
> would related to this failure. Did anyone meet this problem before?
> 
> Besides, I also found an issue in other aws sdk[1] that mentioned this s3 
> exception as well. One
> reply said you can passively avoid the problem by raising the max client 
> retires config. So I found
> that config in presto[2]. Can I just add s3.max-client-retries: xxx in 
> flink-conf.yaml to config
> it? If not, how should I do to overwrite the default value of this 
> configuration? Thanks in advance.
> 
> Best,
> Tony Wei
> 
> [1] https://github.com/aws/aws-sdk-php/issues/885 
> <https://github.com/aws/aws-sdk-php/issues/885>
> [2] 
> https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/s3/HiveS3Config.java#L218
>  
> <https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/s3/HiveS3Config.java#L218>
>

Re: checkpoint failed due to s3 exception: request timeout

Reply via email to