Hi Tony, A while ago, I have answered a similar question.[1]
You can try to increase this value appropriately. You can't put this configuration in flink-conf.yaml, you can put it in the submit command of the job[2], or in the configuration file you specify. [1]: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Why-checkpoint-took-so-long-td22364.html#a22375 [2]: https://ci.apache.org/projects/flink/flink-docs-release-1.6/ops/cli.html Thanks, vino. Tony Wei <tony19920...@gmail.com> 于2018年8月29日周三 上午11:36写道: > Hi, > > I met checkpoint failure problem that cause by s3 exception. > > org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: >> Your socket connection to the server was not read from or written to within >> the timeout period. Idle connections will be closed. (Service: Amazon S3; >> Status Code: 400; Error Code: RequestTimeout; Request ID: >> B8BE8978D3EFF3F5), S3 Extended Request ID: >> ePKce/MjMFPPNYi90rGdYmDw3blfvi0xR2CcJpCISEgxM92/6JZAU4whpfXeV6SfG62cnts0NBw= > > > The full stack trace and screenshot is provided in the attachment. > > My setting for flink cluster and job: > > - flink version 1.4.0 > - standalone mode > - 4 slots for each TM > - presto s3 filesystem > - rocksdb statebackend > - local ssd > - enable incremental checkpoint > > No weird message beside the exception in the log file. No high ratio of GC > during the checkpoint > procedure. And still 3 of 4 parts uploaded successfully on that TM. I > didn't find something that > would related to this failure. Did anyone meet this problem before? > > Besides, I also found an issue in other aws sdk[1] that mentioned this s3 > exception as well. One > reply said you can passively avoid the problem by raising the max client > retires config. So I found > that config in presto[2]. Can I just add s3.max-client-retries: xxx in > flink-conf.yaml to config > it? If not, how should I do to overwrite the default value of this > configuration? Thanks in advance. > > Best, > Tony Wei > > [1] https://github.com/aws/aws-sdk-php/issues/885 > [2] > https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/s3/HiveS3Config.java#L218 >