Hi, the current Flink 1.6.0 version uses Presto Hive s3 connector 0.185 [1], which has this option: S3_MAX_CLIENT_RETRIES = "presto.s3.max-client-retries”;
If you add “s3.max-client-retries” to flink conf, flink-s3-fs-presto [2] should automatically prefix it and configure PrestoS3FileSystem correctly. Cheers, Andrey [1] https://github.com/prestodb/presto/blob/0.185/presto-hive/src/main/java/com/facebook/presto/hive/PrestoS3FileSystem.java <https://github.com/prestodb/presto/blob/0.185/presto-hive/src/main/java/com/facebook/presto/hive/PrestoS3FileSystem.java> [2] https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/aws.html#shaded-hadooppresto-s3-file-systems-recommended <https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/aws.html#shaded-hadooppresto-s3-file-systems-recommended> > On 29 Aug 2018, at 08:49, vino yang <yanghua1...@gmail.com> wrote: > > Hi Tony, > > Maybe you can consider looking at the doc information for this class, this > class comes from flink-s3-fs-presto.[1] > > [1]: > https://ci.apache.org/projects/flink/flink-docs-release-1.6/api/java/org/apache/hadoop/conf/Configuration.html > > <https://ci.apache.org/projects/flink/flink-docs-release-1.6/api/java/org/apache/hadoop/conf/Configuration.html> > > Thanks, vino. > > Tony Wei <tony19920...@gmail.com <mailto:tony19920...@gmail.com>> > 于2018年8月29日周三 下午2:18写道: > Hi Vino, > > I thought this config is for aws s3 client, but this client is inner > flink-s3-fs-presto. > So, I guessed I should find a way to pass this config to this library. > > Best, > Tony Wei > > 2018-08-29 14:13 GMT+08:00 vino yang <yanghua1...@gmail.com > <mailto:yanghua1...@gmail.com>>: > Hi Tony, > > Sorry, I just saw the timeout, I thought they were similar because they both > happened on aws s3. > Regarding this setting, isn't "s3.max-client-retries: xxx" set for the client? > > Thanks, vino. > > Tony Wei <tony19920...@gmail.com <mailto:tony19920...@gmail.com>> > 于2018年8月29日周三 下午1:17写道: > Hi Vino, > > Thanks for your quick reply, but I think these two questions are different. > The checkpoint in that question > finally finished, but my checkpoint failed due to s3 client timeout. You can > see from my screenshot that > showed the checkpoint failed in a short time. > > According to configuration, do you mean pass the configuration as program's > input arguments? I don't > think it will work. At least I need to find a way to pass it to s3 filesystem > builder in my program. However, > I will ask for help to pass it by flink-conf.yaml, because I used that to > config the global setting for s3 > filesystem and I thought it might have a simple way to support this setting > like other s3.xxx config. > > Very much appreciate for your answer and help. > > Best, > Tony Wei > > 2018-08-29 11:51 GMT+08:00 vino yang <yanghua1...@gmail.com > <mailto:yanghua1...@gmail.com>>: > Hi Tony, > > A while ago, I have answered a similar question.[1] > > You can try to increase this value appropriately. You can't put this > configuration in flink-conf.yaml, you can put it in the submit command of the > job[2], or in the configuration file you specify. > > [1]: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Why-checkpoint-took-so-long-td22364.html#a22375 > > <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Why-checkpoint-took-so-long-td22364.html#a22375> > [2]: https://ci.apache.org/projects/flink/flink-docs-release-1.6/ops/cli.html > <https://ci.apache.org/projects/flink/flink-docs-release-1.6/ops/cli.html> > > Thanks, vino. > > Tony Wei <tony19920...@gmail.com <mailto:tony19920...@gmail.com>> > 于2018年8月29日周三 上午11:36写道: > Hi, > > I met checkpoint failure problem that cause by s3 exception. > > org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: > Your socket connection to the server was not read from or written to within > the timeout period. Idle connections will be closed. (Service: Amazon S3; > Status Code: 400; Error Code: RequestTimeout; Request ID: B8BE8978D3EFF3F5), > S3 Extended Request ID: > ePKce/MjMFPPNYi90rGdYmDw3blfvi0xR2CcJpCISEgxM92/6JZAU4whpfXeV6SfG62cnts0NBw= > > The full stack trace and screenshot is provided in the attachment. > > My setting for flink cluster and job: > flink version 1.4.0 > standalone mode > 4 slots for each TM > presto s3 filesystem > rocksdb statebackend > local ssd > enable incremental checkpoint > No weird message beside the exception in the log file. No high ratio of GC > during the checkpoint > procedure. And still 3 of 4 parts uploaded successfully on that TM. I didn't > find something that > would related to this failure. Did anyone meet this problem before? > > Besides, I also found an issue in other aws sdk[1] that mentioned this s3 > exception as well. One > reply said you can passively avoid the problem by raising the max client > retires config. So I found > that config in presto[2]. Can I just add s3.max-client-retries: xxx in > flink-conf.yaml to config > it? If not, how should I do to overwrite the default value of this > configuration? Thanks in advance. > > Best, > Tony Wei > > [1] https://github.com/aws/aws-sdk-php/issues/885 > <https://github.com/aws/aws-sdk-php/issues/885> > [2] > https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/s3/HiveS3Config.java#L218 > > <https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/s3/HiveS3Config.java#L218> >