Hi Andrey, Cool! I will add it in my flink-conf.yaml. However, I'm still wondering if anyone is familiar with this problem or has any idea to find the root cause. Thanks.
Best, Tony Wei 2018-08-29 16:20 GMT+08:00 Andrey Zagrebin <and...@data-artisans.com>: > Hi, > > the current Flink 1.6.0 version uses Presto Hive s3 connector 0.185 [1], > which has this option: > S3_MAX_CLIENT_RETRIES = "presto.s3.max-client-retries”; > > If you add “s3.max-client-retries” to flink conf, flink-s3-fs-presto [2] > should automatically prefix it and configure PrestoS3FileSystem correctly. > > Cheers, > Andrey > > [1] https://github.com/prestodb/presto/blob/0.185/ > presto-hive/src/main/java/com/facebook/presto/hive/PrestoS3FileSystem.java > [2] https://ci.apache.org/projects/flink/flink-docs- > stable/ops/deployment/aws.html#shaded-hadooppresto-s3- > file-systems-recommended > > > On 29 Aug 2018, at 08:49, vino yang <yanghua1...@gmail.com> wrote: > > Hi Tony, > > Maybe you can consider looking at the doc information for this class, this > class comes from flink-s3-fs-presto.[1] > > [1]: https://ci.apache.org/projects/flink/flink-docs- > release-1.6/api/java/org/apache/hadoop/conf/Configuration.html > > Thanks, vino. > > Tony Wei <tony19920...@gmail.com> 于2018年8月29日周三 下午2:18写道: > >> Hi Vino, >> >> I thought this config is for aws s3 client, but this client is inner >> flink-s3-fs-presto. >> So, I guessed I should find a way to pass this config to this library. >> >> Best, >> Tony Wei >> >> 2018-08-29 14:13 GMT+08:00 vino yang <yanghua1...@gmail.com>: >> >>> Hi Tony, >>> >>> Sorry, I just saw the timeout, I thought they were similar because they >>> both happened on aws s3. >>> Regarding this setting, isn't "s3.max-client-retries: xxx" set for the >>> client? >>> >>> Thanks, vino. >>> >>> Tony Wei <tony19920...@gmail.com> 于2018年8月29日周三 下午1:17写道: >>> >>>> Hi Vino, >>>> >>>> Thanks for your quick reply, but I think these two questions are >>>> different. The checkpoint in that question >>>> finally finished, but my checkpoint failed due to s3 client timeout. >>>> You can see from my screenshot that >>>> showed the checkpoint failed in a short time. >>>> >>>> According to configuration, do you mean pass the configuration as >>>> program's input arguments? I don't >>>> think it will work. At least I need to find a way to pass it to s3 >>>> filesystem builder in my program. However, >>>> I will ask for help to pass it by flink-conf.yaml, because I used that >>>> to config the global setting for s3 >>>> filesystem and I thought it might have a simple way to support this >>>> setting like other s3.xxx config. >>>> >>>> Very much appreciate for your answer and help. >>>> >>>> Best, >>>> Tony Wei >>>> >>>> 2018-08-29 11:51 GMT+08:00 vino yang <yanghua1...@gmail.com>: >>>> >>>>> Hi Tony, >>>>> >>>>> A while ago, I have answered a similar question.[1] >>>>> >>>>> You can try to increase this value appropriately. You can't put this >>>>> configuration in flink-conf.yaml, you can put it in the submit command of >>>>> the job[2], or in the configuration file you specify. >>>>> >>>>> [1]: http://apache-flink-user-mailing-list-archive.2336050. >>>>> n4.nabble.com/Why-checkpoint-took-so-long-td22364.html#a22375 >>>>> [2]: https://ci.apache.org/projects/flink/flink-docs- >>>>> release-1.6/ops/cli.html >>>>> >>>>> Thanks, vino. >>>>> >>>>> Tony Wei <tony19920...@gmail.com> 于2018年8月29日周三 上午11:36写道: >>>>> >>>>>> Hi, >>>>>> >>>>>> I met checkpoint failure problem that cause by s3 exception. >>>>>> >>>>>> org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: >>>>>>> Your socket connection to the server was not read from or written to >>>>>>> within >>>>>>> the timeout period. Idle connections will be closed. (Service: Amazon >>>>>>> S3; >>>>>>> Status Code: 400; Error Code: RequestTimeout; Request ID: >>>>>>> B8BE8978D3EFF3F5), S3 Extended Request ID: ePKce/ >>>>>>> MjMFPPNYi90rGdYmDw3blfvi0xR2CcJpCISEgxM92/ >>>>>>> 6JZAU4whpfXeV6SfG62cnts0NBw= >>>>>> >>>>>> >>>>>> The full stack trace and screenshot is provided in the attachment. >>>>>> >>>>>> My setting for flink cluster and job: >>>>>> >>>>>> - flink version 1.4.0 >>>>>> - standalone mode >>>>>> - 4 slots for each TM >>>>>> - presto s3 filesystem >>>>>> - rocksdb statebackend >>>>>> - local ssd >>>>>> - enable incremental checkpoint >>>>>> >>>>>> No weird message beside the exception in the log file. No high ratio >>>>>> of GC during the checkpoint >>>>>> procedure. And still 3 of 4 parts uploaded successfully on that TM. I >>>>>> didn't find something that >>>>>> would related to this failure. Did anyone meet this problem before? >>>>>> >>>>>> Besides, I also found an issue in other aws sdk[1] that mentioned >>>>>> this s3 exception as well. One >>>>>> reply said you can passively avoid the problem by raising the max >>>>>> client retires config. So I found >>>>>> that config in presto[2]. Can I just add s3.max-client-retries: xxx in >>>>>> flink-conf.yaml to config >>>>>> it? If not, how should I do to overwrite the default value of this >>>>>> configuration? Thanks in advance. >>>>>> >>>>>> Best, >>>>>> Tony Wei >>>>>> >>>>>> [1] https://github.com/aws/aws-sdk-php/issues/885 >>>>>> [2] https://github.com/prestodb/presto/blob/master/ >>>>>> presto-hive/src/main/java/com/facebook/presto/hive/s3/ >>>>>> HiveS3Config.java#L218 >>>>>> >>>>> >>>> >> >