Hi Vino, I thought this config is for aws s3 client, but this client is inner flink-s3-fs-presto. So, I guessed I should find a way to pass this config to this library.
Best, Tony Wei 2018-08-29 14:13 GMT+08:00 vino yang <yanghua1...@gmail.com>: > Hi Tony, > > Sorry, I just saw the timeout, I thought they were similar because they > both happened on aws s3. > Regarding this setting, isn't "s3.max-client-retries: xxx" set for the > client? > > Thanks, vino. > > Tony Wei <tony19920...@gmail.com> 于2018年8月29日周三 下午1:17写道: > >> Hi Vino, >> >> Thanks for your quick reply, but I think these two questions are >> different. The checkpoint in that question >> finally finished, but my checkpoint failed due to s3 client timeout. You >> can see from my screenshot that >> showed the checkpoint failed in a short time. >> >> According to configuration, do you mean pass the configuration as >> program's input arguments? I don't >> think it will work. At least I need to find a way to pass it to s3 >> filesystem builder in my program. However, >> I will ask for help to pass it by flink-conf.yaml, because I used that to >> config the global setting for s3 >> filesystem and I thought it might have a simple way to support this >> setting like other s3.xxx config. >> >> Very much appreciate for your answer and help. >> >> Best, >> Tony Wei >> >> 2018-08-29 11:51 GMT+08:00 vino yang <yanghua1...@gmail.com>: >> >>> Hi Tony, >>> >>> A while ago, I have answered a similar question.[1] >>> >>> You can try to increase this value appropriately. You can't put this >>> configuration in flink-conf.yaml, you can put it in the submit command of >>> the job[2], or in the configuration file you specify. >>> >>> [1]: http://apache-flink-user-mailing-list-archive.2336050. >>> n4.nabble.com/Why-checkpoint-took-so-long-td22364.html#a22375 >>> [2]: https://ci.apache.org/projects/flink/flink-docs- >>> release-1.6/ops/cli.html >>> >>> Thanks, vino. >>> >>> Tony Wei <tony19920...@gmail.com> 于2018年8月29日周三 上午11:36写道: >>> >>>> Hi, >>>> >>>> I met checkpoint failure problem that cause by s3 exception. >>>> >>>> org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: >>>>> Your socket connection to the server was not read from or written to >>>>> within >>>>> the timeout period. Idle connections will be closed. (Service: Amazon S3; >>>>> Status Code: 400; Error Code: RequestTimeout; Request ID: >>>>> B8BE8978D3EFF3F5), S3 Extended Request ID: ePKce/ >>>>> MjMFPPNYi90rGdYmDw3blfvi0xR2CcJpCISEgxM92/6JZAU4whpfXeV6SfG62cnts0NBw= >>>> >>>> >>>> The full stack trace and screenshot is provided in the attachment. >>>> >>>> My setting for flink cluster and job: >>>> >>>> - flink version 1.4.0 >>>> - standalone mode >>>> - 4 slots for each TM >>>> - presto s3 filesystem >>>> - rocksdb statebackend >>>> - local ssd >>>> - enable incremental checkpoint >>>> >>>> No weird message beside the exception in the log file. No high ratio of >>>> GC during the checkpoint >>>> procedure. And still 3 of 4 parts uploaded successfully on that TM. I >>>> didn't find something that >>>> would related to this failure. Did anyone meet this problem before? >>>> >>>> Besides, I also found an issue in other aws sdk[1] that mentioned this >>>> s3 exception as well. One >>>> reply said you can passively avoid the problem by raising the max >>>> client retires config. So I found >>>> that config in presto[2]. Can I just add s3.max-client-retries: xxx in >>>> flink-conf.yaml to config >>>> it? If not, how should I do to overwrite the default value of this >>>> configuration? Thanks in advance. >>>> >>>> Best, >>>> Tony Wei >>>> >>>> [1] https://github.com/aws/aws-sdk-php/issues/885 >>>> [2] https://github.com/prestodb/presto/blob/master/ >>>> presto-hive/src/main/java/com/facebook/presto/hive/s3/ >>>> HiveS3Config.java#L218 >>>> >>> >>