Hi Tony, Sorry, I just saw the timeout, I thought they were similar because they both happened on aws s3. Regarding this setting, isn't "s3.max-client-retries: xxx" set for the client?
Thanks, vino. Tony Wei <tony19920...@gmail.com> 于2018年8月29日周三 下午1:17写道: > Hi Vino, > > Thanks for your quick reply, but I think these two questions are > different. The checkpoint in that question > finally finished, but my checkpoint failed due to s3 client timeout. You > can see from my screenshot that > showed the checkpoint failed in a short time. > > According to configuration, do you mean pass the configuration as > program's input arguments? I don't > think it will work. At least I need to find a way to pass it to s3 > filesystem builder in my program. However, > I will ask for help to pass it by flink-conf.yaml, because I used that to > config the global setting for s3 > filesystem and I thought it might have a simple way to support this > setting like other s3.xxx config. > > Very much appreciate for your answer and help. > > Best, > Tony Wei > > 2018-08-29 11:51 GMT+08:00 vino yang <yanghua1...@gmail.com>: > >> Hi Tony, >> >> A while ago, I have answered a similar question.[1] >> >> You can try to increase this value appropriately. You can't put this >> configuration in flink-conf.yaml, you can put it in the submit command of >> the job[2], or in the configuration file you specify. >> >> [1]: >> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Why-checkpoint-took-so-long-td22364.html#a22375 >> [2]: >> https://ci.apache.org/projects/flink/flink-docs-release-1.6/ops/cli.html >> >> Thanks, vino. >> >> Tony Wei <tony19920...@gmail.com> 于2018年8月29日周三 上午11:36写道: >> >>> Hi, >>> >>> I met checkpoint failure problem that cause by s3 exception. >>> >>> org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: >>>> Your socket connection to the server was not read from or written to within >>>> the timeout period. Idle connections will be closed. (Service: Amazon S3; >>>> Status Code: 400; Error Code: RequestTimeout; Request ID: >>>> B8BE8978D3EFF3F5), S3 Extended Request ID: >>>> ePKce/MjMFPPNYi90rGdYmDw3blfvi0xR2CcJpCISEgxM92/6JZAU4whpfXeV6SfG62cnts0NBw= >>> >>> >>> The full stack trace and screenshot is provided in the attachment. >>> >>> My setting for flink cluster and job: >>> >>> - flink version 1.4.0 >>> - standalone mode >>> - 4 slots for each TM >>> - presto s3 filesystem >>> - rocksdb statebackend >>> - local ssd >>> - enable incremental checkpoint >>> >>> No weird message beside the exception in the log file. No high ratio of >>> GC during the checkpoint >>> procedure. And still 3 of 4 parts uploaded successfully on that TM. I >>> didn't find something that >>> would related to this failure. Did anyone meet this problem before? >>> >>> Besides, I also found an issue in other aws sdk[1] that mentioned this >>> s3 exception as well. One >>> reply said you can passively avoid the problem by raising the max client >>> retires config. So I found >>> that config in presto[2]. Can I just add s3.max-client-retries: xxx in >>> flink-conf.yaml to config >>> it? If not, how should I do to overwrite the default value of this >>> configuration? Thanks in advance. >>> >>> Best, >>> Tony Wei >>> >>> [1] https://github.com/aws/aws-sdk-php/issues/885 >>> [2] >>> https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/s3/HiveS3Config.java#L218 >>> >> >