Re: checkpoint failed due to s3 exception: request timeout

vino yang Tue, 28 Aug 2018 23:14:10 -0700

Hi Tony,

Sorry, I just saw the timeout, I thought they were similar because they
both happened on aws s3.
Regarding this setting, isn't "s3.max-client-retries: xxx" set for the
client?


Thanks, vino.

Tony Wei <tony19920...@gmail.com> 于2018年8月29日周三 下午1:17写道：

> Hi Vino,
>
> Thanks for your quick reply, but I think these two questions are
> different. The checkpoint in that question
> finally finished, but my checkpoint failed due to s3 client timeout. You
> can see from my screenshot that
> showed the checkpoint failed in a short time.
>
> According to configuration, do you mean pass the configuration as
> program's input arguments? I don't
> think it will work. At least I need to find a way to pass it to s3
> filesystem builder in my program. However,
> I will ask for help to pass it by flink-conf.yaml, because I used that to
> config the global setting for s3
> filesystem and I thought it might have a simple way to support this
> setting like other s3.xxx config.
>
> Very much appreciate for your answer and help.
>
> Best,
> Tony Wei
>
> 2018-08-29 11:51 GMT+08:00 vino yang <yanghua1...@gmail.com>:
>
>> Hi Tony,
>>
>> A while ago, I have answered a similar question.[1]
>>
>> You can try to increase this value appropriately. You can't put this
>> configuration in flink-conf.yaml, you can put it in the submit command of
>> the job[2], or in the configuration file you specify.
>>
>> [1]:
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Why-checkpoint-took-so-long-td22364.html#a22375
>> [2]:
>> https://ci.apache.org/projects/flink/flink-docs-release-1.6/ops/cli.html
>>
>> Thanks, vino.
>>
>> Tony Wei <tony19920...@gmail.com> 于2018年8月29日周三 上午11:36写道：
>>
>>> Hi,
>>>
>>> I met checkpoint failure problem that cause by s3 exception.
>>>
>>> org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>>>> Your socket connection to the server was not read from or written to within
>>>> the timeout period. Idle connections will be closed. (Service: Amazon S3;
>>>> Status Code: 400; Error Code: RequestTimeout; Request ID:
>>>> B8BE8978D3EFF3F5), S3 Extended Request ID:
>>>> ePKce/MjMFPPNYi90rGdYmDw3blfvi0xR2CcJpCISEgxM92/6JZAU4whpfXeV6SfG62cnts0NBw=
>>>
>>>
>>> The full stack trace and screenshot is provided in the attachment.
>>>
>>> My setting for flink cluster and job:
>>>
>>>    - flink version 1.4.0
>>>    - standalone mode
>>>    - 4 slots for each TM
>>>    - presto s3 filesystem
>>>    - rocksdb statebackend
>>>    - local ssd
>>>    - enable incremental checkpoint
>>>
>>> No weird message beside the exception in the log file. No high ratio of
>>> GC during the checkpoint
>>> procedure. And still 3 of 4 parts uploaded successfully on that TM. I
>>> didn't find something that
>>> would related to this failure. Did anyone meet this problem before?
>>>
>>> Besides, I also found an issue in other aws sdk[1] that mentioned this
>>> s3 exception as well. One
>>> reply said you can passively avoid the problem by raising the max client
>>> retires config. So I found
>>> that config in presto[2]. Can I just add s3.max-client-retries: xxx in
>>> flink-conf.yaml to config
>>> it? If not, how should I do to overwrite the default value of this
>>> configuration? Thanks in advance.
>>>
>>> Best,
>>> Tony Wei
>>>
>>> [1] https://github.com/aws/aws-sdk-php/issues/885
>>> [2]
>>> https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/s3/HiveS3Config.java#L218
>>>
>>
>

Re: checkpoint failed due to s3 exception: request timeout

Reply via email to