Re: checkpoint failed due to s3 exception: request timeout

Tony Wei Tue, 28 Aug 2018 23:19:10 -0700

Hi Vino,

I thought this config is for aws s3 client, but this client is inner
flink-s3-fs-presto.
So, I guessed I should find a way to pass this config to this library.


Best,
Tony Wei

2018-08-29 14:13 GMT+08:00 vino yang <yanghua1...@gmail.com>:

> Hi Tony,
>
> Sorry, I just saw the timeout, I thought they were similar because they
> both happened on aws s3.
> Regarding this setting, isn't "s3.max-client-retries: xxx" set for the
> client?
>
> Thanks, vino.
>
> Tony Wei <tony19920...@gmail.com> 于2018年8月29日周三 下午1:17写道：
>
>> Hi Vino,
>>
>> Thanks for your quick reply, but I think these two questions are
>> different. The checkpoint in that question
>> finally finished, but my checkpoint failed due to s3 client timeout. You
>> can see from my screenshot that
>> showed the checkpoint failed in a short time.
>>
>> According to configuration, do you mean pass the configuration as
>> program's input arguments? I don't
>> think it will work. At least I need to find a way to pass it to s3
>> filesystem builder in my program. However,
>> I will ask for help to pass it by flink-conf.yaml, because I used that to
>> config the global setting for s3
>> filesystem and I thought it might have a simple way to support this
>> setting like other s3.xxx config.
>>
>> Very much appreciate for your answer and help.
>>
>> Best,
>> Tony Wei
>>
>> 2018-08-29 11:51 GMT+08:00 vino yang <yanghua1...@gmail.com>:
>>
>>> Hi Tony,
>>>
>>> A while ago, I have answered a similar question.[1]
>>>
>>> You can try to increase this value appropriately. You can't put this
>>> configuration in flink-conf.yaml, you can put it in the submit command of
>>> the job[2], or in the configuration file you specify.
>>>
>>> [1]: http://apache-flink-user-mailing-list-archive.2336050.
>>> n4.nabble.com/Why-checkpoint-took-so-long-td22364.html#a22375
>>> [2]: https://ci.apache.org/projects/flink/flink-docs-
>>> release-1.6/ops/cli.html
>>>
>>> Thanks, vino.
>>>
>>> Tony Wei <tony19920...@gmail.com> 于2018年8月29日周三 上午11:36写道：
>>>
>>>> Hi,
>>>>
>>>> I met checkpoint failure problem that cause by s3 exception.
>>>>
>>>> org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>>>>> Your socket connection to the server was not read from or written to 
>>>>> within
>>>>> the timeout period. Idle connections will be closed. (Service: Amazon S3;
>>>>> Status Code: 400; Error Code: RequestTimeout; Request ID:
>>>>> B8BE8978D3EFF3F5), S3 Extended Request ID: ePKce/
>>>>> MjMFPPNYi90rGdYmDw3blfvi0xR2CcJpCISEgxM92/6JZAU4whpfXeV6SfG62cnts0NBw=
>>>>
>>>>
>>>> The full stack trace and screenshot is provided in the attachment.
>>>>
>>>> My setting for flink cluster and job:
>>>>
>>>>    - flink version 1.4.0
>>>>    - standalone mode
>>>>    - 4 slots for each TM
>>>>    - presto s3 filesystem
>>>>    - rocksdb statebackend
>>>>    - local ssd
>>>>    - enable incremental checkpoint
>>>>
>>>> No weird message beside the exception in the log file. No high ratio of
>>>> GC during the checkpoint
>>>> procedure. And still 3 of 4 parts uploaded successfully on that TM. I
>>>> didn't find something that
>>>> would related to this failure. Did anyone meet this problem before?
>>>>
>>>> Besides, I also found an issue in other aws sdk[1] that mentioned this
>>>> s3 exception as well. One
>>>> reply said you can passively avoid the problem by raising the max
>>>> client retires config. So I found
>>>> that config in presto[2]. Can I just add s3.max-client-retries: xxx in
>>>> flink-conf.yaml to config
>>>> it? If not, how should I do to overwrite the default value of this
>>>> configuration? Thanks in advance.
>>>>
>>>> Best,
>>>> Tony Wei
>>>>
>>>> [1] https://github.com/aws/aws-sdk-php/issues/885
>>>> [2] https://github.com/prestodb/presto/blob/master/
>>>> presto-hive/src/main/java/com/facebook/presto/hive/s3/
>>>> HiveS3Config.java#L218
>>>>
>>>
>>

Re: checkpoint failed due to s3 exception: request timeout

Reply via email to