Re: checkpoint failed due to s3 exception: request timeout

Tony Wei Wed, 29 Aug 2018 01:47:51 -0700

Hi Andrey,

Cool! I will add it in my flink-conf.yaml. However, I'm still wondering if
anyone is familiar with this
problem or has any idea to find the root cause. Thanks.


Best,
Tony Wei

2018-08-29 16:20 GMT+08:00 Andrey Zagrebin <and...@data-artisans.com>:

> Hi,
>
> the current Flink 1.6.0 version uses Presto Hive s3 connector 0.185 [1],
> which has this option:
> S3_MAX_CLIENT_RETRIES = "presto.s3.max-client-retries”;
>
> If you add “s3.max-client-retries” to flink conf, flink-s3-fs-presto [2]
> should automatically prefix it and configure PrestoS3FileSystem correctly.
>
> Cheers,
> Andrey
>
> [1] https://github.com/prestodb/presto/blob/0.185/
> presto-hive/src/main/java/com/facebook/presto/hive/PrestoS3FileSystem.java
> [2] https://ci.apache.org/projects/flink/flink-docs-
> stable/ops/deployment/aws.html#shaded-hadooppresto-s3-
> file-systems-recommended
>
>
> On 29 Aug 2018, at 08:49, vino yang <yanghua1...@gmail.com> wrote:
>
> Hi Tony,
>
> Maybe you can consider looking at the doc information for this class, this
> class comes from flink-s3-fs-presto.[1]
>
> [1]: https://ci.apache.org/projects/flink/flink-docs-
> release-1.6/api/java/org/apache/hadoop/conf/Configuration.html
>
> Thanks, vino.
>
> Tony Wei <tony19920...@gmail.com> 于2018年8月29日周三 下午2:18写道：
>
>> Hi Vino,
>>
>> I thought this config is for aws s3 client, but this client is inner
>> flink-s3-fs-presto.
>> So, I guessed I should find a way to pass this config to this library.
>>
>> Best,
>> Tony Wei
>>
>> 2018-08-29 14:13 GMT+08:00 vino yang <yanghua1...@gmail.com>:
>>
>>> Hi Tony,
>>>
>>> Sorry, I just saw the timeout, I thought they were similar because they
>>> both happened on aws s3.
>>> Regarding this setting, isn't "s3.max-client-retries: xxx" set for the
>>> client?
>>>
>>> Thanks, vino.
>>>
>>> Tony Wei <tony19920...@gmail.com> 于2018年8月29日周三 下午1:17写道：
>>>
>>>> Hi Vino,
>>>>
>>>> Thanks for your quick reply, but I think these two questions are
>>>> different. The checkpoint in that question
>>>> finally finished, but my checkpoint failed due to s3 client timeout.
>>>> You can see from my screenshot that
>>>> showed the checkpoint failed in a short time.
>>>>
>>>> According to configuration, do you mean pass the configuration as
>>>> program's input arguments? I don't
>>>> think it will work. At least I need to find a way to pass it to s3
>>>> filesystem builder in my program. However,
>>>> I will ask for help to pass it by flink-conf.yaml, because I used that
>>>> to config the global setting for s3
>>>> filesystem and I thought it might have a simple way to support this
>>>> setting like other s3.xxx config.
>>>>
>>>> Very much appreciate for your answer and help.
>>>>
>>>> Best,
>>>> Tony Wei
>>>>
>>>> 2018-08-29 11:51 GMT+08:00 vino yang <yanghua1...@gmail.com>:
>>>>
>>>>> Hi Tony,
>>>>>
>>>>> A while ago, I have answered a similar question.[1]
>>>>>
>>>>> You can try to increase this value appropriately. You can't put this
>>>>> configuration in flink-conf.yaml, you can put it in the submit command of
>>>>> the job[2], or in the configuration file you specify.
>>>>>
>>>>> [1]: http://apache-flink-user-mailing-list-archive.2336050.
>>>>> n4.nabble.com/Why-checkpoint-took-so-long-td22364.html#a22375
>>>>> [2]: https://ci.apache.org/projects/flink/flink-docs-
>>>>> release-1.6/ops/cli.html
>>>>>
>>>>> Thanks, vino.
>>>>>
>>>>> Tony Wei <tony19920...@gmail.com> 于2018年8月29日周三 上午11:36写道：
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I met checkpoint failure problem that cause by s3 exception.
>>>>>>
>>>>>> org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>>>>>>> Your socket connection to the server was not read from or written to 
>>>>>>> within
>>>>>>> the timeout period. Idle connections will be closed. (Service: Amazon 
>>>>>>> S3;
>>>>>>> Status Code: 400; Error Code: RequestTimeout; Request ID:
>>>>>>> B8BE8978D3EFF3F5), S3 Extended Request ID: ePKce/
>>>>>>> MjMFPPNYi90rGdYmDw3blfvi0xR2CcJpCISEgxM92/
>>>>>>> 6JZAU4whpfXeV6SfG62cnts0NBw=
>>>>>>
>>>>>>
>>>>>> The full stack trace and screenshot is provided in the attachment.
>>>>>>
>>>>>> My setting for flink cluster and job:
>>>>>>
>>>>>>    - flink version 1.4.0
>>>>>>    - standalone mode
>>>>>>    - 4 slots for each TM
>>>>>>    - presto s3 filesystem
>>>>>>    - rocksdb statebackend
>>>>>>    - local ssd
>>>>>>    - enable incremental checkpoint
>>>>>>
>>>>>> No weird message beside the exception in the log file. No high ratio
>>>>>> of GC during the checkpoint
>>>>>> procedure. And still 3 of 4 parts uploaded successfully on that TM. I
>>>>>> didn't find something that
>>>>>> would related to this failure. Did anyone meet this problem before?
>>>>>>
>>>>>> Besides, I also found an issue in other aws sdk[1] that mentioned
>>>>>> this s3 exception as well. One
>>>>>> reply said you can passively avoid the problem by raising the max
>>>>>> client retires config. So I found
>>>>>> that config in presto[2]. Can I just add s3.max-client-retries: xxx in
>>>>>> flink-conf.yaml to config
>>>>>> it? If not, how should I do to overwrite the default value of this
>>>>>> configuration? Thanks in advance.
>>>>>>
>>>>>> Best,
>>>>>> Tony Wei
>>>>>>
>>>>>> [1] https://github.com/aws/aws-sdk-php/issues/885
>>>>>> [2] https://github.com/prestodb/presto/blob/master/
>>>>>> presto-hive/src/main/java/com/facebook/presto/hive/s3/
>>>>>> HiveS3Config.java#L218
>>>>>>
>>>>>
>>>>
>>
>

Re: checkpoint failed due to s3 exception: request timeout

Reply via email to