Re: checkpoint failed due to s3 exception: request timeout

2018-08-29 Thread Tony Wei
Hi Andrey,

Cool! I will add it in my flink-conf.yaml. However, I'm still wondering if
anyone is familiar with this
problem or has any idea to find the root cause. Thanks.

Best,
Tony Wei

2018-08-29 16:20 GMT+08:00 Andrey Zagrebin :

> Hi,
>
> the current Flink 1.6.0 version uses Presto Hive s3 connector 0.185 [1],
> which has this option:
> S3_MAX_CLIENT_RETRIES = "presto.s3.max-client-retries”;
>
> If you add “s3.max-client-retries” to flink conf, flink-s3-fs-presto [2]
> should automatically prefix it and configure PrestoS3FileSystem correctly.
>
> Cheers,
> Andrey
>
> [1] https://github.com/prestodb/presto/blob/0.185/
> presto-hive/src/main/java/com/facebook/presto/hive/PrestoS3FileSystem.java
> [2] https://ci.apache.org/projects/flink/flink-docs-
> stable/ops/deployment/aws.html#shaded-hadooppresto-s3-
> file-systems-recommended
>
>
> On 29 Aug 2018, at 08:49, vino yang  wrote:
>
> Hi Tony,
>
> Maybe you can consider looking at the doc information for this class, this
> class comes from flink-s3-fs-presto.[1]
>
> [1]: https://ci.apache.org/projects/flink/flink-docs-
> release-1.6/api/java/org/apache/hadoop/conf/Configuration.html
>
> Thanks, vino.
>
> Tony Wei  于2018年8月29日周三 下午2:18写道:
>
>> Hi Vino,
>>
>> I thought this config is for aws s3 client, but this client is inner
>> flink-s3-fs-presto.
>> So, I guessed I should find a way to pass this config to this library.
>>
>> Best,
>> Tony Wei
>>
>> 2018-08-29 14:13 GMT+08:00 vino yang :
>>
>>> Hi Tony,
>>>
>>> Sorry, I just saw the timeout, I thought they were similar because they
>>> both happened on aws s3.
>>> Regarding this setting, isn't "s3.max-client-retries: xxx" set for the
>>> client?
>>>
>>> Thanks, vino.
>>>
>>> Tony Wei  于2018年8月29日周三 下午1:17写道:
>>>
 Hi Vino,

 Thanks for your quick reply, but I think these two questions are
 different. The checkpoint in that question
 finally finished, but my checkpoint failed due to s3 client timeout.
 You can see from my screenshot that
 showed the checkpoint failed in a short time.

 According to configuration, do you mean pass the configuration as
 program's input arguments? I don't
 think it will work. At least I need to find a way to pass it to s3
 filesystem builder in my program. However,
 I will ask for help to pass it by flink-conf.yaml, because I used that
 to config the global setting for s3
 filesystem and I thought it might have a simple way to support this
 setting like other s3.xxx config.

 Very much appreciate for your answer and help.

 Best,
 Tony Wei

 2018-08-29 11:51 GMT+08:00 vino yang :

> Hi Tony,
>
> A while ago, I have answered a similar question.[1]
>
> You can try to increase this value appropriately. You can't put this
> configuration in flink-conf.yaml, you can put it in the submit command of
> the job[2], or in the configuration file you specify.
>
> [1]: http://apache-flink-user-mailing-list-archive.2336050.
> n4.nabble.com/Why-checkpoint-took-so-long-td22364.html#a22375
> [2]: https://ci.apache.org/projects/flink/flink-docs-
> release-1.6/ops/cli.html
>
> Thanks, vino.
>
> Tony Wei  于2018年8月29日周三 上午11:36写道:
>
>> Hi,
>>
>> I met checkpoint failure problem that cause by s3 exception.
>>
>> org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>>> Your socket connection to the server was not read from or written to 
>>> within
>>> the timeout period. Idle connections will be closed. (Service: Amazon 
>>> S3;
>>> Status Code: 400; Error Code: RequestTimeout; Request ID:
>>> B8BE8978D3EFF3F5), S3 Extended Request ID: ePKce/
>>> MjMFPPNYi90rGdYmDw3blfvi0xR2CcJpCISEgxM92/
>>> 6JZAU4whpfXeV6SfG62cnts0NBw=
>>
>>
>> The full stack trace and screenshot is provided in the attachment.
>>
>> My setting for flink cluster and job:
>>
>>- flink version 1.4.0
>>- standalone mode
>>- 4 slots for each TM
>>- presto s3 filesystem
>>- rocksdb statebackend
>>- local ssd
>>- enable incremental checkpoint
>>
>> No weird message beside the exception in the log file. No high ratio
>> of GC during the checkpoint
>> procedure. And still 3 of 4 parts uploaded successfully on that TM. I
>> didn't find something that
>> would related to this failure. Did anyone meet this problem before?
>>
>> Besides, I also found an issue in other aws sdk[1] that mentioned
>> this s3 exception as well. One
>> reply said you can passively avoid the problem by raising the max
>> client retires config. So I found
>> that config in presto[2]. Can I just add s3.max-client-retries: xxx in
>> flink-conf.yaml to config
>> it? If not, how should I do to overwrite the default value of this
>> configuration? Thanks in advance.
>>
>> Best,

Re: checkpoint failed due to s3 exception: request timeout

2018-08-29 Thread Andrey Zagrebin
Hi,

the current Flink 1.6.0 version uses Presto Hive s3 connector 0.185 [1], which 
has this option:
S3_MAX_CLIENT_RETRIES = "presto.s3.max-client-retries”;

If you add “s3.max-client-retries” to flink conf, flink-s3-fs-presto [2] should 
automatically prefix it and configure PrestoS3FileSystem correctly.

Cheers,
Andrey

[1] 
https://github.com/prestodb/presto/blob/0.185/presto-hive/src/main/java/com/facebook/presto/hive/PrestoS3FileSystem.java
 

[2] 
https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/aws.html#shaded-hadooppresto-s3-file-systems-recommended
 



> On 29 Aug 2018, at 08:49, vino yang  wrote:
> 
> Hi Tony,
> 
> Maybe you can consider looking at the doc information for this class, this 
> class comes from flink-s3-fs-presto.[1]
> 
> [1]: 
> https://ci.apache.org/projects/flink/flink-docs-release-1.6/api/java/org/apache/hadoop/conf/Configuration.html
>  
> 
> 
> Thanks, vino.
> 
> Tony Wei mailto:tony19920...@gmail.com>> 
> 于2018年8月29日周三 下午2:18写道:
> Hi Vino,
> 
> I thought this config is for aws s3 client, but this client is inner 
> flink-s3-fs-presto.
> So, I guessed I should find a way to pass this config to this library.
> 
> Best,
> Tony Wei
> 
> 2018-08-29 14:13 GMT+08:00 vino yang  >:
> Hi Tony,
> 
> Sorry, I just saw the timeout, I thought they were similar because they both 
> happened on aws s3. 
> Regarding this setting, isn't "s3.max-client-retries: xxx" set for the client?
> 
> Thanks, vino.
> 
> Tony Wei mailto:tony19920...@gmail.com>> 
> 于2018年8月29日周三 下午1:17写道:
> Hi Vino,
> 
> Thanks for your quick reply, but I think these two questions are different. 
> The checkpoint in that question 
> finally finished, but my checkpoint failed due to s3 client timeout. You can 
> see from my screenshot that 
> showed the checkpoint failed in a short time.
> 
> According to configuration, do you mean pass the configuration as program's 
> input arguments? I don't 
> think it will work. At least I need to find a way to pass it to s3 filesystem 
> builder in my program. However,
> I will ask for help to pass it by flink-conf.yaml, because I used that to 
> config the global setting for s3
> filesystem and I thought it might have a simple way to support this setting 
> like other s3.xxx config.
> 
> Very much appreciate for your answer and help.
> 
> Best,
> Tony Wei
> 
> 2018-08-29 11:51 GMT+08:00 vino yang  >:
> Hi Tony,
> 
> A while ago, I have answered a similar question.[1]
> 
> You can try to increase this value appropriately. You can't put this 
> configuration in flink-conf.yaml, you can put it in the submit command of the 
> job[2], or in the configuration file you specify.
> 
> [1]: 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Why-checkpoint-took-so-long-td22364.html#a22375
>  
> 
> [2]: https://ci.apache.org/projects/flink/flink-docs-release-1.6/ops/cli.html 
> 
> 
> Thanks, vino.
> 
> Tony Wei mailto:tony19920...@gmail.com>> 
> 于2018年8月29日周三 上午11:36写道:
> Hi,
> 
> I met checkpoint failure problem that cause by s3 exception.
> 
> org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>  Your socket connection to the server was not read from or written to within 
> the timeout period. Idle connections will be closed. (Service: Amazon S3; 
> Status Code: 400; Error Code: RequestTimeout; Request ID: B8BE8978D3EFF3F5), 
> S3 Extended Request ID: 
> ePKce/MjMFPPNYi90rGdYmDw3blfvi0xR2CcJpCISEgxM92/6JZAU4whpfXeV6SfG62cnts0NBw=
> 
> The full stack trace and screenshot is provided in the attachment.
> 
> My setting for flink cluster and job:
> flink version 1.4.0
> standalone mode
> 4 slots for each TM
> presto s3 filesystem
> rocksdb statebackend
> local ssd
> enable incremental checkpoint
> No weird message beside the exception in the log file. No high ratio of GC 
> during the checkpoint
> procedure. And still 3 of 4 parts uploaded successfully on that TM. I didn't 
> find something that 
> would related to this failure. Did anyone meet this problem before?
> 
> Besides, I also found an issue in other aws sdk[1] that mentioned this s3 
> exception as well. One
> reply said you can passively avoid the problem by raising the max client 
> retires config. So I found
> that config in presto[2]. Can I just add s3.max-client-retries: xxx in 
> flink-conf.yaml to config
> it? If not, how should I do to overwrite the 

Re: checkpoint failed due to s3 exception: request timeout

2018-08-29 Thread vino yang
Hi Tony,

Maybe you can consider looking at the doc information for this class, this
class comes from flink-s3-fs-presto.[1]

[1]:
https://ci.apache.org/projects/flink/flink-docs-release-1.6/api/java/org/apache/hadoop/conf/Configuration.html

Thanks, vino.

Tony Wei  于2018年8月29日周三 下午2:18写道:

> Hi Vino,
>
> I thought this config is for aws s3 client, but this client is inner
> flink-s3-fs-presto.
> So, I guessed I should find a way to pass this config to this library.
>
> Best,
> Tony Wei
>
> 2018-08-29 14:13 GMT+08:00 vino yang :
>
>> Hi Tony,
>>
>> Sorry, I just saw the timeout, I thought they were similar because they
>> both happened on aws s3.
>> Regarding this setting, isn't "s3.max-client-retries: xxx" set for the
>> client?
>>
>> Thanks, vino.
>>
>> Tony Wei  于2018年8月29日周三 下午1:17写道:
>>
>>> Hi Vino,
>>>
>>> Thanks for your quick reply, but I think these two questions are
>>> different. The checkpoint in that question
>>> finally finished, but my checkpoint failed due to s3 client timeout. You
>>> can see from my screenshot that
>>> showed the checkpoint failed in a short time.
>>>
>>> According to configuration, do you mean pass the configuration as
>>> program's input arguments? I don't
>>> think it will work. At least I need to find a way to pass it to s3
>>> filesystem builder in my program. However,
>>> I will ask for help to pass it by flink-conf.yaml, because I used that
>>> to config the global setting for s3
>>> filesystem and I thought it might have a simple way to support this
>>> setting like other s3.xxx config.
>>>
>>> Very much appreciate for your answer and help.
>>>
>>> Best,
>>> Tony Wei
>>>
>>> 2018-08-29 11:51 GMT+08:00 vino yang :
>>>
 Hi Tony,

 A while ago, I have answered a similar question.[1]

 You can try to increase this value appropriately. You can't put this
 configuration in flink-conf.yaml, you can put it in the submit command of
 the job[2], or in the configuration file you specify.

 [1]:
 http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Why-checkpoint-took-so-long-td22364.html#a22375
 [2]:
 https://ci.apache.org/projects/flink/flink-docs-release-1.6/ops/cli.html

 Thanks, vino.

 Tony Wei  于2018年8月29日周三 上午11:36写道:

> Hi,
>
> I met checkpoint failure problem that cause by s3 exception.
>
> org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>> Your socket connection to the server was not read from or written to 
>> within
>> the timeout period. Idle connections will be closed. (Service: Amazon S3;
>> Status Code: 400; Error Code: RequestTimeout; Request ID:
>> B8BE8978D3EFF3F5), S3 Extended Request ID:
>> ePKce/MjMFPPNYi90rGdYmDw3blfvi0xR2CcJpCISEgxM92/6JZAU4whpfXeV6SfG62cnts0NBw=
>
>
> The full stack trace and screenshot is provided in the attachment.
>
> My setting for flink cluster and job:
>
>- flink version 1.4.0
>- standalone mode
>- 4 slots for each TM
>- presto s3 filesystem
>- rocksdb statebackend
>- local ssd
>- enable incremental checkpoint
>
> No weird message beside the exception in the log file. No high ratio
> of GC during the checkpoint
> procedure. And still 3 of 4 parts uploaded successfully on that TM. I
> didn't find something that
> would related to this failure. Did anyone meet this problem before?
>
> Besides, I also found an issue in other aws sdk[1] that mentioned this
> s3 exception as well. One
> reply said you can passively avoid the problem by raising the max
> client retires config. So I found
> that config in presto[2]. Can I just add s3.max-client-retries: xxx in
> flink-conf.yaml to config
> it? If not, how should I do to overwrite the default value of this
> configuration? Thanks in advance.
>
> Best,
> Tony Wei
>
> [1] https://github.com/aws/aws-sdk-php/issues/885
> [2]
> https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/s3/HiveS3Config.java#L218
>

>>>
>


Re: checkpoint failed due to s3 exception: request timeout

2018-08-29 Thread Tony Wei
Hi Vino,

I thought this config is for aws s3 client, but this client is inner
flink-s3-fs-presto.
So, I guessed I should find a way to pass this config to this library.

Best,
Tony Wei

2018-08-29 14:13 GMT+08:00 vino yang :

> Hi Tony,
>
> Sorry, I just saw the timeout, I thought they were similar because they
> both happened on aws s3.
> Regarding this setting, isn't "s3.max-client-retries: xxx" set for the
> client?
>
> Thanks, vino.
>
> Tony Wei  于2018年8月29日周三 下午1:17写道:
>
>> Hi Vino,
>>
>> Thanks for your quick reply, but I think these two questions are
>> different. The checkpoint in that question
>> finally finished, but my checkpoint failed due to s3 client timeout. You
>> can see from my screenshot that
>> showed the checkpoint failed in a short time.
>>
>> According to configuration, do you mean pass the configuration as
>> program's input arguments? I don't
>> think it will work. At least I need to find a way to pass it to s3
>> filesystem builder in my program. However,
>> I will ask for help to pass it by flink-conf.yaml, because I used that to
>> config the global setting for s3
>> filesystem and I thought it might have a simple way to support this
>> setting like other s3.xxx config.
>>
>> Very much appreciate for your answer and help.
>>
>> Best,
>> Tony Wei
>>
>> 2018-08-29 11:51 GMT+08:00 vino yang :
>>
>>> Hi Tony,
>>>
>>> A while ago, I have answered a similar question.[1]
>>>
>>> You can try to increase this value appropriately. You can't put this
>>> configuration in flink-conf.yaml, you can put it in the submit command of
>>> the job[2], or in the configuration file you specify.
>>>
>>> [1]: http://apache-flink-user-mailing-list-archive.2336050.
>>> n4.nabble.com/Why-checkpoint-took-so-long-td22364.html#a22375
>>> [2]: https://ci.apache.org/projects/flink/flink-docs-
>>> release-1.6/ops/cli.html
>>>
>>> Thanks, vino.
>>>
>>> Tony Wei  于2018年8月29日周三 上午11:36写道:
>>>
 Hi,

 I met checkpoint failure problem that cause by s3 exception.

 org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
> Your socket connection to the server was not read from or written to 
> within
> the timeout period. Idle connections will be closed. (Service: Amazon S3;
> Status Code: 400; Error Code: RequestTimeout; Request ID:
> B8BE8978D3EFF3F5), S3 Extended Request ID: ePKce/
> MjMFPPNYi90rGdYmDw3blfvi0xR2CcJpCISEgxM92/6JZAU4whpfXeV6SfG62cnts0NBw=


 The full stack trace and screenshot is provided in the attachment.

 My setting for flink cluster and job:

- flink version 1.4.0
- standalone mode
- 4 slots for each TM
- presto s3 filesystem
- rocksdb statebackend
- local ssd
- enable incremental checkpoint

 No weird message beside the exception in the log file. No high ratio of
 GC during the checkpoint
 procedure. And still 3 of 4 parts uploaded successfully on that TM. I
 didn't find something that
 would related to this failure. Did anyone meet this problem before?

 Besides, I also found an issue in other aws sdk[1] that mentioned this
 s3 exception as well. One
 reply said you can passively avoid the problem by raising the max
 client retires config. So I found
 that config in presto[2]. Can I just add s3.max-client-retries: xxx in
 flink-conf.yaml to config
 it? If not, how should I do to overwrite the default value of this
 configuration? Thanks in advance.

 Best,
 Tony Wei

 [1] https://github.com/aws/aws-sdk-php/issues/885
 [2] https://github.com/prestodb/presto/blob/master/
 presto-hive/src/main/java/com/facebook/presto/hive/s3/
 HiveS3Config.java#L218

>>>
>>


Re: checkpoint failed due to s3 exception: request timeout

2018-08-29 Thread vino yang
Hi Tony,

Sorry, I just saw the timeout, I thought they were similar because they
both happened on aws s3.
Regarding this setting, isn't "s3.max-client-retries: xxx" set for the
client?

Thanks, vino.

Tony Wei  于2018年8月29日周三 下午1:17写道:

> Hi Vino,
>
> Thanks for your quick reply, but I think these two questions are
> different. The checkpoint in that question
> finally finished, but my checkpoint failed due to s3 client timeout. You
> can see from my screenshot that
> showed the checkpoint failed in a short time.
>
> According to configuration, do you mean pass the configuration as
> program's input arguments? I don't
> think it will work. At least I need to find a way to pass it to s3
> filesystem builder in my program. However,
> I will ask for help to pass it by flink-conf.yaml, because I used that to
> config the global setting for s3
> filesystem and I thought it might have a simple way to support this
> setting like other s3.xxx config.
>
> Very much appreciate for your answer and help.
>
> Best,
> Tony Wei
>
> 2018-08-29 11:51 GMT+08:00 vino yang :
>
>> Hi Tony,
>>
>> A while ago, I have answered a similar question.[1]
>>
>> You can try to increase this value appropriately. You can't put this
>> configuration in flink-conf.yaml, you can put it in the submit command of
>> the job[2], or in the configuration file you specify.
>>
>> [1]:
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Why-checkpoint-took-so-long-td22364.html#a22375
>> [2]:
>> https://ci.apache.org/projects/flink/flink-docs-release-1.6/ops/cli.html
>>
>> Thanks, vino.
>>
>> Tony Wei  于2018年8月29日周三 上午11:36写道:
>>
>>> Hi,
>>>
>>> I met checkpoint failure problem that cause by s3 exception.
>>>
>>> org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
 Your socket connection to the server was not read from or written to within
 the timeout period. Idle connections will be closed. (Service: Amazon S3;
 Status Code: 400; Error Code: RequestTimeout; Request ID:
 B8BE8978D3EFF3F5), S3 Extended Request ID:
 ePKce/MjMFPPNYi90rGdYmDw3blfvi0xR2CcJpCISEgxM92/6JZAU4whpfXeV6SfG62cnts0NBw=
>>>
>>>
>>> The full stack trace and screenshot is provided in the attachment.
>>>
>>> My setting for flink cluster and job:
>>>
>>>- flink version 1.4.0
>>>- standalone mode
>>>- 4 slots for each TM
>>>- presto s3 filesystem
>>>- rocksdb statebackend
>>>- local ssd
>>>- enable incremental checkpoint
>>>
>>> No weird message beside the exception in the log file. No high ratio of
>>> GC during the checkpoint
>>> procedure. And still 3 of 4 parts uploaded successfully on that TM. I
>>> didn't find something that
>>> would related to this failure. Did anyone meet this problem before?
>>>
>>> Besides, I also found an issue in other aws sdk[1] that mentioned this
>>> s3 exception as well. One
>>> reply said you can passively avoid the problem by raising the max client
>>> retires config. So I found
>>> that config in presto[2]. Can I just add s3.max-client-retries: xxx in
>>> flink-conf.yaml to config
>>> it? If not, how should I do to overwrite the default value of this
>>> configuration? Thanks in advance.
>>>
>>> Best,
>>> Tony Wei
>>>
>>> [1] https://github.com/aws/aws-sdk-php/issues/885
>>> [2]
>>> https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/s3/HiveS3Config.java#L218
>>>
>>
>


Re: checkpoint failed due to s3 exception: request timeout

2018-08-28 Thread Tony Wei
Hi Vino,

Thanks for your quick reply, but I think these two questions are different.
The checkpoint in that question
finally finished, but my checkpoint failed due to s3 client timeout. You
can see from my screenshot that
showed the checkpoint failed in a short time.

According to configuration, do you mean pass the configuration as program's
input arguments? I don't
think it will work. At least I need to find a way to pass it to s3
filesystem builder in my program. However,
I will ask for help to pass it by flink-conf.yaml, because I used that to
config the global setting for s3
filesystem and I thought it might have a simple way to support this setting
like other s3.xxx config.

Very much appreciate for your answer and help.

Best,
Tony Wei

2018-08-29 11:51 GMT+08:00 vino yang :

> Hi Tony,
>
> A while ago, I have answered a similar question.[1]
>
> You can try to increase this value appropriately. You can't put this
> configuration in flink-conf.yaml, you can put it in the submit command of
> the job[2], or in the configuration file you specify.
>
> [1]: http://apache-flink-user-mailing-list-archive.2336050.
> n4.nabble.com/Why-checkpoint-took-so-long-td22364.html#a22375
> [2]: https://ci.apache.org/projects/flink/flink-docs-
> release-1.6/ops/cli.html
>
> Thanks, vino.
>
> Tony Wei  于2018年8月29日周三 上午11:36写道:
>
>> Hi,
>>
>> I met checkpoint failure problem that cause by s3 exception.
>>
>> org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>>> Your socket connection to the server was not read from or written to within
>>> the timeout period. Idle connections will be closed. (Service: Amazon S3;
>>> Status Code: 400; Error Code: RequestTimeout; Request ID:
>>> B8BE8978D3EFF3F5), S3 Extended Request ID: ePKce/
>>> MjMFPPNYi90rGdYmDw3blfvi0xR2CcJpCISEgxM92/6JZAU4whpfXeV6SfG62cnts0NBw=
>>
>>
>> The full stack trace and screenshot is provided in the attachment.
>>
>> My setting for flink cluster and job:
>>
>>- flink version 1.4.0
>>- standalone mode
>>- 4 slots for each TM
>>- presto s3 filesystem
>>- rocksdb statebackend
>>- local ssd
>>- enable incremental checkpoint
>>
>> No weird message beside the exception in the log file. No high ratio of
>> GC during the checkpoint
>> procedure. And still 3 of 4 parts uploaded successfully on that TM. I
>> didn't find something that
>> would related to this failure. Did anyone meet this problem before?
>>
>> Besides, I also found an issue in other aws sdk[1] that mentioned this s3
>> exception as well. One
>> reply said you can passively avoid the problem by raising the max client
>> retires config. So I found
>> that config in presto[2]. Can I just add s3.max-client-retries: xxx in
>> flink-conf.yaml to config
>> it? If not, how should I do to overwrite the default value of this
>> configuration? Thanks in advance.
>>
>> Best,
>> Tony Wei
>>
>> [1] https://github.com/aws/aws-sdk-php/issues/885
>> [2] https://github.com/prestodb/presto/blob/master/
>> presto-hive/src/main/java/com/facebook/presto/hive/s3/
>> HiveS3Config.java#L218
>>
>


Re: checkpoint failed due to s3 exception: request timeout

2018-08-28 Thread vino yang
Hi Tony,

A while ago, I have answered a similar question.[1]

You can try to increase this value appropriately. You can't put this
configuration in flink-conf.yaml, you can put it in the submit command of
the job[2], or in the configuration file you specify.

[1]:
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Why-checkpoint-took-so-long-td22364.html#a22375
[2]:
https://ci.apache.org/projects/flink/flink-docs-release-1.6/ops/cli.html

Thanks, vino.

Tony Wei  于2018年8月29日周三 上午11:36写道:

> Hi,
>
> I met checkpoint failure problem that cause by s3 exception.
>
> org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>> Your socket connection to the server was not read from or written to within
>> the timeout period. Idle connections will be closed. (Service: Amazon S3;
>> Status Code: 400; Error Code: RequestTimeout; Request ID:
>> B8BE8978D3EFF3F5), S3 Extended Request ID:
>> ePKce/MjMFPPNYi90rGdYmDw3blfvi0xR2CcJpCISEgxM92/6JZAU4whpfXeV6SfG62cnts0NBw=
>
>
> The full stack trace and screenshot is provided in the attachment.
>
> My setting for flink cluster and job:
>
>- flink version 1.4.0
>- standalone mode
>- 4 slots for each TM
>- presto s3 filesystem
>- rocksdb statebackend
>- local ssd
>- enable incremental checkpoint
>
> No weird message beside the exception in the log file. No high ratio of GC
> during the checkpoint
> procedure. And still 3 of 4 parts uploaded successfully on that TM. I
> didn't find something that
> would related to this failure. Did anyone meet this problem before?
>
> Besides, I also found an issue in other aws sdk[1] that mentioned this s3
> exception as well. One
> reply said you can passively avoid the problem by raising the max client
> retires config. So I found
> that config in presto[2]. Can I just add s3.max-client-retries: xxx in
> flink-conf.yaml to config
> it? If not, how should I do to overwrite the default value of this
> configuration? Thanks in advance.
>
> Best,
> Tony Wei
>
> [1] https://github.com/aws/aws-sdk-php/issues/885
> [2]
> https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/s3/HiveS3Config.java#L218
>