[jira] [Comment Edited] (FLINK-9061) S3 checkpoint data not partitioned well -- causes errors and poor performance

Jamie Grier (JIRA) Tue, 27 Mar 2018 14:14:42 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-9061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16416254#comment-16416254
 ]


Jamie Grier edited comment on FLINK-9061 at 3/27/18 9:13 PM:
-------------------------------------------------------------

Yeah, so I completely agree that the response should be a 503 or better yet a 
429 but it's not.  I already ran this through the AWS support channels.  The 
response was essentially that this was "internally" a TooBusyException.  Here's 
their full response:
{quote}Based on the information provided, I understand that you are 
experiencing some internal errors (status code 500) from S3, which is impacting 
one of your Flink jobs. From the log dive on your provided request IDs, I 
observe that your PutObject request triggered Internal Error with 
TooBusyException. This happens when a bucket receives more requests than it can 
handle or is allowed [1]. By default, S3 limits 100 PUT/LIST/DELETE requests 
per second or more than 300 GET requests per second. So, if your workload is to 
exceed this limit, you'd need to scale your bucket through partitioning. 
Currently, your key space isn't randomized and all your keys include 
"BUCKET/SERVICE/flink/checkpoints/faa473252e9bf42d07f618923fa22af1/chk-13/". 
Therefore, your bucket isn't being automatically partitioned by S3 and you 
received increased error rates after your requests increased.
{quote}
 


was (Author: jgrier):
Yeah, so I completely agree that should be a 503 but it's not.  I already ran 
this through the AWS channels.  The response was essentially that this was 
"internally" a TooBusyException.  Here's their full response:
{quote}Based on the information provided, I understand that you are 
experiencing some internal errors (status code 500) from S3, which is impacting 
one of your Flink jobs. From the log dive on your provided request IDs, I 
observe that your PutObject request triggered Internal Error with 
TooBusyException. This happens when a bucket receives more requests than it can 
handle or is allowed [1]. By default, S3 limits 100 PUT/LIST/DELETE requests 
per second or more than 300 GET requests per second. So, if your workload is to 
exceed this limit, you'd need to scale your bucket through partitioning. 
Currently, your key space isn't randomized and all your keys include 
"BUCKET/SERVICE/flink/checkpoints/faa473252e9bf42d07f618923fa22af1/chk-13/". 
Therefore, your bucket isn't being automatically partitioned by S3 and you 
received increased error rates after your requests increased.
{quote}
 

> S3 checkpoint data not partitioned well -- causes errors and poor performance
> -----------------------------------------------------------------------------
>
>                 Key: FLINK-9061
>                 URL: https://issues.apache.org/jira/browse/FLINK-9061
>             Project: Flink
>          Issue Type: Bug
>          Components: FileSystem, State Backends, Checkpointing
>    Affects Versions: 1.4.2
>            Reporter: Jamie Grier
>            Priority: Critical
>
> I think we need to modify the way we write checkpoints to S3 for high-scale 
> jobs (those with many total tasks).  The issue is that we are writing all the 
> checkpoint data under a common key prefix.  This is the worst case scenario 
> for S3 performance since the key is used as a partition key.
>  
> In the worst case checkpoints fail with a 500 status code coming back from S3 
> and an internal error type of TooBusyException.
>  
> One possible solution would be to add a hook in the Flink filesystem code 
> that allows me to "rewrite" paths.  For example say I have the checkpoint 
> directory set to:
>  
> s3://bucket/flink/checkpoints
>  
> I would hook that and rewrite that path to:
>  
> s3://bucket/[HASH]/flink/checkpoints, where HASH is the hash of the original 
> path
>  
> This would distribute the checkpoint write load around the S3 cluster evenly.
>  
> For reference: 
> https://aws.amazon.com/premiumsupport/knowledge-center/s3-bucket-performance-improve/
>  
> Any other people hit this issue?  Any other ideas for solutions?  This is a 
> pretty serious problem for people trying to checkpoint to S3.
>  
> -Jamie
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (FLINK-9061) S3 checkpoint data not partitioned well -- causes errors and poor performance

Reply via email to