[ 
https://issues.apache.org/jira/browse/FLINK-9450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16492587#comment-16492587
 ] 

Stefan Richter edited comment on FLINK-9450 at 5/28/18 11:55 AM:
-----------------------------------------------------------------

Do you have some logs for this problem, and/or thread dumps from a "hanging" 
TM, and/or can you figure out if the unreachable S3 leads to any exception in 
the Presto client? You can configure if a job fails or continues if a 
checkpoint fails, but it is unclear from your description if the checkpoint 
actually fails or just waits on S3 access under the checkpointing lock. It is 
possible that the job will not continue with asynchronous checkpoints because 
the timer service snapshots are not async (yet, will probably change in the 
next release) and that part of a checkpoint can therefore be blocking.


was (Author: srichter):
Do you have some logs for this problem, and/or a thread dump from a "hanging" 
TM, and/or can you figure out if the unreachable S3 leads to any exception in 
the Presto client? You can configure if a job fails or continues if a 
checkpoint fails, but it is unclear from your description if the checkpoint 
actually fails or just waits on S3 access under the checkpointing lock. It is 
possible that the job will not continue with asynchronous checkpoints because 
the timer service snapshots are not async (yet, will probably change in the 
next release) and that part of a checkpoint can therefore be blocking.

> Job hangs if S3 access it denied during checkpoints
> ---------------------------------------------------
>
>                 Key: FLINK-9450
>                 URL: https://issues.apache.org/jira/browse/FLINK-9450
>             Project: Flink
>          Issue Type: Bug
>          Components: State Backends, Checkpointing
>    Affects Versions: 1.4.2
>            Reporter: Elias Levy
>            Priority: Major
>
> We have a streaming job that consumes from and writes to Kafka.  The job is 
> configured to checkpoint to S3.  If we deny access to S3 by using iptables on 
> the TM host to deny all outgoing connections to ports 80 and 443, whether 
> using DROP or REJECT, and whether using REJECT with -reject-with tcp-reset or 
> -r reject-with imp-port-unreachable, the job soon stops publishing to Kafka.
> This happens whether or not the Kafka sources have 
> {{setCommitOffsetsOnCheckpoints}} set to true or false.
> The system is configured to use Presto for the S3 file system.  The job has a 
> small amount of state, so it is configured to use {{FsStateBackend}} with 
> asynchronous snapshots.
> If the ip tables rules are removed, the job continues the function.
> I would expect the job to either fail or continue running if a checkpoint 
> fails.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to