[jira] [Updated] (FLINK-13940) S3RecoverableWriter causes job to get stuck in recovery

2020-01-31 Thread Kostas Kloudas (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kostas Kloudas updated FLINK-13940:
---
Fix Version/s: 1.8.2

> S3RecoverableWriter causes job to get stuck in recovery
> ---
>
> Key: FLINK-13940
> URL: https://issues.apache.org/jira/browse/FLINK-13940
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / FileSystem
>Affects Versions: 1.8.0, 1.8.1, 1.9.0
>Reporter: Jimmy Weibel Rasmussen
>Assignee: Kostas Kloudas
>Priority: Major
> Fix For: 1.8.2, 1.9.1, 1.10.0
>
>
>  
>  The cleaning up of tmp files in S3 introduced by this ticket/PR:
>  https://issues.apache.org/jira/browse/FLINK-10963
>   is preventing the flink job from being able to recover under some 
> circumstances.
>   
>  This is what seems to be happening:
>  When the jobs tries to recover, it will call initializeState() on all 
> operators, which results in the Bucket.restoreInProgressFile method being 
> called.
>  This will download the part_tmp file mentioned in the checkpoint that we're 
> restoring from, and finally it will call fsWriter.cleanupRecoverableState 
> which deletes the part_tmp file in S3.
>   Now the open() method is called on all operators. If the open() call fails 
> for one of the operators (this might happen if the issue that caused the job 
> to fail and restart is still unresolved), the job will fail again and try to 
> restart from the same checkpoint as before. This time however, downloading 
> the part_tmp file mentioned in the checkpoint fails because it was deleted 
> during the last recover attempt.
> The bug is critical because it results in data loss.
>   
>   
>   
>  I discovered the bug because I have a flink job with a RabbitMQ source and a 
> StreamingFileSink that writes to S3 (and therefore uses the 
> S3RecoverableWriter).
>  Occasionally I have some RabbitMQ connection issues which causes the job to 
> fail and restart, sometimes the first few restart attempts fail because 
> rabbitmq is unreachable when flink tries to reconnect.
>   
>  This is what I was seeing:
>  RabbitMQ goes down
>  Job fails because of a RabbitMQ ConsumerCancelledException
>  Job attempts to restart but fails with a Rabbitmq connection exception (x 
> number of times)
>  RabbitMQ is back up
>  Job attempts to restart but fails with a FileNotFoundException due to some 
> _part_tmp file missing in S3.
>   
>  The job will be unable to restart and only option is to cancel and restart 
> the job (and loose all state)
>   
>   
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-13940) S3RecoverableWriter causes job to get stuck in recovery

2020-01-28 Thread Hequn Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hequn Cheng updated FLINK-13940:

Fix Version/s: 1.9.1

> S3RecoverableWriter causes job to get stuck in recovery
> ---
>
> Key: FLINK-13940
> URL: https://issues.apache.org/jira/browse/FLINK-13940
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / FileSystem
>Affects Versions: 1.8.0, 1.8.1, 1.9.0
>Reporter: Jimmy Weibel Rasmussen
>Assignee: Kostas Kloudas
>Priority: Major
> Fix For: 1.9.1, 1.10.0
>
>
>  
>  The cleaning up of tmp files in S3 introduced by this ticket/PR:
>  https://issues.apache.org/jira/browse/FLINK-10963
>   is preventing the flink job from being able to recover under some 
> circumstances.
>   
>  This is what seems to be happening:
>  When the jobs tries to recover, it will call initializeState() on all 
> operators, which results in the Bucket.restoreInProgressFile method being 
> called.
>  This will download the part_tmp file mentioned in the checkpoint that we're 
> restoring from, and finally it will call fsWriter.cleanupRecoverableState 
> which deletes the part_tmp file in S3.
>   Now the open() method is called on all operators. If the open() call fails 
> for one of the operators (this might happen if the issue that caused the job 
> to fail and restart is still unresolved), the job will fail again and try to 
> restart from the same checkpoint as before. This time however, downloading 
> the part_tmp file mentioned in the checkpoint fails because it was deleted 
> during the last recover attempt.
> The bug is critical because it results in data loss.
>   
>   
>   
>  I discovered the bug because I have a flink job with a RabbitMQ source and a 
> StreamingFileSink that writes to S3 (and therefore uses the 
> S3RecoverableWriter).
>  Occasionally I have some RabbitMQ connection issues which causes the job to 
> fail and restart, sometimes the first few restart attempts fail because 
> rabbitmq is unreachable when flink tries to reconnect.
>   
>  This is what I was seeing:
>  RabbitMQ goes down
>  Job fails because of a RabbitMQ ConsumerCancelledException
>  Job attempts to restart but fails with a Rabbitmq connection exception (x 
> number of times)
>  RabbitMQ is back up
>  Job attempts to restart but fails with a FileNotFoundException due to some 
> _part_tmp file missing in S3.
>   
>  The job will be unable to restart and only option is to cancel and restart 
> the job (and loose all state)
>   
>   
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-13940) S3RecoverableWriter causes job to get stuck in recovery

2020-01-28 Thread Hequn Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hequn Cheng updated FLINK-13940:

Fix Version/s: (was: 1.9.2)

> S3RecoverableWriter causes job to get stuck in recovery
> ---
>
> Key: FLINK-13940
> URL: https://issues.apache.org/jira/browse/FLINK-13940
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / FileSystem
>Affects Versions: 1.8.0, 1.8.1, 1.9.0
>Reporter: Jimmy Weibel Rasmussen
>Assignee: Kostas Kloudas
>Priority: Major
> Fix For: 1.10.0
>
>
>  
>  The cleaning up of tmp files in S3 introduced by this ticket/PR:
>  https://issues.apache.org/jira/browse/FLINK-10963
>   is preventing the flink job from being able to recover under some 
> circumstances.
>   
>  This is what seems to be happening:
>  When the jobs tries to recover, it will call initializeState() on all 
> operators, which results in the Bucket.restoreInProgressFile method being 
> called.
>  This will download the part_tmp file mentioned in the checkpoint that we're 
> restoring from, and finally it will call fsWriter.cleanupRecoverableState 
> which deletes the part_tmp file in S3.
>   Now the open() method is called on all operators. If the open() call fails 
> for one of the operators (this might happen if the issue that caused the job 
> to fail and restart is still unresolved), the job will fail again and try to 
> restart from the same checkpoint as before. This time however, downloading 
> the part_tmp file mentioned in the checkpoint fails because it was deleted 
> during the last recover attempt.
> The bug is critical because it results in data loss.
>   
>   
>   
>  I discovered the bug because I have a flink job with a RabbitMQ source and a 
> StreamingFileSink that writes to S3 (and therefore uses the 
> S3RecoverableWriter).
>  Occasionally I have some RabbitMQ connection issues which causes the job to 
> fail and restart, sometimes the first few restart attempts fail because 
> rabbitmq is unreachable when flink tries to reconnect.
>   
>  This is what I was seeing:
>  RabbitMQ goes down
>  Job fails because of a RabbitMQ ConsumerCancelledException
>  Job attempts to restart but fails with a Rabbitmq connection exception (x 
> number of times)
>  RabbitMQ is back up
>  Job attempts to restart but fails with a FileNotFoundException due to some 
> _part_tmp file missing in S3.
>   
>  The job will be unable to restart and only option is to cancel and restart 
> the job (and loose all state)
>   
>   
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-13940) S3RecoverableWriter causes job to get stuck in recovery

2019-09-28 Thread Jark Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jark Wu updated FLINK-13940:

Fix Version/s: (was: 1.9.1)
   1.9.2

> S3RecoverableWriter causes job to get stuck in recovery
> ---
>
> Key: FLINK-13940
> URL: https://issues.apache.org/jira/browse/FLINK-13940
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / FileSystem
>Affects Versions: 1.8.0, 1.8.1, 1.9.0
>Reporter: Jimmy Weibel Rasmussen
>Assignee: Kostas Kloudas
>Priority: Major
> Fix For: 1.10.0, 1.9.2
>
>
>  
>  The cleaning up of tmp files in S3 introduced by this ticket/PR:
>  https://issues.apache.org/jira/browse/FLINK-10963
>   is preventing the flink job from being able to recover under some 
> circumstances.
>   
>  This is what seems to be happening:
>  When the jobs tries to recover, it will call initializeState() on all 
> operators, which results in the Bucket.restoreInProgressFile method being 
> called.
>  This will download the part_tmp file mentioned in the checkpoint that we're 
> restoring from, and finally it will call fsWriter.cleanupRecoverableState 
> which deletes the part_tmp file in S3.
>   Now the open() method is called on all operators. If the open() call fails 
> for one of the operators (this might happen if the issue that caused the job 
> to fail and restart is still unresolved), the job will fail again and try to 
> restart from the same checkpoint as before. This time however, downloading 
> the part_tmp file mentioned in the checkpoint fails because it was deleted 
> during the last recover attempt.
> The bug is critical because it results in data loss.
>   
>   
>   
>  I discovered the bug because I have a flink job with a RabbitMQ source and a 
> StreamingFileSink that writes to S3 (and therefore uses the 
> S3RecoverableWriter).
>  Occasionally I have some RabbitMQ connection issues which causes the job to 
> fail and restart, sometimes the first few restart attempts fail because 
> rabbitmq is unreachable when flink tries to reconnect.
>   
>  This is what I was seeing:
>  RabbitMQ goes down
>  Job fails because of a RabbitMQ ConsumerCancelledException
>  Job attempts to restart but fails with a Rabbitmq connection exception (x 
> number of times)
>  RabbitMQ is back up
>  Job attempts to restart but fails with a FileNotFoundException due to some 
> _part_tmp file missing in S3.
>   
>  The job will be unable to restart and only option is to cancel and restart 
> the job (and loose all state)
>   
>   
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-13940) S3RecoverableWriter causes job to get stuck in recovery

2019-09-03 Thread Kostas Kloudas (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kostas Kloudas updated FLINK-13940:
---
Fix Version/s: (was: 1.8.2)

> S3RecoverableWriter causes job to get stuck in recovery
> ---
>
> Key: FLINK-13940
> URL: https://issues.apache.org/jira/browse/FLINK-13940
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / FileSystem
>Affects Versions: 1.8.0, 1.8.1, 1.9.0
>Reporter: Jimmy Weibel Rasmussen
>Assignee: Kostas Kloudas
>Priority: Major
> Fix For: 1.10.0, 1.9.1
>
>
>  
>  The cleaning up of tmp files in S3 introduced by this ticket/PR:
>  https://issues.apache.org/jira/browse/FLINK-10963
>   is preventing the flink job from being able to recover under some 
> circumstances.
>   
>  This is what seems to be happening:
>  When the jobs tries to recover, it will call initializeState() on all 
> operators, which results in the Bucket.restoreInProgressFile method being 
> called.
>  This will download the part_tmp file mentioned in the checkpoint that we're 
> restoring from, and finally it will call fsWriter.cleanupRecoverableState 
> which deletes the part_tmp file in S3.
>   Now the open() method is called on all operators. If the open() call fails 
> for one of the operators (this might happen if the issue that caused the job 
> to fail and restart is still unresolved), the job will fail again and try to 
> restart from the same checkpoint as before. This time however, downloading 
> the part_tmp file mentioned in the checkpoint fails because it was deleted 
> during the last recover attempt.
> The bug is critical because it results in data loss.
>   
>   
>   
>  I discovered the bug because I have a flink job with a RabbitMQ source and a 
> StreamingFileSink that writes to S3 (and therefore uses the 
> S3RecoverableWriter).
>  Occasionally I have some RabbitMQ connection issues which causes the job to 
> fail and restart, sometimes the first few restart attempts fail because 
> rabbitmq is unreachable when flink tries to reconnect.
>   
>  This is what I was seeing:
>  RabbitMQ goes down
>  Job fails because of a RabbitMQ ConsumerCancelledException
>  Job attempts to restart but fails with a Rabbitmq connection exception (x 
> number of times)
>  RabbitMQ is back up
>  Job attempts to restart but fails with a FileNotFoundException due to some 
> _part_tmp file missing in S3.
>   
>  The job will be unable to restart and only option is to cancel and restart 
> the job (and loose all state)
>   
>   
>   



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (FLINK-13940) S3RecoverableWriter causes job to get stuck in recovery

2019-09-03 Thread Kostas Kloudas (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kostas Kloudas updated FLINK-13940:
---
Priority: Major  (was: Blocker)

> S3RecoverableWriter causes job to get stuck in recovery
> ---
>
> Key: FLINK-13940
> URL: https://issues.apache.org/jira/browse/FLINK-13940
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / FileSystem
>Affects Versions: 1.8.0, 1.8.1, 1.9.0
>Reporter: Jimmy Weibel Rasmussen
>Assignee: Kostas Kloudas
>Priority: Major
> Fix For: 1.8.2, 1.10.0, 1.9.1
>
>
>  
>  The cleaning up of tmp files in S3 introduced by this ticket/PR:
>  https://issues.apache.org/jira/browse/FLINK-10963
>   is preventing the flink job from being able to recover under some 
> circumstances.
>   
>  This is what seems to be happening:
>  When the jobs tries to recover, it will call initializeState() on all 
> operators, which results in the Bucket.restoreInProgressFile method being 
> called.
>  This will download the part_tmp file mentioned in the checkpoint that we're 
> restoring from, and finally it will call fsWriter.cleanupRecoverableState 
> which deletes the part_tmp file in S3.
>   Now the open() method is called on all operators. If the open() call fails 
> for one of the operators (this might happen if the issue that caused the job 
> to fail and restart is still unresolved), the job will fail again and try to 
> restart from the same checkpoint as before. This time however, downloading 
> the part_tmp file mentioned in the checkpoint fails because it was deleted 
> during the last recover attempt.
> The bug is critical because it results in data loss.
>   
>   
>   
>  I discovered the bug because I have a flink job with a RabbitMQ source and a 
> StreamingFileSink that writes to S3 (and therefore uses the 
> S3RecoverableWriter).
>  Occasionally I have some RabbitMQ connection issues which causes the job to 
> fail and restart, sometimes the first few restart attempts fail because 
> rabbitmq is unreachable when flink tries to reconnect.
>   
>  This is what I was seeing:
>  RabbitMQ goes down
>  Job fails because of a RabbitMQ ConsumerCancelledException
>  Job attempts to restart but fails with a Rabbitmq connection exception (x 
> number of times)
>  RabbitMQ is back up
>  Job attempts to restart but fails with a FileNotFoundException due to some 
> _part_tmp file missing in S3.
>   
>  The job will be unable to restart and only option is to cancel and restart 
> the job (and loose all state)
>   
>   
>   



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (FLINK-13940) S3RecoverableWriter causes job to get stuck in recovery

2019-09-02 Thread Jark Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jark Wu updated FLINK-13940:

Fix Version/s: 1.8.2

> S3RecoverableWriter causes job to get stuck in recovery
> ---
>
> Key: FLINK-13940
> URL: https://issues.apache.org/jira/browse/FLINK-13940
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / FileSystem
>Affects Versions: 1.8.0, 1.8.1, 1.9.0
>Reporter: Jimmy Weibel Rasmussen
>Priority: Blocker
> Fix For: 1.8.2
>
>
>  
>  The cleaning up of tmp files in S3 introduced by this ticket/PR:
>  https://issues.apache.org/jira/browse/FLINK-10963
>   is preventing the flink job from being able to recover under some 
> circumstances.
>   
>  This is what seems to be happening:
>  When the jobs tries to recover, it will call initializeState() on all 
> operators, which results in the Bucket.restoreInProgressFile method being 
> called.
>  This will download the part_tmp file mentioned in the checkpoint that we're 
> restoring from, and finally it will call fsWriter.cleanupRecoverableState 
> which deletes the part_tmp file in S3.
>   Now the open() method is called on all operators. If the open() call fails 
> for one of the operators (this might happen if the issue that caused the job 
> to fail and restart is still unresolved), the job will fail again and try to 
> restart from the same checkpoint as before. This time however, downloading 
> the part_tmp file mentioned in the checkpoint fails because it was deleted 
> during the last recover attempt.
> The bug is critical because it results in data loss.
>   
>   
>   
>  I discovered the bug because I have a flink job with a RabbitMQ source and a 
> StreamingFileSink that writes to S3 (and therefore uses the 
> S3RecoverableWriter).
>  Occasionally I have some RabbitMQ connection issues which causes the job to 
> fail and restart, sometimes the first few restart attempts fail because 
> rabbitmq is unreachable when flink tries to reconnect.
>   
>  This is what I was seeing:
>  RabbitMQ goes down
>  Job fails because of a RabbitMQ ConsumerCancelledException
>  Job attempts to restart but fails with a Rabbitmq connection exception (x 
> number of times)
>  RabbitMQ is back up
>  Job attempts to restart but fails with a FileNotFoundException due to some 
> _part_tmp file missing in S3.
>   
>  The job will be unable to restart and only option is to cancel and restart 
> the job (and loose all state)
>   
>   
>   



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (FLINK-13940) S3RecoverableWriter causes job to get stuck in recovery

2019-09-02 Thread Jark Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jark Wu updated FLINK-13940:

Fix Version/s: 1.9.1
   1.10.0

> S3RecoverableWriter causes job to get stuck in recovery
> ---
>
> Key: FLINK-13940
> URL: https://issues.apache.org/jira/browse/FLINK-13940
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / FileSystem
>Affects Versions: 1.8.0, 1.8.1, 1.9.0
>Reporter: Jimmy Weibel Rasmussen
>Priority: Blocker
> Fix For: 1.8.2, 1.10.0, 1.9.1
>
>
>  
>  The cleaning up of tmp files in S3 introduced by this ticket/PR:
>  https://issues.apache.org/jira/browse/FLINK-10963
>   is preventing the flink job from being able to recover under some 
> circumstances.
>   
>  This is what seems to be happening:
>  When the jobs tries to recover, it will call initializeState() on all 
> operators, which results in the Bucket.restoreInProgressFile method being 
> called.
>  This will download the part_tmp file mentioned in the checkpoint that we're 
> restoring from, and finally it will call fsWriter.cleanupRecoverableState 
> which deletes the part_tmp file in S3.
>   Now the open() method is called on all operators. If the open() call fails 
> for one of the operators (this might happen if the issue that caused the job 
> to fail and restart is still unresolved), the job will fail again and try to 
> restart from the same checkpoint as before. This time however, downloading 
> the part_tmp file mentioned in the checkpoint fails because it was deleted 
> during the last recover attempt.
> The bug is critical because it results in data loss.
>   
>   
>   
>  I discovered the bug because I have a flink job with a RabbitMQ source and a 
> StreamingFileSink that writes to S3 (and therefore uses the 
> S3RecoverableWriter).
>  Occasionally I have some RabbitMQ connection issues which causes the job to 
> fail and restart, sometimes the first few restart attempts fail because 
> rabbitmq is unreachable when flink tries to reconnect.
>   
>  This is what I was seeing:
>  RabbitMQ goes down
>  Job fails because of a RabbitMQ ConsumerCancelledException
>  Job attempts to restart but fails with a Rabbitmq connection exception (x 
> number of times)
>  RabbitMQ is back up
>  Job attempts to restart but fails with a FileNotFoundException due to some 
> _part_tmp file missing in S3.
>   
>  The job will be unable to restart and only option is to cancel and restart 
> the job (and loose all state)
>   
>   
>   



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (FLINK-13940) S3RecoverableWriter causes job to get stuck in recovery

2019-09-02 Thread Kostas Kloudas (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kostas Kloudas updated FLINK-13940:
---
Priority: Blocker  (was: Major)

> S3RecoverableWriter causes job to get stuck in recovery
> ---
>
> Key: FLINK-13940
> URL: https://issues.apache.org/jira/browse/FLINK-13940
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / FileSystem
>Affects Versions: 1.8.0, 1.8.1, 1.9.0
>Reporter: Jimmy Weibel Rasmussen
>Priority: Blocker
>
>  
>  The cleaning up of tmp files in S3 introduced by this ticket/PR:
>  https://issues.apache.org/jira/browse/FLINK-10963
>   is preventing the flink job from being able to recover under some 
> circumstances.
>   
>  This is what seems to be happening:
>  When the jobs tries to recover, it will call initializeState() on all 
> operators, which results in the Bucket.restoreInProgressFile method being 
> called.
>  This will download the part_tmp file mentioned in the checkpoint that we're 
> restoring from, and finally it will call fsWriter.cleanupRecoverableState 
> which deletes the part_tmp file in S3.
>   Now the open() method is called on all operators. If the open() call fails 
> for one of the operators (this might happen if the issue that caused the job 
> to fail and restart is still unresolved), the job will fail again and try to 
> restart from the same checkpoint as before. This time however, downloading 
> the part_tmp file mentioned in the checkpoint fails because it was deleted 
> during the last recover attempt.
> The bug is critical because it results in data loss.
>   
>   
>   
>  I discovered the bug because I have a flink job with a RabbitMQ source and a 
> StreamingFileSink that writes to S3 (and therefore uses the 
> S3RecoverableWriter).
>  Occasionally I have some RabbitMQ connection issues which causes the job to 
> fail and restart, sometimes the first few restart attempts fail because 
> rabbitmq is unreachable when flink tries to reconnect.
>   
>  This is what I was seeing:
>  RabbitMQ goes down
>  Job fails because of a RabbitMQ ConsumerCancelledException
>  Job attempts to restart but fails with a Rabbitmq connection exception (x 
> number of times)
>  RabbitMQ is back up
>  Job attempts to restart but fails with a FileNotFoundException due to some 
> _part_tmp file missing in S3.
>   
>  The job will be unable to restart and only option is to cancel and restart 
> the job (and loose all state)
>   
>   
>   



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (FLINK-13940) S3RecoverableWriter causes job to get stuck in recovery

2019-09-02 Thread Jimmy Weibel Rasmussen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Weibel Rasmussen updated FLINK-13940:
---
Description: 
 
 The cleaning up of tmp files in S3 introduced by this ticket/PR:
 https://issues.apache.org/jira/browse/FLINK-10963
  is preventing the flink job from being able to recover under some 
circumstances.
  
 This is what seems to be happening:
 When the jobs tries to recover, it will call initializeState() on all 
operators, which results in the Bucket.restoreInProgressFile method being 
called.
 This will download the part_tmp file mentioned in the checkpoint that we're 
restoring from, and finally it will call fsWriter.cleanupRecoverableState which 
deletes the part_tmp file in S3.
  
 Now the open() method is called on all operators. If the open() call fails for 
one of the operators (this might happen if the issue that caused the job to 
fail and restart is still unresolved), the job will fail again and try to 
restart from the same checkpoint as before. This time however, downloading the 
part_tmp file mentioned in the checkpoint fails because it was deleted during 
the last recover attempt.

The bug is critical because it results in data loss.
  
  
  
 I discovered the bug because I have a flink job with a RabbitMQ source and a 
StreamingFileSink that writes to S3 (and therefore uses the 
S3RecoverableWriter).
 Occasionally I have some RabbitMQ connection issues which causes the job to 
fail and restart, sometimes the first few restart attempts fail because 
rabbitmq is unreachable when flink tries to reconnect.
  
 This is what I was seeing:
 RabbitMQ goes down
 Job fails because of a RabbitMQ ConsumerCancelledException
 Job attempts to restart but fails with a Rabbitmq connection exception (x 
number of times)
 RabbitMQ is back up
 Job attempts to restart but fails with a FileNotFoundException due to some 
_part_tmp file missing in S3.
  
 The job will be unable to restart and only option is to cancel and restart the 
job (and loose all state)
  
  
  

  was:
 
  
  
 The cleaning up of tmp files in S3 introduced by this ticket/PR:
 https://issues.apache.org/jira/browse/FLINK-10963
  
 is preventing the flink job from being able to recover under some 
circumstances.
  
 This is what seems to be happening:
 When the jobs tries to recover, it will call initializeState() on all 
operators, which results in the Bucket.restoreInProgressFile method being 
called.
 This will download the part_tmp file mentioned in the checkpoint that we're 
restoring from, and finally it will call fsWriter.cleanupRecoverableState which 
deletes the part_tmp file in S3.
  
 Now the open() method is called on all operators. If the open() call fails for 
one of the operators (this might happen if the issue that caused the job to 
fail and restart is still unresolved), the job will fail again and try to 
restart from the same checkpoint as before.
  
 This time however, downloading the part_tmp file mentioned in the checkpoint 
fails because it was deleted during the last recover attempt.
  The bug is critical because it results in data loss.
  
  
  
 I discovered the bug because I have a flink job with a RabbitMQ source and a 
StreamingFileSink that writes to S3 (and therefore uses the 
S3RecoverableWriter).
 Occasionally I have some RabbitMQ connection issues which causes the job to 
fail and restart, sometimes the first few restart attempts fail because 
rabbitmq is unreachable when flink tries to reconnect.
  
 This is what I was seeing:
 RabbitMQ goes down
 Job fails because of a RabbitMQ ConsumerCancelledException
 Job attempts to restart but fails with a Rabbitmq connection exception (x 
number of times)
 RabbitMQ is back up
 Job attempts to restart but fails with a FileNotFoundException due to some 
_part_tmp file missing in S3.
  
 The job will be unable to restart and only option is to cancel and restart the 
job (and loose all state)
  
  
  


> S3RecoverableWriter causes job to get stuck in recovery
> ---
>
> Key: FLINK-13940
> URL: https://issues.apache.org/jira/browse/FLINK-13940
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / FileSystem
>Affects Versions: 1.8.0, 1.8.1, 1.9.0
>Reporter: Jimmy Weibel Rasmussen
>Priority: Major
>
>  
>  The cleaning up of tmp files in S3 introduced by this ticket/PR:
>  https://issues.apache.org/jira/browse/FLINK-10963
>   is preventing the flink job from being able to recover under some 
> circumstances.
>   
>  This is what seems to be happening:
>  When the jobs tries to recover, it will call initializeState() on all 
> operators, which results in the Bucket.restoreInProgressFile method being 
> called.
>  This will download the part_tmp file mentioned in the checkpoint that we're 
> 

[jira] [Updated] (FLINK-13940) S3RecoverableWriter causes job to get stuck in recovery

2019-09-02 Thread Jimmy Weibel Rasmussen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Weibel Rasmussen updated FLINK-13940:
---
Description: 
 
 The cleaning up of tmp files in S3 introduced by this ticket/PR:
 https://issues.apache.org/jira/browse/FLINK-10963
  is preventing the flink job from being able to recover under some 
circumstances.
  
 This is what seems to be happening:
 When the jobs tries to recover, it will call initializeState() on all 
operators, which results in the Bucket.restoreInProgressFile method being 
called.
 This will download the part_tmp file mentioned in the checkpoint that we're 
restoring from, and finally it will call fsWriter.cleanupRecoverableState which 
deletes the part_tmp file in S3.
  Now the open() method is called on all operators. If the open() call fails 
for one of the operators (this might happen if the issue that caused the job to 
fail and restart is still unresolved), the job will fail again and try to 
restart from the same checkpoint as before. This time however, downloading the 
part_tmp file mentioned in the checkpoint fails because it was deleted during 
the last recover attempt.

The bug is critical because it results in data loss.
  
  
  
 I discovered the bug because I have a flink job with a RabbitMQ source and a 
StreamingFileSink that writes to S3 (and therefore uses the 
S3RecoverableWriter).
 Occasionally I have some RabbitMQ connection issues which causes the job to 
fail and restart, sometimes the first few restart attempts fail because 
rabbitmq is unreachable when flink tries to reconnect.
  
 This is what I was seeing:
 RabbitMQ goes down
 Job fails because of a RabbitMQ ConsumerCancelledException
 Job attempts to restart but fails with a Rabbitmq connection exception (x 
number of times)
 RabbitMQ is back up
 Job attempts to restart but fails with a FileNotFoundException due to some 
_part_tmp file missing in S3.
  
 The job will be unable to restart and only option is to cancel and restart the 
job (and loose all state)
  
  
  

  was:
 
 The cleaning up of tmp files in S3 introduced by this ticket/PR:
 https://issues.apache.org/jira/browse/FLINK-10963
  is preventing the flink job from being able to recover under some 
circumstances.
  
 This is what seems to be happening:
 When the jobs tries to recover, it will call initializeState() on all 
operators, which results in the Bucket.restoreInProgressFile method being 
called.
 This will download the part_tmp file mentioned in the checkpoint that we're 
restoring from, and finally it will call fsWriter.cleanupRecoverableState which 
deletes the part_tmp file in S3.
  
 Now the open() method is called on all operators. If the open() call fails for 
one of the operators (this might happen if the issue that caused the job to 
fail and restart is still unresolved), the job will fail again and try to 
restart from the same checkpoint as before. This time however, downloading the 
part_tmp file mentioned in the checkpoint fails because it was deleted during 
the last recover attempt.

The bug is critical because it results in data loss.
  
  
  
 I discovered the bug because I have a flink job with a RabbitMQ source and a 
StreamingFileSink that writes to S3 (and therefore uses the 
S3RecoverableWriter).
 Occasionally I have some RabbitMQ connection issues which causes the job to 
fail and restart, sometimes the first few restart attempts fail because 
rabbitmq is unreachable when flink tries to reconnect.
  
 This is what I was seeing:
 RabbitMQ goes down
 Job fails because of a RabbitMQ ConsumerCancelledException
 Job attempts to restart but fails with a Rabbitmq connection exception (x 
number of times)
 RabbitMQ is back up
 Job attempts to restart but fails with a FileNotFoundException due to some 
_part_tmp file missing in S3.
  
 The job will be unable to restart and only option is to cancel and restart the 
job (and loose all state)
  
  
  


> S3RecoverableWriter causes job to get stuck in recovery
> ---
>
> Key: FLINK-13940
> URL: https://issues.apache.org/jira/browse/FLINK-13940
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / FileSystem
>Affects Versions: 1.8.0, 1.8.1, 1.9.0
>Reporter: Jimmy Weibel Rasmussen
>Priority: Major
>
>  
>  The cleaning up of tmp files in S3 introduced by this ticket/PR:
>  https://issues.apache.org/jira/browse/FLINK-10963
>   is preventing the flink job from being able to recover under some 
> circumstances.
>   
>  This is what seems to be happening:
>  When the jobs tries to recover, it will call initializeState() on all 
> operators, which results in the Bucket.restoreInProgressFile method being 
> called.
>  This will download the part_tmp file mentioned in the checkpoint that we're 
> restoring from, and 

[jira] [Updated] (FLINK-13940) S3RecoverableWriter causes job to get stuck in recovery

2019-09-02 Thread Jimmy Weibel Rasmussen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Weibel Rasmussen updated FLINK-13940:
---
Description: 
 
  
  
 The cleaning up of tmp files in S3 introduced by this ticket/PR:
 https://issues.apache.org/jira/browse/FLINK-10963
  
 is preventing the flink job from being able to recover under some 
circumstances.
  
 This is what seems to be happening:
 When the jobs tries to recover, it will call initializeState() on all 
operators, which results in the Bucket.restoreInProgressFile method being 
called.
 This will download the part_tmp file mentioned in the checkpoint that we're 
restoring from, and finally it will call fsWriter.cleanupRecoverableState which 
deletes the part_tmp file in S3.
  
 Now the open() method is called on all operators. If the open() call fails for 
one of the operators (this might happen if the issue that caused the job to 
fail and restart is still unresolved), the job will fail again and try to 
restart from the same checkpoint as before.
  
 This time however, downloading the part_tmp file mentioned in the checkpoint 
fails because it was deleted during the last recover attempt.
  The bug is critical because it results in data loss.
  
  
  
 I discovered the bug because I have a flink job with a RabbitMQ source and a 
StreamingFileSink that writes to S3 (and therefore uses the 
S3RecoverableWriter).
 Occasionally I have some RabbitMQ connection issues which causes the job to 
fail and restart, sometimes the first few restart attempts fail because 
rabbitmq is unreachable when flink tries to reconnect.
  
 This is what I was seeing:
 RabbitMQ goes down
 Job fails because of a RabbitMQ ConsumerCancelledException
 Job attempts to restart but fails with a Rabbitmq connection exception (x 
number of times)
 RabbitMQ is back up
 Job attempts to restart but fails with a FileNotFoundException due to some 
_part_tmp file missing in S3.
  
 The job will be unable to restart and only option is to cancel and restart the 
job (and loose all state)
  
  
  

  was:
 
 
 
The cleaning up of tmp files in S3 introduced by this ticket/PR:
https://issues.apache.org/jira/browse/FLINK-10963
 
is preventing the flink job from being able to recover under some circumstances.
 
 
This is what seems to be happening:
When the jobs tries to recover, it will call initializeState() on all 
operators, which results in the Bucket.restoreInProgressFile method being 
called.
This will download the part_tmp file mentioned in the checkpoint that we're 
restoring from, and finally it will call fsWriter.cleanupRecoverableState which 
deletes the part_tmp file in S3.
 
Now the open() method is called on all operators. If the open() call fails for 
one of the operators (this might happen if the issue that caused the job to 
fail and restart is still unresolved), the job will fail again and try to 
restart from the same checkpoint as before.
 
 
This time however, downloading the part_tmp file mentioned in the checkpoint 
fails because it was deleted during the last recover attempt.
 
The bug is critical because it results in data loss.
 
 
 
I discovered the bug because I have a flink job with a RabbitMQ source and a 
StreamingFileSink that writes to S3 (and therefore uses the 
S3RecoverableWriter).
Occasionally I have some RabbitMQ connection issues which causes the job to 
fail and restart, sometimes the first few restart attempts fail because 
rabbitmq is unreachable when flink tries to reconnect.
 
This is what I was seeing:
RabbitMQ goes down
Job fails because of a RabbitMQ ConsumerCancelledException
Job attempts to restart but fails with a Rabbitmq connection exception (x 
number of times)
RabbitMQ is back up
Job attempts to restart but fails with a FileNotFoundException due to some 
_part_tmp file missing in S3.
 
The job will be unable to restart and only option is to cancel and restart the 
job (and loose all state)
 
 
 


> S3RecoverableWriter causes job to get stuck in recovery
> ---
>
> Key: FLINK-13940
> URL: https://issues.apache.org/jira/browse/FLINK-13940
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / FileSystem
>Affects Versions: 1.8.0, 1.8.1, 1.9.0
>Reporter: Jimmy Weibel Rasmussen
>Priority: Major
>
>  
>   
>   
>  The cleaning up of tmp files in S3 introduced by this ticket/PR:
>  https://issues.apache.org/jira/browse/FLINK-10963
>   
>  is preventing the flink job from being able to recover under some 
> circumstances.
>   
>  This is what seems to be happening:
>  When the jobs tries to recover, it will call initializeState() on all 
> operators, which results in the Bucket.restoreInProgressFile method being 
> called.
>  This will download the part_tmp file mentioned in the checkpoint that we're 
>