Hello Kostas,

Thanks for your time.

I started that job from fresh, set checkpoint interval to 15 minutes. It
completed the first 13 checkpoints successfully, only started failing from
the 14th. I waited for about 20 more checkpoints, but all failed.
Then I cancelled the job, restored from the last successful checkpoint, and
there were no more issues.

Today, I had another try - restoring from the last successful checkpoint
from yesterday. Result: started getting the same error from the first
checkpoint after restore. 
Tried to cancel and restore again, then no more issue until now (35 more
checkpoints already).

Regarding my job: I have 6 different S3-file-source streams
connected/unioned together, and then connected to a 7th S3-file-source
broadcast stream. Sinks are S3 parquet files and Elasticsearch.
Checkpointing is incremental and uses RocksDB.
This broadcast stream is one of the new changes to my job. The previous
version with 4 out of those 6 sources has been running well for more than a
month without any issue.
TM/JM logs for the first run today (the failure one) are attached.
The Yarn/EMR cluster is dedicated to the job.

I have a feeling that the issue comes from that broadcast stream (as
mentioned in the document, it doesn't use RocksDB). But not quite sure.

Thanks and regards,
Averell

logs.gz
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1586/logs.gz>
  



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Reply via email to