Hi, 

I'm running a Flink batch job that reads almost 1 TB of data from S3 and
then performs operations on it. A list of filenames are distributed among
the TM's and each subset of files is read from S3 from each TM. This job
errors out at the read step due to the following error:
java.lang.Exception: TaskManager was lost/killed

Having read similar questions on the mailing list, it seems like this is a
memory issue, with full GC at the TM causing the TM to be lost. 

After enabling memory debugging this seems to be the stats just before
erroring out:
Memory usage stats: [HEAP: 8327/18704/18704 MB, NON HEAP: 79/81/-1 MB
(used/committed/max)]
Direct memory stats: Count: 5236, Total Capacity: 17148907, Used Memory:
17148908
Off-heap pool stats: [Code Cache: 25/27/240 MB (used/committed/max)],
[Metaspace: 47/48/-1 MB (used/committed/max)], [Compressed Class Space:
5/5/1024 MB (used/committed/max)]
Garbage collector stats: [G1 Young Generation, GC TIME (ms): 16712, GC
COUNT: 290], [G1 Old Generation, GC TIME (ms): 689, GC COUNT: 2]

I tried all of these suggested fixes: decreased taskmanager.memory.fraction
to give more memory to user managed operations, increased number of
JVM's(parallelism), used the G1 GC for better GC performance, but my job
still errors out.  

I increased akka.watch.heartbeat.pause, akka.watch.threshold,
akka.watch.heartbeat.interval to prevent the timeout due to GC. But this
doesn't help either. I figured with the really high values for death watch,
the program would run really slowly and complete at some point but it fails
anyway. 

I'm now trying to decrease object creation in my program, but so far it
hasn't helped.

How can I go about debugging and fixing this problem?

Thank you. 




--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Reply via email to