Hi, I'm running a Flink batch job that reads almost 1 TB of data from S3 and then performs operations on it. A list of filenames are distributed among the TM's and each subset of files is read from S3 from each TM. This job errors out at the read step due to the following error: java.lang.Exception: TaskManager was lost/killed
Having read similar questions on the mailing list, it seems like this is a memory issue, with full GC at the TM causing the TM to be lost. After enabling memory debugging this seems to be the stats just before erroring out: Memory usage stats: [HEAP: 8327/18704/18704 MB, NON HEAP: 79/81/-1 MB (used/committed/max)] Direct memory stats: Count: 5236, Total Capacity: 17148907, Used Memory: 17148908 Off-heap pool stats: [Code Cache: 25/27/240 MB (used/committed/max)], [Metaspace: 47/48/-1 MB (used/committed/max)], [Compressed Class Space: 5/5/1024 MB (used/committed/max)] Garbage collector stats: [G1 Young Generation, GC TIME (ms): 16712, GC COUNT: 290], [G1 Old Generation, GC TIME (ms): 689, GC COUNT: 2] I tried all of these suggested fixes: decreased taskmanager.memory.fraction to give more memory to user managed operations, increased number of JVM's(parallelism), used the G1 GC for better GC performance, but my job still errors out. I increased akka.watch.heartbeat.pause, akka.watch.threshold, akka.watch.heartbeat.interval to prevent the timeout due to GC. But this doesn't help either. I figured with the really high values for death watch, the program would run really slowly and complete at some point but it fails anyway. I'm now trying to decrease object creation in my program, but so far it hasn't helped. How can I go about debugging and fixing this problem? Thank you. -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/