Re: Task Manager was lost/killed due to full GC

2017-10-17 Thread ShB
I just wanted to leave an update about this issue, for someone else who might come across it. The problem was with memory, but it was disk memory and not heap/off-heap memory. Yarn was killing off my containers as they exceeded the threshold for disk utilization and this was manifesting as Task

Re: Task Manager was lost/killed due to full GC

2017-10-12 Thread ShB
On further investigation, seems to me the I/O exception I posted previously is not the cause of the TM being lost. it's the after effect of the TM being shut down and the channel being closed after a record is emitted but before it's processed. Previously, the logs didn't throw up this error and

Re: Task Manager was lost/killed due to full GC

2017-10-12 Thread ShB
Hi Stephan, Apologies, I hit send too soon on the last email. So, while trying to debug this, I ran it multiple times on different instance types(to increase RAM available) and while digging into the logs, I found this to be the error in the task manager logs: / java.lang.RuntimeException:

Re: Task Manager was lost/killed due to full GC

2017-10-12 Thread ShB
Hi Stephan, Thanks for your response! Task manager lost/killed has been a recurring problem I've had with Flink for the last few months, as I try to scale to larger and larger amounts of data. I would be very grateful for some help figuring out how I can avoid this. The program is set up

Re: Task Manager was lost/killed due to full GC

2017-09-21 Thread Stephan Ewen
Hi! The garbage collection stats actually look okay, not terribly bad - almost surprised that this seems to cause failures. Can you check whether you find messages in the TM / JM log about heartbeat timeouts, actor systems being "gated" or "quarantined"? Would also be interesting to know how

Re: Task Manager was lost/killed due to full GC

2017-09-19 Thread ShB
Thanks for your response! Recommendation to decrease allotted memory? Which allotted memory should be decreased? I tried decreasing taskmanager.memory.fraction to give more memory to user managed operations, that doesn't work beyond a point. Also tried increasing containerized.heap-cutoff-ratio,

Re: Task Manager was lost/killed due to full GC

2017-09-15 Thread Greg Hogan
Late response, but a common reason for disappearing TaskManagers is termination by the Linux out-of-memory killer, with the recommendation to decrease the allotted memory. > On Sep 5, 2017, at 9:09 AM, ShB wrote: > > Hi, > > I'm running a Flink batch job that

Task Manager was lost/killed due to full GC

2017-09-05 Thread ShB
Hi, I'm running a Flink batch job that reads almost 1 TB of data from S3 and then performs operations on it. A list of filenames are distributed among the TM's and each subset of files is read from S3 from each TM. This job errors out at the read step due to the following error: