Are your containers on separate nodes? Are you running in Kubernetes? Have you set hard resource limits?
When I’ve run into this issue it’s been because the JobManager was restarted (I wasn’t running in HA mode). Your node could have been restarted or Docker could have OOM-killed the process if the machine was low on memory. You might want to `docker ps` to see if your containers are restarting. Exit code 137 probably means that they were OOM-killed. I wouldn’t run the JobManager on the same node as TaskManagers unless you’re using hard resource limits. Note: if you decide to go the hard resource limit route, know that Docker OOM-kills based on VIRT, not RSS (watch out for mmap). > On Jul 8, 2017, at 1:54 AM, Chesnay Schepler <ches...@apache.org> wrote: > > If a TaskManager ran out of memory there should be something in the > JobManager logs about a unreachable TaskManager. > That said, there should also be something in the JobManager logs about the > job disappearing... > > Could you set the logging level to DEBUG, run the job again, and provide us > (or me directly) with the logs? > > Regards, > Chesnay > > On 08.07.2017 08:44, G.S.Vijay Raajaa wrote: >> HI Chesnay, >> >> >> I am currently using Flink - 1.3 using docker containers. I am not using it >> in HA mode. I have 3 task managers and one job manager. This happens >> randomly and not every time. Does it mean the task manager ran out of memory >> etc? I am using slots more than the available core , I hope compute is >> shared in round robin. Any pointers to tuning and HA setup will be greatly >> appreciated. >> >> Regards, >> Vijay Raajaa GS >> >> On Sat, Jul 8, 2017 at 12:04 PM, Chesnay Schepler <ches...@apache.org >> <mailto:ches...@apache.org>> wrote: >> Hello, >> >> could you tell us a bit more about your setup? Which Flink version you're >> using, whether HA is enabled, does this happen every time etc. . >> Regards, >> Chesnay >> >> >> On 06.07.2017 21:43, G.S.Vijay Raajaa wrote: >> HI, >> >> I am using Flink Task manager and Job Manager as docker containers. >> Strangely, I find the jobs to disappear from the web portal after some time. >> The jobs don't move to the failed state either. Any pointers will be really >> helpful. Not able to get a clue from the logs. >> >> Kindly let me know if I need specific tuning and ways to persists the >> uploaded jars. >> >> Regards, >> Vijay Raajaa G S >> >> >> >
signature.asc
Description: Message signed with OpenPGP