Hi Greg, you were right! After typing dmsg I found "Out of memory: Kill process 13574 (java)". This is really strange because the JVM of the TM is very calm. Moreover, there are 7 GB of memory available (out of 32) but somehow the OS decides to start swapping and, when it runs out of available swap memory, the OS decides to kill the Flink TM :(
Any idea of what's going on here? On Wed, May 24, 2017 at 2:32 PM, Flavio Pompermaier <pomperma...@okkam.it> wrote: > Hi Greg, > I carefully monitored all TM memory with jstat -gcutil and there'no full > gc, only . > The initial situation on the dying TM is: > > S0 S1 E O M CCS YGC YGCT FGC FGCT > GCT > 0.00 100.00 33.57 88.74 98.42 97.17 159 2.508 1 0.255 > 2.763 > 0.00 100.00 90.14 88.80 98.67 97.17 197 2.617 1 0.255 > 2.873 > 0.00 100.00 27.00 88.82 98.75 97.17 234 2.730 1 0.255 > 2.986 > > After about 10 hours of processing is: > > 0.00 100.00 21.74 83.66 98.52 96.94 5519 33.011 1 0.255 > 33.267 > 0.00 100.00 21.74 83.66 98.52 96.94 5519 33.011 1 0.255 > 33.267 > 0.00 100.00 21.74 83.66 98.52 96.94 5519 33.011 1 0.255 > 33.267 > > So I don't think thta OOM could be an option. > > However, the cluster is running on ESXi vSphere VMs and we already > experienced unexpected crash of jobs because of ESXi moving a heavy-loaded > VM to another (less loaded) physical machine..I would't be surprised if > swapping is also handled somehow differently.. > Looking at Cloudera widgets I see that the crash is usually preceded by an > intense cpu_iowait period. > I fear that Flink unsafe access to memory could be a problem in those > scenarios. Am I wrong? > > Any insight or debugging technique is greatly appreciated. > Best, > Flavio > > > On Wed, May 24, 2017 at 2:11 PM, Greg Hogan <c...@greghogan.com> wrote: > >> Hi Flavio, >> >> Flink handles interrupts so the only silent killer I am aware of is >> Linux's OOM killer. Are you seeing such a message in dmesg? >> >> Greg >> >> On Wed, May 24, 2017 at 3:18 AM, Flavio Pompermaier <pomperma...@okkam.it >> > wrote: >> >>> Hi to all, >>> I'd like to know whether memory swapping could cause a taskmanager >>> crash. >>> In my cluster of virtual machines 'm seeing this strange behavior in my >>> Flink cluster: sometimes, if memory get swapped the taskmanager (on that >>> machine) dies unexpectedly without any log about the error. >>> >>> Is that possible or not? >>> >>> Best, >>> Flavio >>> >> >>