Hi Greg, you were right! After typing dmsg I found "Out of memory: Kill
process 13574 (java)".
This is really strange because the JVM of the TM is very calm.
Moreover, there are 7 GB of memory available (out of 32) but somehow the OS
decides to start swapping and, when it runs out of available swap memory,
the OS decides to kill the Flink TM :(

Any idea of what's going on here?

On Wed, May 24, 2017 at 2:32 PM, Flavio Pompermaier <pomperma...@okkam.it>
wrote:

> Hi Greg,
> I carefully monitored all TM memory with jstat -gcutil and there'no full
> gc, only .
> The initial situation on the dying TM is:
>
>   S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT
> GCT
>   0.00 100.00  33.57  88.74  98.42  97.17    159    2.508     1    0.255
>  2.763
>   0.00 100.00  90.14  88.80  98.67  97.17    197    2.617     1    0.255
>  2.873
>   0.00 100.00  27.00  88.82  98.75  97.17    234    2.730     1    0.255
>  2.986
>
> After about 10 hours of processing is:
>
>   0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011     1    0.255
> 33.267
>   0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011     1    0.255
> 33.267
>   0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011     1    0.255
> 33.267
>
> So I don't think thta OOM could be an option.
>
> However, the cluster is running on ESXi vSphere VMs and we already
> experienced unexpected crash of jobs because of ESXi moving a heavy-loaded
> VM to another (less loaded) physical machine..I would't be surprised if
> swapping is also handled somehow differently..
> Looking at Cloudera widgets I see that the crash is usually preceded by an
> intense cpu_iowait period.
> I fear that Flink unsafe access to memory could be a problem in those
> scenarios. Am I wrong?
>
> Any insight or debugging technique is  greatly appreciated.
> Best,
> Flavio
>
>
> On Wed, May 24, 2017 at 2:11 PM, Greg Hogan <c...@greghogan.com> wrote:
>
>> Hi Flavio,
>>
>> Flink handles interrupts so the only silent killer I am aware of is
>> Linux's OOM killer. Are you seeing such a message in dmesg?
>>
>> Greg
>>
>> On Wed, May 24, 2017 at 3:18 AM, Flavio Pompermaier <pomperma...@okkam.it
>> > wrote:
>>
>>> Hi to all,
>>> I'd like to know whether memory swapping could cause a taskmanager
>>> crash.
>>> In my cluster of virtual machines 'm seeing this strange behavior in my
>>> Flink cluster: sometimes, if memory get swapped the taskmanager (on that
>>> machine) dies unexpectedly without any log about the error.
>>>
>>> Is that possible or not?
>>>
>>> Best,
>>> Flavio
>>>
>>
>>

Reply via email to