Re: Flink streaming job's taskmanager process killed by yarn nodemanager because of exceeding 'PHYSICAL' memory limit

2021-04-23 Thread Matthias Pohl
Another few questions: Have you had the chance to monitor/profile the memory usage? What section of the memory was used excessively? Additionally, could @dhanesh arole 's proposal solve your issue? Matthias On Fri, Apr 23, 2021 at 8:41 AM Matthias Pohl wrote: > Thanks for sharing these

Re: Flink streaming job's taskmanager process killed by yarn nodemanager because of exceeding 'PHYSICAL' memory limit

2021-04-23 Thread Matthias Pohl
Thanks for sharing these details. Looking into FLINK-14952 [1] (which introduced this option) and the related mailing list thread [2], it feels like your issue is quite similar to what is described in there even though it sounds like this issue is mostly tied to bounded jobs. But I'm not sure what

Re: Flink streaming job's taskmanager process killed by yarn nodemanager because of exceeding 'PHYSICAL' memory limit

2021-04-22 Thread 马阳阳
Hi Matthias, We have “solved” the problem by tuning the join. But I still try to answer the questions, hoping this will help. * What is the option you're referring to for the bounded shuffle? That might help to understand what streaming mode solution you're looking for. |

Re: Flink streaming job's taskmanager process killed by yarn nodemanager because of exceeding 'PHYSICAL' memory limit

2021-04-22 Thread dhanesh arole
Hi, Questions that @matth...@ververica.com asked are very valid and might provide more leads. But if you haven't already then it's worth trying to use jemalloc / tcmalloc. We had similar problems with slow growth in TM memory resulting in pods getting OOMed by k8s. After switching to jemalloc,

Re: Flink streaming job's taskmanager process killed by yarn nodemanager because of exceeding 'PHYSICAL' memory limit

2021-04-22 Thread Matthias Pohl
Hi, I have a few questions about your case: * What is the option you're referring to for the bounded shuffle? That might help to understand what streaming mode solution you're looking for. * What does the job graph look like? Are you assuming that it's due to a shuffling operation? Could you

Flink streaming job's taskmanager process killed by yarn nodemanager because of exceeding 'PHYSICAL' memory limit

2021-04-16 Thread 马阳阳
Hi, community, When running a Flink streaming job with big state size, one task manager process was killed by the yarn node manager. The following log is from the yarn node manager: 2021-04-16 11:51:23,013 WARN