Hi,

I am running a streaming job (Flink 1.9) on EMR on yarn. Flink web UI or
metrics reported from prometheus shows total memory usage within specified
task manager memory - 3GB.

Metrics shows below numbers(in MB) -
Heap - 577
Non Heap - 241
DirectMemoryUsed - 852

Non-heap does rise gradually, starting around 210MB and reaching 241 when
yarn kills the container. Heap fluctuates between 1.x - .6GB,
DirectMemoryUsed is constant at 852.

Based on configurations these are the tm params from yarn logs -
-Xms1957m -Xmx1957m -XX:MaxDirectMemorySize=1115m

These are other params as configuration in flink-conf
yarn-cutoff - 270MB
Managed memory - 28MB
Network memory - 819MB

Above memory values are from around the same time the container is killed
by yarn for - <container-xxx> is running beyond physical memory limits.

Is there anything else which is not reported by flink in metrics or I have
been misinterpreting as seen from above total memory consumed is below -
3GB.

Same behavior is reported when I have run the job with 2GB, 2.7GB and now
3GB task mem. My job does have shuffles as data from one operator is sent
to 4 other operators after filtering.

One more thing is I am running this with 3 yarn containers(2 tasks in each
container), total parallelism as 6. As soon as one container fails with
this error, the job re-starts. However, within minutes other 2 containers
also fail with the same error one by one.

Thanks,
Hemant

Reply via email to