Hi All,

Currently I am running a flink job cluster (v1.8.2) on kubernetes with 4
pods, each pod with 4 parallelism.

The flink job reads from a source topic with 96 partitions, and does per
element filter, the filtered value comes from a broadcast topic and it
always use the latest message as the filter criteria, then publish to a
sink topic.

There is no checkpointing and state involved.

Then I am seeing GC overhead limit exceeded error continuously and the pods
keep on restarting

So I tried to increase the heap size for task manager by

containers:

      - args:

        - task-manager

        - -Djobmanager.rpc.address=service-job-manager

        - -Dtaskmanager.heap.size=4096m

        - -Denv.java.opts.taskmanager="-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/dumps/oom.bin"


3 things I noticed,


1. I dont see the heap size from UI for task manager show correctly

[image: image.png]

2. I dont see the heap dump file in the restarted pod /dumps/oom.bin, did I
set the java opts wrong?

3. I continously seeing below logs from all pods, not sure if causes any
issue
{"@timestamp":"2020-04-29T23:39:43.387Z","@version":"1","message":"[Consumer
clientId=consumer-1, groupId=aba774bc] Node 6 was unable to process the
fetch request with (sessionId=2054451921, epoch=474):
FETCH_SESSION_ID_NOT_FOUND.","logger_name":"org.apache.kafka.clients.FetchSessionHandler","thread_name":"pool-6-thread-1","level":"INFO","level_value":20000}

Thanks a lot for any help!

Best,
Eleanore

Reply via email to