Hi Till, please see the screenshot of heap dump: https://ibb.co/92Hzrpr
Thanks! Eleanore On Fri, Oct 23, 2020 at 9:25 AM Eleanore Jin <eleanore....@gmail.com> wrote: > Hi Till, > Thanks a lot for the prompt response, please see below information. > > 1. how much memory assign to JM pod? > 6g for container memory limit, 5g for jobmanager.heap.size, I think this > is the only available jm memory configuration for flink 1.10.2 > > 2. Have you tried with newer Flink versions? > I am actually using Apache Beam, so the latest version they support for > Flink is 1.10 > > 3. What statebackend is used? > FsStateBackend, and the checkpoint size is around 12MB from checkpoint > metrics, so I think it is not get inlined > > 4. What is state.checkpoints.num-retained? > I did not configure this explicitly, so by default only 1 should be > retained > > 5. Anything suspicious from JM log? > There is no Exception nor Error, the only thing I see is the below logs > keeps on repeating > > {"@timestamp":"2020-10-23T16:05:20.350Z","@version":"1","message":"Disabling > threads for Delete operation as thread count 0 is <= > 1","logger_name":"org.apache.flink.fs.shaded.hadoop3.org.apache.hadoop.fs.azure.AzureFileSystemThreadPoolExecutor","thread_name":"jobmanager-future-thread-4","level":"WARN","level_value":30000} > > 6. JVM args obtained vis jcmd > > -Xms5120m -Xmx5120m -XX:MaxGCPauseMillis=20 -XX:-OmitStackTraceInFastThrow > > > 7. Heap info returned by jcmd <pid> GC.heap_info > > it suggested only about 1G of the heap is used > > garbage-first heap total 5242880K, used 1123073K [0x00000006c0000000, > 0x0000000800000000) > > region size 2048K, 117 young (239616K), 15 survivors (30720K) > > Metaspace used 108072K, capacity 110544K, committed 110720K, > reserved 1146880K > > class space used 12963K, capacity 13875K, committed 13952K, reserved > 1048576K > > > 8. top -p <pid> > > it suggested for flink job manager java process 4.8G of physical memory is > consumed > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > > > > 1 root 20 0 13.356g 4.802g 22676 S 6.0 7.6 37:48.62 java > > > > > Thanks a lot! > Eleanore > > > On Fri, Oct 23, 2020 at 4:19 AM Till Rohrmann <trohrm...@apache.org> > wrote: > >> Hi Eleanore, >> >> how much memory did you assign to the JM pod? Maybe the limit is so high >> that it takes a bit of time until GC is triggered. Have you tried whether >> the same problem also occurs with newer Flink versions? >> >> The difference between checkpoints enabled and disabled is that the JM >> needs to do a bit more bookkeeping in order to track the completed >> checkpoints. If you are using the HeapStateBackend, then all states smaller >> than state.backend.fs.memory-threshold will get inlined, meaning that they >> are sent to the JM and stored in the checkpoint meta file. This can >> increase the memory usage of the JM process. Depending on >> state.checkpoints.num-retained this can grow as large as number retained >> checkpoints times the checkpoint size. However, I doubt that this adds up >> to several GB of additional space. >> >> In order to better understand the problem, the debug logs of your JM >> could be helpful. Also a heap dump might be able to point us towards the >> component which is eating up so much memory. >> >> Cheers, >> Till >> >> On Thu, Oct 22, 2020 at 4:56 AM Eleanore Jin <eleanore....@gmail.com> >> wrote: >> >>> Hi all, >>> >>> I have a flink job running version 1.10.2, it simply read from a kafka >>> topic with 96 partitions, and output to another kafka topic. >>> >>> It is running in k8s, with 1 JM (not in HA mode), 12 task managers each >>> has 4 slots. >>> The checkpoint persists the snapshot to azure blob storage, checkpoints >>> interval every 3 seconds, with 10 seconds timeout and minimum pause of 1 >>> second. >>> >>> I observed that the job manager pod memory usage grows over time, any >>> hints on why this is the case? And the memory usage for JM is significantly >>> more compared to no checkpoint enabled. >>> [image: image.png] >>> >>> Thanks a lot! >>> Eleanore >>> >>