Re: Metrics OOM java heap space

yu'an huang Sun, 14 Aug 2022 01:42:10 -0700

You can follow the ticked https://issues.apache.org/jira/browse/FLINK-10243 
<https://issues.apache.org/jira/browse/FLINK-10243> as mentioned in that stack 
overflow question to set this parameter:


“metrics.latency.granularity": 
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#metrics-latency-granularity


You only have 1.688gb for your TaskManager. I also suggest you to increate the 
memory configuration otherwise the test may still fail. 




> On 12 Aug 2022, at 10:52 PM, Yuriy Kutlunin 
> <yuriy.kutlu...@glowbyteconsulting.com> wrote:
> 
> Hello Yuan,
> 
> I don't override any default settings, docker-compose.yml:
> services:
>   jobmanager:
>     image: flink:1.15.1-java11
>     ports:
>       - "8081:8081"
>     command: jobmanager
>     environment:
>       - |
>         FLINK_PROPERTIES=
>         jobmanager.rpc.address: jobmanager
> 
>   taskmanager:
>     image: flink:1.15.1-java11
>     depends_on:
>       - jobmanager
>     command: taskmanager
>     ports:
>       - "8084:8084"
>     environment:
>       - |
>         FLINK_PROPERTIES=
>         jobmanager.rpc.address: jobmanager
>         taskmanager.numberOfTaskSlots: 2
>         metrics.reporter.prom.class: 
> org.apache.flink.metrics.prometheus.PrometheusReporter
>         env.java.opts: -XX:+HeapDumpOnOutOfMemoryError
>  From TaskManager log:
> INFO  [] - Final TaskExecutor Memory configuration:
> INFO  [] -   Total Process Memory:          1.688gb (1811939328 bytes)
> INFO  [] -     Total Flink Memory:          1.250gb (1342177280 bytes)
> INFO  [] -       Total JVM Heap Memory:     512.000mb (536870902 bytes)
> INFO  [] -         Framework:               128.000mb (134217728 bytes)
> INFO  [] -         Task:                    384.000mb (402653174 bytes)
> INFO  [] -       Total Off-heap Memory:     768.000mb (805306378 bytes)
> INFO  [] -         Managed:                 512.000mb (536870920 bytes)
> INFO  [] -         Total JVM Direct Memory: 256.000mb (268435458 bytes)
> INFO  [] -           Framework:             128.000mb (134217728 bytes)
> INFO  [] -           Task:                  0 bytes
> INFO  [] -           Network:               128.000mb (134217730 bytes)
> INFO  [] -     JVM Metaspace:               256.000mb (268435456 bytes)
> INFO  [] -     JVM Overhead:                192.000mb (201326592 bytes)
> 
> I would prefer not to configure memory (at this point), because memory 
> consumption depends on job structure, so it always can exceed configured 
> values.
> 
> My next guess is that the problem is not in metrics content, but in their 
> number, which increases with the number of operators. 
> So the next question is if there is a way to exclude metric generation on 
> operator level.
> Found same question without correct answer on SOF:
> https://stackoverflow.com/questions/54215245/apache-flink-limit-the-amount-of-metrics-exposed
> 
> On Fri, Aug 12, 2022 at 4:05 AM yu'an huang <h.yuan...@gmail.com> wrote:
> Hi Yuriy, 
> 
> How do you set your TaskMananger Memory? I think 40MB is not significant high 
> for Flink. And It’s normal to see memory increase if you have more 
> parallelism or set another metrics on. You can try setting larger moratory 
> for Flink as explained by following documents.
> 
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/memory/mem_setup/
> 
> Best
> Yuan
> 
> 
> 
>> On 12 Aug 2022, at 12:51 AM, Yuriy Kutlunin 
>> <yuriy.kutlu...@glowbyteconsulting.com> wrote:
>> 
>> Hi all,
>> 
>> I'm running Flink Cluster in Session Mode via docker-compose as stated in 
>> docs:
>> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/standalone/docker/#session-cluster-yml
>> 
>> After submitting a test job with many intermediate SQL operations (~500 
>> select * from ...) and metrics turned on (JMX or Prometheus) I got OOM: java 
>> heap space on initialization stage.
>> 
>> Turning metrics off allows the job to get to the Running state.
>> Heap consumption also depends on parallelism - same job succeeds when 
>> submitted with parallelism 1 instead of 2.
>> 
>> There are Task Manager logs for 4 cases:
>> JMX parallelism 1 (succeeded)
>> JMX parallelism 2 (failed)
>> Prometheus parallelism 2 (failed)
>> No metrics parallelism 2 (succeeded)
>> 
>> Post OOM heap dump (JMX parallelism 2) shows 2 main consumption points:
>> 1. Big value (40MB) for some task configuration
>> 2. Many instances (~270k) of some heavy (20KB) value in StreamConfig
>> 
>> Seems like all these heavy values are related to weird task names, which 
>> includes all the operations:
>> Received task Source: source1 -> SourceConversion[2001] -> mapping1 -> 
>> SourceConversion[2003] -> mapping2 -> SourceConversion[2005] -> ... -> 
>> mapping500 -> Sink: sink1 (1/1)#0 (1e089cf3b1581ea7c8fb1cd7b159e66b)
>> 
>> Looking for some way to overcome this heap issue.
>> 
>> -- 
>> Best regards,
>> Yuriy Kutlunin
>> <many_operators_parallelism_1_with_jmx.txt><many_operators_parallelism_2_with_jmx.txt><many_operators_parallelism_2_no_jmx.txt><many_operators_parallelism_2_with_prom.txt><heap_total.png><heap_task2_conf.png><heap_many_string_instances.png><heap_task1_conf.png>
> 
> 
> 
> -- 
> Best regards,
> Yuriy Kutlunin

Re: Metrics OOM java heap space

Reply via email to