May be you can use jstack or flame graph to analyze what's the bottleneck.
BTW, about generating flame graph, arthas[1] is a good tool.
[1] https://github.com/alibaba/arthas
Best regards,
Yuxia
发件人: "Christopher Gustafson"
收件人: "User"
发送时间: 星期一, 2022年 5 月 30日 下午 2:29:19
主题: Large backpressure and slow checkpoints in StateFun
Hi,
I am running some benchmarks using StateFun and have encountered a problem with
backpressure and slow checkpoints that I can't figure out the reason for, and
was hoping that someone might have an idea of what is causing it. My setup is
the following:
I am running the Shopping Cart application from the StateFun playground. The
job is submitted as an uber jar to an existing Flink Cluster with 3
TaskManagers and 1 JobManager. The functions are served using the Undertow
example from the documentation and I am using Kafka ingresses and egresses. My
workload is only at 1000 events/s. Everything is run in separate GCP VMs.
The issue is with very long checkpoints, which I assume is caused by a
backpressured ingress caused by the function dispatcher operator not being able
to handle the workload. The only thing that has helped so far is to increase
the parallelism of the job, but it feels like the still is some other
bottleneck that is causing the issues. I have seen other benchmarks reaching
much higher throughput than 1000 events/s, without more CPU or memory resources
than I am using.
Any ideas of bottlenecks or ways to figure them out are greatly appreciated.
Best Regards,
Christopher Gustafson