Re: Large backpressure and slow checkpoints in StateFun

2022-05-30 Thread yuxia
May be you can use jstack or flame graph to analyze what's the bottleneck. 
BTW, about generating flame graph, arthas[1] is a good tool. 

[1] https://github.com/alibaba/arthas 

Best regards, 
Yuxia 


发件人: "Christopher Gustafson"  
收件人: "User"  
发送时间: 星期一, 2022年 5 月 30日 下午 2:29:19 
主题: Large backpressure and slow checkpoints in StateFun 



Hi, 




I am running some benchmarks using StateFun and have encountered a problem with 
backpressure and slow checkpoints that I can't figure out the reason for, and 
was hoping that someone might have an idea of what is causing it. My setup is 
the following: 



I am running the Shopping Cart application from the StateFun playground. The 
job is submitted as an uber jar to an existing Flink Cluster with 3 
TaskManagers and 1 JobManager. The functions are served using the Undertow 
example from the documentation and I am using Kafka ingresses and egresses. My 
workload is only at 1000 events/s. Everything is run in separate GCP VMs. 




The issue is with very long checkpoints, which I assume is caused by a 
backpressured ingress caused by the function dispatcher operator not being able 
to handle the workload. The only thing that has helped so far is to increase 
the parallelism of the job, but it feels like the still is some other 
bottleneck that is causing the issues. I have seen other benchmarks reaching 
much higher throughput than 1000 events/s, without more CPU or memory resources 
than I am using. 




Any ideas of bottlenecks or ways to figure them out are greatly appreciated. 




Best Regards, 

Christopher Gustafson 



Large backpressure and slow checkpoints in StateFun

2022-05-30 Thread Christopher Gustafson
Hi,


I am running some benchmarks using StateFun and have encountered a problem with 
backpressure and slow checkpoints that I can't figure out the reason for, and 
was hoping that someone might have an idea of what is causing it. My setup is 
the following:


I am running the Shopping Cart application from the StateFun playground. The 
job is submitted as an uber jar to an existing Flink Cluster with 3 
TaskManagers and 1 JobManager. The functions are served using the Undertow 
example from the documentation and I am using Kafka ingresses and egresses. My 
workload is only at 1000 events/s. Everything is run in separate GCP VMs.


The issue is with very long checkpoints, which I assume is caused by a 
backpressured ingress caused by the function dispatcher operator not being able 
to handle the workload. The only thing that has helped so far is to increase 
the parallelism of the job, but it feels like the still is some other 
bottleneck that is causing the issues. I have seen other benchmarks reaching 
much higher throughput than 1000 events/s, without more CPU or memory resources 
than I am using.


Any ideas of bottlenecks or ways to figure them out are greatly appreciated.


Best Regards,

Christopher Gustafson