Hi,

I am running some benchmarks using StateFun and have encountered a problem with 
backpressure and slow checkpoints that I can't figure out the reason for, and 
was hoping that someone might have an idea of what is causing it. My setup is 
the following:


I am running the Shopping Cart application from the StateFun playground. The 
job is submitted as an uber jar to an existing Flink Cluster with 3 
TaskManagers and 1 JobManager. The functions are served using the Undertow 
example from the documentation and I am using Kafka ingresses and egresses. My 
workload is only at 1000 events/s. Everything is run in separate GCP VMs.


The issue is with very long checkpoints, which I assume is caused by a 
backpressured ingress caused by the function dispatcher operator not being able 
to handle the workload. The only thing that has helped so far is to increase 
the parallelism of the job, but it feels like the still is some other 
bottleneck that is causing the issues. I have seen other benchmarks reaching 
much higher throughput than 1000 events/s, without more CPU or memory resources 
than I am using.


Any ideas of bottlenecks or ways to figure them out are greatly appreciated.


Best Regards,

Christopher Gustafson

Reply via email to