Hi everyone,

My team is developing a streaming pipeline for analytics on top of market data. 
The ultimate goal is to be able to handle tens of millions of events per second 
distributed across the cluster according to the unique ID of a particular 
financial instrument. Unfortunately, we struggle with achieving acceptable 
performance. As far as I can see, Flink forcibly breaks operator chaining when 
it encounters a job graph node with multiple inputs. Subsequently, it severely 
affects the performance because a network boundary is enforced, and every event 
is forcibly serialised and deserialised.

>From the pipeline graph perspective, the requirements are:

  *   Read data from multiple Kafka topics that are connected to different 
nodes in the graph.
  *   Broadcast a number of dynamic rules to the pipeline.

The cleanest way is to achieve the first goal is to have a bunch of 
KeyedCoProcessFunction operations. This design didn't work for us because the 
SerDe overhead added by broken chains was too high, we had to completely 
flatten the pipeline instead. Unfortunately, I can't find any way to solve the 
second problem. As soon as the broadcast stream is introduced into the 
pipeline, the performance tanks.

Is there any technique that I could possibly utilise to preserve the chaining?

Kind regards,
Viacheslav

Reply via email to