We're about to get started on a 9-person-month PoC using Flink Streaming. Before we get started, I am interested to know how low-latency I can expect for my end-to-end flow for a single event (from source to sink).
Here is a very high-level description of our Flink design: We need at least once semantics, and our main flow of application is parsing a message ( < 50 microseconds) from Kafka, and then doing a keyBy on the parsed event ( <1kb) and then updating a very small user state in the KeyedStream, and then doing another keyBy and then operator of that KeyedStream. Each of the operators is a very simple operation - very little calculation and no I/O. ** Our requirement is to get close to 1ms (99%) or lower for end-to-end processing (timer starts once we get message from Kafka). Is this at all realistic if are flow contains 2 aggregations? If so, what optimizations might we need to get there regarding cluster configuration (both Flink and Hardware). Our throughput is possibly small enough (40,000 events per second) that we could run on one node - which might eliminate some network latency. I did read in https://ci.apache.org/projects/flink/flink-docs-master/internals/stream_checkpointing.html in Exactly Once vs At Least Once that a few milliseconds is considered super low-latency - wondering if we can get lower. Any advice or 'war stories' are very welcome. Thanks, Hayden Marchant