Hi,
I have been trying to benchmark the end-to-end latency of a Flink 1.3.1
application, but got confused regarding the amount of time spent in
Flink. In my setting, data source and data sink dwell in separated
machines, like the following topology:
Machine 1 Machine 2
Machine 3
data source (via a socket client) -> Flink -> data sink (via a
socket server)
I observed 200-400 milliseconds end-to-end latency, while the execution
time of my stream transformations took no more than two milliseconds,
and the socket-only networking latency between machines is no more than
one millisecond, and I used ptpd so that the clock offset between
machines were also no more than one millisecond.
Question: What took those hundreds of milliseconds?
Here are the details of my setting and my observation so far:
On Machine 2, I implemented a socket server as a data source to Flink
(by implementing SourceFunction), and I splited the incoming stream into
several streams (by SplitStream) for some transformations (implementing
MapFuction and CoFlatMapFunction), where the results were fed to socket
(using writeToSocket). I used c++11's chrono time library (through JNI)
to take timestamps and determine the elapsed time, and I have verified
that the overhead of timestamping this way is no more than one millisecond.
I observed that for the four consecutive writes from Machine 1, with the
time between two writes no more than 0.3 milliseconds, on Machine 2
Flink got the first write in 0.2 milliseconds, but then it took 90
milliseconds for Flink to get the next write, and another 4 milliseconds
for the third write, and yet another 4 milliseconds for the fourth write.
And then it took more than 70 milliseconds before Flink started
processing my plan's first stream transformation. And after my last
transformation, it took more than 70 milliseconds before the result was
received at Machine 3.
Thank you,
Chao