timing for async requests

Jay Kreps Wed, 04 Apr 2012 08:38:57 -0700

I ran a quick timing test on the long-poll fetches in 0.8
(https://issues.apache.org/jira/browse/KAFKA-48). This measures
end-to-end latency of a send to the broker followed by a read by the
consumer. Previously this number was actually pretty bad for us, we
had a backoff of a few hundred ms in the consumer to avoid
spin-waiting on data arrival, so on average you waited 50% of that.
Bottom line is that after the long poll patch the end-to-end latency
is 0.93ms. This is not amazing by low-latency messaging standards but
absolutely incredibly by the standards of high-throughput log
aggregation systems, and there are a few ways this test is pessimistic
(see below).


This is using a flush.interval of 1 to avoid batching writes because
right now we don't hand out messages until post-flush. After
replication I don't think we need this delay because the durability
guarantees will come from replication not disk flush, which is
arguably better.

This number is obviously important for people who hope for low-latency
messaging. More importantly this is also very closely related to the
number we will see for the quorum writes with replication. Obviously
there is a big difference if the send() performance with acks > 1
takes 5ms, 50ms, or 500ms to get replicated and acknowledged. This
send, replicate, and acknowledge loop is pretty similar to the fetch
and consume loop I am testing, so we can probably put together a
reasonable simulation of performance for different replication
scenarios with this data.

I think a sub-1ms produce/fetch path means we should definitely be
able to get a sub-10ms replicated send().

This test is something like,
loop {
  start = System.nanoTime
  producer.send(message)
  iterator.next()
  recordTime(System.nanoTime - start)
}

This measurement is just over localhost so there is no actual network
latency, which is optimistic. There are two ways the test is
pessimistic. First, as I mentioned it includes a disk flush on each
message because of the flush.interval. This may represent much of the
time. Second, the send() call is now synchronous, so the consumer is
actually blocked by the producer acknowledgement.

-Jay

timing for async requests

Reply via email to