Jay this is great news! I'm so happy to see progress on this front. It will help our use case immensely.
On Apr 4, 2012, at 8:38 AM, Jay Kreps <jay.kr...@gmail.com> wrote: > I ran a quick timing test on the long-poll fetches in 0.8 > (https://issues.apache.org/jira/browse/KAFKA-48). This measures > end-to-end latency of a send to the broker followed by a read by the > consumer. Previously this number was actually pretty bad for us, we > had a backoff of a few hundred ms in the consumer to avoid > spin-waiting on data arrival, so on average you waited 50% of that. > Bottom line is that after the long poll patch the end-to-end latency > is 0.93ms. This is not amazing by low-latency messaging standards but > absolutely incredibly by the standards of high-throughput log > aggregation systems, and there are a few ways this test is pessimistic > (see below). > > This is using a flush.interval of 1 to avoid batching writes because > right now we don't hand out messages until post-flush. After > replication I don't think we need this delay because the durability > guarantees will come from replication not disk flush, which is > arguably better. > > This number is obviously important for people who hope for low-latency > messaging. More importantly this is also very closely related to the > number we will see for the quorum writes with replication. Obviously > there is a big difference if the send() performance with acks > 1 > takes 5ms, 50ms, or 500ms to get replicated and acknowledged. This > send, replicate, and acknowledge loop is pretty similar to the fetch > and consume loop I am testing, so we can probably put together a > reasonable simulation of performance for different replication > scenarios with this data. > > I think a sub-1ms produce/fetch path means we should definitely be > able to get a sub-10ms replicated send(). > > This test is something like, > loop { > start = System.nanoTime > producer.send(message) > iterator.next() > recordTime(System.nanoTime - start) > } > > This measurement is just over localhost so there is no actual network > latency, which is optimistic. There are two ways the test is > pessimistic. First, as I mentioned it includes a disk flush on each > message because of the flush.interval. This may represent much of the > time. Second, the send() call is now synchronous, so the consumer is > actually blocked by the producer acknowledgement. > > -Jay