How to avoid Kafka latency spikes caused by log segment flush

2022-09-02 Thread Jiří Holuša
Hi,


we're experiencing a big latency spikes (two orders of magnitude) on 99th 
percentile in our Kafka deployment. We googled a bit and found that this is 
pretty well documented phenomenon: 
https://issues.apache.org/jira/browse/KAFKA-9693


In the ticket, suggested "solution" is disabling log flush but that's hardly an 
acceptable solution if you care about data consistency.


We've tried to tune around log sizes, flush intervals etc. but that's only 
delaying the log flush doing nothing to the magnitude of the spike. I find it 
hard to acknowledge that all the Kafka users in the world, the most popular 
message broker in the world, are OK with such latency spikes.


Question

Is there any real solution/workaround to this problem? To be clear, I'm talking 
about how to lower the spike down to the minimum.


BTW I apologize for cross-posting but I originally asked on StackOverflow 
(https://stackoverflow.com/questions/73555649/how-to-avoid-kafka-latency-spikes-caused-by-log-segment-flush)
 and I'm really trying to get help.


Thanks,

Jiri





Re: kafka latency for large message

2019-03-19 Thread Nan Xu
that's very good information from the slides, thanks. Our design to use
kafka has 2 purpose. one is use it as a cache, we use ktable for that
purpose, second purpose is use as message delivery mechanism to send it to
other system. Because we very much care the latency, the ktable with a
compact topic suit us very well, if has to find another system to do the
caching, big change involved. The way described in the slides, which break
the message to smaller chunks then reassemble them seems a viable solution.

do you know why kafka doesn't have a liner latency for big messages
comparing to small ones. for 2M message, I have avg latency less than 10
ms, more expecting for 30M has latency less than 10 * 20 = 200ms

On Mon, Mar 18, 2019 at 3:29 PM Bruce Markey  wrote:

> Hi Nan,
>
> Would you consider other approaches that may actually be a more efficient
> solution for you? There is a slide deck Handle Large Messages In Apache
> Kafka
> <
> https://www.slideshare.net/JiangjieQin/handle-large-messages-in-apache-kafka-58692297
> >.
> For messages this large, one of the approaches suggested is Reference Based
> Messaging where you write your large files to an external data store then
> produce a small Apache Kafka message with a reference for where to find the
> file. This would allow your consumer applications to find the file as
> needed rather than storing all that data in the event log.
>
> --  bjm
>
> On Thu, Mar 14, 2019 at 1:53 PM Xu, Nan  wrote:
>
> > Hi,
> >
> > We are using kafka to send messages and there is less than 1% of
> > message is very big, close to 30M. understanding kafka is not ideal for
> > sending big messages, because the large message rate is very low, we just
> > want let kafka do it anyway. But still want to get a reasonable latency.
> >
> > To test, I just setup up a topic test on a single broker local kafka,
> > with only 1 partition and 1 replica, using the following command
> >
> > ./kafka-producer-perf-test.sh  --topic test --num-records 200
> > --throughput 1 --record-size 3000 --producer.config
> > ../config/producer.properties
> >
> > Producer.config
> >
> > #Max 40M message
> > max.request.size=4000
> > buffer.memory=4000
> >
> > #2M buffer
> > send.buffer.bytes=200
> >
> > 6 records sent, 1.1 records/sec (31.00 MB/sec), 973.0 ms avg latency,
> > 1386.0 max latency.
> > 6 records sent, 1.0 records/sec (28.91 MB/sec), 787.2 ms avg latency,
> > 1313.0 max latency.
> > 5 records sent, 1.0 records/sec (27.92 MB/sec), 582.8 ms avg latency,
> > 643.0 max latency.
> > 6 records sent, 1.1 records/sec (30.16 MB/sec), 685.3 ms avg latency,
> > 1171.0 max latency.
> > 5 records sent, 1.0 records/sec (27.92 MB/sec), 629.4 ms avg latency,
> > 729.0 max latency.
> > 5 records sent, 1.0 records/sec (27.61 MB/sec), 635.6 ms avg latency,
> > 673.0 max latency.
> > 6 records sent, 1.1 records/sec (30.09 MB/sec), 736.2 ms avg latency,
> > 1255.0 max latency.
> > 5 records sent, 1.0 records/sec (27.62 MB/sec), 626.8 ms avg latency,
> > 685.0 max latency.
> > 5 records sent, 1.0 records/sec (28.38 MB/sec), 608.8 ms avg latency,
> > 685.0 max latency.
> >
> >
> > On the broker, I change the
> >
> > socket.send.buffer.bytes=2024000
> > # The receive buffer (SO_RCVBUF) used by the socket server
> > socket.receive.buffer.bytes=2224000
> >
> > and all others are default.
> >
> > I am a little surprised to see about 1 s max latency and average about
> 0.5
> > s. my understanding is kafka is doing the memory mapping for log file and
> > let system flush it. all the write are sequential. So flush should be not
> > affected by message size that much. Batching and network will take
> longer,
> > but those are memory based and local machine. my ssd should be far better
> > than 0.5 second. where the time got consumed? any suggestion?
> >
> > Thanks,
> > Nan
> >
> >
> >
> >
> >
> >
> >
> > --
> > This message, and any attachments, is for the intended recipient(s) only,
> > may contain information that is privileged, confidential and/or
> proprietary
> > and subject to important terms and conditions available at
> > http://www.bankofamerica.com/emaildisclaimer.   If you are not the
> > intended recipient, please delete this message.
> >
>


Re: kafka latency for large message

2019-03-18 Thread Bruce Markey
Hi Nan,

Would you consider other approaches that may actually be a more efficient
solution for you? There is a slide deck Handle Large Messages In Apache
Kafka
.
For messages this large, one of the approaches suggested is Reference Based
Messaging where you write your large files to an external data store then
produce a small Apache Kafka message with a reference for where to find the
file. This would allow your consumer applications to find the file as
needed rather than storing all that data in the event log.

--  bjm

On Thu, Mar 14, 2019 at 1:53 PM Xu, Nan  wrote:

> Hi,
>
> We are using kafka to send messages and there is less than 1% of
> message is very big, close to 30M. understanding kafka is not ideal for
> sending big messages, because the large message rate is very low, we just
> want let kafka do it anyway. But still want to get a reasonable latency.
>
> To test, I just setup up a topic test on a single broker local kafka,
> with only 1 partition and 1 replica, using the following command
>
> ./kafka-producer-perf-test.sh  --topic test --num-records 200
> --throughput 1 --record-size 3000 --producer.config
> ../config/producer.properties
>
> Producer.config
>
> #Max 40M message
> max.request.size=4000
> buffer.memory=4000
>
> #2M buffer
> send.buffer.bytes=200
>
> 6 records sent, 1.1 records/sec (31.00 MB/sec), 973.0 ms avg latency,
> 1386.0 max latency.
> 6 records sent, 1.0 records/sec (28.91 MB/sec), 787.2 ms avg latency,
> 1313.0 max latency.
> 5 records sent, 1.0 records/sec (27.92 MB/sec), 582.8 ms avg latency,
> 643.0 max latency.
> 6 records sent, 1.1 records/sec (30.16 MB/sec), 685.3 ms avg latency,
> 1171.0 max latency.
> 5 records sent, 1.0 records/sec (27.92 MB/sec), 629.4 ms avg latency,
> 729.0 max latency.
> 5 records sent, 1.0 records/sec (27.61 MB/sec), 635.6 ms avg latency,
> 673.0 max latency.
> 6 records sent, 1.1 records/sec (30.09 MB/sec), 736.2 ms avg latency,
> 1255.0 max latency.
> 5 records sent, 1.0 records/sec (27.62 MB/sec), 626.8 ms avg latency,
> 685.0 max latency.
> 5 records sent, 1.0 records/sec (28.38 MB/sec), 608.8 ms avg latency,
> 685.0 max latency.
>
>
> On the broker, I change the
>
> socket.send.buffer.bytes=2024000
> # The receive buffer (SO_RCVBUF) used by the socket server
> socket.receive.buffer.bytes=2224000
>
> and all others are default.
>
> I am a little surprised to see about 1 s max latency and average about 0.5
> s. my understanding is kafka is doing the memory mapping for log file and
> let system flush it. all the write are sequential. So flush should be not
> affected by message size that much. Batching and network will take longer,
> but those are memory based and local machine. my ssd should be far better
> than 0.5 second. where the time got consumed? any suggestion?
>
> Thanks,
> Nan
>
>
>
>
>
>
>
> --
> This message, and any attachments, is for the intended recipient(s) only,
> may contain information that is privileged, confidential and/or proprietary
> and subject to important terms and conditions available at
> http://www.bankofamerica.com/emaildisclaimer.   If you are not the
> intended recipient, please delete this message.
>


Re: kafka latency for large message

2019-03-18 Thread Mike Trienis
It takes time to send that much data over the network. Why would you expect
a smaller latency?

On Mon, Mar 18, 2019 at 8:05 AM Nan Xu  wrote:

> anyone can give some suggestion? or an explanation why kafka give a big
> latency for large payload.
>
> Thanks,
> Nan
>
> On Thu, Mar 14, 2019 at 3:53 PM Xu, Nan  wrote:
>
> > Hi,
> >
> > We are using kafka to send messages and there is less than 1% of
> > message is very big, close to 30M. understanding kafka is not ideal for
> > sending big messages, because the large message rate is very low, we just
> > want let kafka do it anyway. But still want to get a reasonable latency.
> >
> > To test, I just setup up a topic test on a single broker local kafka,
> > with only 1 partition and 1 replica, using the following command
> >
> > ./kafka-producer-perf-test.sh  --topic test --num-records 200
> > --throughput 1 --record-size 3000 --producer.config
> > ../config/producer.properties
> >
> > Producer.config
> >
> > #Max 40M message
> > max.request.size=4000
> > buffer.memory=4000
> >
> > #2M buffer
> > send.buffer.bytes=200
> >
> > 6 records sent, 1.1 records/sec (31.00 MB/sec), 973.0 ms avg latency,
> > 1386.0 max latency.
> > 6 records sent, 1.0 records/sec (28.91 MB/sec), 787.2 ms avg latency,
> > 1313.0 max latency.
> > 5 records sent, 1.0 records/sec (27.92 MB/sec), 582.8 ms avg latency,
> > 643.0 max latency.
> > 6 records sent, 1.1 records/sec (30.16 MB/sec), 685.3 ms avg latency,
> > 1171.0 max latency.
> > 5 records sent, 1.0 records/sec (27.92 MB/sec), 629.4 ms avg latency,
> > 729.0 max latency.
> > 5 records sent, 1.0 records/sec (27.61 MB/sec), 635.6 ms avg latency,
> > 673.0 max latency.
> > 6 records sent, 1.1 records/sec (30.09 MB/sec), 736.2 ms avg latency,
> > 1255.0 max latency.
> > 5 records sent, 1.0 records/sec (27.62 MB/sec), 626.8 ms avg latency,
> > 685.0 max latency.
> > 5 records sent, 1.0 records/sec (28.38 MB/sec), 608.8 ms avg latency,
> > 685.0 max latency.
> >
> >
> > On the broker, I change the
> >
> > socket.send.buffer.bytes=2024000
> > # The receive buffer (SO_RCVBUF) used by the socket server
> > socket.receive.buffer.bytes=2224000
> >
> > and all others are default.
> >
> > I am a little surprised to see about 1 s max latency and average about
> 0.5
> > s. my understanding is kafka is doing the memory mapping for log file and
> > let system flush it. all the write are sequential. So flush should be not
> > affected by message size that much. Batching and network will take
> longer,
> > but those are memory based and local machine. my ssd should be far better
> > than 0.5 second. where the time got consumed? any suggestion?
> >
> > Thanks,
> > Nan
> >
> >
> >
> >
> >
> >
> >
> > --
> > This message, and any attachments, is for the intended recipient(s) only,
> > may contain information that is privileged, confidential and/or
> proprietary
> > and subject to important terms and conditions available at
> > http://www.bankofamerica.com/emaildisclaimer.   If you are not the
> > intended recipient, please delete this message.
> >
>


-- 
Thanks, Mike


Re: kafka latency for large message

2019-03-18 Thread Nan Xu
anyone can give some suggestion? or an explanation why kafka give a big
latency for large payload.

Thanks,
Nan

On Thu, Mar 14, 2019 at 3:53 PM Xu, Nan  wrote:

> Hi,
>
> We are using kafka to send messages and there is less than 1% of
> message is very big, close to 30M. understanding kafka is not ideal for
> sending big messages, because the large message rate is very low, we just
> want let kafka do it anyway. But still want to get a reasonable latency.
>
> To test, I just setup up a topic test on a single broker local kafka,
> with only 1 partition and 1 replica, using the following command
>
> ./kafka-producer-perf-test.sh  --topic test --num-records 200
> --throughput 1 --record-size 3000 --producer.config
> ../config/producer.properties
>
> Producer.config
>
> #Max 40M message
> max.request.size=4000
> buffer.memory=4000
>
> #2M buffer
> send.buffer.bytes=200
>
> 6 records sent, 1.1 records/sec (31.00 MB/sec), 973.0 ms avg latency,
> 1386.0 max latency.
> 6 records sent, 1.0 records/sec (28.91 MB/sec), 787.2 ms avg latency,
> 1313.0 max latency.
> 5 records sent, 1.0 records/sec (27.92 MB/sec), 582.8 ms avg latency,
> 643.0 max latency.
> 6 records sent, 1.1 records/sec (30.16 MB/sec), 685.3 ms avg latency,
> 1171.0 max latency.
> 5 records sent, 1.0 records/sec (27.92 MB/sec), 629.4 ms avg latency,
> 729.0 max latency.
> 5 records sent, 1.0 records/sec (27.61 MB/sec), 635.6 ms avg latency,
> 673.0 max latency.
> 6 records sent, 1.1 records/sec (30.09 MB/sec), 736.2 ms avg latency,
> 1255.0 max latency.
> 5 records sent, 1.0 records/sec (27.62 MB/sec), 626.8 ms avg latency,
> 685.0 max latency.
> 5 records sent, 1.0 records/sec (28.38 MB/sec), 608.8 ms avg latency,
> 685.0 max latency.
>
>
> On the broker, I change the
>
> socket.send.buffer.bytes=2024000
> # The receive buffer (SO_RCVBUF) used by the socket server
> socket.receive.buffer.bytes=2224000
>
> and all others are default.
>
> I am a little surprised to see about 1 s max latency and average about 0.5
> s. my understanding is kafka is doing the memory mapping for log file and
> let system flush it. all the write are sequential. So flush should be not
> affected by message size that much. Batching and network will take longer,
> but those are memory based and local machine. my ssd should be far better
> than 0.5 second. where the time got consumed? any suggestion?
>
> Thanks,
> Nan
>
>
>
>
>
>
>
> --
> This message, and any attachments, is for the intended recipient(s) only,
> may contain information that is privileged, confidential and/or proprietary
> and subject to important terms and conditions available at
> http://www.bankofamerica.com/emaildisclaimer.   If you are not the
> intended recipient, please delete this message.
>


kafka latency for large message

2019-03-14 Thread Xu, Nan
Hi, 
   
We are using kafka to send messages and there is less than 1% of message is 
very big, close to 30M. understanding kafka is not ideal for sending big 
messages, because the large message rate is very low, we just want let kafka do 
it anyway. But still want to get a reasonable latency.

To test, I just setup up a topic test on a single broker local kafka,  with 
only 1 partition and 1 replica, using the following command

./kafka-producer-perf-test.sh  --topic test --num-records 200  --throughput 
1 --record-size 3000 --producer.config ../config/producer.properties

Producer.config

#Max 40M message
max.request.size=4000
buffer.memory=4000

#2M buffer
send.buffer.bytes=200

6 records sent, 1.1 records/sec (31.00 MB/sec), 973.0 ms avg latency, 1386.0 
max latency.
6 records sent, 1.0 records/sec (28.91 MB/sec), 787.2 ms avg latency, 1313.0 
max latency.
5 records sent, 1.0 records/sec (27.92 MB/sec), 582.8 ms avg latency, 643.0 max 
latency.
6 records sent, 1.1 records/sec (30.16 MB/sec), 685.3 ms avg latency, 1171.0 
max latency.
5 records sent, 1.0 records/sec (27.92 MB/sec), 629.4 ms avg latency, 729.0 max 
latency.
5 records sent, 1.0 records/sec (27.61 MB/sec), 635.6 ms avg latency, 673.0 max 
latency.
6 records sent, 1.1 records/sec (30.09 MB/sec), 736.2 ms avg latency, 1255.0 
max latency.
5 records sent, 1.0 records/sec (27.62 MB/sec), 626.8 ms avg latency, 685.0 max 
latency.
5 records sent, 1.0 records/sec (28.38 MB/sec), 608.8 ms avg latency, 685.0 max 
latency.


On the broker, I change the 

socket.send.buffer.bytes=2024000
# The receive buffer (SO_RCVBUF) used by the socket server
socket.receive.buffer.bytes=2224000

and all others are default.

I am a little surprised to see about 1 s max latency and average about 0.5 s. 
my understanding is kafka is doing the memory mapping for log file and let 
system flush it. all the write are sequential. So flush should be not affected 
by message size that much. Batching and network will take longer, but those are 
memory based and local machine. my ssd should be far better than 0.5 second. 
where the time got consumed? any suggestion?

Thanks,
Nan







--
This message, and any attachments, is for the intended recipient(s) only, may 
contain information that is privileged, confidential and/or proprietary and 
subject to important terms and conditions available at 
http://www.bankofamerica.com/emaildisclaimer.   If you are not the intended 
recipient, please delete this message.


Kafka latency optimization

2017-11-03 Thread 陈江枫
Hi, everyone
My kafka version is 0.10.2.1.
My service have really low qps (1msg/sec). And our requirement for rtt is
really strict. ( 99.9% < 30ms)
Currently I've encounter a problem, when kafka run for a long time, 15 days
or so, performance start to go down.
2017-10-21 was like
Time .num of msgs .  percentage

cost<=2ms

0

0.000%

2ms1s

0

0.000%

But recently, it became :

cost<=2ms

0

0.000%

2ms1s

0

0.000%
When I check the log, I don't see a way to check the reason why a specific
message have a high rtt. And if there's any way to optimize(OS tune, broker
config), please enlighten me


Re: kafka latency

2016-07-26 Thread Stevo Slavić
Hello Chao,

How did you measure latency?

See also
https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines

Kind regards,
Stevo Slavic.

On Tue, Jul 26, 2016 at 9:52 PM, Luo, Chao  wrote:

> Dear Kafka guys,
>
> I measured the latency of kafka system. The producer, kafka servers, and
> consumer were running on different machines (AWS EC2 instances). The
> producer-to-consumer latency is around 250 milliseconds. Is it a normal
> value for kafka system. Can I do better and how? Any comments or suggestion
> are highly appreciated!
>
>
> Best,
> Chao
>


kafka latency

2016-07-26 Thread Luo, Chao
Dear Kafka guys,

I measured the latency of kafka system. The producer, kafka servers, and 
consumer were running on different machines (AWS EC2 instances). The 
producer-to-consumer latency is around 250 milliseconds. Is it a normal value 
for kafka system. Can I do better and how? Any comments or suggestion are 
highly appreciated!


Best,
Chao


Re: Trying to figure out kafka latency issues

2014-12-30 Thread Jay Kreps
 still don't know how to figure out (1).
   
Thanks!
   
   
On Mon, Dec 29, 2014 at 3:02 PM, Rajiv Kurian ra...@signalfuse.com
wrote:
   
Hi Jay,
   
Re (1) - I am not sure how to do this? Actually I am not sure what
  this
means. Is this the time every write/fetch request is received on the
broker? Do I need to enable some specific log level for this to show
   up? It
doesn't show up in the usual log. Is this information also available
  via
jmx somehow?
Re (2) - Are you saying that I should instrument the percentage of
  time
waiting for buffer space stat myself? If so how do I do this. Or
 are
   these
stats already output to jmx by the kafka producer code. This seems
  like
it's in the internals of the kafka producer client code.
   
Thanks again!
   
   
On Mon, Dec 29, 2014 at 10:22 AM, Rajiv Kurian 
 ra...@signalfuse.com
wrote:
   
Thanks Jay. Will check (1) and (2) and get back to you. The test is
  not
stand-alone now. It might be a bit of work to extract it to a
   stand-alone
executable. It might take me a bit of time to get that going.
   
On Mon, Dec 29, 2014 at 9:45 AM, Jay Kreps j...@confluent.io
 wrote:
   
Hey Rajiv,
   
This sounds like a bug. The more info you can help us get the
 easier
   to
fix. Things that would help:
1. Can you check if the the request log on the servers shows
 latency
spikes
(in which case it is a server problem)?
2. It would be worth also getting the jmx stats on the producer as
   they
will show things like what percentage of time it is waiting for
  buffer
space etc.
   
If your test is reasonably stand-alone it would be great to file a
   JIRA
and
attach the test code and the findings you already have so someone
  can
dig
into what is going on.
   
-Jay
   
On Sun, Dec 28, 2014 at 7:15 PM, Rajiv Kurian 
 ra...@signalfuse.com
  
wrote:
   
 Hi all,

 Bumping this up, in case some one has any ideas. I did yet
 another
 experiment where I create 4 producers and stripe the send
 requests
across
 them in a manner such that any one producer only sees 256
  partitions
 instead of the entire 1024. This seems to have helped a bit, and
though I
 still see crazy high 99th (25-30 seconds), the median, mean,
 75th
   and
95th
 percentile have all gone down.

 Thanks!

 On Sun, Dec 21, 2014 at 12:27 PM, Thunder Stumpges 
tstump...@ntent.com
 wrote:

  Ah I thought it was restarting the broker that made things
  better
   :)
 
  Yeah I have no experience with the Java client so can't really
   help
 there.
 
  Good luck!
 
  -Original Message-
  From: Rajiv Kurian [ra...@signalfuse.com]
  Received: Sunday, 21 Dec 2014, 12:25PM
  To: users@kafka.apache.org [users@kafka.apache.org]
  Subject: Re: Trying to figure out kafka latency issues
 
  I'll take a look at the GC profile of the brokers Right now I
  keep
a tab
 on
  the CPU, Messages in, Bytes in, Bytes out, free memory (on the
machine
 not
  JVM heap) free disk space on the broker. I'll need to take a
  look
at the
  JVM metrics too. What seemed strange is that going from 8 -
 512
 partitions
  increases the latency, but going fro 512- 8 does not decrease
  it.
I have
  to restart the producer (but not the broker) for the end to
 end
latency
 to
  go down That made it seem  that the fault was probably with
 the
producer
  and not the broker. Only restarting the producer made things
better. I'll
  do more extensive measurement on the broker.
 
  On Sun, Dec 21, 2014 at 9:08 AM, Thunder Stumpges 
tstump...@ntent.com
  wrote:
  
   Did you see my response and have you checked the server logs
especially
   the GC logs? It still sounds like you are running out of
  memory
on the
   broker. What is your max heap memory and are you thrashing
  once
you
 start
   writing to all those partitions?
  
   You have measured very thoroughly from an external point of
   view,
i
 think
   now you'll have to start measuring the internal metrics.
 Maybe
someone
  else
   will have ideas on what jmx values to watch.
  
   Best,
   Thunder
  
  
   -Original Message-
   From: Rajiv Kurian [ra...@signalfuse.com]
   Received: Saturday, 20 Dec 2014, 10:24PM
   To: users@kafka.apache.org [users@kafka.apache.org]
   Subject: Re: Trying to figure out kafka latency issues
  
   Some more work tells me that the end to end latency numbers
  vary
with
 the
   number of partitions I am writing to. I did an experiment,
  where
based
  on a
   run time flag I would dynamically select how many of the
 *1024
  partitions*
   I write

Re: Trying to figure out kafka latency issues

2014-12-30 Thread Rajiv Kurian
 latency issues
  
   I'll take a look at the GC profile of the brokers Right now
 I
   keep
 a tab
  on
   the CPU, Messages in, Bytes in, Bytes out, free memory (on
 the
 machine
  not
   JVM heap) free disk space on the broker. I'll need to take a
   look
 at the
   JVM metrics too. What seemed strange is that going from 8 -
  512
  partitions
   increases the latency, but going fro 512- 8 does not
 decrease
   it.
 I have
   to restart the producer (but not the broker) for the end to
  end
 latency
  to
   go down That made it seem  that the fault was probably with
  the
 producer
   and not the broker. Only restarting the producer made things
 better. I'll
   do more extensive measurement on the broker.
  
   On Sun, Dec 21, 2014 at 9:08 AM, Thunder Stumpges 
 tstump...@ntent.com
   wrote:
   
Did you see my response and have you checked the server
 logs
 especially
the GC logs? It still sounds like you are running out of
   memory
 on the
broker. What is your max heap memory and are you thrashing
   once
 you
  start
writing to all those partitions?
   
You have measured very thoroughly from an external point
 of
view,
 i
  think
now you'll have to start measuring the internal metrics.
  Maybe
 someone
   else
will have ideas on what jmx values to watch.
   
Best,
Thunder
   
   
-Original Message-
From: Rajiv Kurian [ra...@signalfuse.com]
Received: Saturday, 20 Dec 2014, 10:24PM
To: users@kafka.apache.org [users@kafka.apache.org]
Subject: Re: Trying to figure out kafka latency issues
   
Some more work tells me that the end to end latency
 numbers
   vary
 with
  the
number of partitions I am writing to. I did an experiment,
   where
 based
   on a
run time flag I would dynamically select how many of the
  *1024
   partitions*
I write to. So say I decide I'll write to at most 256
   partitions
 I mod
whatever partition I would actually write to by 256.
  Basically
the
  number
of partitions for this topic on the broker remains the
 same
  at
 *1024*
partitions but the number of partitions my producers write
  to
 changes
dynamically based on a run time flag. So something like
  this:
   
int partition = getPartitionForMessage(message);
int maxPartitionsToWriteTo = maxPartitionsFlag.get();   //
   This
 flag
  can
   be
updated without bringing the application down - just a
   volatile
 read of
some number set externally.
int moddedPartition = partition % maxPartitionsToWrite.
// Send a message to this Kafka partition.
   
Here are some interesting things I've noticed:
   
i) When I start my client and it *never writes* to more
 than
   *8
partitions *(same
data rate but fewer partitions) - the end to end *99th
  latency
is
  300-350
ms*. Quite a bit of this (numbers in my previous emails)
 is
   the
 latency
from producer - broker and the latency from broker -
   consumer.
 Still
nowhere as poor as the *20 - 30* seconds I was seeing.
   
ii) When I increase the maximum number of partitions, end
 to
   end
  latency
increases dramatically. At *256 partitions* the end to end
   *99th
  latency
   is
still 390 - 418 ms.* Worse than the latency figures for *8
 *partitions,
   but
not by much. When I increase this number to *512
 partitions
   *the
 end
to end *99th
latency *becomes an intolerable *19-24 seconds*. At *1024*
 partitions
  the
*99th
latency is at 25 - 30 seconds*.
A table of the numbers:
   
Max number of partitions written to (out of 1024)
   
End to end latency
   
8
   
300 - 350 ms
   
256
   
390 - 418 ms
   
512
   
19 - 24 seconds
   
1024
   
25 - 30 seconds
   
   
iii) Once I make the max number of partitions high enough,
 reducing it
doesn't help. For eg: If I go up from *8* to *512
  *partitions,
the
   latency
goes up. But while the producer is running if I go down
 from
 *512* to
*8 *partitions,
it doesn't reduce the latency numbers. My guess is that
 the
 producer is
creating some state lazily per partition and this state is
 causing the
latency. Once this state is created, writing to fewer
   partitions
  doesn't
seem to help. Only a restart of the producer calms things
   down.
   
So my current plan is to reduce the number

Re: Trying to figure out kafka latency issues

2014-12-30 Thread Rajiv Kurian
: Rajiv Kurian [ra...@signalfuse.com]
   Received: Sunday, 21 Dec 2014, 12:25PM
   To: users@kafka.apache.org [users@kafka.apache.org]
   Subject: Re: Trying to figure out kafka latency issues
  
   I'll take a look at the GC profile of the brokers Right
 now I
   keep
 a tab
  on
   the CPU, Messages in, Bytes in, Bytes out, free memory (on
 the
 machine
  not
   JVM heap) free disk space on the broker. I'll need to take
 a
   look
 at the
   JVM metrics too. What seemed strange is that going from 8
 -
  512
  partitions
   increases the latency, but going fro 512- 8 does not
 decrease
   it.
 I have
   to restart the producer (but not the broker) for the end to
  end
 latency
  to
   go down That made it seem  that the fault was probably with
  the
 producer
   and not the broker. Only restarting the producer made
 things
 better. I'll
   do more extensive measurement on the broker.
  
   On Sun, Dec 21, 2014 at 9:08 AM, Thunder Stumpges 
 tstump...@ntent.com
   wrote:
   
Did you see my response and have you checked the server
 logs
 especially
the GC logs? It still sounds like you are running out of
   memory
 on the
broker. What is your max heap memory and are you
 thrashing
   once
 you
  start
writing to all those partitions?
   
You have measured very thoroughly from an external point
 of
view,
 i
  think
now you'll have to start measuring the internal metrics.
  Maybe
 someone
   else
will have ideas on what jmx values to watch.
   
Best,
Thunder
   
   
-Original Message-
From: Rajiv Kurian [ra...@signalfuse.com]
Received: Saturday, 20 Dec 2014, 10:24PM
To: users@kafka.apache.org [users@kafka.apache.org]
Subject: Re: Trying to figure out kafka latency issues
   
Some more work tells me that the end to end latency
 numbers
   vary
 with
  the
number of partitions I am writing to. I did an
 experiment,
   where
 based
   on a
run time flag I would dynamically select how many of the
  *1024
   partitions*
I write to. So say I decide I'll write to at most 256
   partitions
 I mod
whatever partition I would actually write to by 256.
  Basically
the
  number
of partitions for this topic on the broker remains the
 same
  at
 *1024*
partitions but the number of partitions my producers
 write
  to
 changes
dynamically based on a run time flag. So something like
  this:
   
int partition = getPartitionForMessage(message);
int maxPartitionsToWriteTo = maxPartitionsFlag.get();
  //
   This
 flag
  can
   be
updated without bringing the application down - just a
   volatile
 read of
some number set externally.
int moddedPartition = partition % maxPartitionsToWrite.
// Send a message to this Kafka partition.
   
Here are some interesting things I've noticed:
   
i) When I start my client and it *never writes* to more
 than
   *8
partitions *(same
data rate but fewer partitions) - the end to end *99th
  latency
is
  300-350
ms*. Quite a bit of this (numbers in my previous emails)
 is
   the
 latency
from producer - broker and the latency from broker -
   consumer.
 Still
nowhere as poor as the *20 - 30* seconds I was seeing.
   
ii) When I increase the maximum number of partitions,
 end to
   end
  latency
increases dramatically. At *256 partitions* the end to
 end
   *99th
  latency
   is
still 390 - 418 ms.* Worse than the latency figures for
 *8
 *partitions,
   but
not by much. When I increase this number to *512
 partitions
   *the
 end
to end *99th
latency *becomes an intolerable *19-24 seconds*. At
 *1024*
 partitions
  the
*99th
latency is at 25 - 30 seconds*.
A table of the numbers:
   
Max number of partitions written to (out of 1024)
   
End to end latency
   
8
   
300 - 350 ms
   
256
   
390 - 418 ms
   
512
   
19 - 24 seconds
   
1024
   
25 - 30 seconds
   
   
iii) Once I make the max number of partitions high
 enough,
 reducing it
doesn't help. For eg: If I go up from *8* to *512
  *partitions,
the
   latency
goes up. But while the producer is running if I go down
 from
 *512* to
*8 *partitions,
it doesn't reduce the latency numbers. My guess is that
 the
 producer is
creating some state lazily per partition and this state
 is
 causing the
latency. Once

Re: Trying to figure out kafka latency issues

2014-12-30 Thread Jay Kreps
 has any ideas. I did yet
   another
   experiment where I create 4 producers and stripe the send
   requests
  across
   them in a manner such that any one producer only sees 256
partitions
   instead of the entire 1024. This seems to have helped a bit,
  and
  though I
   still see crazy high 99th (25-30 seconds), the median, mean,
   75th
 and
  95th
   percentile have all gone down.
  
   Thanks!
  
   On Sun, Dec 21, 2014 at 12:27 PM, Thunder Stumpges 
  tstump...@ntent.com
   wrote:
  
Ah I thought it was restarting the broker that made things
better
 :)
   
Yeah I have no experience with the Java client so can't
  really
 help
   there.
   
Good luck!
   
-Original Message-
From: Rajiv Kurian [ra...@signalfuse.com]
Received: Sunday, 21 Dec 2014, 12:25PM
To: users@kafka.apache.org [users@kafka.apache.org]
Subject: Re: Trying to figure out kafka latency issues
   
I'll take a look at the GC profile of the brokers Right
 now
  I
keep
  a tab
   on
the CPU, Messages in, Bytes in, Bytes out, free memory (on
  the
  machine
   not
JVM heap) free disk space on the broker. I'll need to
 take a
look
  at the
JVM metrics too. What seemed strange is that going from 8
 -
   512
   partitions
increases the latency, but going fro 512- 8 does not
  decrease
it.
  I have
to restart the producer (but not the broker) for the end
 to
   end
  latency
   to
go down That made it seem  that the fault was probably
 with
   the
  producer
and not the broker. Only restarting the producer made
 things
  better. I'll
do more extensive measurement on the broker.
   
On Sun, Dec 21, 2014 at 9:08 AM, Thunder Stumpges 
  tstump...@ntent.com
wrote:

 Did you see my response and have you checked the server
  logs
  especially
 the GC logs? It still sounds like you are running out of
memory
  on the
 broker. What is your max heap memory and are you
 thrashing
once
  you
   start
 writing to all those partitions?

 You have measured very thoroughly from an external point
  of
 view,
  i
   think
 now you'll have to start measuring the internal metrics.
   Maybe
  someone
else
 will have ideas on what jmx values to watch.

 Best,
 Thunder


 -Original Message-
 From: Rajiv Kurian [ra...@signalfuse.com]
 Received: Saturday, 20 Dec 2014, 10:24PM
 To: users@kafka.apache.org [users@kafka.apache.org]
 Subject: Re: Trying to figure out kafka latency issues

 Some more work tells me that the end to end latency
  numbers
vary
  with
   the
 number of partitions I am writing to. I did an
 experiment,
where
  based
on a
 run time flag I would dynamically select how many of the
   *1024
partitions*
 I write to. So say I decide I'll write to at most 256
partitions
  I mod
 whatever partition I would actually write to by 256.
   Basically
 the
   number
 of partitions for this topic on the broker remains the
  same
   at
  *1024*
 partitions but the number of partitions my producers
 write
   to
  changes
 dynamically based on a run time flag. So something like
   this:

 int partition = getPartitionForMessage(message);
 int maxPartitionsToWriteTo = maxPartitionsFlag.get();
  //
This
  flag
   can
be
 updated without bringing the application down - just a
volatile
  read of
 some number set externally.
 int moddedPartition = partition % maxPartitionsToWrite.
 // Send a message to this Kafka partition.

 Here are some interesting things I've noticed:

 i) When I start my client and it *never writes* to more
  than
*8
 partitions *(same
 data rate but fewer partitions) - the end to end *99th
   latency
 is
   300-350
 ms*. Quite a bit of this (numbers in my previous emails)
  is
the
  latency
 from producer - broker and the latency from broker -
consumer.
  Still
 nowhere as poor as the *20 - 30* seconds I was seeing.

 ii) When I increase the maximum number of partitions,
 end
  to
end
   latency
 increases dramatically. At *256 partitions* the end to
 end
*99th
   latency
is
 still 390 - 418 ms.* Worse than the latency figures for
 *8
  *partitions,
but
 not by much. When I increase this number to *512
  partitions
*the
  end

Re: Trying to figure out kafka latency issues

2014-12-30 Thread Rajiv Kurian
 shows
latency
   spikes
   (in which case it is a server problem)?
   2. It would be worth also getting the jmx stats on the
  producer
   as
  they
   will show things like what percentage of time it is waiting
  for
 buffer
   space etc.
  
   If your test is reasonably stand-alone it would be great to
   file a
  JIRA
   and
   attach the test code and the findings you already have so
   someone
 can
   dig
   into what is going on.
  
   -Jay
  
   On Sun, Dec 28, 2014 at 7:15 PM, Rajiv Kurian 
ra...@signalfuse.com
 
   wrote:
  
Hi all,
   
Bumping this up, in case some one has any ideas. I did yet
another
experiment where I create 4 producers and stripe the send
requests
   across
them in a manner such that any one producer only sees 256
 partitions
instead of the entire 1024. This seems to have helped a
 bit,
   and
   though I
still see crazy high 99th (25-30 seconds), the median,
 mean,
75th
  and
   95th
percentile have all gone down.
   
Thanks!
   
On Sun, Dec 21, 2014 at 12:27 PM, Thunder Stumpges 
   tstump...@ntent.com
wrote:
   
 Ah I thought it was restarting the broker that made
 things
 better
  :)

 Yeah I have no experience with the Java client so can't
   really
  help
there.

 Good luck!

 -Original Message-
 From: Rajiv Kurian [ra...@signalfuse.com]
 Received: Sunday, 21 Dec 2014, 12:25PM
 To: users@kafka.apache.org [users@kafka.apache.org]
 Subject: Re: Trying to figure out kafka latency issues

 I'll take a look at the GC profile of the brokers Right
  now
   I
 keep
   a tab
on
 the CPU, Messages in, Bytes in, Bytes out, free memory
 (on
   the
   machine
not
 JVM heap) free disk space on the broker. I'll need to
  take a
 look
   at the
 JVM metrics too. What seemed strange is that going from
 8
  -
512
partitions
 increases the latency, but going fro 512- 8 does not
   decrease
 it.
   I have
 to restart the producer (but not the broker) for the end
  to
end
   latency
to
 go down That made it seem  that the fault was probably
  with
the
   producer
 and not the broker. Only restarting the producer made
  things
   better. I'll
 do more extensive measurement on the broker.

 On Sun, Dec 21, 2014 at 9:08 AM, Thunder Stumpges 
   tstump...@ntent.com
 wrote:
 
  Did you see my response and have you checked the
 server
   logs
   especially
  the GC logs? It still sounds like you are running out
 of
 memory
   on the
  broker. What is your max heap memory and are you
  thrashing
 once
   you
start
  writing to all those partitions?
 
  You have measured very thoroughly from an external
 point
   of
  view,
   i
think
  now you'll have to start measuring the internal
 metrics.
Maybe
   someone
 else
  will have ideas on what jmx values to watch.
 
  Best,
  Thunder
 
 
  -Original Message-
  From: Rajiv Kurian [ra...@signalfuse.com]
  Received: Saturday, 20 Dec 2014, 10:24PM
  To: users@kafka.apache.org [users@kafka.apache.org]
  Subject: Re: Trying to figure out kafka latency issues
 
  Some more work tells me that the end to end latency
   numbers
 vary
   with
the
  number of partitions I am writing to. I did an
  experiment,
 where
   based
 on a
  run time flag I would dynamically select how many of
 the
*1024
 partitions*
  I write to. So say I decide I'll write to at most 256
 partitions
   I mod
  whatever partition I would actually write to by 256.
Basically
  the
number
  of partitions for this topic on the broker remains the
   same
at
   *1024*
  partitions but the number of partitions my producers
  write
to
   changes
  dynamically based on a run time flag. So something
 like
this:
 
  int partition = getPartitionForMessage(message);
  int maxPartitionsToWriteTo = maxPartitionsFlag.get();
   //
 This
   flag
can
 be
  updated without bringing the application down - just a
 volatile
   read of
  some number set externally.
  int moddedPartition = partition %
 maxPartitionsToWrite.
  // Send a message to this Kafka partition.
 
  Here are some interesting things I've

Re: Trying to figure out kafka latency issues

2014-12-29 Thread Jay Kreps
Hey Rajiv,

This sounds like a bug. The more info you can help us get the easier to
fix. Things that would help:
1. Can you check if the the request log on the servers shows latency spikes
(in which case it is a server problem)?
2. It would be worth also getting the jmx stats on the producer as they
will show things like what percentage of time it is waiting for buffer
space etc.

If your test is reasonably stand-alone it would be great to file a JIRA and
attach the test code and the findings you already have so someone can dig
into what is going on.

-Jay

On Sun, Dec 28, 2014 at 7:15 PM, Rajiv Kurian ra...@signalfuse.com wrote:

 Hi all,

 Bumping this up, in case some one has any ideas. I did yet another
 experiment where I create 4 producers and stripe the send requests across
 them in a manner such that any one producer only sees 256 partitions
 instead of the entire 1024. This seems to have helped a bit, and though I
 still see crazy high 99th (25-30 seconds), the median, mean, 75th and 95th
 percentile have all gone down.

 Thanks!

 On Sun, Dec 21, 2014 at 12:27 PM, Thunder Stumpges tstump...@ntent.com
 wrote:

  Ah I thought it was restarting the broker that made things better :)
 
  Yeah I have no experience with the Java client so can't really help
 there.
 
  Good luck!
 
  -Original Message-
  From: Rajiv Kurian [ra...@signalfuse.com]
  Received: Sunday, 21 Dec 2014, 12:25PM
  To: users@kafka.apache.org [users@kafka.apache.org]
  Subject: Re: Trying to figure out kafka latency issues
 
  I'll take a look at the GC profile of the brokers Right now I keep a tab
 on
  the CPU, Messages in, Bytes in, Bytes out, free memory (on the machine
 not
  JVM heap) free disk space on the broker. I'll need to take a look at the
  JVM metrics too. What seemed strange is that going from 8 - 512
 partitions
  increases the latency, but going fro 512- 8 does not decrease it. I have
  to restart the producer (but not the broker) for the end to end latency
 to
  go down That made it seem  that the fault was probably with the producer
  and not the broker. Only restarting the producer made things better. I'll
  do more extensive measurement on the broker.
 
  On Sun, Dec 21, 2014 at 9:08 AM, Thunder Stumpges tstump...@ntent.com
  wrote:
  
   Did you see my response and have you checked the server logs especially
   the GC logs? It still sounds like you are running out of memory on the
   broker. What is your max heap memory and are you thrashing once you
 start
   writing to all those partitions?
  
   You have measured very thoroughly from an external point of view, i
 think
   now you'll have to start measuring the internal metrics. Maybe someone
  else
   will have ideas on what jmx values to watch.
  
   Best,
   Thunder
  
  
   -Original Message-
   From: Rajiv Kurian [ra...@signalfuse.com]
   Received: Saturday, 20 Dec 2014, 10:24PM
   To: users@kafka.apache.org [users@kafka.apache.org]
   Subject: Re: Trying to figure out kafka latency issues
  
   Some more work tells me that the end to end latency numbers vary with
 the
   number of partitions I am writing to. I did an experiment, where based
  on a
   run time flag I would dynamically select how many of the *1024
  partitions*
   I write to. So say I decide I'll write to at most 256 partitions I mod
   whatever partition I would actually write to by 256. Basically the
 number
   of partitions for this topic on the broker remains the same at *1024*
   partitions but the number of partitions my producers write to changes
   dynamically based on a run time flag. So something like this:
  
   int partition = getPartitionForMessage(message);
   int maxPartitionsToWriteTo = maxPartitionsFlag.get();   // This flag
 can
  be
   updated without bringing the application down - just a volatile read of
   some number set externally.
   int moddedPartition = partition % maxPartitionsToWrite.
   // Send a message to this Kafka partition.
  
   Here are some interesting things I've noticed:
  
   i) When I start my client and it *never writes* to more than *8
   partitions *(same
   data rate but fewer partitions) - the end to end *99th latency is
 300-350
   ms*. Quite a bit of this (numbers in my previous emails) is the latency
   from producer - broker and the latency from broker - consumer. Still
   nowhere as poor as the *20 - 30* seconds I was seeing.
  
   ii) When I increase the maximum number of partitions, end to end
 latency
   increases dramatically. At *256 partitions* the end to end *99th
 latency
  is
   still 390 - 418 ms.* Worse than the latency figures for *8 *partitions,
  but
   not by much. When I increase this number to *512 partitions *the end
   to end *99th
   latency *becomes an intolerable *19-24 seconds*. At *1024* partitions
 the
   *99th
   latency is at 25 - 30 seconds*.
   A table of the numbers:
  
   Max number of partitions written to (out of 1024)
  
   End to end latency
  
   8
  
   300 - 350 ms
  
   256

Re: Trying to figure out kafka latency issues

2014-12-29 Thread Rajiv Kurian
Thanks Jay. Will check (1) and (2) and get back to you. The test is not
stand-alone now. It might be a bit of work to extract it to a stand-alone
executable. It might take me a bit of time to get that going.

On Mon, Dec 29, 2014 at 9:45 AM, Jay Kreps j...@confluent.io wrote:

 Hey Rajiv,

 This sounds like a bug. The more info you can help us get the easier to
 fix. Things that would help:
 1. Can you check if the the request log on the servers shows latency spikes
 (in which case it is a server problem)?
 2. It would be worth also getting the jmx stats on the producer as they
 will show things like what percentage of time it is waiting for buffer
 space etc.

 If your test is reasonably stand-alone it would be great to file a JIRA and
 attach the test code and the findings you already have so someone can dig
 into what is going on.

 -Jay

 On Sun, Dec 28, 2014 at 7:15 PM, Rajiv Kurian ra...@signalfuse.com
 wrote:

  Hi all,
 
  Bumping this up, in case some one has any ideas. I did yet another
  experiment where I create 4 producers and stripe the send requests across
  them in a manner such that any one producer only sees 256 partitions
  instead of the entire 1024. This seems to have helped a bit, and though I
  still see crazy high 99th (25-30 seconds), the median, mean, 75th and
 95th
  percentile have all gone down.
 
  Thanks!
 
  On Sun, Dec 21, 2014 at 12:27 PM, Thunder Stumpges tstump...@ntent.com
  wrote:
 
   Ah I thought it was restarting the broker that made things better :)
  
   Yeah I have no experience with the Java client so can't really help
  there.
  
   Good luck!
  
   -Original Message-
   From: Rajiv Kurian [ra...@signalfuse.com]
   Received: Sunday, 21 Dec 2014, 12:25PM
   To: users@kafka.apache.org [users@kafka.apache.org]
   Subject: Re: Trying to figure out kafka latency issues
  
   I'll take a look at the GC profile of the brokers Right now I keep a
 tab
  on
   the CPU, Messages in, Bytes in, Bytes out, free memory (on the machine
  not
   JVM heap) free disk space on the broker. I'll need to take a look at
 the
   JVM metrics too. What seemed strange is that going from 8 - 512
  partitions
   increases the latency, but going fro 512- 8 does not decrease it. I
 have
   to restart the producer (but not the broker) for the end to end latency
  to
   go down That made it seem  that the fault was probably with the
 producer
   and not the broker. Only restarting the producer made things better.
 I'll
   do more extensive measurement on the broker.
  
   On Sun, Dec 21, 2014 at 9:08 AM, Thunder Stumpges tstump...@ntent.com
 
   wrote:
   
Did you see my response and have you checked the server logs
 especially
the GC logs? It still sounds like you are running out of memory on
 the
broker. What is your max heap memory and are you thrashing once you
  start
writing to all those partitions?
   
You have measured very thoroughly from an external point of view, i
  think
now you'll have to start measuring the internal metrics. Maybe
 someone
   else
will have ideas on what jmx values to watch.
   
Best,
Thunder
   
   
-Original Message-
From: Rajiv Kurian [ra...@signalfuse.com]
Received: Saturday, 20 Dec 2014, 10:24PM
To: users@kafka.apache.org [users@kafka.apache.org]
Subject: Re: Trying to figure out kafka latency issues
   
Some more work tells me that the end to end latency numbers vary with
  the
number of partitions I am writing to. I did an experiment, where
 based
   on a
run time flag I would dynamically select how many of the *1024
   partitions*
I write to. So say I decide I'll write to at most 256 partitions I
 mod
whatever partition I would actually write to by 256. Basically the
  number
of partitions for this topic on the broker remains the same at *1024*
partitions but the number of partitions my producers write to changes
dynamically based on a run time flag. So something like this:
   
int partition = getPartitionForMessage(message);
int maxPartitionsToWriteTo = maxPartitionsFlag.get();   // This flag
  can
   be
updated without bringing the application down - just a volatile read
 of
some number set externally.
int moddedPartition = partition % maxPartitionsToWrite.
// Send a message to this Kafka partition.
   
Here are some interesting things I've noticed:
   
i) When I start my client and it *never writes* to more than *8
partitions *(same
data rate but fewer partitions) - the end to end *99th latency is
  300-350
ms*. Quite a bit of this (numbers in my previous emails) is the
 latency
from producer - broker and the latency from broker - consumer.
 Still
nowhere as poor as the *20 - 30* seconds I was seeing.
   
ii) When I increase the maximum number of partitions, end to end
  latency
increases dramatically. At *256 partitions* the end to end *99th
  latency
   is
still 390 - 418 ms.* Worse

Re: Trying to figure out kafka latency issues

2014-12-29 Thread Rajiv Kurian
Hi Jay,

Re (1) - I am not sure how to do this? Actually I am not sure what this
means. Is this the time every write/fetch request is received on the
broker? Do I need to enable some specific log level for this to show up? It
doesn't show up in the usual log. Is this information also available via
jmx somehow?
Re (2) - Are you saying that I should instrument the percentage of time
waiting for buffer space stat myself? If so how do I do this. Or are these
stats already output to jmx by the kafka producer code. This seems like
it's in the internals of the kafka producer client code.

Thanks again!


On Mon, Dec 29, 2014 at 10:22 AM, Rajiv Kurian ra...@signalfuse.com wrote:

 Thanks Jay. Will check (1) and (2) and get back to you. The test is not
 stand-alone now. It might be a bit of work to extract it to a stand-alone
 executable. It might take me a bit of time to get that going.

 On Mon, Dec 29, 2014 at 9:45 AM, Jay Kreps j...@confluent.io wrote:

 Hey Rajiv,

 This sounds like a bug. The more info you can help us get the easier to
 fix. Things that would help:
 1. Can you check if the the request log on the servers shows latency
 spikes
 (in which case it is a server problem)?
 2. It would be worth also getting the jmx stats on the producer as they
 will show things like what percentage of time it is waiting for buffer
 space etc.

 If your test is reasonably stand-alone it would be great to file a JIRA
 and
 attach the test code and the findings you already have so someone can dig
 into what is going on.

 -Jay

 On Sun, Dec 28, 2014 at 7:15 PM, Rajiv Kurian ra...@signalfuse.com
 wrote:

  Hi all,
 
  Bumping this up, in case some one has any ideas. I did yet another
  experiment where I create 4 producers and stripe the send requests
 across
  them in a manner such that any one producer only sees 256 partitions
  instead of the entire 1024. This seems to have helped a bit, and though
 I
  still see crazy high 99th (25-30 seconds), the median, mean, 75th and
 95th
  percentile have all gone down.
 
  Thanks!
 
  On Sun, Dec 21, 2014 at 12:27 PM, Thunder Stumpges tstump...@ntent.com
 
  wrote:
 
   Ah I thought it was restarting the broker that made things better :)
  
   Yeah I have no experience with the Java client so can't really help
  there.
  
   Good luck!
  
   -Original Message-
   From: Rajiv Kurian [ra...@signalfuse.com]
   Received: Sunday, 21 Dec 2014, 12:25PM
   To: users@kafka.apache.org [users@kafka.apache.org]
   Subject: Re: Trying to figure out kafka latency issues
  
   I'll take a look at the GC profile of the brokers Right now I keep a
 tab
  on
   the CPU, Messages in, Bytes in, Bytes out, free memory (on the machine
  not
   JVM heap) free disk space on the broker. I'll need to take a look at
 the
   JVM metrics too. What seemed strange is that going from 8 - 512
  partitions
   increases the latency, but going fro 512- 8 does not decrease it. I
 have
   to restart the producer (but not the broker) for the end to end
 latency
  to
   go down That made it seem  that the fault was probably with the
 producer
   and not the broker. Only restarting the producer made things better.
 I'll
   do more extensive measurement on the broker.
  
   On Sun, Dec 21, 2014 at 9:08 AM, Thunder Stumpges 
 tstump...@ntent.com
   wrote:
   
Did you see my response and have you checked the server logs
 especially
the GC logs? It still sounds like you are running out of memory on
 the
broker. What is your max heap memory and are you thrashing once you
  start
writing to all those partitions?
   
You have measured very thoroughly from an external point of view, i
  think
now you'll have to start measuring the internal metrics. Maybe
 someone
   else
will have ideas on what jmx values to watch.
   
Best,
Thunder
   
   
-Original Message-
From: Rajiv Kurian [ra...@signalfuse.com]
Received: Saturday, 20 Dec 2014, 10:24PM
To: users@kafka.apache.org [users@kafka.apache.org]
Subject: Re: Trying to figure out kafka latency issues
   
Some more work tells me that the end to end latency numbers vary
 with
  the
number of partitions I am writing to. I did an experiment, where
 based
   on a
run time flag I would dynamically select how many of the *1024
   partitions*
I write to. So say I decide I'll write to at most 256 partitions I
 mod
whatever partition I would actually write to by 256. Basically the
  number
of partitions for this topic on the broker remains the same at
 *1024*
partitions but the number of partitions my producers write to
 changes
dynamically based on a run time flag. So something like this:
   
int partition = getPartitionForMessage(message);
int maxPartitionsToWriteTo = maxPartitionsFlag.get();   // This flag
  can
   be
updated without bringing the application down - just a volatile
 read of
some number set externally.
int moddedPartition = partition

Re: Trying to figure out kafka latency issues

2014-12-29 Thread Rajiv Kurian
Never mind about (2). I see these stats are already being output by the
kafka producer. I've attached a couple of screenshots (couldn't copy paste
from jvisualvm ). Do any of these things strike as odd? The
bufferpool-wait-ratio sadly shows up as a NaN.

I still don't know how to figure out (1).

Thanks!


On Mon, Dec 29, 2014 at 3:02 PM, Rajiv Kurian ra...@signalfuse.com wrote:

 Hi Jay,

 Re (1) - I am not sure how to do this? Actually I am not sure what this
 means. Is this the time every write/fetch request is received on the
 broker? Do I need to enable some specific log level for this to show up? It
 doesn't show up in the usual log. Is this information also available via
 jmx somehow?
 Re (2) - Are you saying that I should instrument the percentage of time
 waiting for buffer space stat myself? If so how do I do this. Or are these
 stats already output to jmx by the kafka producer code. This seems like
 it's in the internals of the kafka producer client code.

 Thanks again!


 On Mon, Dec 29, 2014 at 10:22 AM, Rajiv Kurian ra...@signalfuse.com
 wrote:

 Thanks Jay. Will check (1) and (2) and get back to you. The test is not
 stand-alone now. It might be a bit of work to extract it to a stand-alone
 executable. It might take me a bit of time to get that going.

 On Mon, Dec 29, 2014 at 9:45 AM, Jay Kreps j...@confluent.io wrote:

 Hey Rajiv,

 This sounds like a bug. The more info you can help us get the easier to
 fix. Things that would help:
 1. Can you check if the the request log on the servers shows latency
 spikes
 (in which case it is a server problem)?
 2. It would be worth also getting the jmx stats on the producer as they
 will show things like what percentage of time it is waiting for buffer
 space etc.

 If your test is reasonably stand-alone it would be great to file a JIRA
 and
 attach the test code and the findings you already have so someone can dig
 into what is going on.

 -Jay

 On Sun, Dec 28, 2014 at 7:15 PM, Rajiv Kurian ra...@signalfuse.com
 wrote:

  Hi all,
 
  Bumping this up, in case some one has any ideas. I did yet another
  experiment where I create 4 producers and stripe the send requests
 across
  them in a manner such that any one producer only sees 256 partitions
  instead of the entire 1024. This seems to have helped a bit, and
 though I
  still see crazy high 99th (25-30 seconds), the median, mean, 75th and
 95th
  percentile have all gone down.
 
  Thanks!
 
  On Sun, Dec 21, 2014 at 12:27 PM, Thunder Stumpges 
 tstump...@ntent.com
  wrote:
 
   Ah I thought it was restarting the broker that made things better :)
  
   Yeah I have no experience with the Java client so can't really help
  there.
  
   Good luck!
  
   -Original Message-
   From: Rajiv Kurian [ra...@signalfuse.com]
   Received: Sunday, 21 Dec 2014, 12:25PM
   To: users@kafka.apache.org [users@kafka.apache.org]
   Subject: Re: Trying to figure out kafka latency issues
  
   I'll take a look at the GC profile of the brokers Right now I keep a
 tab
  on
   the CPU, Messages in, Bytes in, Bytes out, free memory (on the
 machine
  not
   JVM heap) free disk space on the broker. I'll need to take a look at
 the
   JVM metrics too. What seemed strange is that going from 8 - 512
  partitions
   increases the latency, but going fro 512- 8 does not decrease it. I
 have
   to restart the producer (but not the broker) for the end to end
 latency
  to
   go down That made it seem  that the fault was probably with the
 producer
   and not the broker. Only restarting the producer made things better.
 I'll
   do more extensive measurement on the broker.
  
   On Sun, Dec 21, 2014 at 9:08 AM, Thunder Stumpges 
 tstump...@ntent.com
   wrote:
   
Did you see my response and have you checked the server logs
 especially
the GC logs? It still sounds like you are running out of memory on
 the
broker. What is your max heap memory and are you thrashing once you
  start
writing to all those partitions?
   
You have measured very thoroughly from an external point of view, i
  think
now you'll have to start measuring the internal metrics. Maybe
 someone
   else
will have ideas on what jmx values to watch.
   
Best,
Thunder
   
   
-Original Message-
From: Rajiv Kurian [ra...@signalfuse.com]
Received: Saturday, 20 Dec 2014, 10:24PM
To: users@kafka.apache.org [users@kafka.apache.org]
Subject: Re: Trying to figure out kafka latency issues
   
Some more work tells me that the end to end latency numbers vary
 with
  the
number of partitions I am writing to. I did an experiment, where
 based
   on a
run time flag I would dynamically select how many of the *1024
   partitions*
I write to. So say I decide I'll write to at most 256 partitions I
 mod
whatever partition I would actually write to by 256. Basically the
  number
of partitions for this topic on the broker remains the same at
 *1024*
partitions but the number of partitions my

Re: Trying to figure out kafka latency issues

2014-12-29 Thread Rajiv Kurian
In case the attachments don't work out here is an imgur link -
http://imgur.com/NslGpT3,Uw6HFow#0

On Mon, Dec 29, 2014 at 3:13 PM, Rajiv Kurian ra...@signalfuse.com wrote:

 Never mind about (2). I see these stats are already being output by the
 kafka producer. I've attached a couple of screenshots (couldn't copy paste
 from jvisualvm ). Do any of these things strike as odd? The
 bufferpool-wait-ratio sadly shows up as a NaN.

 I still don't know how to figure out (1).

 Thanks!


 On Mon, Dec 29, 2014 at 3:02 PM, Rajiv Kurian ra...@signalfuse.com
 wrote:

 Hi Jay,

 Re (1) - I am not sure how to do this? Actually I am not sure what this
 means. Is this the time every write/fetch request is received on the
 broker? Do I need to enable some specific log level for this to show up? It
 doesn't show up in the usual log. Is this information also available via
 jmx somehow?
 Re (2) - Are you saying that I should instrument the percentage of time
 waiting for buffer space stat myself? If so how do I do this. Or are these
 stats already output to jmx by the kafka producer code. This seems like
 it's in the internals of the kafka producer client code.

 Thanks again!


 On Mon, Dec 29, 2014 at 10:22 AM, Rajiv Kurian ra...@signalfuse.com
 wrote:

 Thanks Jay. Will check (1) and (2) and get back to you. The test is not
 stand-alone now. It might be a bit of work to extract it to a stand-alone
 executable. It might take me a bit of time to get that going.

 On Mon, Dec 29, 2014 at 9:45 AM, Jay Kreps j...@confluent.io wrote:

 Hey Rajiv,

 This sounds like a bug. The more info you can help us get the easier to
 fix. Things that would help:
 1. Can you check if the the request log on the servers shows latency
 spikes
 (in which case it is a server problem)?
 2. It would be worth also getting the jmx stats on the producer as they
 will show things like what percentage of time it is waiting for buffer
 space etc.

 If your test is reasonably stand-alone it would be great to file a JIRA
 and
 attach the test code and the findings you already have so someone can
 dig
 into what is going on.

 -Jay

 On Sun, Dec 28, 2014 at 7:15 PM, Rajiv Kurian ra...@signalfuse.com
 wrote:

  Hi all,
 
  Bumping this up, in case some one has any ideas. I did yet another
  experiment where I create 4 producers and stripe the send requests
 across
  them in a manner such that any one producer only sees 256 partitions
  instead of the entire 1024. This seems to have helped a bit, and
 though I
  still see crazy high 99th (25-30 seconds), the median, mean, 75th and
 95th
  percentile have all gone down.
 
  Thanks!
 
  On Sun, Dec 21, 2014 at 12:27 PM, Thunder Stumpges 
 tstump...@ntent.com
  wrote:
 
   Ah I thought it was restarting the broker that made things better :)
  
   Yeah I have no experience with the Java client so can't really help
  there.
  
   Good luck!
  
   -Original Message-
   From: Rajiv Kurian [ra...@signalfuse.com]
   Received: Sunday, 21 Dec 2014, 12:25PM
   To: users@kafka.apache.org [users@kafka.apache.org]
   Subject: Re: Trying to figure out kafka latency issues
  
   I'll take a look at the GC profile of the brokers Right now I keep
 a tab
  on
   the CPU, Messages in, Bytes in, Bytes out, free memory (on the
 machine
  not
   JVM heap) free disk space on the broker. I'll need to take a look
 at the
   JVM metrics too. What seemed strange is that going from 8 - 512
  partitions
   increases the latency, but going fro 512- 8 does not decrease it.
 I have
   to restart the producer (but not the broker) for the end to end
 latency
  to
   go down That made it seem  that the fault was probably with the
 producer
   and not the broker. Only restarting the producer made things
 better. I'll
   do more extensive measurement on the broker.
  
   On Sun, Dec 21, 2014 at 9:08 AM, Thunder Stumpges 
 tstump...@ntent.com
   wrote:
   
Did you see my response and have you checked the server logs
 especially
the GC logs? It still sounds like you are running out of memory
 on the
broker. What is your max heap memory and are you thrashing once
 you
  start
writing to all those partitions?
   
You have measured very thoroughly from an external point of view,
 i
  think
now you'll have to start measuring the internal metrics. Maybe
 someone
   else
will have ideas on what jmx values to watch.
   
Best,
Thunder
   
   
-Original Message-
From: Rajiv Kurian [ra...@signalfuse.com]
Received: Saturday, 20 Dec 2014, 10:24PM
To: users@kafka.apache.org [users@kafka.apache.org]
Subject: Re: Trying to figure out kafka latency issues
   
Some more work tells me that the end to end latency numbers vary
 with
  the
number of partitions I am writing to. I did an experiment, where
 based
   on a
run time flag I would dynamically select how many of the *1024
   partitions*
I write to. So say I decide I'll write to at most 256 partitions
 I mod
whatever

Re: Trying to figure out kafka latency issues

2014-12-29 Thread Jay Kreps
Hey Rajiv,

Yes, if you uncomment the line
  #log4j.logger.kafka.server.KafkaApis=TRACE, requestAppender
in the example log4j.properties file that will enable logging of each
request including the time taken processing the request. This is the first
step for diagnosing latency spikes since this will rule out the server. You
may also want to enable DEBUG or TRACE on the producer and see what is
happening when those spikes occur.

The JMX stats show that at least the producer is not exhausting the I/O
thread (it is 92% idle) and doesn't seem to be waiting on memory which is
somewhat surprising to me (the NaN is an artifact of how we compute that
stat--no allocations took place in the time period measured so it is kind
of 0/0).

-Jay

On Mon, Dec 29, 2014 at 4:54 PM, Rajiv Kurian ra...@signalfuse.com wrote:

 In case the attachments don't work out here is an imgur link -
 http://imgur.com/NslGpT3,Uw6HFow#0

 On Mon, Dec 29, 2014 at 3:13 PM, Rajiv Kurian ra...@signalfuse.com
 wrote:

  Never mind about (2). I see these stats are already being output by the
  kafka producer. I've attached a couple of screenshots (couldn't copy
 paste
  from jvisualvm ). Do any of these things strike as odd? The
  bufferpool-wait-ratio sadly shows up as a NaN.
 
  I still don't know how to figure out (1).
 
  Thanks!
 
 
  On Mon, Dec 29, 2014 at 3:02 PM, Rajiv Kurian ra...@signalfuse.com
  wrote:
 
  Hi Jay,
 
  Re (1) - I am not sure how to do this? Actually I am not sure what this
  means. Is this the time every write/fetch request is received on the
  broker? Do I need to enable some specific log level for this to show
 up? It
  doesn't show up in the usual log. Is this information also available via
  jmx somehow?
  Re (2) - Are you saying that I should instrument the percentage of time
  waiting for buffer space stat myself? If so how do I do this. Or are
 these
  stats already output to jmx by the kafka producer code. This seems like
  it's in the internals of the kafka producer client code.
 
  Thanks again!
 
 
  On Mon, Dec 29, 2014 at 10:22 AM, Rajiv Kurian ra...@signalfuse.com
  wrote:
 
  Thanks Jay. Will check (1) and (2) and get back to you. The test is not
  stand-alone now. It might be a bit of work to extract it to a
 stand-alone
  executable. It might take me a bit of time to get that going.
 
  On Mon, Dec 29, 2014 at 9:45 AM, Jay Kreps j...@confluent.io wrote:
 
  Hey Rajiv,
 
  This sounds like a bug. The more info you can help us get the easier
 to
  fix. Things that would help:
  1. Can you check if the the request log on the servers shows latency
  spikes
  (in which case it is a server problem)?
  2. It would be worth also getting the jmx stats on the producer as
 they
  will show things like what percentage of time it is waiting for buffer
  space etc.
 
  If your test is reasonably stand-alone it would be great to file a
 JIRA
  and
  attach the test code and the findings you already have so someone can
  dig
  into what is going on.
 
  -Jay
 
  On Sun, Dec 28, 2014 at 7:15 PM, Rajiv Kurian ra...@signalfuse.com
  wrote:
 
   Hi all,
  
   Bumping this up, in case some one has any ideas. I did yet another
   experiment where I create 4 producers and stripe the send requests
  across
   them in a manner such that any one producer only sees 256 partitions
   instead of the entire 1024. This seems to have helped a bit, and
  though I
   still see crazy high 99th (25-30 seconds), the median, mean, 75th
 and
  95th
   percentile have all gone down.
  
   Thanks!
  
   On Sun, Dec 21, 2014 at 12:27 PM, Thunder Stumpges 
  tstump...@ntent.com
   wrote:
  
Ah I thought it was restarting the broker that made things better
 :)
   
Yeah I have no experience with the Java client so can't really
 help
   there.
   
Good luck!
   
-Original Message-
From: Rajiv Kurian [ra...@signalfuse.com]
Received: Sunday, 21 Dec 2014, 12:25PM
To: users@kafka.apache.org [users@kafka.apache.org]
Subject: Re: Trying to figure out kafka latency issues
   
I'll take a look at the GC profile of the brokers Right now I keep
  a tab
   on
the CPU, Messages in, Bytes in, Bytes out, free memory (on the
  machine
   not
JVM heap) free disk space on the broker. I'll need to take a look
  at the
JVM metrics too. What seemed strange is that going from 8 - 512
   partitions
increases the latency, but going fro 512- 8 does not decrease it.
  I have
to restart the producer (but not the broker) for the end to end
  latency
   to
go down That made it seem  that the fault was probably with the
  producer
and not the broker. Only restarting the producer made things
  better. I'll
do more extensive measurement on the broker.
   
On Sun, Dec 21, 2014 at 9:08 AM, Thunder Stumpges 
  tstump...@ntent.com
wrote:

 Did you see my response and have you checked the server logs
  especially
 the GC logs? It still sounds like you are running out of memory

Re: Trying to figure out kafka latency issues

2014-12-29 Thread Rajiv Kurian
 is
 not
   stand-alone now. It might be a bit of work to extract it to a
  stand-alone
   executable. It might take me a bit of time to get that going.
  
   On Mon, Dec 29, 2014 at 9:45 AM, Jay Kreps j...@confluent.io wrote:
  
   Hey Rajiv,
  
   This sounds like a bug. The more info you can help us get the easier
  to
   fix. Things that would help:
   1. Can you check if the the request log on the servers shows latency
   spikes
   (in which case it is a server problem)?
   2. It would be worth also getting the jmx stats on the producer as
  they
   will show things like what percentage of time it is waiting for
 buffer
   space etc.
  
   If your test is reasonably stand-alone it would be great to file a
  JIRA
   and
   attach the test code and the findings you already have so someone
 can
   dig
   into what is going on.
  
   -Jay
  
   On Sun, Dec 28, 2014 at 7:15 PM, Rajiv Kurian ra...@signalfuse.com
 
   wrote:
  
Hi all,
   
Bumping this up, in case some one has any ideas. I did yet another
experiment where I create 4 producers and stripe the send requests
   across
them in a manner such that any one producer only sees 256
 partitions
instead of the entire 1024. This seems to have helped a bit, and
   though I
still see crazy high 99th (25-30 seconds), the median, mean, 75th
  and
   95th
percentile have all gone down.
   
Thanks!
   
On Sun, Dec 21, 2014 at 12:27 PM, Thunder Stumpges 
   tstump...@ntent.com
wrote:
   
 Ah I thought it was restarting the broker that made things
 better
  :)

 Yeah I have no experience with the Java client so can't really
  help
there.

 Good luck!

 -Original Message-
 From: Rajiv Kurian [ra...@signalfuse.com]
 Received: Sunday, 21 Dec 2014, 12:25PM
 To: users@kafka.apache.org [users@kafka.apache.org]
 Subject: Re: Trying to figure out kafka latency issues

 I'll take a look at the GC profile of the brokers Right now I
 keep
   a tab
on
 the CPU, Messages in, Bytes in, Bytes out, free memory (on the
   machine
not
 JVM heap) free disk space on the broker. I'll need to take a
 look
   at the
 JVM metrics too. What seemed strange is that going from 8 - 512
partitions
 increases the latency, but going fro 512- 8 does not decrease
 it.
   I have
 to restart the producer (but not the broker) for the end to end
   latency
to
 go down That made it seem  that the fault was probably with the
   producer
 and not the broker. Only restarting the producer made things
   better. I'll
 do more extensive measurement on the broker.

 On Sun, Dec 21, 2014 at 9:08 AM, Thunder Stumpges 
   tstump...@ntent.com
 wrote:
 
  Did you see my response and have you checked the server logs
   especially
  the GC logs? It still sounds like you are running out of
 memory
   on the
  broker. What is your max heap memory and are you thrashing
 once
   you
start
  writing to all those partitions?
 
  You have measured very thoroughly from an external point of
  view,
   i
think
  now you'll have to start measuring the internal metrics. Maybe
   someone
 else
  will have ideas on what jmx values to watch.
 
  Best,
  Thunder
 
 
  -Original Message-
  From: Rajiv Kurian [ra...@signalfuse.com]
  Received: Saturday, 20 Dec 2014, 10:24PM
  To: users@kafka.apache.org [users@kafka.apache.org]
  Subject: Re: Trying to figure out kafka latency issues
 
  Some more work tells me that the end to end latency numbers
 vary
   with
the
  number of partitions I am writing to. I did an experiment,
 where
   based
 on a
  run time flag I would dynamically select how many of the *1024
 partitions*
  I write to. So say I decide I'll write to at most 256
 partitions
   I mod
  whatever partition I would actually write to by 256. Basically
  the
number
  of partitions for this topic on the broker remains the same at
   *1024*
  partitions but the number of partitions my producers write to
   changes
  dynamically based on a run time flag. So something like this:
 
  int partition = getPartitionForMessage(message);
  int maxPartitionsToWriteTo = maxPartitionsFlag.get();   //
 This
   flag
can
 be
  updated without bringing the application down - just a
 volatile
   read of
  some number set externally.
  int moddedPartition = partition % maxPartitionsToWrite.
  // Send a message to this Kafka partition.
 
  Here are some interesting things I've noticed:
 
  i) When I start my client and it *never writes* to more than
 *8
  partitions *(same
  data rate but fewer partitions) - the end to end *99th latency
  is
300-350
  ms*. Quite a bit of this (numbers in my previous emails) is
 the
   latency
  from producer - broker and the latency from

Re: Trying to figure out kafka latency issues

2014-12-28 Thread Rajiv Kurian
Hi all,

Bumping this up, in case some one has any ideas. I did yet another
experiment where I create 4 producers and stripe the send requests across
them in a manner such that any one producer only sees 256 partitions
instead of the entire 1024. This seems to have helped a bit, and though I
still see crazy high 99th (25-30 seconds), the median, mean, 75th and 95th
percentile have all gone down.

Thanks!

On Sun, Dec 21, 2014 at 12:27 PM, Thunder Stumpges tstump...@ntent.com
wrote:

 Ah I thought it was restarting the broker that made things better :)

 Yeah I have no experience with the Java client so can't really help there.

 Good luck!

 -Original Message-
 From: Rajiv Kurian [ra...@signalfuse.com]
 Received: Sunday, 21 Dec 2014, 12:25PM
 To: users@kafka.apache.org [users@kafka.apache.org]
 Subject: Re: Trying to figure out kafka latency issues

 I'll take a look at the GC profile of the brokers Right now I keep a tab on
 the CPU, Messages in, Bytes in, Bytes out, free memory (on the machine not
 JVM heap) free disk space on the broker. I'll need to take a look at the
 JVM metrics too. What seemed strange is that going from 8 - 512 partitions
 increases the latency, but going fro 512- 8 does not decrease it. I have
 to restart the producer (but not the broker) for the end to end latency to
 go down That made it seem  that the fault was probably with the producer
 and not the broker. Only restarting the producer made things better. I'll
 do more extensive measurement on the broker.

 On Sun, Dec 21, 2014 at 9:08 AM, Thunder Stumpges tstump...@ntent.com
 wrote:
 
  Did you see my response and have you checked the server logs especially
  the GC logs? It still sounds like you are running out of memory on the
  broker. What is your max heap memory and are you thrashing once you start
  writing to all those partitions?
 
  You have measured very thoroughly from an external point of view, i think
  now you'll have to start measuring the internal metrics. Maybe someone
 else
  will have ideas on what jmx values to watch.
 
  Best,
  Thunder
 
 
  -Original Message-
  From: Rajiv Kurian [ra...@signalfuse.com]
  Received: Saturday, 20 Dec 2014, 10:24PM
  To: users@kafka.apache.org [users@kafka.apache.org]
  Subject: Re: Trying to figure out kafka latency issues
 
  Some more work tells me that the end to end latency numbers vary with the
  number of partitions I am writing to. I did an experiment, where based
 on a
  run time flag I would dynamically select how many of the *1024
 partitions*
  I write to. So say I decide I'll write to at most 256 partitions I mod
  whatever partition I would actually write to by 256. Basically the number
  of partitions for this topic on the broker remains the same at *1024*
  partitions but the number of partitions my producers write to changes
  dynamically based on a run time flag. So something like this:
 
  int partition = getPartitionForMessage(message);
  int maxPartitionsToWriteTo = maxPartitionsFlag.get();   // This flag can
 be
  updated without bringing the application down - just a volatile read of
  some number set externally.
  int moddedPartition = partition % maxPartitionsToWrite.
  // Send a message to this Kafka partition.
 
  Here are some interesting things I've noticed:
 
  i) When I start my client and it *never writes* to more than *8
  partitions *(same
  data rate but fewer partitions) - the end to end *99th latency is 300-350
  ms*. Quite a bit of this (numbers in my previous emails) is the latency
  from producer - broker and the latency from broker - consumer. Still
  nowhere as poor as the *20 - 30* seconds I was seeing.
 
  ii) When I increase the maximum number of partitions, end to end latency
  increases dramatically. At *256 partitions* the end to end *99th latency
 is
  still 390 - 418 ms.* Worse than the latency figures for *8 *partitions,
 but
  not by much. When I increase this number to *512 partitions *the end
  to end *99th
  latency *becomes an intolerable *19-24 seconds*. At *1024* partitions the
  *99th
  latency is at 25 - 30 seconds*.
  A table of the numbers:
 
  Max number of partitions written to (out of 1024)
 
  End to end latency
 
  8
 
  300 - 350 ms
 
  256
 
  390 - 418 ms
 
  512
 
  19 - 24 seconds
 
  1024
 
  25 - 30 seconds
 
 
  iii) Once I make the max number of partitions high enough, reducing it
  doesn't help. For eg: If I go up from *8* to *512 *partitions, the
 latency
  goes up. But while the producer is running if I go down from  *512* to
  *8 *partitions,
  it doesn't reduce the latency numbers. My guess is that the producer is
  creating some state lazily per partition and this state is causing the
  latency. Once this state is created, writing to fewer partitions doesn't
  seem to help. Only a restart of the producer calms things down.
 
  So my current plan is to reduce the number of partitions on the topic,
 but
  there seems to be something deeper going on for the latency numbers

RE: Trying to figure out kafka latency issues

2014-12-21 Thread Thunder Stumpges
Did you see my response and have you checked the server logs especially the GC 
logs? It still sounds like you are running out of memory on the broker. What is 
your max heap memory and are you thrashing once you start writing to all those 
partitions?

You have measured very thoroughly from an external point of view, i think now 
you'll have to start measuring the internal metrics. Maybe someone else will 
have ideas on what jmx values to watch.

Best,
Thunder


-Original Message-
From: Rajiv Kurian [ra...@signalfuse.com]
Received: Saturday, 20 Dec 2014, 10:24PM
To: users@kafka.apache.org [users@kafka.apache.org]
Subject: Re: Trying to figure out kafka latency issues

Some more work tells me that the end to end latency numbers vary with the
number of partitions I am writing to. I did an experiment, where based on a
run time flag I would dynamically select how many of the *1024 partitions*
I write to. So say I decide I'll write to at most 256 partitions I mod
whatever partition I would actually write to by 256. Basically the number
of partitions for this topic on the broker remains the same at *1024*
partitions but the number of partitions my producers write to changes
dynamically based on a run time flag. So something like this:

int partition = getPartitionForMessage(message);
int maxPartitionsToWriteTo = maxPartitionsFlag.get();   // This flag can be
updated without bringing the application down - just a volatile read of
some number set externally.
int moddedPartition = partition % maxPartitionsToWrite.
// Send a message to this Kafka partition.

Here are some interesting things I've noticed:

i) When I start my client and it *never writes* to more than *8
partitions *(same
data rate but fewer partitions) - the end to end *99th latency is 300-350
ms*. Quite a bit of this (numbers in my previous emails) is the latency
from producer - broker and the latency from broker - consumer. Still
nowhere as poor as the *20 - 30* seconds I was seeing.

ii) When I increase the maximum number of partitions, end to end latency
increases dramatically. At *256 partitions* the end to end *99th latency is
still 390 - 418 ms.* Worse than the latency figures for *8 *partitions, but
not by much. When I increase this number to *512 partitions *the end
to end *99th
latency *becomes an intolerable *19-24 seconds*. At *1024* partitions the *99th
latency is at 25 - 30 seconds*.
A table of the numbers:

Max number of partitions written to (out of 1024)

End to end latency

8

300 - 350 ms

256

390 - 418 ms

512

19 - 24 seconds

1024

25 - 30 seconds


iii) Once I make the max number of partitions high enough, reducing it
doesn't help. For eg: If I go up from *8* to *512 *partitions, the latency
goes up. But while the producer is running if I go down from  *512* to
*8 *partitions,
it doesn't reduce the latency numbers. My guess is that the producer is
creating some state lazily per partition and this state is causing the
latency. Once this state is created, writing to fewer partitions doesn't
seem to help. Only a restart of the producer calms things down.

So my current plan is to reduce the number of partitions on the topic, but
there seems to be something deeper going on for the latency numbers to be
so poor to begin with and then suffer so much more (non linearly) with
additional partitions.

Thanks!

On Sat, Dec 20, 2014 at 6:03 PM, Rajiv Kurian ra...@signalfuse.com wrote:

 I've done some more measurements. I've also started measuring the latency
 between when I ask my producer to send a message and when I get an
 acknowledgement via the callback. Here is my code:

 // This function is called on every producer once every 30 seconds.

 public void addLagMarkers(final Histogram enqueueLag) {

 final int numberOfPartitions = 1024;

 final long timeOfEnqueue = System.currentTimeMillis();

 final Callback callback = new Callback() {

 @Override

 public void onCompletion(RecordMetadata metadata, Exception ex)
 {

 if (metadata != null) {

 // The difference between ack time from broker and
 enqueue time.

 final long timeOfAck = System.currentTimeMillis();

 final long lag = timeOfAck - timeOfEnqueue;

 enqueueLag.update(lag);

 }

 }

 };

 for (int i = 0; i  numberOfPartitions; i++) {

 try {

 byte[] value = LagMarker.serialize(timeOfEnqueue);  // 10
 bytes - short version + long timestamp.

 // This message is later used by the consumers to measure
 lag.

 ProducerRecord record = new ProducerRecord(MY_TOPIC, i,
 null, value);

 kafkaProducer.send(record, callback);

 } catch (Exception e) {

 // We just dropped a lag marker.

 }

 }

 }

 The* 99th* on this lag is about* 350 - 400* ms. It's not stellar

Re: Trying to figure out kafka latency issues

2014-12-21 Thread Rajiv Kurian
I'll take a look at the GC profile of the brokers Right now I keep a tab on
the CPU, Messages in, Bytes in, Bytes out, free memory (on the machine not
JVM heap) free disk space on the broker. I'll need to take a look at the
JVM metrics too. What seemed strange is that going from 8 - 512 partitions
increases the latency, but going fro 512- 8 does not decrease it. I have
to restart the producer (but not the broker) for the end to end latency to
go down That made it seem  that the fault was probably with the producer
and not the broker. Only restarting the producer made things better. I'll
do more extensive measurement on the broker.

On Sun, Dec 21, 2014 at 9:08 AM, Thunder Stumpges tstump...@ntent.com
wrote:

 Did you see my response and have you checked the server logs especially
 the GC logs? It still sounds like you are running out of memory on the
 broker. What is your max heap memory and are you thrashing once you start
 writing to all those partitions?

 You have measured very thoroughly from an external point of view, i think
 now you'll have to start measuring the internal metrics. Maybe someone else
 will have ideas on what jmx values to watch.

 Best,
 Thunder


 -Original Message-
 From: Rajiv Kurian [ra...@signalfuse.com]
 Received: Saturday, 20 Dec 2014, 10:24PM
 To: users@kafka.apache.org [users@kafka.apache.org]
 Subject: Re: Trying to figure out kafka latency issues

 Some more work tells me that the end to end latency numbers vary with the
 number of partitions I am writing to. I did an experiment, where based on a
 run time flag I would dynamically select how many of the *1024 partitions*
 I write to. So say I decide I'll write to at most 256 partitions I mod
 whatever partition I would actually write to by 256. Basically the number
 of partitions for this topic on the broker remains the same at *1024*
 partitions but the number of partitions my producers write to changes
 dynamically based on a run time flag. So something like this:

 int partition = getPartitionForMessage(message);
 int maxPartitionsToWriteTo = maxPartitionsFlag.get();   // This flag can be
 updated without bringing the application down - just a volatile read of
 some number set externally.
 int moddedPartition = partition % maxPartitionsToWrite.
 // Send a message to this Kafka partition.

 Here are some interesting things I've noticed:

 i) When I start my client and it *never writes* to more than *8
 partitions *(same
 data rate but fewer partitions) - the end to end *99th latency is 300-350
 ms*. Quite a bit of this (numbers in my previous emails) is the latency
 from producer - broker and the latency from broker - consumer. Still
 nowhere as poor as the *20 - 30* seconds I was seeing.

 ii) When I increase the maximum number of partitions, end to end latency
 increases dramatically. At *256 partitions* the end to end *99th latency is
 still 390 - 418 ms.* Worse than the latency figures for *8 *partitions, but
 not by much. When I increase this number to *512 partitions *the end
 to end *99th
 latency *becomes an intolerable *19-24 seconds*. At *1024* partitions the
 *99th
 latency is at 25 - 30 seconds*.
 A table of the numbers:

 Max number of partitions written to (out of 1024)

 End to end latency

 8

 300 - 350 ms

 256

 390 - 418 ms

 512

 19 - 24 seconds

 1024

 25 - 30 seconds


 iii) Once I make the max number of partitions high enough, reducing it
 doesn't help. For eg: If I go up from *8* to *512 *partitions, the latency
 goes up. But while the producer is running if I go down from  *512* to
 *8 *partitions,
 it doesn't reduce the latency numbers. My guess is that the producer is
 creating some state lazily per partition and this state is causing the
 latency. Once this state is created, writing to fewer partitions doesn't
 seem to help. Only a restart of the producer calms things down.

 So my current plan is to reduce the number of partitions on the topic, but
 there seems to be something deeper going on for the latency numbers to be
 so poor to begin with and then suffer so much more (non linearly) with
 additional partitions.

 Thanks!

 On Sat, Dec 20, 2014 at 6:03 PM, Rajiv Kurian ra...@signalfuse.com
 wrote:
 
  I've done some more measurements. I've also started measuring the latency
  between when I ask my producer to send a message and when I get an
  acknowledgement via the callback. Here is my code:
 
  // This function is called on every producer once every 30 seconds.
 
  public void addLagMarkers(final Histogram enqueueLag) {
 
  final int numberOfPartitions = 1024;
 
  final long timeOfEnqueue = System.currentTimeMillis();
 
  final Callback callback = new Callback() {
 
  @Override
 
  public void onCompletion(RecordMetadata metadata, Exception
 ex)
  {
 
  if (metadata != null) {
 
  // The difference between ack time from broker and
  enqueue time.
 
  final long timeOfAck

RE: Trying to figure out kafka latency issues

2014-12-21 Thread Thunder Stumpges
Ah I thought it was restarting the broker that made things better :)

Yeah I have no experience with the Java client so can't really help there.

Good luck!

-Original Message-
From: Rajiv Kurian [ra...@signalfuse.com]
Received: Sunday, 21 Dec 2014, 12:25PM
To: users@kafka.apache.org [users@kafka.apache.org]
Subject: Re: Trying to figure out kafka latency issues

I'll take a look at the GC profile of the brokers Right now I keep a tab on
the CPU, Messages in, Bytes in, Bytes out, free memory (on the machine not
JVM heap) free disk space on the broker. I'll need to take a look at the
JVM metrics too. What seemed strange is that going from 8 - 512 partitions
increases the latency, but going fro 512- 8 does not decrease it. I have
to restart the producer (but not the broker) for the end to end latency to
go down That made it seem  that the fault was probably with the producer
and not the broker. Only restarting the producer made things better. I'll
do more extensive measurement on the broker.

On Sun, Dec 21, 2014 at 9:08 AM, Thunder Stumpges tstump...@ntent.com
wrote:

 Did you see my response and have you checked the server logs especially
 the GC logs? It still sounds like you are running out of memory on the
 broker. What is your max heap memory and are you thrashing once you start
 writing to all those partitions?

 You have measured very thoroughly from an external point of view, i think
 now you'll have to start measuring the internal metrics. Maybe someone else
 will have ideas on what jmx values to watch.

 Best,
 Thunder


 -Original Message-
 From: Rajiv Kurian [ra...@signalfuse.com]
 Received: Saturday, 20 Dec 2014, 10:24PM
 To: users@kafka.apache.org [users@kafka.apache.org]
 Subject: Re: Trying to figure out kafka latency issues

 Some more work tells me that the end to end latency numbers vary with the
 number of partitions I am writing to. I did an experiment, where based on a
 run time flag I would dynamically select how many of the *1024 partitions*
 I write to. So say I decide I'll write to at most 256 partitions I mod
 whatever partition I would actually write to by 256. Basically the number
 of partitions for this topic on the broker remains the same at *1024*
 partitions but the number of partitions my producers write to changes
 dynamically based on a run time flag. So something like this:

 int partition = getPartitionForMessage(message);
 int maxPartitionsToWriteTo = maxPartitionsFlag.get();   // This flag can be
 updated without bringing the application down - just a volatile read of
 some number set externally.
 int moddedPartition = partition % maxPartitionsToWrite.
 // Send a message to this Kafka partition.

 Here are some interesting things I've noticed:

 i) When I start my client and it *never writes* to more than *8
 partitions *(same
 data rate but fewer partitions) - the end to end *99th latency is 300-350
 ms*. Quite a bit of this (numbers in my previous emails) is the latency
 from producer - broker and the latency from broker - consumer. Still
 nowhere as poor as the *20 - 30* seconds I was seeing.

 ii) When I increase the maximum number of partitions, end to end latency
 increases dramatically. At *256 partitions* the end to end *99th latency is
 still 390 - 418 ms.* Worse than the latency figures for *8 *partitions, but
 not by much. When I increase this number to *512 partitions *the end
 to end *99th
 latency *becomes an intolerable *19-24 seconds*. At *1024* partitions the
 *99th
 latency is at 25 - 30 seconds*.
 A table of the numbers:

 Max number of partitions written to (out of 1024)

 End to end latency

 8

 300 - 350 ms

 256

 390 - 418 ms

 512

 19 - 24 seconds

 1024

 25 - 30 seconds


 iii) Once I make the max number of partitions high enough, reducing it
 doesn't help. For eg: If I go up from *8* to *512 *partitions, the latency
 goes up. But while the producer is running if I go down from  *512* to
 *8 *partitions,
 it doesn't reduce the latency numbers. My guess is that the producer is
 creating some state lazily per partition and this state is causing the
 latency. Once this state is created, writing to fewer partitions doesn't
 seem to help. Only a restart of the producer calms things down.

 So my current plan is to reduce the number of partitions on the topic, but
 there seems to be something deeper going on for the latency numbers to be
 so poor to begin with and then suffer so much more (non linearly) with
 additional partitions.

 Thanks!

 On Sat, Dec 20, 2014 at 6:03 PM, Rajiv Kurian ra...@signalfuse.com
 wrote:
 
  I've done some more measurements. I've also started measuring the latency
  between when I ask my producer to send a message and when I get an
  acknowledgement via the callback. Here is my code:
 
  // This function is called on every producer once every 30 seconds.
 
  public void addLagMarkers(final Histogram enqueueLag) {
 
  final int numberOfPartitions = 1024;
 
  final long timeOfEnqueue

Trying to figure out kafka latency issues

2014-12-20 Thread Rajiv Kurian
I am trying to replace a Thrift peer to peer API with kafka for a
particular work flow. I am finding the 99th percentile latency to be
unacceptable at this time. This entire work load runs in an Amazon VPC. I'd
greatly appreciate it if some one has any insights on why I am seeing such
poor numbers. Here are some details and measurements taken:

i) I have a single topic with 1024 partitions that I am writing to from six
clients using the kafka 0.8.2 beta kafka producer.

ii) I have 3 brokers, each on  a c3 2x machine on ec2. Each of those
machines has 8 virtual cpus, 15 GB memory and 2 * 80 GB SSDs. The broker -
partitions mapping was decided by Kafka when I created the topic.

iii) I write about 22 thousand messages per second from across the 6
clients. This number was calculated using distributed counters. I just
increment a distributed counter in the callback from my enqueue job if the
metadata returned is not null. I also increment a number o dropped messages
counter if the callback has a non-null exception or if there was an
exception in the synchronous send call. The number of dropped messages is
pretty much always zero. Out of the 6 clients 3 are responsible for 95% of
the traffic. Messages are very tiny and have null keys  and 27 byte values
(2 byte version and 25 byte payload). Again these messages are written
using the kafka 0.8.2 client.

iv) I have 3 consumer processes consuming only from this topic. Each
consumer process is assigned a disjoint set of the 1024 partitions by an
eternal arbiter. Each consumer process then creates a mapping from brokers
- partitions it has been assigned. It then starts one fetcher thread per
broker. Each thread queries the broker (using the SimpleConsumer) it has
been assigned for partitions such that partitions = (partitions on broker)
*∩ *(partitions assigned to the process by the arbiter). So in effect there
are 9 consumer threads across 3 consumer processes that query a disjoint
set of partitions for the single topic and amongst themselves they manage
to consume from every topic. So each thread has a run loop where it asks
for messages, consumes them then asks for more messages. When a run loop
starts, it starts by querying for the latest message in a partition (i.e.
discards any previous backup) and then maintains a map of partition -
nextOffsetToRequest in memory to make sure that it consumes messages in
order.

v) Consumption is really simple. Each message is put on a non blocking
efficient ring buffer. If the ring buffer is full the message is dropped. I
measure the mean time between fetches and it is the same as the time for
fetches up to a ms, meaning no matter how many messages are dequeued, it
takes almost no time to process them. At the end of processing a fetch
request I increment another distributed counter that counts the number of
messages processed. This counter tells me that on average I am consuming
the same number of messages/sec that I enqueue on the producers i.e. around
22 thousand messages/sec.

vi) The 99th percentile of the number of messages fetched per fetch request
is about 500 messages.  The 99th percentile of the time it takes to fetch a
batch is abut 130 - 140 ms. I played around with the buffer and maxWait
settings on the SimpleConsumer and attempting to consume more messages was
leading the 99th percentile of the fetch time to balloon up, so I am
consuming in smaller batches right now.

vii) Every 30 seconds odd each of the producers inserts a trace message
into each partition (so 1024 messages per producer every 30 seconds). Each
message only contains the System.currentTimeMillis() at the time of
enqueueing. It is about 9 bytes long. The consumer in step (v) always
checks to see if a message it dequeued is of this trace type (instead of
the regular message type). This is a simple check on the first 2 byte
version that each message buffer contains. If it is a trace message it
reads it's payload as a long and uses the difference between the system
time on the consumer and this payload and updates a histogram with this
difference value. Ignoring NTP skew (which I've measured to be in the order
of milliseconds) this is the lag between when a message was enqueued on the
producer and when it was read on the consumer.

So pseudocode for every consumer thread (one per broker per consumer
process (9 in total across 3 consumers) is:

void run() {

while (running) {

FetchRequest fetchRequest = buildFetchRequest(
partitionsAssignedToThisProcessThatFallOnThisBroker);  // This assignment
is done by an external arbiter.

measureTimeBetweenFetchRequests();   // The 99th of this is 130-140ms

FetchResponse fetchResponse = fetchResponse(fetchRequest);  // The 99th
of this is 130-140ms, which is the same as the time between fetches.

processData(fetchResponse);  // The 99th on this is 2-3 ms.
  }
}

void processData(FetchResponse response) {
  try (Timer.Context _ = processWorkerDataTimer.time() {  // The 99th on
this is 2-3 ms.
  

RE: Trying to figure out kafka latency issues

2014-12-20 Thread Thunder Stumpges
That's a pretty detailed analysis, I'll be very interested to see what the root 
cause is.

Have you had a look at the broker GC logs? The spike in cpu and the long tail 
on the latency make me think garbage collection pauses.

I suppose the large number of partitions may have increased the memory needs? 
Seems like the ssd's should be able to keep up no problem.

Good luck and I'm sure the others will have some other ideas!

Thunder


-Original Message-
From: Rajiv Kurian [ra...@signalfuse.com]
Received: Saturday, 20 Dec 2014, 3:58PM
To: users@kafka.apache.org [users@kafka.apache.org]
Subject: Trying to figure out kafka latency issues

I am trying to replace a Thrift peer to peer API with kafka for a
particular work flow. I am finding the 99th percentile latency to be
unacceptable at this time. This entire work load runs in an Amazon VPC. I'd
greatly appreciate it if some one has any insights on why I am seeing such
poor numbers. Here are some details and measurements taken:

i) I have a single topic with 1024 partitions that I am writing to from six
clients using the kafka 0.8.2 beta kafka producer.

ii) I have 3 brokers, each on  a c3 2x machine on ec2. Each of those
machines has 8 virtual cpus, 15 GB memory and 2 * 80 GB SSDs. The broker -
partitions mapping was decided by Kafka when I created the topic.

iii) I write about 22 thousand messages per second from across the 6
clients. This number was calculated using distributed counters. I just
increment a distributed counter in the callback from my enqueue job if the
metadata returned is not null. I also increment a number o dropped messages
counter if the callback has a non-null exception or if there was an
exception in the synchronous send call. The number of dropped messages is
pretty much always zero. Out of the 6 clients 3 are responsible for 95% of
the traffic. Messages are very tiny and have null keys  and 27 byte values
(2 byte version and 25 byte payload). Again these messages are written
using the kafka 0.8.2 client.

iv) I have 3 consumer processes consuming only from this topic. Each
consumer process is assigned a disjoint set of the 1024 partitions by an
eternal arbiter. Each consumer process then creates a mapping from brokers
- partitions it has been assigned. It then starts one fetcher thread per
broker. Each thread queries the broker (using the SimpleConsumer) it has
been assigned for partitions such that partitions = (partitions on broker)
*∩ *(partitions assigned to the process by the arbiter). So in effect there
are 9 consumer threads across 3 consumer processes that query a disjoint
set of partitions for the single topic and amongst themselves they manage
to consume from every topic. So each thread has a run loop where it asks
for messages, consumes them then asks for more messages. When a run loop
starts, it starts by querying for the latest message in a partition (i.e.
discards any previous backup) and then maintains a map of partition -
nextOffsetToRequest in memory to make sure that it consumes messages in
order.

v) Consumption is really simple. Each message is put on a non blocking
efficient ring buffer. If the ring buffer is full the message is dropped. I
measure the mean time between fetches and it is the same as the time for
fetches up to a ms, meaning no matter how many messages are dequeued, it
takes almost no time to process them. At the end of processing a fetch
request I increment another distributed counter that counts the number of
messages processed. This counter tells me that on average I am consuming
the same number of messages/sec that I enqueue on the producers i.e. around
22 thousand messages/sec.

vi) The 99th percentile of the number of messages fetched per fetch request
is about 500 messages.  The 99th percentile of the time it takes to fetch a
batch is abut 130 - 140 ms. I played around with the buffer and maxWait
settings on the SimpleConsumer and attempting to consume more messages was
leading the 99th percentile of the fetch time to balloon up, so I am
consuming in smaller batches right now.

vii) Every 30 seconds odd each of the producers inserts a trace message
into each partition (so 1024 messages per producer every 30 seconds). Each
message only contains the System.currentTimeMillis() at the time of
enqueueing. It is about 9 bytes long. The consumer in step (v) always
checks to see if a message it dequeued is of this trace type (instead of
the regular message type). This is a simple check on the first 2 byte
version that each message buffer contains. If it is a trace message it
reads it's payload as a long and uses the difference between the system
time on the consumer and this payload and updates a histogram with this
difference value. Ignoring NTP skew (which I've measured to be in the order
of milliseconds) this is the lag between when a message was enqueued on the
producer and when it was read on the consumer.

So pseudocode for every consumer thread (one per broker per consumer
process

Re: Trying to figure out kafka latency issues

2014-12-20 Thread Rajiv Kurian
On Sat, Dec 20, 2014 at 3:49 PM, Rajiv Kurian ra...@signalfuse.com wrote:

 I am trying to replace a Thrift peer to peer API with kafka for a
 particular work flow. I am finding the 99th percentile latency to be
 unacceptable at this time. This entire work load runs in an Amazon VPC. I'd
 greatly appreciate it if some one has any insights on why I am seeing such
 poor numbers. Here are some details and measurements taken:

 i) I have a single topic with 1024 partitions that I am writing to from
 six clients using the kafka 0.8.2 beta kafka producer.

 ii) I have 3 brokers, each on  a c3 2x machine on ec2. Each of those
 machines has 8 virtual cpus, 15 GB memory and 2 * 80 GB SSDs. The broker -
 partitions mapping was decided by Kafka when I created the topic.

 iii) I write about 22 thousand messages per second from across the 6
 clients. This number was calculated using distributed counters. I just
 increment a distributed counter in the callback from my enqueue job if the
 metadata returned is not null. I also increment a number o dropped messages
 counter if the callback has a non-null exception or if there was an
 exception in the synchronous send call. The number of dropped messages is
 pretty much always zero. Out of the 6 clients 3 are responsible for 95% of
 the traffic. Messages are very tiny and have null keys  and 27 byte values
 (2 byte version and 25 byte payload). Again these messages are written
 using the kafka 0.8.2 client.

 iv) I have 3 consumer processes consuming only from this topic. Each
 consumer process is assigned a disjoint set of the 1024 partitions by an
 eternal arbiter. Each consumer process then creates a mapping from brokers
 - partitions it has been assigned. It then starts one fetcher thread per
 broker. Each thread queries the broker (using the SimpleConsumer) it has
 been assigned for partitions such that partitions = (partitions on broker)
 *∩ *(partitions assigned to the process by the arbiter). So in effect
 there are 9 consumer threads across 3 consumer processes that query a
 disjoint set of partitions for the single topic and amongst themselves they
 manage to consume from every topic. So each thread has a run loop where it
 asks for messages, consumes them then asks for more messages. When a run
 loop starts, it starts by querying for the latest message in a partition
 (i.e. discards any previous backup) and then maintains a map of partition
 - nextOffsetToRequest in memory to make sure that it consumes messages in
 order.

Edit: Mean't external arbiter.


 v) Consumption is really simple. Each message is put on a non blocking
 efficient ring buffer. If the ring buffer is full the message is dropped. I
 measure the mean time between fetches and it is the same as the time for
 fetches up to a ms, meaning no matter how many messages are dequeued, it
 takes almost no time to process them. At the end of processing a fetch
 request I increment another distributed counter that counts the number of
 messages processed. This counter tells me that on average I am consuming
 the same number of messages/sec that I enqueue on the producers i.e. around
 22 thousand messages/sec.

 vi) The 99th percentile of the number of messages fetched per fetch
 request is about 500 messages.  The 99th percentile of the time it takes to
 fetch a batch is abut 130 - 140 ms. I played around with the buffer and
 maxWait settings on the SimpleConsumer and attempting to consume more
 messages was leading the 99th percentile of the fetch time to balloon up,
 so I am consuming in smaller batches right now.

 vii) Every 30 seconds odd each of the producers inserts a trace message
 into each partition (so 1024 messages per producer every 30 seconds). Each
 message only contains the System.currentTimeMillis() at the time of
 enqueueing. It is about 9 bytes long. The consumer in step (v) always
 checks to see if a message it dequeued is of this trace type (instead of
 the regular message type). This is a simple check on the first 2 byte
 version that each message buffer contains. If it is a trace message it
 reads it's payload as a long and uses the difference between the system
 time on the consumer and this payload and updates a histogram with this
 difference value. Ignoring NTP skew (which I've measured to be in the order
 of milliseconds) this is the lag between when a message was enqueued on the
 producer and when it was read on the consumer.

Edit: Trace messages are 10 bytes long - 2byte version + 8 byte long for
timestamp.

So pseudocode for every consumer thread (one per broker per consumer
 process (9 in total across 3 consumers) is:

 void run() {

 while (running) {

 FetchRequest fetchRequest = buildFetchRequest(
 partitionsAssignedToThisProcessThatFallOnThisBroker);  // This assignment
 is done by an external arbiter.

 measureTimeBetweenFetchRequests();   // The 99th of this is 130-140ms

 FetchResponse fetchResponse = fetchResponse(fetchRequest);  // The
 99th of this is 

Re: Trying to figure out kafka latency issues

2014-12-20 Thread Rajiv Kurian
I've done some more measurements. I've also started measuring the latency
between when I ask my producer to send a message and when I get an
acknowledgement via the callback. Here is my code:

// This function is called on every producer once every 30 seconds.

public void addLagMarkers(final Histogram enqueueLag) {

final int numberOfPartitions = 1024;

final long timeOfEnqueue = System.currentTimeMillis();

final Callback callback = new Callback() {

@Override

public void onCompletion(RecordMetadata metadata, Exception ex)
{

if (metadata != null) {

// The difference between ack time from broker and
enqueue time.

final long timeOfAck = System.currentTimeMillis();

final long lag = timeOfAck - timeOfEnqueue;

enqueueLag.update(lag);

}

}

};

for (int i = 0; i  numberOfPartitions; i++) {

try {

byte[] value = LagMarker.serialize(timeOfEnqueue);  // 10
bytes - short version + long timestamp.

// This message is later used by the consumers to measure
lag.

ProducerRecord record = new ProducerRecord(MY_TOPIC, i, null,
value);

kafkaProducer.send(record, callback);

} catch (Exception e) {

// We just dropped a lag marker.

}

}

}

The* 99th* on this lag is about* 350 - 400* ms. It's not stellar, but
doesn't account for the *20-30 second 99th* I see on the end to end lag. I
am consuming in a tight loop on the Consumers (using the SimpleConsumer)
with minimal processing with a *99th fetch time *of *130-140* ms, so I
don't think that should be a problem either. Completely baffled.


Thanks!



On Sat, Dec 20, 2014 at 5:51 PM, Rajiv Kurian ra...@signalfuse.com wrote:



 On Sat, Dec 20, 2014 at 3:49 PM, Rajiv Kurian ra...@signalfuse.com
 wrote:

 I am trying to replace a Thrift peer to peer API with kafka for a
 particular work flow. I am finding the 99th percentile latency to be
 unacceptable at this time. This entire work load runs in an Amazon VPC. I'd
 greatly appreciate it if some one has any insights on why I am seeing such
 poor numbers. Here are some details and measurements taken:

 i) I have a single topic with 1024 partitions that I am writing to from
 six clients using the kafka 0.8.2 beta kafka producer.

 ii) I have 3 brokers, each on  a c3 2x machine on ec2. Each of those
 machines has 8 virtual cpus, 15 GB memory and 2 * 80 GB SSDs. The broker -
 partitions mapping was decided by Kafka when I created the topic.

 iii) I write about 22 thousand messages per second from across the 6
 clients. This number was calculated using distributed counters. I just
 increment a distributed counter in the callback from my enqueue job if the
 metadata returned is not null. I also increment a number o dropped messages
 counter if the callback has a non-null exception or if there was an
 exception in the synchronous send call. The number of dropped messages is
 pretty much always zero. Out of the 6 clients 3 are responsible for 95% of
 the traffic. Messages are very tiny and have null keys  and 27 byte values
 (2 byte version and 25 byte payload). Again these messages are written
 using the kafka 0.8.2 client.

 iv) I have 3 consumer processes consuming only from this topic. Each
 consumer process is assigned a disjoint set of the 1024 partitions by an
 eternal arbiter. Each consumer process then creates a mapping from brokers
 - partitions it has been assigned. It then starts one fetcher thread per
 broker. Each thread queries the broker (using the SimpleConsumer) it has
 been assigned for partitions such that partitions = (partitions on broker)
 *∩ *(partitions assigned to the process by the arbiter). So in effect
 there are 9 consumer threads across 3 consumer processes that query a
 disjoint set of partitions for the single topic and amongst themselves they
 manage to consume from every topic. So each thread has a run loop where it
 asks for messages, consumes them then asks for more messages. When a run
 loop starts, it starts by querying for the latest message in a partition
 (i.e. discards any previous backup) and then maintains a map of partition
 - nextOffsetToRequest in memory to make sure that it consumes messages in
 order.

 Edit: Mean't external arbiter.


 v) Consumption is really simple. Each message is put on a non blocking
 efficient ring buffer. If the ring buffer is full the message is dropped. I
 measure the mean time between fetches and it is the same as the time for
 fetches up to a ms, meaning no matter how many messages are dequeued, it
 takes almost no time to process them. At the end of processing a fetch
 request I increment another distributed counter that counts the number of
 messages processed. This counter tells me that on average I am consuming
 the same number of 

Re: Trying to figure out kafka latency issues

2014-12-20 Thread Rajiv Kurian
Some more work tells me that the end to end latency numbers vary with the
number of partitions I am writing to. I did an experiment, where based on a
run time flag I would dynamically select how many of the *1024 partitions*
I write to. So say I decide I'll write to at most 256 partitions I mod
whatever partition I would actually write to by 256. Basically the number
of partitions for this topic on the broker remains the same at *1024*
partitions but the number of partitions my producers write to changes
dynamically based on a run time flag. So something like this:

int partition = getPartitionForMessage(message);
int maxPartitionsToWriteTo = maxPartitionsFlag.get();   // This flag can be
updated without bringing the application down - just a volatile read of
some number set externally.
int moddedPartition = partition % maxPartitionsToWrite.
// Send a message to this Kafka partition.

Here are some interesting things I've noticed:

i) When I start my client and it *never writes* to more than *8
partitions *(same
data rate but fewer partitions) - the end to end *99th latency is 300-350
ms*. Quite a bit of this (numbers in my previous emails) is the latency
from producer - broker and the latency from broker - consumer. Still
nowhere as poor as the *20 - 30* seconds I was seeing.

ii) When I increase the maximum number of partitions, end to end latency
increases dramatically. At *256 partitions* the end to end *99th latency is
still 390 - 418 ms.* Worse than the latency figures for *8 *partitions, but
not by much. When I increase this number to *512 partitions *the end
to end *99th
latency *becomes an intolerable *19-24 seconds*. At *1024* partitions the *99th
latency is at 25 - 30 seconds*.
A table of the numbers:

Max number of partitions written to (out of 1024)

End to end latency

8

300 - 350 ms

256

390 - 418 ms

512

19 - 24 seconds

1024

25 - 30 seconds


iii) Once I make the max number of partitions high enough, reducing it
doesn't help. For eg: If I go up from *8* to *512 *partitions, the latency
goes up. But while the producer is running if I go down from  *512* to
*8 *partitions,
it doesn't reduce the latency numbers. My guess is that the producer is
creating some state lazily per partition and this state is causing the
latency. Once this state is created, writing to fewer partitions doesn't
seem to help. Only a restart of the producer calms things down.

So my current plan is to reduce the number of partitions on the topic, but
there seems to be something deeper going on for the latency numbers to be
so poor to begin with and then suffer so much more (non linearly) with
additional partitions.

Thanks!

On Sat, Dec 20, 2014 at 6:03 PM, Rajiv Kurian ra...@signalfuse.com wrote:

 I've done some more measurements. I've also started measuring the latency
 between when I ask my producer to send a message and when I get an
 acknowledgement via the callback. Here is my code:

 // This function is called on every producer once every 30 seconds.

 public void addLagMarkers(final Histogram enqueueLag) {

 final int numberOfPartitions = 1024;

 final long timeOfEnqueue = System.currentTimeMillis();

 final Callback callback = new Callback() {

 @Override

 public void onCompletion(RecordMetadata metadata, Exception ex)
 {

 if (metadata != null) {

 // The difference between ack time from broker and
 enqueue time.

 final long timeOfAck = System.currentTimeMillis();

 final long lag = timeOfAck - timeOfEnqueue;

 enqueueLag.update(lag);

 }

 }

 };

 for (int i = 0; i  numberOfPartitions; i++) {

 try {

 byte[] value = LagMarker.serialize(timeOfEnqueue);  // 10
 bytes - short version + long timestamp.

 // This message is later used by the consumers to measure
 lag.

 ProducerRecord record = new ProducerRecord(MY_TOPIC, i,
 null, value);

 kafkaProducer.send(record, callback);

 } catch (Exception e) {

 // We just dropped a lag marker.

 }

 }

 }

 The* 99th* on this lag is about* 350 - 400* ms. It's not stellar, but
 doesn't account for the *20-30 second 99th* I see on the end to end lag.
 I am consuming in a tight loop on the Consumers (using the SimpleConsumer)
 with minimal processing with a *99th fetch time *of *130-140* ms, so I
 don't think that should be a problem either. Completely baffled.


 Thanks!



 On Sat, Dec 20, 2014 at 5:51 PM, Rajiv Kurian ra...@signalfuse.com
 wrote:



 On Sat, Dec 20, 2014 at 3:49 PM, Rajiv Kurian ra...@signalfuse.com
 wrote:

 I am trying to replace a Thrift peer to peer API with kafka for a
 particular work flow. I am finding the 99th percentile latency to be
 unacceptable at this time. This entire work load runs in an Amazon VPC. I'd
 greatly 

Re: Kafka latency measures

2014-06-19 Thread Jay Kreps
There were actually several patches against trunk since 0.8.1.1 that may
impact latency however, especially when using acks=-1. So those results in
the blog may be a bit better than what you would see in 0.8.1.1.

-Jay


On Wed, Jun 18, 2014 at 7:58 PM, Supun Kamburugamuva supu...@gmail.com
wrote:

 My machine configuration is not very high. The average one way latency we
 observe is around 10 ~ 15 ms for 50k messages. The outliers doesn't occur
 for small messages. For small messages we observe around 6 ms latency.

 Thanks,
 Supun..


 On Wed, Jun 18, 2014 at 10:18 PM, Neha Narkhede neha.narkh...@gmail.com
 wrote:

  what are the latency numbers you observed, avg as well as worst case?
 Here
  is a blog that we did recently which should reflect latest performance
  metrics for latency -
 
 
 http://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
 
 
  On Wed, Jun 18, 2014 at 11:01 AM, Supun Kamburugamuva supu...@gmail.com
 
  wrote:
 
   I've found this performance test.
  
   http://blog.liveramp.com/2013/04/08/kafka-0-8-producer-performance-2/
  
   This performance test has mentioned about the same issue at the end.
  
   Thanks,
   Supun..
  
  
   On Wed, Jun 18, 2014 at 12:43 PM, Supun Kamburugamuva 
 supu...@gmail.com
  
   wrote:
  
The spikes happens without any correlation with the
log.flush.interval.message.
They happen more frequently.
   
I'm using the latest version. I'm sending the messages to Kafka, then
there is a message receiver, it sends the same messages back through
   kafka
to original sender. The round trip latency is measured.
   
Thanks,
Supun..
   
   
On Wed, Jun 18, 2014 at 12:02 PM, Neha Narkhede 
  neha.narkh...@gmail.com
   
wrote:
   
Which version of Kafka did you use? When you say latency, do you
 mean
   the
latency between the producer and consumer? If so, are you using a
timestamp
within the message to compute this latency?
   
   
On Wed, Jun 18, 2014 at 8:15 AM, Magnus Edenhill 
 mag...@edenhill.se
wrote:
   
 Hi,

 do these spikes happen to correlate with
 log.flush.interval.messages
   or
 log.flush.interval.ms?
 If so it's the file system sync blockage you are seeing.

 /Magnus


 2014-06-18 16:31 GMT+02:00 Supun Kamburugamuva supu...@gmail.com
 :

  Hi,
 
  We are trying to evaluate Kafka for a real time application. We
  are
 sending
  50 Kb messages at a fixed rate. The normal messages have a
   reasonable
  latency. But then there are these outliers that takes
  unpredictable
 amount
  of time. This causes the average latency to increase
 dramatically.
   We
are
  running with basically the default configuration. Any
 suggestions
   for
  improving the latency?
 
  Thanks in advance,
  Supun..
 
  --
  Supun Kamburugamuva
  Member, Apache Software Foundation; http://www.apache.org
  E-mail: supu...@gmail.com;  Mobile: +1 812 369 6762
  Blog: http://supunk.blogspot.com
 

   
   
   
   
--
Supun Kamburugamuva
Member, Apache Software Foundation; http://www.apache.org
E-mail: supu...@gmail.com;  Mobile: +1 812 369 6762
Blog: http://supunk.blogspot.com
   
   
  
  
   --
   Supun Kamburugamuva
   Member, Apache Software Foundation; http://www.apache.org
   E-mail: supu...@gmail.com;  Mobile: +1 812 369 6762
   Blog: http://supunk.blogspot.com
  
 



 --
 Supun Kamburugamuva
 Member, Apache Software Foundation; http://www.apache.org
 E-mail: supu...@gmail.com;  Mobile: +1 812 369 6762
 Blog: http://supunk.blogspot.com



Kafka latency measures

2014-06-18 Thread Supun Kamburugamuva
Hi,

We are trying to evaluate Kafka for a real time application. We are sending
50 Kb messages at a fixed rate. The normal messages have a reasonable
latency. But then there are these outliers that takes unpredictable amount
of time. This causes the average latency to increase dramatically. We are
running with basically the default configuration. Any suggestions for
improving the latency?

Thanks in advance,
Supun..

-- 
Supun Kamburugamuva
Member, Apache Software Foundation; http://www.apache.org
E-mail: supu...@gmail.com;  Mobile: +1 812 369 6762
Blog: http://supunk.blogspot.com


Re: Kafka latency measures

2014-06-18 Thread Magnus Edenhill
Hi,

do these spikes happen to correlate with log.flush.interval.messages or
log.flush.interval.ms?
If so it's the file system sync blockage you are seeing.

/Magnus


2014-06-18 16:31 GMT+02:00 Supun Kamburugamuva supu...@gmail.com:

 Hi,

 We are trying to evaluate Kafka for a real time application. We are sending
 50 Kb messages at a fixed rate. The normal messages have a reasonable
 latency. But then there are these outliers that takes unpredictable amount
 of time. This causes the average latency to increase dramatically. We are
 running with basically the default configuration. Any suggestions for
 improving the latency?

 Thanks in advance,
 Supun..

 --
 Supun Kamburugamuva
 Member, Apache Software Foundation; http://www.apache.org
 E-mail: supu...@gmail.com;  Mobile: +1 812 369 6762
 Blog: http://supunk.blogspot.com



Re: Kafka latency measures

2014-06-18 Thread Neha Narkhede
Which version of Kafka did you use? When you say latency, do you mean the
latency between the producer and consumer? If so, are you using a timestamp
within the message to compute this latency?


On Wed, Jun 18, 2014 at 8:15 AM, Magnus Edenhill mag...@edenhill.se wrote:

 Hi,

 do these spikes happen to correlate with log.flush.interval.messages or
 log.flush.interval.ms?
 If so it's the file system sync blockage you are seeing.

 /Magnus


 2014-06-18 16:31 GMT+02:00 Supun Kamburugamuva supu...@gmail.com:

  Hi,
 
  We are trying to evaluate Kafka for a real time application. We are
 sending
  50 Kb messages at a fixed rate. The normal messages have a reasonable
  latency. But then there are these outliers that takes unpredictable
 amount
  of time. This causes the average latency to increase dramatically. We are
  running with basically the default configuration. Any suggestions for
  improving the latency?
 
  Thanks in advance,
  Supun..
 
  --
  Supun Kamburugamuva
  Member, Apache Software Foundation; http://www.apache.org
  E-mail: supu...@gmail.com;  Mobile: +1 812 369 6762
  Blog: http://supunk.blogspot.com
 



Re: Kafka latency measures

2014-06-18 Thread Supun Kamburugamuva
The spikes happens without any correlation with the
log.flush.interval.message.
They happen more frequently.

I'm using the latest version. I'm sending the messages to Kafka, then there
is a message receiver, it sends the same messages back through kafka to
original sender. The round trip latency is measured.

Thanks,
Supun..


On Wed, Jun 18, 2014 at 12:02 PM, Neha Narkhede neha.narkh...@gmail.com
wrote:

 Which version of Kafka did you use? When you say latency, do you mean the
 latency between the producer and consumer? If so, are you using a timestamp
 within the message to compute this latency?


 On Wed, Jun 18, 2014 at 8:15 AM, Magnus Edenhill mag...@edenhill.se
 wrote:

  Hi,
 
  do these spikes happen to correlate with log.flush.interval.messages or
  log.flush.interval.ms?
  If so it's the file system sync blockage you are seeing.
 
  /Magnus
 
 
  2014-06-18 16:31 GMT+02:00 Supun Kamburugamuva supu...@gmail.com:
 
   Hi,
  
   We are trying to evaluate Kafka for a real time application. We are
  sending
   50 Kb messages at a fixed rate. The normal messages have a reasonable
   latency. But then there are these outliers that takes unpredictable
  amount
   of time. This causes the average latency to increase dramatically. We
 are
   running with basically the default configuration. Any suggestions for
   improving the latency?
  
   Thanks in advance,
   Supun..
  
   --
   Supun Kamburugamuva
   Member, Apache Software Foundation; http://www.apache.org
   E-mail: supu...@gmail.com;  Mobile: +1 812 369 6762
   Blog: http://supunk.blogspot.com
  
 




-- 
Supun Kamburugamuva
Member, Apache Software Foundation; http://www.apache.org
E-mail: supu...@gmail.com;  Mobile: +1 812 369 6762
Blog: http://supunk.blogspot.com


Re: Kafka latency measures

2014-06-18 Thread Neha Narkhede
what are the latency numbers you observed, avg as well as worst case? Here
is a blog that we did recently which should reflect latest performance
metrics for latency -
http://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines


On Wed, Jun 18, 2014 at 11:01 AM, Supun Kamburugamuva supu...@gmail.com
wrote:

 I've found this performance test.

 http://blog.liveramp.com/2013/04/08/kafka-0-8-producer-performance-2/

 This performance test has mentioned about the same issue at the end.

 Thanks,
 Supun..


 On Wed, Jun 18, 2014 at 12:43 PM, Supun Kamburugamuva supu...@gmail.com
 wrote:

  The spikes happens without any correlation with the
  log.flush.interval.message.
  They happen more frequently.
 
  I'm using the latest version. I'm sending the messages to Kafka, then
  there is a message receiver, it sends the same messages back through
 kafka
  to original sender. The round trip latency is measured.
 
  Thanks,
  Supun..
 
 
  On Wed, Jun 18, 2014 at 12:02 PM, Neha Narkhede neha.narkh...@gmail.com
 
  wrote:
 
  Which version of Kafka did you use? When you say latency, do you mean
 the
  latency between the producer and consumer? If so, are you using a
  timestamp
  within the message to compute this latency?
 
 
  On Wed, Jun 18, 2014 at 8:15 AM, Magnus Edenhill mag...@edenhill.se
  wrote:
 
   Hi,
  
   do these spikes happen to correlate with log.flush.interval.messages
 or
   log.flush.interval.ms?
   If so it's the file system sync blockage you are seeing.
  
   /Magnus
  
  
   2014-06-18 16:31 GMT+02:00 Supun Kamburugamuva supu...@gmail.com:
  
Hi,
   
We are trying to evaluate Kafka for a real time application. We are
   sending
50 Kb messages at a fixed rate. The normal messages have a
 reasonable
latency. But then there are these outliers that takes unpredictable
   amount
of time. This causes the average latency to increase dramatically.
 We
  are
running with basically the default configuration. Any suggestions
 for
improving the latency?
   
Thanks in advance,
Supun..
   
--
Supun Kamburugamuva
Member, Apache Software Foundation; http://www.apache.org
E-mail: supu...@gmail.com;  Mobile: +1 812 369 6762
Blog: http://supunk.blogspot.com
   
  
 
 
 
 
  --
  Supun Kamburugamuva
  Member, Apache Software Foundation; http://www.apache.org
  E-mail: supu...@gmail.com;  Mobile: +1 812 369 6762
  Blog: http://supunk.blogspot.com
 
 


 --
 Supun Kamburugamuva
 Member, Apache Software Foundation; http://www.apache.org
 E-mail: supu...@gmail.com;  Mobile: +1 812 369 6762
 Blog: http://supunk.blogspot.com



Re: Kafka latency measures

2014-06-18 Thread Supun Kamburugamuva
My machine configuration is not very high. The average one way latency we
observe is around 10 ~ 15 ms for 50k messages. The outliers doesn't occur
for small messages. For small messages we observe around 6 ms latency.

Thanks,
Supun..


On Wed, Jun 18, 2014 at 10:18 PM, Neha Narkhede neha.narkh...@gmail.com
wrote:

 what are the latency numbers you observed, avg as well as worst case? Here
 is a blog that we did recently which should reflect latest performance
 metrics for latency -

 http://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines


 On Wed, Jun 18, 2014 at 11:01 AM, Supun Kamburugamuva supu...@gmail.com
 wrote:

  I've found this performance test.
 
  http://blog.liveramp.com/2013/04/08/kafka-0-8-producer-performance-2/
 
  This performance test has mentioned about the same issue at the end.
 
  Thanks,
  Supun..
 
 
  On Wed, Jun 18, 2014 at 12:43 PM, Supun Kamburugamuva supu...@gmail.com
 
  wrote:
 
   The spikes happens without any correlation with the
   log.flush.interval.message.
   They happen more frequently.
  
   I'm using the latest version. I'm sending the messages to Kafka, then
   there is a message receiver, it sends the same messages back through
  kafka
   to original sender. The round trip latency is measured.
  
   Thanks,
   Supun..
  
  
   On Wed, Jun 18, 2014 at 12:02 PM, Neha Narkhede 
 neha.narkh...@gmail.com
  
   wrote:
  
   Which version of Kafka did you use? When you say latency, do you mean
  the
   latency between the producer and consumer? If so, are you using a
   timestamp
   within the message to compute this latency?
  
  
   On Wed, Jun 18, 2014 at 8:15 AM, Magnus Edenhill mag...@edenhill.se
   wrote:
  
Hi,
   
do these spikes happen to correlate with log.flush.interval.messages
  or
log.flush.interval.ms?
If so it's the file system sync blockage you are seeing.
   
/Magnus
   
   
2014-06-18 16:31 GMT+02:00 Supun Kamburugamuva supu...@gmail.com:
   
 Hi,

 We are trying to evaluate Kafka for a real time application. We
 are
sending
 50 Kb messages at a fixed rate. The normal messages have a
  reasonable
 latency. But then there are these outliers that takes
 unpredictable
amount
 of time. This causes the average latency to increase dramatically.
  We
   are
 running with basically the default configuration. Any suggestions
  for
 improving the latency?

 Thanks in advance,
 Supun..

 --
 Supun Kamburugamuva
 Member, Apache Software Foundation; http://www.apache.org
 E-mail: supu...@gmail.com;  Mobile: +1 812 369 6762
 Blog: http://supunk.blogspot.com

   
  
  
  
  
   --
   Supun Kamburugamuva
   Member, Apache Software Foundation; http://www.apache.org
   E-mail: supu...@gmail.com;  Mobile: +1 812 369 6762
   Blog: http://supunk.blogspot.com
  
  
 
 
  --
  Supun Kamburugamuva
  Member, Apache Software Foundation; http://www.apache.org
  E-mail: supu...@gmail.com;  Mobile: +1 812 369 6762
  Blog: http://supunk.blogspot.com
 




-- 
Supun Kamburugamuva
Member, Apache Software Foundation; http://www.apache.org
E-mail: supu...@gmail.com;  Mobile: +1 812 369 6762
Blog: http://supunk.blogspot.com