Re: Experiences testing new producer performance across multiple threads/producer counts

tao xiao Mon, 18 May 2015 06:51:17 -0700

Garry,

Do you mind to share the source code that you did for the profiling?


On Sun, May 17, 2015 at 4:59 PM, Garry Turkington <
g.turking...@improvedigital.com> wrote:

> Hi Guozhang/Jay/Becket,
>
> Thanks for the responses.
>
> Regarding my point on performance dropping when the number of partitions
> was increased, that surprised me too as on another cluster I had done just
> this to help with the issue of lots of ISR churn and it had been a straight
> win.
>
> I mentioned in my last mail that I had simplified the code to generate
> test messages with the effect that it greatly reduced the CPU load per
> thread. After doing this the performance on the higher partition-count
> topic was consistent with the lower partition count one and showed no
> degredation. So the sender threads were becoming CPU bound, I'm assuming
> possibly due to the additional locks involved with more partitions but that
> needs validation.
>
> I've been running my clients with acks=1, linger.ms floating between 0
> and 1 because I want to convince myself of it making a difference but so
> far I've not really seen it and similar to Jay's experiences settled on 64K
> for batch.size because I just didn't see any benefit of anything beyond
> that and even the jump from 32K wasn't proved beneficial. For this
> particular application I've already hit the needed performance (around
> 700K/sec at peak) but my workload can be quite a sawtooth moving from peak
> to idle and back again. So peak becomes the new norm and understanding the
> head-room in the current setup and how to grow beyond that is important.
>
> I've had a few more test boxes put on the same 10GB network as the cluster
> in question so I'll re-visit this and do deeper profiling over the next
> week and will revert here with findings.
>
> Regards
> Garry
>
> -----Original Message-----
> From: Guozhang Wang [mailto:wangg...@gmail.com]
> Sent: 14 May 2015 18:57
> To: users@kafka.apache.org
> Subject: Re: Experiences testing new producer performance across multiple
> threads/producer counts
>
> Regarding the issue that adding more partitions kill the performance: I
> would suspect it maybe due to not-sufficient batching. Note that in the new
> producer batching is done per-partition, and if linger.ms setting low,
> partition data may not be batched enough before they got sent to the
> brokers. Also since the new producer will drain all partitions that belongs
> to the same broker, when one of them hits either linger time or batch size,
> when you only have one or a few brokers this will further exaggerate the
> not-sufficient-batching issue. So monitoring on average batch size would be
> a good idea.
>
> Guozhang
>
> On Wed, May 13, 2015 at 7:47 PM, Jay Kreps <jay.kr...@gmail.com> wrote:
>
> > Hey Garry,
> >
> > Super interesting. We honestly never did a ton of performance tuning
> > on the producer. I checked the profiles early on in development and we
> > fixed a few issues that popped up in deployment, but I don't think
> > anyone has done a really scientific look. If you (or anyone else) want
> > to dive into things I suspect it could be improved.
> >
> > Becket is exactly right. There are two possible bottlenecks you can
> > hit in the producer--the single background sender thread and the
> > per-partition lock. You can check utilization on the background thread
> > with jvisualvm (it's named something like
> > kafka-producer-network-thread). The locking is fairly hard to improve.
> >
> > It's a little surprising that adding partitions caused a large
> > decrease in performance. Generally this is only the case if you
> > override the flush settings on the broker to force fsync more frequently.
> >
> > The ISR issues under heavy load are probably fixable, the issue is
> > discussed a bit here:
> >
> > http://blog.confluent.io/2015/04/07/hands-free-kafka-replication-a-les
> > son-in-operational-simplicity/
> >
> > The producer settings that may matter for performance are:
> > acks
> > batch.size (though beyond 32k I didn't see much improvement) linger.ms
> > (setting >= 1 may help a bit) send.buffer.bytes (maybe, but probably
> > not)
> >
> > Cheers,
> >
> > -Jay
> >
> > On Wed, May 13, 2015 at 3:42 PM, Jiangjie Qin
> > <j...@linkedin.com.invalid>
> > wrote:
> >
> > > Thanks for sharing this, Garry. I actually did similar tests before
> > > but unfortunately lost the test data because my laptop rebooted and
> > > I forgot to save the dataŠ
> > >
> > > Anyway, several things to verify:
> > >
> > > 1. Remember KafkaProducer holds lock per partition. So if you have
> > > only one partition in the target topic and many application threads.
> > > Lock contention could be an issue.
> > >
> > > 2. It matters that how frequent the sender thread wake up and runs.
> > > You can take a look at the following sensors to further verify
> > > whether the sender thread really become a bottleneck or not:
> > > Select-rate
> > > Io-wait-time-ns-avg
> > > Io-time-ns-avg
> > >
> > > 3. Batch size matters, so take a look at the sensor batch-size-avg
> > > and
> > see
> > > if the average batch size makes sense or not.
> > >
> > > Looking forward to your further profiling. My thinking is that
> > > unless you are sending very small messages to a small number of
> > > partitions. You
> > don¹t
> > > need to worry about to use more than one producer.
> > >
> > > Thanks.
> > >
> > > Jiangjie (Becket) Qin
> > >
> > >
> > >
> > > On 5/13/15, 2:40 PM, "Garry Turkington"
> > > <g.turking...@improvedigital.com
> > >
> > > wrote:
> > >
> > > >Hi,
> > > >
> > > >I talked with Gwen at Strata last week and promised to share some
> > > >of my experiences benchmarking an app reliant on the new  producer.
> > > >I'm using relatively meaty boxes running my producer code (24
> > > >core/64GB RAM) but I wasn't pushing them until I got them on the
> > > >same 10GB fabric as the
> > Kafka
> > > >cluster they are using (saturating the prior 1GB NICs was just too
> > easy).
> > > >There are 5 brokers, 24 core/192GB RAM/8*2TB disks, running 0.8.2.1.
> > > >
> > > >With lots of cores and a dedicated box the question was then how to
> > > >deploy my application. In particular how many worker threads and
> > > >how
> > many
> > > >instances of the KafkaProducer  to share amongst them. I also
> > > >wanted to see how things would change as I scale up the thread count.
> > > >
> > > >I ripped out the data retrieval part of my app (it reads from S3)
> > > >and instead replaced it with some code to produce random records of
> > > >average size 500 bytes but varying between 250 and 750. I started
> > > >the app running, ignored the first 25m messages then measured the
> > > >timing for the next 100m and  calculated the average messages/sec
> > > >written to Kafka across that run.
> > > >
> > > >Starting small I created 4 application threads with a range of
> > approaches
> > > >to sharing KafkaProducer instances. The records written to the
> > > >Kafka cluster per second were as follows:
> > > >
> > > >4 threads all sharing 1 client: 332K
> > > >4 threads sharing 2 clients: 333K
> > > >4 threads, dedicated client per thread: 310K
> > > >
> > > >Note that when I had 2 KafkaProducer clients as in the second line
> > > >above each was used by 2 threads. Similar below, number of
> > > >threads/number of clients is the max number of threads per
> KafkaProducer instance.
> > > >
> > > >As can be seen from the above there's not much in it. Scaling up to
> > > >8 application threads the numbers  looked like:
> > > >
> > > >8 threads sharing 1 client: 387K
> > > >8 threads sharing 2 clients: 609K
> > > >8 threads sharing 4 clients: 628K
> > > >8 threads with dedicated  client per thread: 527K
> > > >
> > > >This time sharing a single producer client  across all threads has
> > > >by
> > far
> > > >the worse performance and  isn't much better than when using 4
> > > >application threads. The 2 and 4 client options are much better and
> > > >are in the ballpark of 2x the 4 thread performance. A dedicated
> > > >client per thread isn't quite as good but isn't so far off to be
> > > >unusable. So then taking it to 16 application threads:
> > > >
> > > >16 threads sharing 1 client: 380K
> > > >16 threads sharing 2 clients: 675K
> > > >16 threads sharing 4 clients: 869K
> > > >16 threads sharing 8 clients: 733K
> > > >16 threads  with a dedicated client per thread: 491K
> > > >
> > > >This gives a much clearer performance curve. The 16 thread/4
> > > >producer client is by far the best performance but it is still far
> > > >from 4x the 4-thread or 2x the 8-thread mark. At this point I seem
> > > >to be hitting
> > some
> > > >limiting factor. On the client machine memory was still lightly
> > > >used, network was peaking just over 4GB/sec but CPU load was
> > > >showing 1 minute load averages around 18-20. CPU load did seem to
> > > >increase with as did
> > the
> > > >number of KafkaProducer instances but that is more a conclusion
> > > >from memory and not backed by hard numbers.
> > > >
> > > >For completeness sake I did do a 24 thread test but the numbers are
> > > >as you'd expect. 1 client and 24 both showed poor performance. 4,6
> > > >or 8 clients (24 has more  ways of dividing it by 2!) all showed
> > > >performance around that of the 16 thread/4 client run above. The
> > > >other configs were in-between.
> > > >
> > > >With my full application I've found the best deployment so far is
> > > >to
> > have
> > > >  multiple instances running on the same box. I can get much better
> > > >performance from 3 instances each with 8 threads than 1 instance
> > > >with 24 threads. This is almost certainly because when adding in my
> > > >own application logic and the AWS clients there is just a lot more
> > contention
> > > >- not to mention much more i/o waits -- in each application instance.
> > The
> > > >benchmark variant doesn't have as much happening but just to
> > > >compare I ran a few concurrent instances:
> > > >
> > > >2 copies of 8 threads sharing 4 clients: 780K total
> > > >2 copies of 8 threads sharing 2 clients: 870K total
> > > >3 copies of 8 threads sharing 2 clients: 945k total
> > > >
> > > >So bottom line - around 900K/sec is the max I can get from one of
> > > >these hosts for my application. At which point I brought a 2nd host
> > > >to bear
> > and
> > > >ran 2 concurrent instances of the best performing config on each:
> > > >
> > > >2 copies of 16 threads sharing 4 clients on 2 hosts: 1458k total
> > > >
> > > >This doesn't quite give 2x the single box performance but it does
> > > >show that the cluster has capacity to spare beyond what the single
> > > >client
> > host
> > > >can drive. This was also backed up by the metrics on the brokers,
> > > >they got busy but moderately so given the amount of work they were
> doing.
> > > >
> > > >At this point things  did get a bit 'ragged edge'. I noticed a very
> > > >high rate of ISR churn on the brokers, it looked like the replicas
> > > >were
> > having
> > > >trouble keeping up with the master and hosts were constantly being
> > > >dropped out then re-added to the ISR. I had set the test topic to
> > > >have a relatively low partition count (1 per spindle) so I doubled
> > > >that to see if it could help the ISRs remain  stable. And my
> > > >performance fell
> > through
> > > >the floor. So whereas I thought this was an equation involving
> > > >application threads and producer instances perhaps partition count
> > > >is a third. I need look into that some more but so far it looks
> > > >like that for my application - I'm not suggesting this is a
> > > >universal truth -- sharing a KafkaProducer instance amongst around 4
> threads is the sweet spot.
> > > >
> > > >I'll be doing further profiling of my application so I'll flag to
> > > >the list anything that appears within the producer itself. And
> > > >because 900K messages/sec was so close to a significant number I
> > > >modified my code
> > that
> > > >generates the messages to keep the key random for each message but
> > > >to
> > use
> > > >repeated message bodies across multiple messages. At which point
> > > >1.05m messages/sec was possible - from a single box. Nice. :)
> > > >
> > > >This turned out much longer than planned, I probably should have
> > > >blogged this somewhere. If anyone reads this far hope it is useful
> > > >or of interest, I'd be interested in hearing if the profiles I'm
> > > >seeing are expected and if any other tests would be useful.
> > > >
> > > >Regards
> > > >Garry
> > >
> > >
> >
>
>
>
> --
> -- Guozhang
>



-- 
Regards,
Tao

Re: Experiences testing new producer performance across multiple threads/producer counts

Reply via email to