Garry, Do you mind to share the source code that you did for the profiling?
On Sun, May 17, 2015 at 4:59 PM, Garry Turkington < g.turking...@improvedigital.com> wrote: > Hi Guozhang/Jay/Becket, > > Thanks for the responses. > > Regarding my point on performance dropping when the number of partitions > was increased, that surprised me too as on another cluster I had done just > this to help with the issue of lots of ISR churn and it had been a straight > win. > > I mentioned in my last mail that I had simplified the code to generate > test messages with the effect that it greatly reduced the CPU load per > thread. After doing this the performance on the higher partition-count > topic was consistent with the lower partition count one and showed no > degredation. So the sender threads were becoming CPU bound, I'm assuming > possibly due to the additional locks involved with more partitions but that > needs validation. > > I've been running my clients with acks=1, linger.ms floating between 0 > and 1 because I want to convince myself of it making a difference but so > far I've not really seen it and similar to Jay's experiences settled on 64K > for batch.size because I just didn't see any benefit of anything beyond > that and even the jump from 32K wasn't proved beneficial. For this > particular application I've already hit the needed performance (around > 700K/sec at peak) but my workload can be quite a sawtooth moving from peak > to idle and back again. So peak becomes the new norm and understanding the > head-room in the current setup and how to grow beyond that is important. > > I've had a few more test boxes put on the same 10GB network as the cluster > in question so I'll re-visit this and do deeper profiling over the next > week and will revert here with findings. > > Regards > Garry > > -----Original Message----- > From: Guozhang Wang [mailto:wangg...@gmail.com] > Sent: 14 May 2015 18:57 > To: users@kafka.apache.org > Subject: Re: Experiences testing new producer performance across multiple > threads/producer counts > > Regarding the issue that adding more partitions kill the performance: I > would suspect it maybe due to not-sufficient batching. Note that in the new > producer batching is done per-partition, and if linger.ms setting low, > partition data may not be batched enough before they got sent to the > brokers. Also since the new producer will drain all partitions that belongs > to the same broker, when one of them hits either linger time or batch size, > when you only have one or a few brokers this will further exaggerate the > not-sufficient-batching issue. So monitoring on average batch size would be > a good idea. > > Guozhang > > On Wed, May 13, 2015 at 7:47 PM, Jay Kreps <jay.kr...@gmail.com> wrote: > > > Hey Garry, > > > > Super interesting. We honestly never did a ton of performance tuning > > on the producer. I checked the profiles early on in development and we > > fixed a few issues that popped up in deployment, but I don't think > > anyone has done a really scientific look. If you (or anyone else) want > > to dive into things I suspect it could be improved. > > > > Becket is exactly right. There are two possible bottlenecks you can > > hit in the producer--the single background sender thread and the > > per-partition lock. You can check utilization on the background thread > > with jvisualvm (it's named something like > > kafka-producer-network-thread). The locking is fairly hard to improve. > > > > It's a little surprising that adding partitions caused a large > > decrease in performance. Generally this is only the case if you > > override the flush settings on the broker to force fsync more frequently. > > > > The ISR issues under heavy load are probably fixable, the issue is > > discussed a bit here: > > > > http://blog.confluent.io/2015/04/07/hands-free-kafka-replication-a-les > > son-in-operational-simplicity/ > > > > The producer settings that may matter for performance are: > > acks > > batch.size (though beyond 32k I didn't see much improvement) linger.ms > > (setting >= 1 may help a bit) send.buffer.bytes (maybe, but probably > > not) > > > > Cheers, > > > > -Jay > > > > On Wed, May 13, 2015 at 3:42 PM, Jiangjie Qin > > <j...@linkedin.com.invalid> > > wrote: > > > > > Thanks for sharing this, Garry. I actually did similar tests before > > > but unfortunately lost the test data because my laptop rebooted and > > > I forgot to save the dataŠ > > > > > > Anyway, several things to verify: > > > > > > 1. Remember KafkaProducer holds lock per partition. So if you have > > > only one partition in the target topic and many application threads. > > > Lock contention could be an issue. > > > > > > 2. It matters that how frequent the sender thread wake up and runs. > > > You can take a look at the following sensors to further verify > > > whether the sender thread really become a bottleneck or not: > > > Select-rate > > > Io-wait-time-ns-avg > > > Io-time-ns-avg > > > > > > 3. Batch size matters, so take a look at the sensor batch-size-avg > > > and > > see > > > if the average batch size makes sense or not. > > > > > > Looking forward to your further profiling. My thinking is that > > > unless you are sending very small messages to a small number of > > > partitions. You > > don¹t > > > need to worry about to use more than one producer. > > > > > > Thanks. > > > > > > Jiangjie (Becket) Qin > > > > > > > > > > > > On 5/13/15, 2:40 PM, "Garry Turkington" > > > <g.turking...@improvedigital.com > > > > > > wrote: > > > > > > >Hi, > > > > > > > >I talked with Gwen at Strata last week and promised to share some > > > >of my experiences benchmarking an app reliant on the new producer. > > > >I'm using relatively meaty boxes running my producer code (24 > > > >core/64GB RAM) but I wasn't pushing them until I got them on the > > > >same 10GB fabric as the > > Kafka > > > >cluster they are using (saturating the prior 1GB NICs was just too > > easy). > > > >There are 5 brokers, 24 core/192GB RAM/8*2TB disks, running 0.8.2.1. > > > > > > > >With lots of cores and a dedicated box the question was then how to > > > >deploy my application. In particular how many worker threads and > > > >how > > many > > > >instances of the KafkaProducer to share amongst them. I also > > > >wanted to see how things would change as I scale up the thread count. > > > > > > > >I ripped out the data retrieval part of my app (it reads from S3) > > > >and instead replaced it with some code to produce random records of > > > >average size 500 bytes but varying between 250 and 750. I started > > > >the app running, ignored the first 25m messages then measured the > > > >timing for the next 100m and calculated the average messages/sec > > > >written to Kafka across that run. > > > > > > > >Starting small I created 4 application threads with a range of > > approaches > > > >to sharing KafkaProducer instances. The records written to the > > > >Kafka cluster per second were as follows: > > > > > > > >4 threads all sharing 1 client: 332K > > > >4 threads sharing 2 clients: 333K > > > >4 threads, dedicated client per thread: 310K > > > > > > > >Note that when I had 2 KafkaProducer clients as in the second line > > > >above each was used by 2 threads. Similar below, number of > > > >threads/number of clients is the max number of threads per > KafkaProducer instance. > > > > > > > >As can be seen from the above there's not much in it. Scaling up to > > > >8 application threads the numbers looked like: > > > > > > > >8 threads sharing 1 client: 387K > > > >8 threads sharing 2 clients: 609K > > > >8 threads sharing 4 clients: 628K > > > >8 threads with dedicated client per thread: 527K > > > > > > > >This time sharing a single producer client across all threads has > > > >by > > far > > > >the worse performance and isn't much better than when using 4 > > > >application threads. The 2 and 4 client options are much better and > > > >are in the ballpark of 2x the 4 thread performance. A dedicated > > > >client per thread isn't quite as good but isn't so far off to be > > > >unusable. So then taking it to 16 application threads: > > > > > > > >16 threads sharing 1 client: 380K > > > >16 threads sharing 2 clients: 675K > > > >16 threads sharing 4 clients: 869K > > > >16 threads sharing 8 clients: 733K > > > >16 threads with a dedicated client per thread: 491K > > > > > > > >This gives a much clearer performance curve. The 16 thread/4 > > > >producer client is by far the best performance but it is still far > > > >from 4x the 4-thread or 2x the 8-thread mark. At this point I seem > > > >to be hitting > > some > > > >limiting factor. On the client machine memory was still lightly > > > >used, network was peaking just over 4GB/sec but CPU load was > > > >showing 1 minute load averages around 18-20. CPU load did seem to > > > >increase with as did > > the > > > >number of KafkaProducer instances but that is more a conclusion > > > >from memory and not backed by hard numbers. > > > > > > > >For completeness sake I did do a 24 thread test but the numbers are > > > >as you'd expect. 1 client and 24 both showed poor performance. 4,6 > > > >or 8 clients (24 has more ways of dividing it by 2!) all showed > > > >performance around that of the 16 thread/4 client run above. The > > > >other configs were in-between. > > > > > > > >With my full application I've found the best deployment so far is > > > >to > > have > > > > multiple instances running on the same box. I can get much better > > > >performance from 3 instances each with 8 threads than 1 instance > > > >with 24 threads. This is almost certainly because when adding in my > > > >own application logic and the AWS clients there is just a lot more > > contention > > > >- not to mention much more i/o waits -- in each application instance. > > The > > > >benchmark variant doesn't have as much happening but just to > > > >compare I ran a few concurrent instances: > > > > > > > >2 copies of 8 threads sharing 4 clients: 780K total > > > >2 copies of 8 threads sharing 2 clients: 870K total > > > >3 copies of 8 threads sharing 2 clients: 945k total > > > > > > > >So bottom line - around 900K/sec is the max I can get from one of > > > >these hosts for my application. At which point I brought a 2nd host > > > >to bear > > and > > > >ran 2 concurrent instances of the best performing config on each: > > > > > > > >2 copies of 16 threads sharing 4 clients on 2 hosts: 1458k total > > > > > > > >This doesn't quite give 2x the single box performance but it does > > > >show that the cluster has capacity to spare beyond what the single > > > >client > > host > > > >can drive. This was also backed up by the metrics on the brokers, > > > >they got busy but moderately so given the amount of work they were > doing. > > > > > > > >At this point things did get a bit 'ragged edge'. I noticed a very > > > >high rate of ISR churn on the brokers, it looked like the replicas > > > >were > > having > > > >trouble keeping up with the master and hosts were constantly being > > > >dropped out then re-added to the ISR. I had set the test topic to > > > >have a relatively low partition count (1 per spindle) so I doubled > > > >that to see if it could help the ISRs remain stable. And my > > > >performance fell > > through > > > >the floor. So whereas I thought this was an equation involving > > > >application threads and producer instances perhaps partition count > > > >is a third. I need look into that some more but so far it looks > > > >like that for my application - I'm not suggesting this is a > > > >universal truth -- sharing a KafkaProducer instance amongst around 4 > threads is the sweet spot. > > > > > > > >I'll be doing further profiling of my application so I'll flag to > > > >the list anything that appears within the producer itself. And > > > >because 900K messages/sec was so close to a significant number I > > > >modified my code > > that > > > >generates the messages to keep the key random for each message but > > > >to > > use > > > >repeated message bodies across multiple messages. At which point > > > >1.05m messages/sec was possible - from a single box. Nice. :) > > > > > > > >This turned out much longer than planned, I probably should have > > > >blogged this somewhere. If anyone reads this far hope it is useful > > > >or of interest, I'd be interested in hearing if the profiles I'm > > > >seeing are expected and if any other tests would be useful. > > > > > > > >Regards > > > >Garry > > > > > > > > > > > > -- > -- Guozhang > -- Regards, Tao