Good point Jun Rao. We've been trying to get things to scale with normal mode first and haven't tried failure scenarios yet.
Thanks for the pointer to KIP-227! It looks promising indeed. I'm working on sprucing up my reproduction environment and tests and hopefully will have more info to share soon. On Wed, Jan 3, 2018 at 4:47 PM, Jun Rao <j...@confluent.io> wrote: > Hi, Andrey, > > If the test is in the normal mode, it would be useful to figure out why ZK > is the bottleneck since the normal mode typically doesn't require ZK > accesses. > > Thanks, > > Jun > > On Wed, Jan 3, 2018 at 3:00 PM, Andrey Falko <afa...@salesforce.com> wrote: > >> Ben Wood: >> 1. We have 5 ZK nodes. >> 2. I only tracked outstanding requests thus far from ZK-side of >> things. At 9.5k topics, I recorded about 5k outstanding requests. I'll >> start tracking this better for my next run. Anything else worth >> tracking? >> >> Jun Rao: >> I'm testing the latest 1.0.0. I'm testing normal mode. I don't take >> anything down. I had a run where I tried to scale the brokers up, but >> it didn't improve things. Thanks for pointing me at the KIP! >> >> On Wed, Jan 3, 2018 at 2:50 PM, Jun Rao <j...@confluent.io> wrote: >> > Hi, Andrey, >> > >> > Thanks for reporting the results. Which version of Kafka are you testing? >> > Also, it would be useful to know if you are testing the normal mode when >> > all replicas are up and in sync, or the failure mode when some of the >> > replicas are being restarted. Typically, ZK is only accessed in the >> failure >> > mode. >> > >> > We have made some significant improvement in the failure mode by reducing >> > the logging overhead (KAFKA-6116) and making the ZK accesses async >> > (KAFKA-5642). These won't necessarily reduce the number of requests to >> ZK, >> > but will allow better pipelining when accessing ZK. >> > >> > In the normal mode, we are now discussing KIP-227, which could reduce the >> > overhead for replication and consumption when there are many partitions. >> > >> > Jun >> > >> > On Wed, Jan 3, 2018 at 1:48 PM, Andrey Falko <afa...@salesforce.com> >> wrote: >> > >> >> Hi everyone, >> >> >> >> We are seeing more and more push from our Kafka users to support well >> >> more than 10k replicated partitions. We'd ideally like to avoid running >> >> multiple >> >> clusters to keep our cluster management and monitoring simple. We >> started >> >> testing kafka to see how many replicated partitions it could handle. >> >> >> >> We found that, to maintain SLAs of under 50ms for produce latency, >> >> Kafka starts going downhill at around 9k topics with 5 brokers. Each >> topic >> >> is >> >> replicated 3x in our test. The bottleneck appears to be zookeeper: >> >> after a certain >> >> period of time, the number of outstanding requests in ZK spikes up at a >> >> linear rate. Slowing down the rate at which we create and produce to >> >> topics, >> >> improves things, but doing that makes the system tougher to manage and >> use. >> >> We are happy to publish our detailed results with reproduction >> >> steps if anyone is interested. >> >> >> >> Has anyone overcome this problem and scaled beyond 9k replicated >> >> partitions? >> >> Does anyone have zookeeper tuning suggestions? Is it even the >> bottleneck? >> >> >> >> According to this we should have at most 300 3x replicated per broker: >> >> https://www.confluent.io/blog/how-to-choose-the-number-of- >> >> topicspartitions-in-a-kafka-cluster/ >> >> Is anyone doing work to have kafka support more than that? >> >> >> >> Best regards, >> >> Andrey Falko >> >> Salesforce.com >> >> >>