Luckily, I was just reviewing a lot of this information for my ApacheCon talk next week. Those slides, and the video (I hope) will be published as soon as the talk is done. I'll give you the information I have from LinkedIn's point of view, but out of order :)
Our Kafka brokers are all the same model. We use a system with 12 CPU cores, currently 2.6 GHz, with hyperthreading enabled. They have 64 GB of memory, and dual 1G network interfaces that are bonded, but operating active/passive. The systems have 16 1 TB SAS drives in them: 2 are configured as RAID-1 for the OS, and the other 14 are configured as RAID-10 specifically for the Kafka log segments. This gives us a little under 7 TB of useable space for message retention per broker. On layout, we try to follow a few rules with varying consistency (we're getting more strict over time): - We do not colocate other applications with Kafka. It gets the entire system to itself - Zookeeper runs on 5 separate servers (also not colocated with other applications). Those servers have the same CPU, memory, and network spec, but they do not have all the disks. They do have 550GB SSD drives which are dedicated for the ZK transaction logs. - We try not to have more than 1 Kafka broker in a cluster in the same rack. This is to minimize the kinds of failures that can take partitions offline - All Kafka producers and consumers are local to the datacenter that the cluster is in. We use mirror maker and aggregate clusters to copy messages between datacenters. Our smallest cluster is currently 3 brokers, and our largest is 42. It largely depends on how much retention we need, and how much traffic that cluster is getting. Our clusters are separated out by the general type of traffic: queuing, tracking, metrics, and logging. Queuing clusters are generally the smallest, while metrics clusters are the largest (with tracking close behind). We expand clusters based on the following loose rules: - Disk usage on the log segments partition should stay under 60% (we have default 4 day retention) - Network usage on each broker should stay under 75% - Partition count (leader and follower combined) on each broker should stay under 4000 As far as topic volume goes, it varies widely. We have topics that only see a single message per minute (or less). Our largest topic by bytes has a peak rate of about 290 Mbits/sec. Our largest topic by messages has a peak rate of about 225k messages/sec. Note that those are in the same cluster. When we are sizing topics (number of partitions), we use the following guidelines: - Have at least as many partitions as there are consumers in the largest group - Keep partition size on disk under 50GB per partition (better balance) - Take into account any other application requirements (keyed messages, specific topic counts required, etc.) I hope this helps. I'll be covering some of this at my ApacheCon talk (Kafka at Scale: Multi-Tier Architectures) and at the meet up that Jun has set up at ApacheCon. If you have any questions, just ask! -Todd On Mon, Apr 6, 2015 at 9:35 AM, Rama Ramani <rama.ram...@live.com> wrote: > Hello, > I am trying to understand some of the common Kafka deployment > sizes ("small", "medium", "large") and configuration to come up with a set > of common templates for deployment on Linux. Some of the Qs to answer are: > > - Number of nodes in the cluster > - Machine Specs (cpu, memory, number of disks, network etc.) > - Speeds & Feeds of messages > - What are some of the best practices to consider when laying out the > clusters? > - Is there a sizing calculator for coming up with this? > > If you can please share pointers to existing materials or specific details > of your deployment, that will be great. > > Regards > Rama >