To add a little more context to Shaun's question, we have around 400 customers. Each customer has a stream of events. Some customers generate a lot of data while others don't. We need to ensure that each customer's data is sorted globally by timestamp.
We have two use cases around consumption: 1. A user may consume an individual customers data 2. A user may consume data for all customers Given these two use cases, I think the better strategy is to have a separate topic per customer as Todd suggested. On Wed, Sep 30, 2015 at 9:26 AM, Todd Palino <tpal...@gmail.com> wrote: > So I disagree with the idea to use custom partitioning, depending on your > requirements. Having a consumer consume from a single partition is not > (currently) that easy. If you don't care which consumer gets which > partition (group), then it's not that bad. You have 20 partitions, you have > 20 consumers, and you use custom partitioning as noted. The consumers use > the high level consumer with a single group, each one will get one > partition each, and it's pretty straightforward. If a consumer crashes, you > will end up with two partitions on one of the remaining consumers. If this > is OK, this is a decent solution. > > If, however, you require that each consumer always have the same group of > data, and you need to know what that group is beforehand, it's more > difficult. You need to use the simple consumer to do it, which means you > need to implement a lot of logic for error and status code handling > yourself, and do it right. In this case, I think your idea of using 400 > separate topics is sound. This way you can still use the high level > consumer, which takes care of the error handling for you, and your data is > separated out by topic. > > Provided it is not an issue to implement it in your producer, I would go > with the separate topics. Alternately, if you're not sure you always want > separate topics, you could go with something similar to your second idea, > but have a consumer read the single topic and split the data out into 400 > separate topics in Kafka (no need for Cassandra or Redis or anything else). > Then your real consumers can all consume their separate topics. Reading and > writing the data one extra time is much better than rereading all of it 400 > times and throwing most of it away. > > -Todd > > > On Wed, Sep 30, 2015 at 9:06 AM, Ben Stopford <b...@confluent.io> wrote: > > > Hi Shaun > > > > You might consider using a custom partition assignment strategy to push > > your different “groups" to different partitions. This would allow you > walk > > the middle ground between "all consumers consume everything” and “one > topic > > per consumer” as you vary the number of partitions in the topic, albeit > at > > the cost of a little extra complexity. > > > > Also, not sure if you’ve seen it but there is quite a good section in the > > FAQ here < > > > https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-HowmanytopicscanIhave > ?> > > on topic and partition sizing. > > > > B > > > > > On 29 Sep 2015, at 18:48, Shaun Senecal <shaun.sene...@lithium.com> > > wrote: > > > > > > Hi > > > > > > > > > I heave read Jay Kreps post regarding the number of topics that can be > > handled by a broker ( > > https://www.quora.com/How-many-topics-can-be-created-in-Apache-Kafka), > > and it has left me with more questions that I dont see answered anywhere > > else. > > > > > > > > > We have a data stream which will be consumed by many consumers (~400). > > We also have many "groups" within our data. A group in the data > > corresponds 1:1 with what the consumers would consume, so consumer A only > > ever see group A messages, consumer B only consumes group B messages, > etc. > > > > > > > > > The downstream consumers will be consuming via a websocket API, so the > > API server will be the thing consuming from kafka. > > > > > > > > > If I use a single topic with, say, 20 partitions, the consumers in the > > API server would need to re-read the same messages over and over for each > > consumer, which seems like a waste of network and a potential bottleneck. > > > > > > > > > Alternatively, I could use a single topic with 20 partitions and have a > > single consumer in the API put the messages into cassandra/redis (as > > suggested by Jay), and serve out the downstream consumer streams that > way. > > However, that requires using a secondary sorted storage, which seems > like a > > waste (and added complexity) given that Kafka already has the data > exactly > > as I need it. Especially if cassandra/redis are required to maintain a > > long TTL on the stream. > > > > > > > > > Finally, I could use 1 topic per group, each with a single partition. > > This would result in 400 topics on the broker, but would allow the API > > server to simply serve the stream for each consumer directly from kafka > and > > wont require additional machinery to serve out the requests. > > > > > > > > > The 400 topic solution makes the most sense to me (doesnt require extra > > services, doesnt waste resources), but seem to conflict with best > > practices, so I wanted to ask the community for input. Has anyone done > > this before? What makes the most sense here? > > > > > > > > > > > > > > > Thanks > > > > > > > > > Shaun > > > > >