To add a little more context to Shaun's question, we have around 400
customers. Each customer has a stream of events. Some customers generate a
lot of data while others don't. We need to ensure that each customer's data
is sorted globally by timestamp.

We have two use cases around consumption:

1. A user may consume an individual customers data
2. A user may consume data for all customers

Given these two use cases, I think the better strategy is to have a
separate topic per customer as Todd suggested.

On Wed, Sep 30, 2015 at 9:26 AM, Todd Palino <tpal...@gmail.com> wrote:

> So I disagree with the idea to use custom partitioning, depending on your
> requirements. Having a consumer consume from a single partition is not
> (currently) that easy. If you don't care which consumer gets which
> partition (group), then it's not that bad. You have 20 partitions, you have
> 20 consumers, and you use custom partitioning as noted. The consumers use
> the high level consumer with a single group, each one will get one
> partition each, and it's pretty straightforward. If a consumer crashes, you
> will end up with two partitions on one of the remaining consumers. If this
> is OK, this is a decent solution.
>
> If, however, you require that each consumer always have the same group of
> data, and you need to know what that group is beforehand, it's more
> difficult. You need to use the simple consumer to do it, which means you
> need to implement a lot of logic for error and status code handling
> yourself, and do it right. In this case, I think your idea of using 400
> separate topics is sound. This way you can still use the high level
> consumer, which takes care of the error handling for you, and your data is
> separated out by topic.
>
> Provided it is not an issue to implement it in your producer, I would go
> with the separate topics. Alternately, if you're not sure you always want
> separate topics, you could go with something similar to your second idea,
> but have a consumer read the single topic and split the data out into 400
> separate topics in Kafka (no need for Cassandra or Redis or anything else).
> Then your real consumers can all consume their separate topics. Reading and
> writing the data one extra time is much better than rereading all of it 400
> times and throwing most of it away.
>
> -Todd
>
>
> On Wed, Sep 30, 2015 at 9:06 AM, Ben Stopford <b...@confluent.io> wrote:
>
> > Hi Shaun
> >
> > You might consider using a custom partition assignment strategy to push
> > your different “groups" to different partitions. This would allow you
> walk
> > the middle ground between "all consumers consume everything” and “one
> topic
> > per consumer” as you vary the number of partitions in the topic, albeit
> at
> > the cost of a little extra complexity.
> >
> > Also, not sure if you’ve seen it but there is quite a good section in the
> > FAQ here <
> >
> https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-HowmanytopicscanIhave
> ?>
> > on topic and partition sizing.
> >
> > B
> >
> > > On 29 Sep 2015, at 18:48, Shaun Senecal <shaun.sene...@lithium.com>
> > wrote:
> > >
> > > Hi
> > >
> > >
> > > I heave read Jay Kreps post regarding the number of topics that can be
> > handled by a broker (
> > https://www.quora.com/How-many-topics-can-be-created-in-Apache-Kafka),
> > and it has left me with more questions that I dont see answered anywhere
> > else.
> > >
> > >
> > > We have a data stream which will be consumed by many consumers (~400).
> > We also have many "groups" within our data.  A group in the data
> > corresponds 1:1 with what the consumers would consume, so consumer A only
> > ever see group A messages, consumer B only consumes group B messages,
> etc.
> > >
> > >
> > > The downstream consumers will be consuming via a websocket API, so the
> > API server will be the thing consuming from kafka.
> > >
> > >
> > > If I use a single topic with, say, 20 partitions, the consumers in the
> > API server would need to re-read the same messages over and over for each
> > consumer, which seems like a waste of network and a potential bottleneck.
> > >
> > >
> > > Alternatively, I could use a single topic with 20 partitions and have a
> > single consumer in the API put the messages into cassandra/redis (as
> > suggested by Jay), and serve out the downstream consumer streams that
> way.
> > However, that requires using a secondary sorted storage, which seems
> like a
> > waste (and added complexity) given that Kafka already has the data
> exactly
> > as I need it.  Especially if cassandra/redis are required to maintain a
> > long TTL on the stream.
> > >
> > >
> > > Finally, I could use 1 topic per group, each with a single partition.
> > This would result in 400 topics on the broker, but would allow the API
> > server to simply serve the stream for each consumer directly from kafka
> and
> > wont require additional machinery to serve out the requests.
> > >
> > >
> > > The 400 topic solution makes the most sense to me (doesnt require extra
> > services, doesnt waste resources), but seem to conflict with best
> > practices, so I wanted to ask the community for input.  Has anyone done
> > this before?  What makes the most sense here?
> > >
> > >
> > >
> > >
> > > Thanks
> > >
> > >
> > > Shaun
> >
> >
>

Reply via email to