I'm designing a system and need some more clarity regarding Kafka's
recommended limits on the number of topics and/or partitions. At a high
level, our system would work like this:

- A user creates a job X (X is a UUID).
- The user uploads data for X to an input topic: X.in.
- Workers process the data, writing results to an output topic: X.out.
- The user downloads the data from X.out.

It's important for the system that data for different jobs be kept
separate, and that input and output data be kept separate. By "separate" I
mean that there needs to be a reasonable way for users and the system's
workers to query for the data they need (by job-id and by input-vs-output)
and not get the data they don't need.

Based on expected usage and our data retention policy, we would not expect
to need more than 12,000 active jobs at any one time -- in other words,
24,000 topics. If we were to have 5 partitions per topic (our cluster has 5
brokers), that would imply 120,000 partitions. [These number refer only to
main/primary partitions, not any replicas that might exist.]

Those numbers seem to be far larger than the suggested limits I see online.
For example, the Kafka FAQ on these matters seems to imply that the most
relevant limit is the number of partitions (rather than topics) and sort of
implies that 10,000 partitions might be a suggested guideline (
https://goo.gl/fQs2md). Also implied is that systems should use fewer
topics and instead partition the data within topics if further separation
is needed (the FAQ entry uses the example of partitioning by user ID, which
is roughly analogous to job ID in my use case).

The guidance in the FAQ is unclear to me:

- Does the suggested limit of 10,000 refer to the total number of
partitions (ie, main partitions plus any replicas) or just the main
partitions?

- If the most important limitation is number of partitions (rather than
number of topics), how does the suggested strategy of using fewer topics
and then partitioning by some other attribute (ie job ID) help at all?

- Is my use case just a bad fit for Kafka? Or, is there a way for us to use
Kafka while still supporting the kinds of query patterns that we need (ie,
by job ID and by input-vs-output)?

Thanks in advance for any guidance.

Monty

Reply via email to