I'm designing a system and need some more clarity regarding Kafka's recommended limits on the number of topics and/or partitions. At a high level, our system would work like this:
- A user creates a job X (X is a UUID). - The user uploads data for X to an input topic: X.in. - Workers process the data, writing results to an output topic: X.out. - The user downloads the data from X.out. It's important for the system that data for different jobs be kept separate, and that input and output data be kept separate. By "separate" I mean that there needs to be a reasonable way for users and the system's workers to query for the data they need (by job-id and by input-vs-output) and not get the data they don't need. Based on expected usage and our data retention policy, we would not expect to need more than 12,000 active jobs at any one time -- in other words, 24,000 topics. If we were to have 5 partitions per topic (our cluster has 5 brokers), that would imply 120,000 partitions. [These number refer only to main/primary partitions, not any replicas that might exist.] Those numbers seem to be far larger than the suggested limits I see online. For example, the Kafka FAQ on these matters seems to imply that the most relevant limit is the number of partitions (rather than topics) and sort of implies that 10,000 partitions might be a suggested guideline ( https://goo.gl/fQs2md). Also implied is that systems should use fewer topics and instead partition the data within topics if further separation is needed (the FAQ entry uses the example of partitioning by user ID, which is roughly analogous to job ID in my use case). The guidance in the FAQ is unclear to me: - Does the suggested limit of 10,000 refer to the total number of partitions (ie, main partitions plus any replicas) or just the main partitions? - If the most important limitation is number of partitions (rather than number of topics), how does the suggested strategy of using fewer topics and then partitioning by some other attribute (ie job ID) help at all? - Is my use case just a bad fit for Kafka? Or, is there a way for us to use Kafka while still supporting the kinds of query patterns that we need (ie, by job ID and by input-vs-output)? Thanks in advance for any guidance. Monty