Hey Mathias,
Kafka's topic:partition mapping is not quite what you describe. If a topic
were to have 4 partitions, and a cluster had 4 machines, you would not end
up with 4*4=16 partitions. You would end up with 4 partitions, one on each
box. Using your annotation, a topic T with N partitions would have N
partitions (not P*N). These N partitions will be distributed as fairly as
possible across all Kafka brokers.
If a Samza job were to read from this 4 partition topic, it would have 4
tasks, each consuming one partition of the topic.
If a Samza job were reading from Topic A (4 partitions) and Topic B (6
partitions), then the Samza job would have 6 tasks. Tasks 1-4 would read
messages from both topics, and tasks 5-6 would receive messages only from
Topic B.
This is described in some detail in the docs:
http://samza.incubator.apache.org/learn/documentation/0.7.0/container/task-
runner.html
The motivation for this partitioning model is that it allows us to
guarantee that two topics with the same partition count, and same
partitioning key, will be delivered to a single Samza task. For example,
if you have an AdView topic and an AdClick topic with the same partition
count, and both partitioned by member_id, then the Samza task that
receives the AdView events for member 0 will also receive the AdClick
events for member 0. This behavior enables aggregation and joining of data.
Cheers,
Chris
On 8/23/13 12:42 PM, "Mathias Herberts" <[email protected]> wrote:
>Hi,
>
>first of all kudos for putting Samza into the Apache Incubator, it's good
>to have yet another approach to stream processing.
>
>IIRC in a multi-node Kafka cluster (let's assume P nodes), a topic T with
>N
>partitions will have N partitions on each node, so the total number of
>partitions will be P*N.
>
>My question relates to the notion of partition in the Samza stream linked
>to T, will the Samza partition number be N or P*N ?
>
>Mathias.