Thanks for the clarification, my experience with Kafka was with 0.6 and 0.7 and I'm pretty sure each broker had N partitions, we were load balancing brokers using a HW LB with a partition count of 1 and each broker had a partition 0.
I guess this has changed in 0.8 then, or my memory is all messed up! Anyway if there are N partitions spread over P brokers the Samza job partitions fit the topic ones and everything is good. Mathias. On Fri, Aug 23, 2013 at 11:17 PM, Chris Riccomini <[email protected]>wrote: > Hey Mathias, > > Kafka's topic:partition mapping is not quite what you describe. If a topic > were to have 4 partitions, and a cluster had 4 machines, you would not end > up with 4*4=16 partitions. You would end up with 4 partitions, one on each > box. Using your annotation, a topic T with N partitions would have N > partitions (not P*N). These N partitions will be distributed as fairly as > possible across all Kafka brokers. > > If a Samza job were to read from this 4 partition topic, it would have 4 > tasks, each consuming one partition of the topic. > > If a Samza job were reading from Topic A (4 partitions) and Topic B (6 > partitions), then the Samza job would have 6 tasks. Tasks 1-4 would read > messages from both topics, and tasks 5-6 would receive messages only from > Topic B. > > This is described in some detail in the docs: > > > http://samza.incubator.apache.org/learn/documentation/0.7.0/container/task- > runner.html > > The motivation for this partitioning model is that it allows us to > guarantee that two topics with the same partition count, and same > partitioning key, will be delivered to a single Samza task. For example, > if you have an AdView topic and an AdClick topic with the same partition > count, and both partitioned by member_id, then the Samza task that > receives the AdView events for member 0 will also receive the AdClick > events for member 0. This behavior enables aggregation and joining of data. > > Cheers, > Chris > > On 8/23/13 12:42 PM, "Mathias Herberts" <[email protected]> > wrote: > > >Hi, > > > >first of all kudos for putting Samza into the Apache Incubator, it's good > >to have yet another approach to stream processing. > > > >IIRC in a multi-node Kafka cluster (let's assume P nodes), a topic T with > >N > >partitions will have N partitions on each node, so the total number of > >partitions will be P*N. > > > >My question relates to the notion of partition in the Samza stream linked > >to T, will the Samza partition number be N or P*N ? > > > >Mathias. > >
