Hi Selina,


what do you mean by "primary key" here? Is it one of the partitions of "input" 
or something like "if one msg meets condition x, we think msg has the primary 
key"?


If you just want to count the msgs, you can count in one Samza job and send the 
result to "output" topic. You can send to any partition of the "output" if you 
give the msgs the same partition key.


Thanks,
Yan







At 2015-10-22 08:30:15, "Selina Tech" <swucaree...@gmail.com> wrote:
>Hi, All:
>
>        In the Samza document, it mentioned "Each task consumes data from
>one partition for each of the job’s input streams." Does it mean if the
>data processing one job is not in one partition, the result will be wrong.
>
>        Assuming my Samza input data on Kafka topic -- "input" is
>partitioned by default -- round robin. And I have five partitions. If my
>Samza job is to count messages by primary key of the message at "input"
>topic, and then output it to kafka topic -- "output".
>
>       So I need steps as below
>      1. read data from Kafka topic "input"
>      2. reset the partition key to "primary key" in Samza
>      3. produce it back to Kafka topic named as "temp"
>      4. read "temp" topic at Samza
>      5. count it in Samza
>      6. write it to Kafka topic named as "output"
>
>      If I just read data from Kafka topic "input" and count it in Samza
>and write it to topic "output". The result will not be correct because there
>might have multiple messages for same "primary key" in "output" topic.  Do
>I understand it correctly?
>
>Sincerely,
>Selina

Reply via email to