Hi Selina,

Your understanding is correct. Yes, you "need to consumer the original input 
and send it back to Kafka and reset the* Key to departmentName *and then 
consume it again 
to count in Samza" if you want to count the number of students in the same 
departmentName. This is a typical aggregation use case. Because after 
aggregating the students in the same department, you can do more than just 
"count". :)


Cheers,
Yan


At 2015-10-25 06:12:50, "Selina Tech" <swucaree...@gmail.com> wrote:
>Hi, Yan:
>
>      Thanks a lot for your reply.
>
>      You mentioned "if you give the msgs the same partition key", which
>mean same partition key value or  same partition key attribute name?
>
>       I mentioned "primary key" as "key" at public
>KeyedMessage(java.lang.String topic, K key, V message) or you can ignore
>it. I explain it in another way below.
>
>       If I need aggregate data, but the data are not in same partition, do
>we need consumer the data, and put it back it to Kafka with with new key
>and then consumer it again and aggregate it in Samza.
>
>      For example,  messages about student GPA information was send to
>Kafka by* K key(String schoolName)*. The message looks like "name,
>schoolName,  departmentName,  grade, GPA", and assuming I have 3
>partitions, With my understanding, all student records in one school should
>go to same partition.
>
>      Right now I need to aggregate data for same department, no matter
>which school.  which mean all the same departmentName message will be in
>three different partition. If I just count it in one samza job, will the
>result correct?  Do I need to consumer the original input and send it back
>to Kafka and reset the* Key to  departmentName *and then consume it again
>to count in Samza?
>
>     If I did not understand the partition and task of Samza, would you
>like to correct me?
>
>Sincerely,
>Selina
>
>On Sat, Oct 24, 2015 at 2:45 AM, Yan Fang <yanfangw...@163.com> wrote:
>
>>
>>
>> Hi Selina,
>>
>>
>> what do you mean by "primary key" here? Is it one of the partitions of
>> "input" or something like "if one msg meets condition x, we think msg has
>> the primary key"?
>>
>>
>> If you just want to count the msgs, you can count in one Samza job and
>> send the result to "output" topic. You can send to any partition of the
>> "output" if you give the msgs the same partition key.
>>
>>
>> Thanks,
>> Yan
>>
>>
>>
>>
>>
>>
>>
>> At 2015-10-22 08:30:15, "Selina Tech" <swucaree...@gmail.com> wrote:
>> >Hi, All:
>> >
>> >        In the Samza document, it mentioned "Each task consumes data from
>> >one partition for each of the job’s input streams." Does it mean if the
>> >data processing one job is not in one partition, the result will be wrong.
>> >
>> >        Assuming my Samza input data on Kafka topic -- "input" is
>> >partitioned by default -- round robin. And I have five partitions. If my
>> >Samza job is to count messages by primary key of the message at "input"
>> >topic, and then output it to kafka topic -- "output".
>> >
>> >       So I need steps as below
>> >      1. read data from Kafka topic "input"
>> >      2. reset the partition key to "primary key" in Samza
>> >      3. produce it back to Kafka topic named as "temp"
>> >      4. read "temp" topic at Samza
>> >      5. count it in Samza
>> >      6. write it to Kafka topic named as "output"
>> >
>> >      If I just read data from Kafka topic "input" and count it in Samza
>> >and write it to topic "output". The result will not be correct because
>> there
>> >might have multiple messages for same "primary key" in "output" topic.  Do
>> >I understand it correctly?
>> >
>> >Sincerely,
>> >Selina
>>

Reply via email to