Hi Selina,
Your understanding is correct. Yes, you "need to consumer the original input and send it back to Kafka and reset the* Key to departmentName *and then consume it again to count in Samza" if you want to count the number of students in the same departmentName. This is a typical aggregation use case. Because after aggregating the students in the same department, you can do more than just "count". :) Cheers, Yan At 2015-10-25 06:12:50, "Selina Tech" <swucaree...@gmail.com> wrote: >Hi, Yan: > > Thanks a lot for your reply. > > You mentioned "if you give the msgs the same partition key", which >mean same partition key value or same partition key attribute name? > > I mentioned "primary key" as "key" at public >KeyedMessage(java.lang.String topic, K key, V message) or you can ignore >it. I explain it in another way below. > > If I need aggregate data, but the data are not in same partition, do >we need consumer the data, and put it back it to Kafka with with new key >and then consumer it again and aggregate it in Samza. > > For example, messages about student GPA information was send to >Kafka by* K key(String schoolName)*. The message looks like "name, >schoolName, departmentName, grade, GPA", and assuming I have 3 >partitions, With my understanding, all student records in one school should >go to same partition. > > Right now I need to aggregate data for same department, no matter >which school. which mean all the same departmentName message will be in >three different partition. If I just count it in one samza job, will the >result correct? Do I need to consumer the original input and send it back >to Kafka and reset the* Key to departmentName *and then consume it again >to count in Samza? > > If I did not understand the partition and task of Samza, would you >like to correct me? > >Sincerely, >Selina > >On Sat, Oct 24, 2015 at 2:45 AM, Yan Fang <yanfangw...@163.com> wrote: > >> >> >> Hi Selina, >> >> >> what do you mean by "primary key" here? Is it one of the partitions of >> "input" or something like "if one msg meets condition x, we think msg has >> the primary key"? >> >> >> If you just want to count the msgs, you can count in one Samza job and >> send the result to "output" topic. You can send to any partition of the >> "output" if you give the msgs the same partition key. >> >> >> Thanks, >> Yan >> >> >> >> >> >> >> >> At 2015-10-22 08:30:15, "Selina Tech" <swucaree...@gmail.com> wrote: >> >Hi, All: >> > >> > In the Samza document, it mentioned "Each task consumes data from >> >one partition for each of the job’s input streams." Does it mean if the >> >data processing one job is not in one partition, the result will be wrong. >> > >> > Assuming my Samza input data on Kafka topic -- "input" is >> >partitioned by default -- round robin. And I have five partitions. If my >> >Samza job is to count messages by primary key of the message at "input" >> >topic, and then output it to kafka topic -- "output". >> > >> > So I need steps as below >> > 1. read data from Kafka topic "input" >> > 2. reset the partition key to "primary key" in Samza >> > 3. produce it back to Kafka topic named as "temp" >> > 4. read "temp" topic at Samza >> > 5. count it in Samza >> > 6. write it to Kafka topic named as "output" >> > >> > If I just read data from Kafka topic "input" and count it in Samza >> >and write it to topic "output". The result will not be correct because >> there >> >might have multiple messages for same "primary key" in "output" topic. Do >> >I understand it correctly? >> > >> >Sincerely, >> >Selina >>