Yan Fang created SAMZA-717:
------------------------------

             Summary: Expose the TaskNameGrouper API
                 Key: SAMZA-717
                 URL: https://issues.apache.org/jira/browse/SAMZA-717
             Project: Samza
          Issue Type: New Feature
            Reporter: Yan Fang
            Priority: Minor


We now are using the 
[GroupByContainerCount|https://github.com/apache/samza/blob/master/samza-core/src/main/scala/org/apache/samza/container/grouper/task/GroupByContainerCount.scala]
 that extends 
[TaskNameGrouper|https://github.com/apache/samza/blob/master/samza-core/src/main/scala/org/apache/samza/container/grouper/task/TaskNameGrouper.scala]
 to assign TaskModels to ContainerModels (equivalent to assign tasks to 
different containers in YARN world).

I think it also makes sense that we expose the TaskNameGrouper as an API that 
users can use to implement how they want to assign the TaskModels to the 
ContainerModels. 

This is useful when users have knowledge about the throughput of their streams 
because we are sharing the consumers for all the taskIntances in one container. 
One use case is that users want to put (partition-1, partition-3), 
(partition-2, partition-4) instead of (partition-1, partition-2), (partition-3, 
partition-4), which is current strategy. Because partition-1 and partition-2 
both have a lot of messages coming, while partition-3 and partition-4 have 
fewer messages coming. Of course, when users have enough containers (same 
number as the task number) or all the partitions are equally divided, this 
feature is useless.

What do you guys think?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to