Yan Fang created SAMZA-717: ------------------------------ Summary: Expose the TaskNameGrouper API Key: SAMZA-717 URL: https://issues.apache.org/jira/browse/SAMZA-717 Project: Samza Issue Type: New Feature Reporter: Yan Fang Priority: Minor
We now are using the [GroupByContainerCount|https://github.com/apache/samza/blob/master/samza-core/src/main/scala/org/apache/samza/container/grouper/task/GroupByContainerCount.scala] that extends [TaskNameGrouper|https://github.com/apache/samza/blob/master/samza-core/src/main/scala/org/apache/samza/container/grouper/task/TaskNameGrouper.scala] to assign TaskModels to ContainerModels (equivalent to assign tasks to different containers in YARN world). I think it also makes sense that we expose the TaskNameGrouper as an API that users can use to implement how they want to assign the TaskModels to the ContainerModels. This is useful when users have knowledge about the throughput of their streams because we are sharing the consumers for all the taskIntances in one container. One use case is that users want to put (partition-1, partition-3), (partition-2, partition-4) instead of (partition-1, partition-2), (partition-3, partition-4), which is current strategy. Because partition-1 and partition-2 both have a lot of messages coming, while partition-3 and partition-4 have fewer messages coming. Of course, when users have enough containers (same number as the task number) or all the partitions are equally divided, this feature is useless. What do you guys think? -- This message was sent by Atlassian JIRA (v6.3.4#6332)