[
https://issues.apache.org/jira/browse/SAMZA-123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13985842#comment-13985842
]
Jay Kreps commented on SAMZA-123:
---------------------------------
On the topic of cohort vs task id/name, let me try to give my rationale for
what is clearly a somewhat subjective matter.
Task is a well defined thing in Samza. I agree that various systems have
different notions of tasks and that could be confusing, but that is the
terminology we have. A lot of thought actually went into this approach because
we first did it wrong (at first we conflated tasks and containers, coming from
a MapReduce background ourselves) then realized all the flaws and redid it to
what it is.
I think some of the disagreement may be more around the model than the
terminology.
Let me give a quick background explanation for why we separate tasks and
containers since there are lots of people on this thread. Essentially what we
realized is that you need both a unit of "logical execution" and "physical
parallelism" for a job that runs forever. A batch system can conflate these
because if you want to restart a running job with more resources it starts over
from the very beginning, throwing away whatever intermediate results the tasks
had accumulated. A stream processing job, though, never stops, and so if you
allow any kind of state (whether backed by our changelogs or anything) and you
want to be able to guarantee any kind of correctness of semantics you need your
units of processing to not change. Since you obviously need to be able to scale
up and down the parallelism of processing (how may cpus/processes you get) you
really want to separate the two concepts.
So we can call this unit of logical execution any number of things. It could be
a task or a stream processing element, or a processor, or whatever. But if that
unit is called an X, then we should really consider calling the unique
identifier for that unit an X id, or, if we must name.
I actually like the phrase cohort, the issue is just that we aren't using that
elsewhere to denote what we currently call a task.
> Move topic partition grouping to the AM and generalize
> ------------------------------------------------------
>
> Key: SAMZA-123
> URL: https://issues.apache.org/jira/browse/SAMZA-123
> Project: Samza
> Issue Type: Sub-task
> Components: container
> Affects Versions: 0.6.0
> Reporter: Jakob Homan
> Assignee: Jakob Homan
> Attachments: SAMZA-123-design-doc.md, SAMZA-123-design-doc.pdf
>
>
> Currently the AM sends a set of all the topics and partitions to the
> container, which then groups them by partition and assigns each set to a task
> instance. By moving the grouping to the AM, we can assign arbitrary groups to
> task instances, which will allow more partitioning strategies, as discussed
> in SAMZA-71.
--
This message was sent by Atlassian JIRA
(v6.2#6252)