[ 
https://issues.apache.org/jira/browse/SAMZA-123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13985842#comment-13985842
 ] 

Jay Kreps commented on SAMZA-123:
---------------------------------

On the topic of cohort vs task id/name, let me try to give my rationale for 
what is clearly a somewhat subjective matter.

Task is a well defined thing in Samza. I agree that various systems have 
different notions of tasks and that could be confusing, but that is the 
terminology we have. A lot of thought actually went into this approach because 
we first did it wrong (at first we conflated tasks and containers, coming from 
a MapReduce background ourselves) then realized all the flaws and redid it to 
what it is.

I think some of the disagreement may be more around the model than the 
terminology.

Let me give a quick background explanation for why we separate tasks and 
containers since there are lots of people on this thread. Essentially what we 
realized is that you need both a unit of "logical execution" and "physical 
parallelism" for a job that runs forever. A batch system can conflate these 
because if you want to restart a running job with more resources it starts over 
from the very beginning, throwing away whatever intermediate results the tasks 
had accumulated. A stream processing job, though, never stops, and so if you 
allow any kind of state (whether backed by our changelogs or anything) and you 
want to be able to guarantee any kind of correctness of semantics you need your 
units of processing to not change. Since you obviously need to be able to scale 
up and down the parallelism of processing (how may cpus/processes you get) you 
really want to separate the two concepts.

So we can call this unit of logical execution any number of things. It could be 
a task or a stream processing element, or a processor, or whatever. But if that 
unit is called an X, then we should really consider calling the unique 
identifier for that unit an X id, or, if we must name.

I actually like the phrase cohort, the issue is just that we aren't using that 
elsewhere to denote what we currently call a task.

> Move topic partition grouping to the AM and generalize
> ------------------------------------------------------
>
>                 Key: SAMZA-123
>                 URL: https://issues.apache.org/jira/browse/SAMZA-123
>             Project: Samza
>          Issue Type: Sub-task
>          Components: container
>    Affects Versions: 0.6.0
>            Reporter: Jakob Homan
>            Assignee: Jakob Homan
>         Attachments: SAMZA-123-design-doc.md, SAMZA-123-design-doc.pdf
>
>
> Currently the AM sends a set of all the topics and partitions to the 
> container, which then groups them by partition and assigns each set to a task 
> instance. By moving the grouping to the AM, we can assign arbitrary groups to 
> task instances, which will allow more partitioning strategies, as discussed 
> in SAMZA-71.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to