[jira] [Commented] (SAMZA-123) Move topic partition grouping to the AM and generalize

Jakob Homan (JIRA) Fri, 25 Apr 2014 11:29:53 -0700

    [ 
https://issues.apache.org/jira/browse/SAMZA-123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13981408#comment-13981408
 ]


Jakob Homan commented on SAMZA-123:
-----------------------------------

bq. For example, in the case of GroupIntoNSets, you could have the same 
strategy but changing the number of TIs (in order to scale for changing profile 
over time) would map the partitions to a different task instance. Now are the 
checkpoint and state information for that task valid?
Again, depends on the job itself.
* *A job that uses the state to record per-SSP commutative information* such as 
a job that does per-SSP buffered sums (ie, counts some property of the SSP, 
storing the count in the data store and emits those values on a regular basis, 
perhaps for further aggregation upstream) would be fine.  The existing values 
would be emited and then not updated again.  The SSPs in their new task homes 
would be started from 0 and go from there (contingent on our current 
at-least-once guarantee.  Once we have consumer-based checkpointing and 
idempontent producer, this would continue work as well with exactly-once).
* *A job that does a join but not state* would be fine as the keys would go 
ahead and hash to their new homes and still be available for pairing.
* *A job that does a join and requires state* would not be fine (ie would 
break) because this amounts to a mid-stream change in the partitioning 
function, which would be invalid in pretty much any case.  This would be 
equivalent to changing the upstream partitioning and expecting the saved state 
to be valid, except that in that case we wouldn't know it had even happened, 
whereas with this we could detect the change.

bq. My vote for the naming would be taskName if it is a string and taskId if we 
can map all use cases to ids.
As above, I do not want to use anything that is overloaded or could reasonably 
be confused by a newcomer with existing YARN or Map-Reduce terminology.  Cohort 
is nice because people will have to use what it is and how to use it.  The 
novelty is a feature, not a bug.

bq. Is the expectation that anyone who extends the grouping strategy need to 
add their own configs to the framework and wire them in?
Depends on the implementation, but this could certainly be the case.  Some 
implementations wouldn't require any extra configuration, others could have it 
hardwired in and some would need configuration-time definition.  This is true 
with pretty much anything we have pluggable.

bq. Are there any other grouping strategies that might require a lot more 
change in the framework that just implementing this API?
Not that I can think of and in fact this feature may be an important help in 
bringing SQL-like capabilities to Samza, as it allows very precise control over 
SSP-to-TI assignment which would be useful for join optimization, etc.

bq.  Finally, unless we are very sure about this working out well, we should 
not make this a public API.
I personally am, but I don't think that with a project this young we should be 
overly cautious about experimenting or trying to be more feature-rich.  We're 
not yet 1.0 and are quite up-front that the framework is evolving.  We've 
provided no API guarantees thus far.

> Move topic partition grouping to the AM and generalize
> ------------------------------------------------------
>
>                 Key: SAMZA-123
>                 URL: https://issues.apache.org/jira/browse/SAMZA-123
>             Project: Samza
>          Issue Type: Sub-task
>          Components: container
>    Affects Versions: 0.6.0
>            Reporter: Jakob Homan
>            Assignee: Jakob Homan
>         Attachments: SAMZA-123-design-doc.md, SAMZA-123-design-doc.pdf
>
>
> Currently the AM sends a set of all the topics and partitions to the 
> container, which then groups them by partition and assigns each set to a task 
> instance. By moving the grouping to the AM, we can assign arbitrary groups to 
> task instances, which will allow more partitioning strategies, as discussed 
> in SAMZA-71.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SAMZA-123) Move topic partition grouping to the AM and generalize

Reply via email to