[
https://issues.apache.org/jira/browse/SAMZA-123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13981408#comment-13981408
]
Jakob Homan commented on SAMZA-123:
-----------------------------------
bq. For example, in the case of GroupIntoNSets, you could have the same
strategy but changing the number of TIs (in order to scale for changing profile
over time) would map the partitions to a different task instance. Now are the
checkpoint and state information for that task valid?
Again, depends on the job itself.
* *A job that uses the state to record per-SSP commutative information* such as
a job that does per-SSP buffered sums (ie, counts some property of the SSP,
storing the count in the data store and emits those values on a regular basis,
perhaps for further aggregation upstream) would be fine. The existing values
would be emited and then not updated again. The SSPs in their new task homes
would be started from 0 and go from there (contingent on our current
at-least-once guarantee. Once we have consumer-based checkpointing and
idempontent producer, this would continue work as well with exactly-once).
* *A job that does a join but not state* would be fine as the keys would go
ahead and hash to their new homes and still be available for pairing.
* *A job that does a join and requires state* would not be fine (ie would
break) because this amounts to a mid-stream change in the partitioning
function, which would be invalid in pretty much any case. This would be
equivalent to changing the upstream partitioning and expecting the saved state
to be valid, except that in that case we wouldn't know it had even happened,
whereas with this we could detect the change.
bq. My vote for the naming would be taskName if it is a string and taskId if we
can map all use cases to ids.
As above, I do not want to use anything that is overloaded or could reasonably
be confused by a newcomer with existing YARN or Map-Reduce terminology. Cohort
is nice because people will have to use what it is and how to use it. The
novelty is a feature, not a bug.
bq. Is the expectation that anyone who extends the grouping strategy need to
add their own configs to the framework and wire them in?
Depends on the implementation, but this could certainly be the case. Some
implementations wouldn't require any extra configuration, others could have it
hardwired in and some would need configuration-time definition. This is true
with pretty much anything we have pluggable.
bq. Are there any other grouping strategies that might require a lot more
change in the framework that just implementing this API?
Not that I can think of and in fact this feature may be an important help in
bringing SQL-like capabilities to Samza, as it allows very precise control over
SSP-to-TI assignment which would be useful for join optimization, etc.
bq. Finally, unless we are very sure about this working out well, we should
not make this a public API.
I personally am, but I don't think that with a project this young we should be
overly cautious about experimenting or trying to be more feature-rich. We're
not yet 1.0 and are quite up-front that the framework is evolving. We've
provided no API guarantees thus far.
> Move topic partition grouping to the AM and generalize
> ------------------------------------------------------
>
> Key: SAMZA-123
> URL: https://issues.apache.org/jira/browse/SAMZA-123
> Project: Samza
> Issue Type: Sub-task
> Components: container
> Affects Versions: 0.6.0
> Reporter: Jakob Homan
> Assignee: Jakob Homan
> Attachments: SAMZA-123-design-doc.md, SAMZA-123-design-doc.pdf
>
>
> Currently the AM sends a set of all the topics and partitions to the
> container, which then groups them by partition and assigns each set to a task
> instance. By moving the grouping to the AM, we can assign arbitrary groups to
> task instances, which will allow more partitioning strategies, as discussed
> in SAMZA-71.
--
This message was sent by Atlassian JIRA
(v6.2#6252)