[
https://issues.apache.org/jira/browse/SAMZA-348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14211282#comment-14211282
]
Yi Pan (Data Infrastructure) commented on SAMZA-348:
----------------------------------------------------
Hi, referring to Chris's comment before:
https://issues.apache.org/jira/browse/SAMZA-348?focusedCommentId=14134172&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14134172,
we had a discussion on the potential race condition problem w/ the checkpoint
messages from container. The specific issue in the discussion: when a new
container was restarted, how does Samza AM knows that it have consumed all
checkpoint messages from the container's last run? I think that the following
solution would be a cheap and attractive one:
1. Samza AM always assign a monotonically increasing generation # to each
invocation of a specific container
2. Each container will now associate its current generation # w/ the checkpoint
messages published to the ConfigStream
3. When a container failure is detected, Samza AM increment the generation #
and bounce the container
4. On restart, the container will first publish a start token w/ its current
generation # to the ConfigStream
5. Now, the Samza AM receiving messages from the ConfigStream can perform the
following decisions:
a. on reception of the new generation # of the container, Smaza AM knows
that it has consumed all previous checkpoints and recovered the state. Hence,
it can start serving the config to that new generation of container via HTTP API
b. If there is any issue (e.g. network partition between the SamzaAM and the
container) that makes the SamzaAM "thinks" the container has failed and
restarted a new one, the checkpoint messages from the still running old
generation of container can now be safely discarded after receiving the start
token from the new generation of the same container.
Please review and comment to see whether there are other issues I may miss.
Thanks!
> Configure Samza jobs through a stream
> -------------------------------------
>
> Key: SAMZA-348
> URL: https://issues.apache.org/jira/browse/SAMZA-348
> Project: Samza
> Issue Type: Bug
> Affects Versions: 0.7.0
> Reporter: Chris Riccomini
> Assignee: Chris Riccomini
> Labels: design, project
> Attachments: DESIGN-SAMZA-348-0.md, DESIGN-SAMZA-348-0.pdf,
> DESIGN-SAMZA-348-1.md, DESIGN-SAMZA-348-1.pdf
>
>
> Samza's existing config setup is problematic for a number of reasons:
> # It's completely immutable once a job starts. This prevents any dynamic
> reconfiguration and auto-scaling. It is debatable whether we want these
> feature or not, but our existing implementation actively prevents it. See
> SAMZA-334 for discussion.
> # We pass existing configuration through environment variables. YARN exports
> environment variables in a shell script, which limits the size to the varargs
> length on the machine. This is usually ~128KB. See SAMZA-333 and SAMZA-337
> for details.
> # User-defined configuration (the Config object) and programmatic
> configuration (checkpoints and TaskName:State mappings (see SAMZA-123)) are
> handled differently. It's debatable whether this makes sense.
> In SAMZA-123, [~jghoman] and I propose implementing a ConfigLog. This log
> would replace both the checkpoint topic and the existing config environment
> variables in SamzaContainer and Samza's YARN AM.
> I'd like to keep this ticket's scope limited to just the implementation of
> the ConfigLog, and not re-designing how Samza's config is used in the code
> (SAMZA-40). We should, however, discuss how this feature would affect dynamic
> reconfiguration/auto-scaling.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)