[jira] [Commented] (SAMZA-348) Configure Samza jobs through a stream

Chris Riccomini (JIRA) Mon, 15 Sep 2014 13:18:18 -0700

    [ 
https://issues.apache.org/jira/browse/SAMZA-348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14134414#comment-14134414
 ]


Chris Riccomini commented on SAMZA-348:
---------------------------------------

bq. the user might lose track of what the exact config is

For this, I was thinking that configure-job.sh could have a --read switch, to 
get all existing configs for a job. I agree it's super useful to have the AM 
expose them as well, which we can continue to do.

bq. Avoid all concurrency issues.

Isn't there still a concurrency issue if two writers update the AM UI at the 
same time?

bq. We can reflect the current config accurately (for example - if within 
LinkedIn, the user only modifies the config via cfg2, then there's an extra 
overhead of keeping that in sync with the actual config - since config 
mutations might be done via the AM).

For this kind of use case, I was figuring we'd have configure-job.sh behave a 
lot like run-job.sh does today: take a URI and a factory, and resolve configs. 
For example, something like:

{noformat}
$ configure-job.sh --uri kafa://localhost:1025 --job.name foo --job.id bar 
--config-file=file://... --config-factory=PropertiesConfigFactory
{noformat}

You could have configure-job.sh run against a static config file every time 
run-job.sh is run. This would essentially mirror how Samza currently works.

One other thought: if we depend on a UI (in YARN or otherwise), we get into a 
problem where we might need to edit config while the job is down (the UI is 
unavailable).

I haven't really fully baked any of this, but this is just along the lines of 
what I'm thinking right now. I think it's OK to live with concurrency issues 
for config, but for offsets, it could be problematic. I haven't spent much time 
thinking about how to fix that yet.

> Configure Samza jobs through a stream
> -------------------------------------
>
>                 Key: SAMZA-348
>                 URL: https://issues.apache.org/jira/browse/SAMZA-348
>             Project: Samza
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Chris Riccomini
>              Labels: project
>         Attachments: DESIGN-SAMZA-348-0.md, DESIGN-SAMZA-348-0.pdf
>
>
> Samza's existing config setup is problematic for a number of reasons:
> # It's completely immutable once a job starts. This prevents any dynamic 
> reconfiguration and auto-scaling. It is debatable whether we want these 
> feature or not, but our existing implementation actively prevents it. See 
> SAMZA-334 for discussion.
> # We pass existing configuration through environment variables. YARN exports 
> environment variables in a shell script, which limits the size to the varargs 
> length on the machine. This is usually ~128KB. See SAMZA-333 and SAMZA-337 
> for details.
> # User-defined configuration (the Config object) and programmatic 
> configuration (checkpoints and TaskName:State mappings (see SAMZA-123)) are 
> handled differently. It's debatable whether this makes sense.
> In SAMZA-123, [~jghoman] and I propose implementing a ConfigLog. This log 
> would replace both the checkpoint topic and the existing config environment 
> variables in SamzaContainer and Samza's YARN AM.
> I'd like to keep this ticket's scope limited to just the implementation of 
> the ConfigLog, and not re-designing how Samza's config is used in the code 
> (SAMZA-40). We should, however, discuss how this feature would affect dynamic 
> reconfiguration/auto-scaling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (SAMZA-348) Configure Samza jobs through a stream

Reply via email to