On Fri, Aug 2, 2019 at 11:00 AM Chad Dombrova <[email protected]> wrote:
> Hi all,
> I’m a bit confused about the desire to use json for the environment_config.
>
Note that complex PipelineOptions are already expected to be in JSON
format[1, 2]. This has solved many string parsing and ambiguity issues.
> It’s harder to use json on the command line, such that now we’re talking
> about the value being *either* a docker image name *or* a path to a json
> file (OR maybe yaml too!), which is not only less convenient than just
> typing the docker ags you want, it's also IMHO a dirty/inconsistent design.
>
> The idea of having each key of the json config map to a docker flag seems
> like a maintenance quagmire with little benefit. In that case, Beam devs
> would have to maintain parity with docker options and be the arbiters of
> what's "safe" and what's not, and users would have to read additional beam
> documentation (which may or may not exist) to discover what keys are valid,
> rather than simply doing what they know, and passing the docker args. As
> Sam points out, if security is a concern there are plenty of ways to abuse
> the system already. Security should be handled at the infrastructure
> deployment level, where it’s actually meaningful.
>
Wouldn't supporting every possible docker option also be a backwards
compatibility and portability quagmire?
Users could say that option X worked with Beam Y but no longer with Y+1 or
with runner A since it used docker Z+1 but not with runner B because it
uses docker Z.
Both options have tradeoffs and the important part is whether the
convenience of specifying all options available to users via docker run
outweigh the drawbacks.
> It also seems like there’s already a precedent for encoding environment
> configuration as command line args. Consider the SUBPROCESS_SDK environment:
>
> options = PipelineOptions()
> options.view_as(PortableOptions).environment_type = \
> python_urns.SUBPROCESS_SDK
> options.view_as(PortableOptions).environment_config = \
> b'/usr/bin/python -m apache_beam.runners.worker.sdk_worker_main'
>
> This could be encoded as json to avoid someone passing something nasty,
> but luckily that was *not* the choice that was made, because I think this
> is a fine design.
>
> As a result, I think the original proposal was the most elegant and
> consistent with other environment types:
>
> --environment_type DOCKER --environment_config "-v
> /Volumes/mnt/foo:/Volumes/mnt/foo --user sambvfx MY_CONTAINER_NAME"
>
>
> Note that the help docs for --environment_config heavily suggest that the
intent for the command was always JSON:
Set environment configuration for running the user code. For DOCKER: Url
for the docker image.\n For PROCESS: json of the form {"os": "<OS>",
"arch": "<ARCHITECTURE>", "command": "<process to execute>",
"env":{"<Environment variables 1>": "<ENV_VAL>"} }. All fields in the json
are optional except command.
1:
https://github.com/apache/beam/blob/24e9cedcc768d901de795477fa78c7f357635671/sdks/java/core/src/main/java/org/apache/beam/sdk/options/PipelineOptionsFactory.java#L163
2:
https://github.com/apache/beam/blob/24e9cedcc768d901de795477fa78c7f357635671/sdks/python/apache_beam/options/pipeline_options.py#L822