A benefit of using docker containers is that (nearly) arbitrary native
dependencies can be installed in the container image itself by either the
user or SDK. For example, the (minimal, in progress) Python container
Dockerfile is here:


https://github.com/apache/beam/blob/1039f5b9682fa6aa5fba256110c63caf4d0da41f/sdks/python/container/Dockerfile

Any user could simply augment it with "pip install" commands, say, or use
something else entirely (although the corresponding boot program may also
need to change in that case). The Python SDK itself might also include
options/scripts/etc to make common customizations easier to use to avoid
installing them at runtime. Multiple Dockerfiles can also co-exist. For
actually passing the container image to the runner it's a choice make by
each SDK, which is why it's not discussed much in the portability context.
But a uniform flag along the lines of --sdk_harness_container_image to
include the image into the pipeline proto would seem desirable. That said,
I don't think how all these capabilities would best be exposed to users has
been much explored yet in any SDK.

Finally, there has been several thoughts on cross-language pipelines and I
think it's a very exciting aspect of the portability framework. A doc is
here:

   https://s.apache.org/beam-mixed-language-pipelines.

It is also linked from design section in the portability page.

Thanks,
 Henning


On Sat, Nov 18, 2017 at 6:33 AM, Holden Karau <hol...@pigscanfly.ca> wrote:

> So I was looking through https://beam.apache.org/contribute/portability/
> which lead me to BEAM-2900, and then to
> https://docs.google.com/document/d/1n6s3BOxOPct3uF4UgbbI9O9rpdiKW
> FH9R6mtVmR7xp0/edit#
> .
>
> I was wondering if there is any considerations being given to native
> dependencies that user code may have (especially things like Python
> packages which can be super painful to deal with in a Spark cluster unless
> you use one of the vendor solutions)?
>
> Also, and this may be a terrible idea, but has there been thought given to
> the idea of a cross-language pipelines (I see these in Spark occasionally
> but with the DL stuff happening I suspect we might see users wanting
> cross-language functionality more often)?
>
> I also saw "Proposal: introduce an option to pass SDK harness container
> image in Beam SDKs" & it seems like Robert brought up the benefits of using
> Docker for Python runners, but I don't see the details on how we would
> expose that to users it in the design docs I've found yet (which could very
> well be I'm not looking at the right docs).
>
> Cheers,
>
> Holden :)
>
> --
> Twitter: https://twitter.com/holdenkarau
>

Reply via email to