Environments and compatibility Was: [Proposal] Beam ML containers

Kenneth Knowles Tue, 15 Jul 2025 13:42:35 -0700

Altered subject because this has come up in a number of contexts and I
wonder if now is a time to have fresh thoughts on it.

"Environment" has become almost-but-not-quite assumed to be a docker
container image specification. TBH I haven't dug in deeply recently but
this is how I see it discussed. I am aware that there are LOOPBACK and
EMBEDDED environments as well.

The intention behind having environments is really that they could be
somewhat abstract and *often* recognized by runners and elided for
efficiency. It is an anit-goal to have the entirety of the contents of a
container be "the spec" for an environment. This is begging to be trapped
by Hyrum's Law.

All that said, we are where we are. My presumption, for a while now, is
that runners would have to recognize and/or parse docker images and
*reinterpret* them as abstract specifications of environments. In other
words the URL for the Beam Java SDK harness container (plus deps) would be
reinterpreted as "default Beam Java SDK harness" and the runner would then
run in any compatible way it desired.

For example it is *intended* that non-portable runners could be seamlessly
reused after verifying that all environments are compatible with their
non-portable execution style. The Flink/Spark/Samza runner executing an
all-Java pipeline via the portable gRPC protocols is a huge missed
opportunity, just throwing existing functionality and performance away.

In such a regime (where so much is hidden inside a docker URL)
compatibility pretty much has to be runner-determined. First the runner
reinterprets docker URLs (what aspects are important is partially
runner-specific) and deps and in Joey's case licenses more abstractly, then
has its own logic to decide which environments it can merge. This logic is
inherently cross-SDK, both in terms of versions and languages. We could
have a spec in docs and proto but it can't live as code in a particular SDK.

Kenn

On Thu, Jul 3, 2025 at 4:34 AM Joey Tran <joey.t...@schrodinger.com> wrote:

>
> On Tue, Jul 1, 2025 at 2:37 PM Danny McCormick via dev <
> dev@beam.apache.org> wrote:
>
>> I think it is probably reasonable to automate this when a GPU resource
>> hint is used. I think we still need to expose this as a config option for
>> the ML containers (and it is the same with distroless) since it is pretty
>> difficult to say with confidence that those images are/aren't needed (even
>> if you're using a transform like RunInference, maybe you're using Spacy
>> which isn't a default dependency included in the ML images) and there is a
>> cost to using them (longer startup times).
>>
>> > This being the messy world of ML, would these images be
>> mahine/accelerator agnostic?
>>
>> That is the goal (at least to be agnostic within GPU types), and the
>> images will be as simple as possible to accommodate this. I think building
>> from an Nvidia base should accomplish this for most cases. For anything
>> beyond that, I think it is reasonable to ask users to build their own
>> container.
>>
>> On Tue, Jul 1, 2025 at 1:36 PM Robert Bradshaw <rober...@waymo.com>
>> wrote:
>>
>>> On Tue, Jul 1, 2025 at 10:32 AM Kenneth Knowles <k...@apache.org> wrote:
>>> >
>>> > Obligatory question: can we automate this? Specifically: can we
>>> publish the ML-specific containers and then use them as appropriate without
>>> making it a user-facing knob?
>>>
>>> +1
>>>
>>> Transforms can declare their own environments. The only problem with
>>> this is that distinct environments prohibit fusion--we need a way to
>>> say that a given environment is a superset of another. (We can do this
>>> with dependencies, but not with arbitrary docker images.) (One could
>>> possibly get away with the "AnyOf" environment as the base environment
>>> as well, if we define (and enforce) a preference order.)
>>>
>>>
> This comes up a lot for us (Schrodinger). e.g. our runner allows for
> transforms to specify what licenses they require, but the current rules for
> environment compatibility make it difficult to allow transforms that have
> no license requirements to fuse with environments that do have requirements
> (as a workaround, we just implement this through transform annotations).
>
> It'd also be really convenient for us since we don't ship our software
> with GCP libraries so we need a separate environment for GCP-transforms.
> Allowing fusion of GCP-transforms with non-GCP-transforms will be a bit
> difficult with the current system.
>
>
>
> This being the messy world of ML, would these images be
>>> mahine/accelerator agnostic?
>>>
>>> > Kenn
>>> >
>>> > On Mon, Jun 30, 2025 at 12:07 PM Danny McCormick via dev <
>>> dev@beam.apache.org> wrote:
>>> >>
>>> >> Hey everyone, I'd like to propose publishing some ML-specific Beam
>>> containers alongside our normal base containers. The end result would be
>>> allowing users to specify `--sdk_container_image=ml` or
>>> `--sdk_container_image=gpu` so that their jobs run in containers which work
>>> well with ML/GPU jobs.
>>> >>
>>> >> I put together a tiny design, please take a look and let me know what
>>> you think.
>>> >>
>>> >>
>>> https://docs.google.com/document/d/1JcVFJsPbVvtvaYdGi-DzWy9PIIYJhL7LwWGEXt2NZMk/edit?usp=sharing
>>> >>
>>> >> Thanks,
>>> >> Danny
>>>
>>

Environments and compatibility Was: [Proposal] Beam ML containers

Reply via email to