Re: Environments and compatibility Was: [Proposal] Beam ML containers

Chamikara Jayalath via dev Tue, 15 Jul 2025 23:24:49 -0700

On Tue, Jul 15, 2025 at 5:32 PM Joey Tran <joey.t...@schrodinger.com> wrote:


> "Environment" has become almost-but-not-quite assumed to be a docker
>> container image specification. TBH I haven't dug in deeply recently but
>> this is how I see it discussed. I am aware that there are LOOPBACK and
>> EMBEDDED environments as well.
>
>
> Docker container images only cover the software deps part of the
> "Environment", right? A runner still needs to figure out hardware
> requirements for provisioning workers based on the Environment as well. I'm
> not sure how that might be interpretable from a docker URL.
>

I think hardware requirements are usually specified via other pipeline
options (could be runner specific). Environment has been traditionally used
to define software components where the transforms should be executed. In
Docker case, the URN should fully define the software execution environment.


>
>
>> In such a regime (where so much is hidden inside a docker URL)
>> compatibility pretty much has to be runner-determined. First the runner
>> reinterprets docker URLs (what aspects are important is partially
>> runner-specific) and deps and in Joey's case licenses more abstractly, then
>> has its own logic to decide which environments it can merge. This logic is
>> inherently cross-SDK, both in terms of versions and languages. We could
>> have a spec in docs and proto but it can't live as code in a particular SDK.
>
>
> I think the spec may also need ways to implement user-configurable merge
> strategies. e.g. you can imagine that a gpu-rich user may be happy with
> greedy fusion while a user with few GPUs would want only GPU-requiring
> transforms running on their GPU nodes
>

Runners are free to override the environments as they wish but as Kenn
mentioned, once this is done, compatibility completely becomes the runner's
responsibility. I think for supported runners in the Beam repo, we should
do this with care since ensuring compatibility for various transforms can
be a challenge without a well defined environment that executes Beam
transforms. For example, Dataflow does override the Docker environment but
just to a different container that usually is a clone of the original
container so compatibility is guaranteed.

Thanks,
Cham


>
>
> On Tue, Jul 15, 2025 at 4:42 PM Kenneth Knowles <k...@apache.org> wrote:
>
>> Altered subject because this has come up in a number of contexts and I
>> wonder if now is a time to have fresh thoughts on it.
>>
>> "Environment" has become almost-but-not-quite assumed to be a docker
>> container image specification. TBH I haven't dug in deeply recently but
>> this is how I see it discussed. I am aware that there are LOOPBACK and
>> EMBEDDED environments as well.
>>
>> The intention behind having environments is really that they could be
>> somewhat abstract and *often* recognized by runners and elided for
>> efficiency. It is an anit-goal to have the entirety of the contents of a
>> container be "the spec" for an environment. This is begging to be trapped
>> by Hyrum's Law.
>>
>> All that said, we are where we are. My presumption, for a while now, is
>> that runners would have to recognize and/or parse docker images and
>> *reinterpret* them as abstract specifications of environments. In other
>> words the URL for the Beam Java SDK harness container (plus deps) would be
>> reinterpreted as "default Beam Java SDK harness" and the runner would then
>> run in any compatible way it desired.
>>
>> For example it is *intended* that non-portable runners could be
>> seamlessly reused after verifying that all environments are compatible with
>> their non-portable execution style. The Flink/Spark/Samza runner executing
>> an all-Java pipeline via the portable gRPC protocols is a huge missed
>> opportunity, just throwing existing functionality and performance away.
>>
>> In such a regime (where so much is hidden inside a docker URL)
>> compatibility pretty much has to be runner-determined. First the runner
>> reinterprets docker URLs (what aspects are important is partially
>> runner-specific) and deps and in Joey's case licenses more abstractly, then
>> has its own logic to decide which environments it can merge. This logic is
>> inherently cross-SDK, both in terms of versions and languages. We could
>> have a spec in docs and proto but it can't live as code in a particular SDK.
>>
>> Kenn
>>
>> On Thu, Jul 3, 2025 at 4:34 AM Joey Tran <joey.t...@schrodinger.com>
>> wrote:
>>
>>>
>>> On Tue, Jul 1, 2025 at 2:37 PM Danny McCormick via dev <
>>> dev@beam.apache.org> wrote:
>>>
>>>> I think it is probably reasonable to automate this when a GPU resource
>>>> hint is used. I think we still need to expose this as a config option for
>>>> the ML containers (and it is the same with distroless) since it is pretty
>>>> difficult to say with confidence that those images are/aren't needed (even
>>>> if you're using a transform like RunInference, maybe you're using Spacy
>>>> which isn't a default dependency included in the ML images) and there is a
>>>> cost to using them (longer startup times).
>>>>
>>>> > This being the messy world of ML, would these images be
>>>> mahine/accelerator agnostic?
>>>>
>>>> That is the goal (at least to be agnostic within GPU types), and the
>>>> images will be as simple as possible to accommodate this. I think building
>>>> from an Nvidia base should accomplish this for most cases. For anything
>>>> beyond that, I think it is reasonable to ask users to build their own
>>>> container.
>>>>
>>>> On Tue, Jul 1, 2025 at 1:36 PM Robert Bradshaw <rober...@waymo.com>
>>>> wrote:
>>>>
>>>>> On Tue, Jul 1, 2025 at 10:32 AM Kenneth Knowles <k...@apache.org>
>>>>> wrote:
>>>>> >
>>>>> > Obligatory question: can we automate this? Specifically: can we
>>>>> publish the ML-specific containers and then use them as appropriate 
>>>>> without
>>>>> making it a user-facing knob?
>>>>>
>>>>> +1
>>>>>
>>>>> Transforms can declare their own environments. The only problem with
>>>>> this is that distinct environments prohibit fusion--we need a way to
>>>>> say that a given environment is a superset of another. (We can do this
>>>>> with dependencies, but not with arbitrary docker images.) (One could
>>>>> possibly get away with the "AnyOf" environment as the base environment
>>>>> as well, if we define (and enforce) a preference order.)
>>>>>
>>>>>
>>> This comes up a lot for us (Schrodinger). e.g. our runner allows for
>>> transforms to specify what licenses they require, but the current rules for
>>> environment compatibility make it difficult to allow transforms that have
>>> no license requirements to fuse with environments that do have requirements
>>> (as a workaround, we just implement this through transform annotations).
>>>
>>> It'd also be really convenient for us since we don't ship our software
>>> with GCP libraries so we need a separate environment for GCP-transforms.
>>> Allowing fusion of GCP-transforms with non-GCP-transforms will be a bit
>>> difficult with the current system.
>>>
>>>
>>>
>>> This being the messy world of ML, would these images be
>>>>> mahine/accelerator agnostic?
>>>>>
>>>>> > Kenn
>>>>> >
>>>>> > On Mon, Jun 30, 2025 at 12:07 PM Danny McCormick via dev <
>>>>> dev@beam.apache.org> wrote:
>>>>> >>
>>>>> >> Hey everyone, I'd like to propose publishing some ML-specific Beam
>>>>> containers alongside our normal base containers. The end result would be
>>>>> allowing users to specify `--sdk_container_image=ml` or
>>>>> `--sdk_container_image=gpu` so that their jobs run in containers which 
>>>>> work
>>>>> well with ML/GPU jobs.
>>>>> >>
>>>>> >> I put together a tiny design, please take a look and let me know
>>>>> what you think.
>>>>> >>
>>>>> >>
>>>>> https://docs.google.com/document/d/1JcVFJsPbVvtvaYdGi-DzWy9PIIYJhL7LwWGEXt2NZMk/edit?usp=sharing
>>>>> >>
>>>>> >> Thanks,
>>>>> >> Danny
>>>>>
>>>>

Re: Environments and compatibility Was: [Proposal] Beam ML containers

Reply via email to