Minor correction, the CoGBK broadcast vs. full shuffle is probably not ideal example, because it still requires grouping the larger PCollection (if not already grouped). If we take Join PTransform that acts on cartesian product of these groups, then it works well.

Jan

On 11/16/20 8:39 PM, Jan Lukavský wrote:

Hi,

could this proposal be generalized to annotations of PCollections as well? Maybe that reduces to several types of annotations of a PTransform - e.g.

 a) runtime annotations of a PTransform (that might be scheduling hints - i.e. schedule this task to nodes with GPUs, etc.)

 b) output annotations - i.e. annotations that actually apply to PCollections, as every PCollection has at most one producer (this is what have been actually discussed in the referenced mailing list threads)

It would be cool, if this added option to do PTransform expansions based on annotations of input PCollections. We tried to play with this in Euphoria DSL, but it turned out it would be best fitted in Beam SDK.

Example of input annotation sensitive expansion might be CoGBK, when one side is annotated i.e. FitsInMemoryPerWindow (or SmallPerWindow, or whatever), then CoGBK might be expanded using broadcast instead of full shuffle.

Absolutely agree that all this must not have anything to do with semantics and correctness, thus might be safely ignored, and that might answer the last question of @Reuven, when there are conflicting annotations, it would be possible to simple drop them as a last resort.

Jan

On 11/16/20 8:13 PM, Robert Burke wrote:
I imagine it has everything to do with the specific annotation to define that.

The runner notionally doesn't need to do anything with them, as they are optional, and not required for correctness.

On Mon, Nov 16, 2020, 10:56 AM Reuven Lax <re...@google.com <mailto:re...@google.com>> wrote:

    PTransforms are hierarchical - namely a PTransform contains other
    PTransforms, and so on. Is the runner expected to resolve all
    annotations down to leaf nodes? What happens if that results in
    conflicting annotations?

    On Mon, Nov 16, 2020 at 10:54 AM Robert Burke <rob...@frantil.com
    <mailto:rob...@frantil.com>> wrote:

        That's a good question.

        I think the main difference is a matter of scope. Annotations
        would apply to a PTransform while an environment applies to
        sets of transforms. A difference is the optional nature of
        the annotations they don't affect correctness. Runners don't
        need to do anything with them and still execute the pipeline
        correctly.

        Consider a privacy analysis on a pipeline graph. An
        annotation indicating that a transform provides a certain
        level of anonymization can be used in an analysis to
        determine if the downstream transforms are encountering raw
        data or not.

        From my understanding (which can be wrong) environments are
        rigid. Transforms in different environments can't be fused.
        "This is the python env", "this is the java env" can't be
        merged together. It's not clear to me that we have defined
        when environments are safely fuseable outside of equality.
        There's value in that simplicity.

        AFIACT environment has less to do with the machines a
        pipeline is executing on than it does about the kinds of SDK
        pipelines it understands and can execute.



        On Mon, Nov 16, 2020, 10:36 AM Chad Dombrova
        <chad...@gmail.com <mailto:chad...@gmail.com>> wrote:


                Another example of an optional annotation is marking
                a transform to run on secure hardware, or to give
                hints to profiling/dynamic analysis tools.


            There seems to be a lot of overlap between this idea and
            Environments.  Can you talk about how you feel they may
            be different or related?  For example, I could see
            annotations as a way of tagging transforms with an
            Environment, or I could see Environments becoming a
            specialized form of annotation.

            -chad

Reply via email to