Re: PTransform Annotations Proposal

Jan Lukavský Mon, 16 Nov 2020 11:44:53 -0800

Minor correction, the CoGBK broadcast vs. full shuffle is probably notideal example, because it still requires grouping the larger PCollection(if not already grouped). If we take Join PTransform that acts oncartesian product of these groups, then it works well.

Jan


On 11/16/20 8:39 PM, Jan Lukavský wrote:

Hi,
could this proposal be generalized to annotations of PCollections aswell? Maybe that reduces to several types of annotations of aPTransform - e.g.
a) runtime annotations of a PTransform (that might be schedulinghints - i.e. schedule this task to nodes with GPUs, etc.)
b) output annotations - i.e. annotations that actually apply toPCollections, as every PCollection has at most one producer (this iswhat have been actually discussed in the referenced mailing list threads)
It would be cool, if this added option to do PTransform expansionsbased on annotations of input PCollections. We tried to play with thisin Euphoria DSL, but it turned out it would be best fitted in Beam SDK.
Example of input annotation sensitive expansion might be CoGBK, whenone side is annotated i.e. FitsInMemoryPerWindow (or SmallPerWindow,or whatever), then CoGBK might be expanded using broadcast instead offull shuffle.
Absolutely agree that all this must not have anything to do withsemantics and correctness, thus might be safely ignored, and thatmight answer the last question of @Reuven, when there are conflictingannotations, it would be possible to simple drop them as a last resort.
Jan

On 11/16/20 8:13 PM, Robert Burke wrote:
I imagine it has everything to do with the specific annotation todefine that.
The runner notionally doesn't need to do anything with them, as theyare optional, and not required for correctness.
On Mon, Nov 16, 2020, 10:56 AM Reuven Lax <re...@google.com<mailto:re...@google.com>> wrote:
    PTransforms are hierarchical - namely a PTransform contains other
    PTransforms, and so on. Is the runner expected to resolve all
    annotations down to leaf nodes? What happens if that results in
    conflicting annotations?

    On Mon, Nov 16, 2020 at 10:54 AM Robert Burke <rob...@frantil.com
    <mailto:rob...@frantil.com>> wrote:

        That's a good question.

        I think the main difference is a matter of scope. Annotations
        would apply to a PTransform while an environment applies to
        sets of transforms. A difference is the optional nature of
        the annotations they don't affect correctness. Runners don't
        need to do anything with them and still execute the pipeline
        correctly.

        Consider a privacy analysis on a pipeline graph. An
        annotation indicating that a transform provides a certain
        level of anonymization can be used in an analysis to
        determine if the downstream transforms are encountering raw
        data or not.

        From my understanding (which can be wrong) environments are
        rigid. Transforms in different environments can't be fused.
        "This is the python env", "this is the java env" can't be
        merged together. It's not clear to me that we have defined
        when environments are safely fuseable outside of equality.
        There's value in that simplicity.

        AFIACT environment has less to do with the machines a
        pipeline is executing on than it does about the kinds of SDK
        pipelines it understands and can execute.



        On Mon, Nov 16, 2020, 10:36 AM Chad Dombrova
        <chad...@gmail.com <mailto:chad...@gmail.com>> wrote:


                Another example of an optional annotation is marking
                a transform to run on secure hardware, or to give
                hints to profiling/dynamic analysis tools.


            There seems to be a lot of overlap between this idea and
            Environments.  Can you talk about how you feel they may
            be different or related?  For example, I could see
            annotations as a way of tagging transforms with an
            Environment, or I could see Environments becoming a
            specialized form of annotation.

            -chad

Re: PTransform Annotations Proposal

Reply via email to