Mostly because perfect is the enemy of good enough. We have a proposal, we have clear boundaries for it. It's fine if the discussion continues, but I see no evidence of concerns that should prevent starting an implementation, because it seems we'll need both anyway.
On Wed, Nov 25, 2020, 10:25 AM Robert Bradshaw <[email protected]> wrote: > On Wed, Nov 25, 2020 at 10:15 AM Robert Burke <[email protected]> wrote: > > > > It sounds like we've come to the position that non-correctness affecting > Ptransform Annotations are valuable at both leaf and composite levels, and > don't remove the potential need for a similar feature on Environments, to > handle physical concerns equirements for worker processes to have (such as > Ram, CPU, or GPU requirements.) > > > > Kenn, it's not clear what part of the solution (an annotation field on > the Ptransform proto message) would need to change to satisfy your scope > concern, beyond documenting unambiguously that these may not be used for > physical concerns or things that affect correctness. > > I'll let Kenn answer as well, but from my point of view, explicitly > having somewhere better to put these things would help. > > > I'm also unclear your scope concern not matching, given the above. Your > first paragraph reads very supportive of logical annotations on > Ptransforms, and that matches 1-1 with the current proposed solution. Can > you clarify your concern? > > > > I don't wish to scope creep on the physical requirements issue at this > time. It seems we are agreed they should end up on environments, but I'm > not seeing proposals on the right way to execute them at this time.They > seem to be a fruitful topic of discussion, in particular > unifying/consolidating them for efficient use of resources. I don't think > we want to end up in a state where every distinct physical concern means a > distinct environment. > > Why not? Assuming, of course, that runners are free to merge > environments (merging those resource hints they understand and are > otherwise compatible, and discarding those they don't) for efficient > execution. > > > I for one am ready to see a PR. > > +1 > > > On Mon, Nov 23, 2020, 6:02 PM Kenneth Knowles <[email protected]> wrote: > >> > >> > >> > >> On Mon, Nov 23, 2020 at 3:04 PM Robert Bradshaw <[email protected]> > wrote: > >>> > >>> On Fri, Nov 20, 2020 at 11:08 AM Mirac Vuslat Basaran < > [email protected]> wrote: > >>> > > >>> > Thanks everyone so much for their input and for the insightful > discussion. > >>> > > >>> > Not being knowledgeable about Beam's internals, I have to say I am a > bit lost on the PTransform vs. environment discussion. > >>> > > >>> > I do agree with Burke's notion that merge rules are very annotation > dependent, I don't think we can find a one-size-fits-all solution for that. > So this might be actually be an argument in favour of having annotations on > PTransforms, since it avoids the conflation with environments. > >>> > > >>> > Also in general, I feel that having annotations per single transform > (rather than composite) and on PTransforms could lead to a simpler design. > >>> > >>> If we want to use these for privacy, I don't see how attaching them to > >>> leaf transforms alone could work. (Even CombinePerKey is a composite.) > >>> > >>> > Seeing as there are valuable arguments in favour of both (PTransform > and environments) with no clear(?) "best solution", I would propose moving > forward with the initial (PTransform) design to ship the feature and > unblock teams asking for it. If it turns out that there was indeed a need > to have annotations in environments, we could always refactor it. > >>> > >>> I have yet to see any arguments that resource-level hints, such as > >>> memory or GPU, don't better belong on the environment. But moving > >>> forward on PTransform-level ones for logical statements like privacy > >>> declarations makes sense. > >> > >> > >> Exactly this. Properties of transforms make sense. The properties may > hold only of the whole subgraph. Even something as simple as "preserves > keys". This is analogous (but converse) to requirements like "requires > sorted input" which were explicitly excluded from the scope, which was > about hardware environment for execution. > >> > >> The proposed scope and the proposed solution do not match and need to > be reconciled. > >> > >> Kenn > >> > >>> > On 2020/11/17 19:07:22, Robert Bradshaw <[email protected]> wrote: > >>> > > So far we have two distinct usecases for annotations: resource > hints > >>> > > and privacy directives, and I've been trying to figure out how to > >>> > > reconcile them, but they seem to have very different > characteristics. > >>> > > (It would be nice to come up with other uses as well to see if > we're > >>> > > really coming up with a generally useful mode--I think display data > >>> > > could fit into this as a new kind of annotation rather than being a > >>> > > top-level property, and it could make sense on both leaf and > composite > >>> > > transforms.) > >>> > > > >>> > > To me, resource hints like GPU are inextricably tied to the > >>> > > environment. A transform tagged with GPU should reference a Fn that > >>> > > invokes GPU-accelerated code that lives in a particular > environment. > >>> > > Something like high-mem is a bit squishier. Some DoFns take a lot > of > >>> > > memory, but on the other hand one could imagine labeling a CoGBK as > >>> > > high-mem due to knowing that, in this particular usage, there will > be > >>> > > lots of values with the same key. Ideally runners would be > intelligent > >>> > > enough to automatically learn memory usage, but even in this case > it > >>> > > may be a good hint to try and learn the requirements for DoFn A and > >>> > > DoFn B separately (which is difficult if they are always colocated, > >>> > > but valuable if, e.g. A takes a huge amount of memory and B takes a > >>> > > huge amount of wall time). > >>> > > > >>> > > Note that tying things to the environment does not preclude using > them > >>> > > in non-portable runners as they'll still have an SDK-level > >>> > > representation (though I don't think we should have an explicit > goal > >>> > > of feature parity for non-portable runners, e.g. multi-language > isn't > >>> > > happening, and hope that non-portable runners go away soon anyway). > >>> > > > >>> > > Now let's consider privacy annotations. To make things very > concrete, > >>> > > imagine a transform AverageSpendPerZipCode which takes as input > (user, > >>> > > zip, spend), all users unique, and returns (zip, avg(spend)). In > >>> > > Python, this is GroupBy('zip').aggregate_field('spend', > >>> > > MeanCombineFn()). This is not very privacy preserving to those > users > >>> > > who are the only (or one of a few) in a zip code. So we could > define a > >>> > > transform PrivacyPreservingAverageSpendPerZipCode as > >>> > > > >>> > > @ptransform_fn > >>> > > def PrivacyPreservingAverageSpendPerZipCode(spend_per_user, > threshold) > >>> > > counts_per_zip = spend_per_user | > >>> > > GroupBy('zip').aggregate_field('user', CountCombineFn()) > >>> > > spend_per_zip = spend_per_user | > >>> > > GroupBy('zip').aggregate_field('spend', MeanCombineFn()) > >>> > > filtered = spend_per_zip | beam.Filter( > >>> > > lambda x, counts: counts[x.zip] > threshold, > >>> > > counts=AsMap(counts_per_zip)) > >>> > > return filtered > >>> > > > >>> > > We now have a composite that has privacy preserving properties > (i.e. > >>> > > the input may be quite sensitive, but the output is not, depending > on > >>> > > the value of threshold). What is interesting here is that it is > only > >>> > > the composite that has this property--no individual sub-transform > is > >>> > > itself privacy preserving. Furthermore, an optimizer may notice > we're > >>> > > doing aggregation on the same key twice and rewrite this using > >>> > > (logically) > >>> > > > >>> > > GroupBy('zip').aggregate_field('user', > >>> > > CountCombineFn()).aggregate_field('spend', MeanCombineFn()) > >>> > > > >>> > > and then applying the filter, which is semantically equivalent and > >>> > > satisfies the privacy annotations (and notably that does not even > >>> > > require the optimizer to interpret the annotations, just pass them > >>> > > on). To me, this implies that these annotations belong on the > >>> > > composites, and *not* on the leaf nodes (where they would be > >>> > > incorrect). > >>> > > > >>> > > I'll leave aside most questions of API until we figure out the > model > >>> > > semantics, but wanted to throw one possible idea out (though I am > >>> > > ambivalent about it). Instead of attaching things to transforms, we > >>> > > can just wrap transforms in composites that have no role other than > >>> > > declaring information about their contents. E.g. we could have a > >>> > > composite transform whose payload is simply an assertion of the > >>> > > privacy (or resource?) properties of its inner structure. This > would > >>> > > be just as expressive as adding new properties to transforms > >>> > > themselves (but would add an extra level of nesting, and make > >>> > > respecting the precice nesting more important). > >>> > > > >>> > > On Tue, Nov 17, 2020 at 8:12 AM Robert Burke <[email protected]> > wrote: > >>> > > > > >>> > > > +1 to discussing PCollection annotations on a separate thread. > It would be confusing to mix them up. > >>> > > > > >>> > > > ----------- > >>> > > > > >>> > > > The question around conflicts is interesting, but confusing to > me. I don't think they exist in general. I keep coming back around to that > it depends on the annotation and the purpose of composites. Optionality > saves us here too. > >>> > > > > >>> > > > Composites are nothing without their internal hypergraph > structure. Eventually it comes down to executing the leaf nodes. The > alternative to executing the leaf nodes is when the composite represents a > known transform and is replaced by the runner on submission time. Lets > look at each. > >>> > > > > >>> > > > If there's a property that only exists on the leaf nodes, then > it's not possible to bubble up that property to the composite in all cases. > Afterall, it's not necessarily the case that a privacy preserving transform > maintains the property for all output edges as not all such edges pass > through the preserving transform. > >>> > > > > >>> > > > On the other hand, with memory or gpu recommendations, that > might set a low bar on the composite level. > >>> > > > > >>> > > > But, composites (any transform really) can be runner replaced. I > think it's fair to say that a runner replaced composite is not beholden to > the annotations of the original leaf transforms, especially around physical > requirements. The implementations are different. If a known composite at > the composite level requires GPUs and it's known replacement doesn't, I'd > posit that replacement was a choice the runner made since it can't > provision machines with GPUs. > >>> > > > > >>> > > > But, crucially around privacy annotated transforms, a runner > likely shouldn't replace a given subgraph that contains a privacy > annotationed transform unless the replacements provide the same level of > privacy. However, such replacements only happens with well known transforms > with known properties anyway, so this can serve as an additional layer of > validation for a runner aware of the properties. > >>> > > > > >>> > > > This brings me back to my position: that the notion of conflicts > is very annotation dependant, and that defining them as optional is the > most important feature to avoid issues. Conflicts don't exist as an > inherent property of annotations on ptransform of the hypergraph structure. > Am i wrong? No one has come up with an actual example of a conflict as far > as i understand the thread. > >>> > > > > >>> > > > Even Reuven's original question is more about whether the runner > is forced to look at leaf bodes rather than only looking at the composite. > Assuming the composite isn't replaced, the runner needs to look at the leaf > nodes regardless. And as discussed above there's no generalized semantics > that fit for all kinds of annotations, once replacements are also > considered. > >>> > > > > >>> > > > On Tue, Nov 17, 2020, 6:35 AM Ismaël Mejía <[email protected]> > wrote: > >>> > > >> > >>> > > >> +1 Nice to see there is finally interest on this. Annotations > for > >>> > > >> PTransforms make total sense! > >>> > > >> > >>> > > >> The semantics should be strictly optional for runners and > correct > >>> > > >> execution should not be affected by lack of support of any > annotation. > >>> > > >> We should however keep the set of annotations small. > >>> > > >> > >>> > > >> > PTransforms are hierarchical - namely a PTransform contains > other PTransforms, and so on. Is the runner expected to resolve all > annotations down to leaf nodes? What happens if that results in conflicting > annotations? > >>> > > >> > >>> > > >> +1 to this question, This needs to be detailed. > >>> > > >> > >>> > > >> I am curious about how the end user APIs of this will look > maybe in > >>> > > >> Java or Python, just an extra method to inject a Map or via Java > >>> > > >> annotations/Python decorators? > >>> > > >> > >>> > > >> We might prefer not to mix the concepts of annotations and > >>> > > >> environments because this will limit the scope of annotations. > >>> > > >> Annotations are different from environments because they serve > a more > >>> > > >> general idea: to express an intention and it is up to the > runner to > >>> > > >> choose the strategy to accomplish this, for example in the GPU > >>> > > >> assignation case it could be to rewrite resource allocation via > >>> > > >> Environments but it could also just delegate this to a resource > >>> > > >> manager which is what we could do for example for GPU (or data > >>> > > >> locality) cases on the Spark/Flink classic runners. If we tie > this to > >>> > > >> environments we will leave classic runners out of the loop for > no > >>> > > >> major reason and also not cover use cases not related to > resource > >>> > > >> allocation. > >>> > > >> > >>> > > >> I do not understand the use case to justify PCollection > annotations > >>> > > >> but to not mix this thread with them, would you be interested to > >>> > > >> elaborate more about them in a different thread Jan? > >>> > > >> > >>> > > >> On Tue, Nov 17, 2020 at 2:28 AM Robert Bradshaw < > [email protected]> wrote: > >>> > > >> > > >>> > > >> > I agree things like GPU, high-mem, etc. belong to the > environment. If > >>> > > >> > annotations are truly advisory, one can imagine merging > environments > >>> > > >> > by taking the union of annotations and still producing a > correct > >>> > > >> > pipeline. (This would mean that annotations would have to be a > >>> > > >> > multi-map...) > >>> > > >> > > >>> > > >> > On the other hand, this doesn't seem to handle the case of > privacy > >>> > > >> > analysis, which could apply to composites without applying to > any > >>> > > >> > individual component, and don't really make sense as part of a > >>> > > >> > fusion/execution story. > >>> > > >> > > >>> > > >> > On Mon, Nov 16, 2020 at 4:00 PM Robert Burke < > [email protected]> wrote: > >>> > > >> > > > >>> > > >> > > That's good historical context. > >>> > > >> > > > >>> > > >> > > But then we'd still need to codify the annotation would > need to be optional, and not affect correctness. > >>> > > >> > > > >>> > > >> > > Conflicts become easier to manage, (as environments with > conflicting annotations simply don't get merged, and stay as distinct > environments) but are still notionally annotation dependant. Do most > runners handle environments so individually though? > >>> > > >> > > > >>> > > >> > > Reuven's question is a good one though. For the Go SDK, and > the proposed implementation i saw, they only applied to leaf nodes. This is > an artifact of how the Go SDK handles composites. Nothing stops it from > being implemented on the composites Go has, but it didn't make sense > otherwise. AFAICT Composites are generally for organizational convenience > and not for functional aspects. Is this wrong? Afterall, does it make sense > for environments to be on leaves and composites either? It's the same issue > there. > >>> > > >> > > > >>> > > >> > > > >>> > > >> > > On Mon, Nov 16, 2020, 3:45 PM Kenneth Knowles < > [email protected]> wrote: > >>> > > >> > >> > >>> > > >> > >> I am +1 to the proposal but believe it should be moved to > the Environment. I could be convinced otherwise, but would want to really > understand the details. > >>> > > >> > >> > >>> > > >> > >> I think we haven't done a great job communicating the > purpose of the Environment proto. It was explicitly created for this > purpose. > >>> > > >> > >> > >>> > > >> > >> 1. It tells the runner things it needs to know to > interpret the DoFn (or other UDF). So these are the existing proto fields > like docker image (in the payload) and required artifacts that were staged. > >>> > > >> > >> 2. It is also the place for additional requirements or > hints like "high mem" or "GPU" etc. > >>> > > >> > >> > >>> > > >> > >> Every user function has an associated environment, and > environments exist only for the purpose of executing user functions. In > fact, Environment originated as inline requirements/attributes for a user > function proto message and was separated just to make the proto smaller. > >>> > > >> > >> > >>> > > >> > >> A PTransform is an abstract concept for organizing the > graph, not an executable thing. So a hint/capability/requirement on a > PTransform only really makes sense as a scoping mechanism for applying a > hint to a bunch of functions within a subgraph. This seems like a user > interface concern and the SDK should own propagating the hints. If the hint > truly applies to the whole PTransform and *not* the parts, then I am > interested in learning about that. > >>> > > >> > >> > >>> > > >> > >> Kenn > >>> > > >> > >> > >>> > > >> > >> On Mon, Nov 16, 2020 at 10:54 AM Robert Burke < > [email protected]> wrote: > >>> > > >> > >>> > >>> > > >> > >>> That's a good question. > >>> > > >> > >>> > >>> > > >> > >>> I think the main difference is a matter of scope. > Annotations would apply to a PTransform while an environment applies to > sets of transforms. A difference is the optional nature of the annotations > they don't affect correctness. Runners don't need to do anything with them > and still execute the pipeline correctly. > >>> > > >> > >>> > >>> > > >> > >>> Consider a privacy analysis on a pipeline graph. An > annotation indicating that a transform provides a certain level of > anonymization can be used in an analysis to determine if the downstream > transforms are encountering raw data or not. > >>> > > >> > >>> > >>> > > >> > >>> From my understanding (which can be wrong) environments > are rigid. Transforms in different environments can't be fused. "This is > the python env", "this is the java env" can't be merged together. It's not > clear to me that we have defined when environments are safely fuseable > outside of equality. There's value in that simplicity. > >>> > > >> > >>> > >>> > > >> > >>> AFIACT environment has less to do with the machines a > pipeline is executing on than it does about the kinds of SDK pipelines it > understands and can execute. > >>> > > >> > >>> > >>> > > >> > >>> > >>> > > >> > >>> > >>> > > >> > >>> On Mon, Nov 16, 2020, 10:36 AM Chad Dombrova < > [email protected]> wrote: > >>> > > >> > >>>>> > >>> > > >> > >>>>> > >>> > > >> > >>>>> Another example of an optional annotation is marking a > transform to run on secure hardware, or to give hints to profiling/dynamic > analysis tools. > >>> > > >> > >>>> > >>> > > >> > >>>> > >>> > > >> > >>>> There seems to be a lot of overlap between this idea and > Environments. Can you talk about how you feel they may be different or > related? For example, I could see annotations as a way of tagging > transforms with an Environment, or I could see Environments becoming a > specialized form of annotation. > >>> > > >> > >>>> > >>> > > >> > >>>> -chad > >>> > > >> > >>>> > >>> > > >
