Minor correction, the CoGBK broadcast vs. full shuffle is probably not
ideal example, because it still requires grouping the larger PCollection
(if not already grouped). If we take Join PTransform that acts on
cartesian product of these groups, then it works well.
Jan
On 11/16/20 8:39 PM, Jan Lukavský wrote:
Hi,
could this proposal be generalized to annotations of PCollections as
well? Maybe that reduces to several types of annotations of a
PTransform - e.g.
a) runtime annotations of a PTransform (that might be scheduling
hints - i.e. schedule this task to nodes with GPUs, etc.)
b) output annotations - i.e. annotations that actually apply to
PCollections, as every PCollection has at most one producer (this is
what have been actually discussed in the referenced mailing list threads)
It would be cool, if this added option to do PTransform expansions
based on annotations of input PCollections. We tried to play with this
in Euphoria DSL, but it turned out it would be best fitted in Beam SDK.
Example of input annotation sensitive expansion might be CoGBK, when
one side is annotated i.e. FitsInMemoryPerWindow (or SmallPerWindow,
or whatever), then CoGBK might be expanded using broadcast instead of
full shuffle.
Absolutely agree that all this must not have anything to do with
semantics and correctness, thus might be safely ignored, and that
might answer the last question of @Reuven, when there are conflicting
annotations, it would be possible to simple drop them as a last resort.
Jan
On 11/16/20 8:13 PM, Robert Burke wrote:
I imagine it has everything to do with the specific annotation to
define that.
The runner notionally doesn't need to do anything with them, as they
are optional, and not required for correctness.
On Mon, Nov 16, 2020, 10:56 AM Reuven Lax <re...@google.com
<mailto:re...@google.com>> wrote:
PTransforms are hierarchical - namely a PTransform contains other
PTransforms, and so on. Is the runner expected to resolve all
annotations down to leaf nodes? What happens if that results in
conflicting annotations?
On Mon, Nov 16, 2020 at 10:54 AM Robert Burke <rob...@frantil.com
<mailto:rob...@frantil.com>> wrote:
That's a good question.
I think the main difference is a matter of scope. Annotations
would apply to a PTransform while an environment applies to
sets of transforms. A difference is the optional nature of
the annotations they don't affect correctness. Runners don't
need to do anything with them and still execute the pipeline
correctly.
Consider a privacy analysis on a pipeline graph. An
annotation indicating that a transform provides a certain
level of anonymization can be used in an analysis to
determine if the downstream transforms are encountering raw
data or not.
From my understanding (which can be wrong) environments are
rigid. Transforms in different environments can't be fused.
"This is the python env", "this is the java env" can't be
merged together. It's not clear to me that we have defined
when environments are safely fuseable outside of equality.
There's value in that simplicity.
AFIACT environment has less to do with the machines a
pipeline is executing on than it does about the kinds of SDK
pipelines it understands and can execute.
On Mon, Nov 16, 2020, 10:36 AM Chad Dombrova
<chad...@gmail.com <mailto:chad...@gmail.com>> wrote:
Another example of an optional annotation is marking
a transform to run on secure hardware, or to give
hints to profiling/dynamic analysis tools.
There seems to be a lot of overlap between this idea and
Environments. Can you talk about how you feel they may
be different or related? For example, I could see
annotations as a way of tagging transforms with an
Environment, or I could see Environments becoming a
specialized form of annotation.
-chad