Re: Requiring PTransform to set a coder on its resulting collections

Robert Bradshaw Thu, 27 Jul 2017 00:02:28 -0700

I like the ideal of "no coders during construction." I also like the idea
of a second pass to determine coders (this would work particularly well in
Python). However, this assumes a 1:1 relationship between TypeDescriptors
and Coders, which is not accurate in practice. In particular:


* Some types can be encoded in multiple ways (e.g. VarInt, BegEndianInt,
...). Are we saying that there's never a reason for the user to specify
this? (Maybe, now that we're finally moving away from using Coders for IO.)
* Some Coders can encode multiple types, e.g. SerializableCoder (as
horrible as that one is) and cannot vend their type descriptor. There are
also cases I've seen of List<Object> being encoded with coders that have
special knowledge that the set of possible Objects are limited, or knowing
the (variable-arity) set of corresponding coders, but this can't be encoded
in the type system. (Arguably a special type should be created in this
case, but that's not possible without dynamic classes for some usecases).

Also, we're far from the world of being able to infer a coder for every
TypeDescriptor, especially for user-defined types. If we go this route, we
should allow the binding of a TypeDescriptor to a Coder (fallback or
preferred) to be defined locally and limited in scope (whereas I think the
coder registry is global, or at least pipeline global). That and make it
straightforward to define such bindings for parameterized types. Things can
get particularly messy with wildcards.

Regardless, the referenced PR is a step in the right direction. Whether the
Coder or a TypeDescriptor that determines the Coder is used, having the
producing (primitive) transform be the one that defines it rather than
mutating the PCollection after the fact will be good.


On Wed, Jul 26, 2017 at 9:08 PM, Kenneth Knowles <k...@google.com.invalid>
wrote:

> On Wed, Jul 26, 2017 at 8:58 PM, Eugene Kirpichov <
> kirpic...@google.com.invalid> wrote:
>
> > Hmm, yes, I just noticed that PCollection has a setTypeDescriptor()
> method,
> > and I wonder how much will break if all call sites of setCoder() will
> call
> > setTypeDescriptor() instead - i.e. how far are we from the ideal state of
> > having a coder inferrable for every sufficiently concrete type
> descriptor.
> > It would certainly make a lot of call sites a lot less awkward.
> >
> > Perhaps if we are not too far from that, then the guidance should be that
> > transforms should be configurable with type descriptors rather than with
> > coders.
> >
> > Do you think the current PR still makes sense as a step in this
> direction,
> > removing at least some of the mess?
> >
>
> Yes, indeed. I think all of your arguments apply also to TypeDescriptor and
> we should do the same to .setTypeDescriptor. If you can build a coder, you
> can build a type descriptor (for one, coders can vend the corresponding
> type descriptor...we should be sure the APIs are rich and easy enough to do
> it directly). We should do it when we fabricate a fresh PCollection.
>
> Kenn
>
>
> >
> > On Wed, Jul 26, 2017 at 8:16 PM Kenneth Knowles <k...@google.com.invalid>
> > wrote:
> >
> > > +1 but maybe go ever further
> > >
> > > On Tue, Jul 25, 2017 at 8:25 PM, Eugene Kirpichov <
> > > kirpic...@google.com.invalid> wrote:
> > >
> > > > Hello,
> > > >
> > > > I've worked on a few different things recently and ran repeatedly
> into
> > > the
> > > > same issue: that we do not have clear guidance on who should set the
> > > Coder
> > > > on a PCollection: is it responsibility of the PTransform that outputs
> > it,
> > > > or is it responsibility of the user, or is it sometimes one and
> > sometimes
> > > > the other?
> > > >
> > >
> > > I think it is more ambiguous than even this, since often we expect a
> > third
> > > option: the Pipeline propagates them automatically via common type
> > > variables, as in Filter :: PTransform<T, T>.
> > >
> > > The unfortunate part is that this is only a best-effort process due to
> > > erasure and the complexities of type equivalence. I'm not convinced it
> is
> > > unsolvable but I don't necessarily think we should put our effort into
> > > solving it.
> > >
> > >
> > > > I believe that the answer is "it's responsibility of the transform"
> and
> > > > moreover that  ideally PCollection.setCoder() should not exist.
> > Instead:
> > > >
> > >
> > > I would actually like this instead to be a requirement to provide a
> fully
> > > concrete type descriptor when creating a new PCollection. This allows
> > more
> > > flexibility to easily infer a coder for the type. Ideally, .getCoder()
> > > would be deprecated as well and this all moved to a phase after
> > > construction.
> > >
> > > In particular: I'd like to change our current practice of GBK crashing
> > when
> > > a key coder is nondeterministic. Instead, inference should call
> > > "getDeterministicCoder" when it is known to be necessary. Lots of types
> > > might have a deterministic coder that is more costly, and we can enable
> > > this as an optimization choice. The transform producing a PCollection
> > > cannot know that it will be used for a GBK. The pipeline also cannot
> know
> > > when it tries to infer it without knowing what transform is taking it
> as
> > > input. It has to be part of a more holistic analysis.
> > >
> > > And it seems nice to build a whole pipeline and then apply coders as
> > > specified by what is registered. So the same construction can be
> > > transparently improved if a better coder is concocted. This is likely
> not
> > > possible if transforms set up the coders a priori.
> > >
> > > So my perfect world is "no coders during construction".
> > >
> > > Kenn
> > >
> > > - Require that all transforms set a Coder on the PCollection's they
> > produce
> > > > - i.e. it should never be responsibility of the user to "fix up" a
> > coder
> > > on
> > > > a PCollection produced by a transform.
> > > >
> > > > - Since all transforms are composed of primitive transforms, saying
> > > > "transforms must set a Coder" means simply that all *primitive*
> > > transforms
> > > > must set a Coder on their output.
> > > >
> > > > - In some cases, a primitive PTransform currently doesn't have enough
> > > > information to infer a coder for its output collection - e.g.
> > > > ParDo.of(DoFn<InputT, OutputT>) might be unable to infer a coder for
> > > > OutputT. In that case such transforms should allow the user to
> provide
> > a
> > > > coder: ParDo.of(DoFn).withOutputCoder(...) [note that this differs
> > from
> > > > requiring the user to set a coder on the resulting collection]
> > > >
> > > > - Corollary: composite transforms need to only configure their
> > primitive
> > > > transforms (and composite sub-transforms) properly, and give them a
> > Coder
> > > > if needed.
> > > >
> > > > - Corollary: a PTransform with type parameters <FooT, BarT, ...>
> needs
> > to
> > > > be configurable with coders for all of these, because the
> > implementation
> > > of
> > > > the transform may change and it may introduce intermediate
> collections
> > > > involving these types. However, in many cases, some of these type
> > > > parameters appear in the type of the transform's input, e.g. a
> > > > PTransform<PCollection<KV<FooT, BarT>>, PCollection<MooT>> will
> > always be
> > > > able to extract the coders for FooT and BarT from the input
> > PCollection,
> > > so
> > > > the user does not need to provide them. However, a coder for BarT
> must
> > be
> > > > provided. I think in most cases the transform needs to be
> configurable
> > > only
> > > > with coders for its output.
> > > >
> > > > Here's a smooth migration path to accomplish the above:
> > > > - Make PCollection.createPrimitiveOutputInternal() take a Coder.
> > > > - Make all primitive transforms optionally configurable with a coder
> > for
> > > > their outputs, such as ParDo.of(DoFn).withOutputCoder().
> > > > - By using the above, make all composite transforms shipped with the
> > SDK
> > > > set a Coder on the collections they produce; in some cases, this will
> > > > require adding a withSomethingCoder() option to the transform and
> > > > propagating that coder to its sub-transforms. If the option is unset,
> > > > that's fine for now.
> > > > - As a result of the above, get rid of all setCoder() calls in the
> Beam
> > > > repo. The call will still be there, but it will just not be used
> > anywhere
> > > > in the SDK or examples, and we can mark it deprecated.
> > > > - Add guidance to PTransform Style Guide in line with the above
> > > >
> > > > Does this sound like a good idea? I'm not sure how urgent it would be
> > to
> > > > actually do this, but I'd like to know whether people agree that this
> > is
> > > a
> > > > good goal in general.
> > > >
> > >
> >
>

Re: Requiring PTransform to set a coder on its resulting collections

Reply via email to