Re: Requiring PTransform to set a coder on its resulting collections

Eugene Kirpichov Wed, 26 Jul 2017 17:27:07 -0700

Okay, first PR is in review https://github.com/apache/beam/pull/3649


On Wed, Jul 26, 2017 at 11:58 AM Robert Bradshaw
<[email protected]> wrote:

> +1, I'm a huge fan of moving this direction. Right now there's also
> the ugliness that setCoder() may be called any number of times before
> a PCollection is used (the last setter winning) but is an error to
> call it once it has been used (and here "used" is not clear--if a
> PCollection is returned from a composite operation it may or may not
> have been consumed within that operation as well). Coupled with the
> fact that it can also be an error to use a PCollection without setting
> the coder if it was unable to infer one has bitten me several times
> writing generic transforms.
>
> The fact that this is the one thing that prevents PCollections from
> being immutable objects has bothered me from a purity point of view
> too.
>
> On Wed, Jul 26, 2017 at 11:43 AM, Mingmin Xu <[email protected]> wrote:
> > Second that 'it's responsibility of the transform'. For the case when a
> > PTransform doesn't have enough information(PTransform developer should
> have
> > the knowledge), I would prefer a strict way so users won't forget to call
> > withSomethingCoder(), like
> > - a Coder is required to new the PTransform;
> > - or an interface 'getOutputCoder' to be implemented;
> >
> > On Wed, Jul 26, 2017 at 11:21 AM, Eugene Kirpichov <
> > [email protected]> wrote:
> >
> >> Hm, can you elaborate? I'm not sure how this relates to my suggestion,
> the
> >> gist of which is "PTransform's should set the coder on all of their
> >> outputs, and the user should never have to .setCoder() on a PCollection
> >> obtained from a PTransform"
> >>
> >> On Wed, Jul 26, 2017 at 7:38 AM Lukasz Cwik <[email protected]>
> >> wrote:
> >>
> >> > I'm split between our current one pass model of pipeline construction
> >> and a
> >> > two pass model where all information is gathered and then PTransform
> >> > expansions are performed.
> >> >
> >> >
> >> > On Tue, Jul 25, 2017 at 8:25 PM, Eugene Kirpichov <
> >> > [email protected]> wrote:
> >> >
> >> > > Hello,
> >> > >
> >> > > I've worked on a few different things recently and ran repeatedly
> into
> >> > the
> >> > > same issue: that we do not have clear guidance on who should set the
> >> > Coder
> >> > > on a PCollection: is it responsibility of the PTransform that
> outputs
> >> it,
> >> > > or is it responsibility of the user, or is it sometimes one and
> >> sometimes
> >> > > the other?
> >> > >
> >> > > I believe that the answer is "it's responsibility of the transform"
> and
> >> > > moreover that  ideally PCollection.setCoder() should not exist.
> >> Instead:
> >> > >
> >> > > - Require that all transforms set a Coder on the PCollection's they
> >> > produce
> >> > > - i.e. it should never be responsibility of the user to "fix up" a
> >> coder
> >> > on
> >> > > a PCollection produced by a transform.
> >> > >
> >> > > - Since all transforms are composed of primitive transforms, saying
> >> > > "transforms must set a Coder" means simply that all *primitive*
> >> > transforms
> >> > > must set a Coder on their output.
> >> > >
> >> > > - In some cases, a primitive PTransform currently doesn't have
> enough
> >> > > information to infer a coder for its output collection - e.g.
> >> > > ParDo.of(DoFn<InputT, OutputT>) might be unable to infer a coder for
> >> > > OutputT. In that case such transforms should allow the user to
> provide
> >> a
> >> > > coder: ParDo.of(DoFn).withOutputCoder(...) [note that this differs
> >> from
> >> > > requiring the user to set a coder on the resulting collection]
> >> > >
> >> > > - Corollary: composite transforms need to only configure their
> >> primitive
> >> > > transforms (and composite sub-transforms) properly, and give them a
> >> Coder
> >> > > if needed.
> >> > >
> >> > > - Corollary: a PTransform with type parameters <FooT, BarT, ...>
> needs
> >> to
> >> > > be configurable with coders for all of these, because the
> >> implementation
> >> > of
> >> > > the transform may change and it may introduce intermediate
> collections
> >> > > involving these types. However, in many cases, some of these type
> >> > > parameters appear in the type of the transform's input, e.g. a
> >> > > PTransform<PCollection<KV<FooT, BarT>>, PCollection<MooT>> will
> >> always be
> >> > > able to extract the coders for FooT and BarT from the input
> >> PCollection,
> >> > so
> >> > > the user does not need to provide them. However, a coder for BarT
> must
> >> be
> >> > > provided. I think in most cases the transform needs to be
> configurable
> >> > only
> >> > > with coders for its output.
> >> > >
> >> > > Here's a smooth migration path to accomplish the above:
> >> > > - Make PCollection.createPrimitiveOutputInternal() take a Coder.
> >> > > - Make all primitive transforms optionally configurable with a coder
> >> for
> >> > > their outputs, such as ParDo.of(DoFn).withOutputCoder().
> >> > > - By using the above, make all composite transforms shipped with the
> >> SDK
> >> > > set a Coder on the collections they produce; in some cases, this
> will
> >> > > require adding a withSomethingCoder() option to the transform and
> >> > > propagating that coder to its sub-transforms. If the option is
> unset,
> >> > > that's fine for now.
> >> > > - As a result of the above, get rid of all setCoder() calls in the
> Beam
> >> > > repo. The call will still be there, but it will just not be used
> >> anywhere
> >> > > in the SDK or examples, and we can mark it deprecated.
> >> > > - Add guidance to PTransform Style Guide in line with the above
> >> > >
> >> > > Does this sound like a good idea? I'm not sure how urgent it would
> be
> >> to
> >> > > actually do this, but I'd like to know whether people agree that
> this
> >> is
> >> > a
> >> > > good goal in general.
> >> > >
> >> >
> >>
> >
> >
> >
> > --
> > ----
> > Mingmin
>

Re: Requiring PTransform to set a coder on its resulting collections

Reply via email to