Re: Requiring PTransform to set a coder on its resulting collections

Mingmin Xu Wed, 26 Jul 2017 11:43:45 -0700

Second that 'it's responsibility of the transform'. For the case when a
PTransform doesn't have enough information(PTransform developer should have
the knowledge), I would prefer a strict way so users won't forget to call
withSomethingCoder(), like
- a Coder is required to new the PTransform;
- or an interface 'getOutputCoder' to be implemented;


On Wed, Jul 26, 2017 at 11:21 AM, Eugene Kirpichov <
kirpic...@google.com.invalid> wrote:

> Hm, can you elaborate? I'm not sure how this relates to my suggestion, the
> gist of which is "PTransform's should set the coder on all of their
> outputs, and the user should never have to .setCoder() on a PCollection
> obtained from a PTransform"
>
> On Wed, Jul 26, 2017 at 7:38 AM Lukasz Cwik <lc...@google.com.invalid>
> wrote:
>
> > I'm split between our current one pass model of pipeline construction
> and a
> > two pass model where all information is gathered and then PTransform
> > expansions are performed.
> >
> >
> > On Tue, Jul 25, 2017 at 8:25 PM, Eugene Kirpichov <
> > kirpic...@google.com.invalid> wrote:
> >
> > > Hello,
> > >
> > > I've worked on a few different things recently and ran repeatedly into
> > the
> > > same issue: that we do not have clear guidance on who should set the
> > Coder
> > > on a PCollection: is it responsibility of the PTransform that outputs
> it,
> > > or is it responsibility of the user, or is it sometimes one and
> sometimes
> > > the other?
> > >
> > > I believe that the answer is "it's responsibility of the transform" and
> > > moreover that  ideally PCollection.setCoder() should not exist.
> Instead:
> > >
> > > - Require that all transforms set a Coder on the PCollection's they
> > produce
> > > - i.e. it should never be responsibility of the user to "fix up" a
> coder
> > on
> > > a PCollection produced by a transform.
> > >
> > > - Since all transforms are composed of primitive transforms, saying
> > > "transforms must set a Coder" means simply that all *primitive*
> > transforms
> > > must set a Coder on their output.
> > >
> > > - In some cases, a primitive PTransform currently doesn't have enough
> > > information to infer a coder for its output collection - e.g.
> > > ParDo.of(DoFn<InputT, OutputT>) might be unable to infer a coder for
> > > OutputT. In that case such transforms should allow the user to provide
> a
> > > coder: ParDo.of(DoFn).withOutputCoder(...) [note that this differs
> from
> > > requiring the user to set a coder on the resulting collection]
> > >
> > > - Corollary: composite transforms need to only configure their
> primitive
> > > transforms (and composite sub-transforms) properly, and give them a
> Coder
> > > if needed.
> > >
> > > - Corollary: a PTransform with type parameters <FooT, BarT, ...> needs
> to
> > > be configurable with coders for all of these, because the
> implementation
> > of
> > > the transform may change and it may introduce intermediate collections
> > > involving these types. However, in many cases, some of these type
> > > parameters appear in the type of the transform's input, e.g. a
> > > PTransform<PCollection<KV<FooT, BarT>>, PCollection<MooT>> will
> always be
> > > able to extract the coders for FooT and BarT from the input
> PCollection,
> > so
> > > the user does not need to provide them. However, a coder for BarT must
> be
> > > provided. I think in most cases the transform needs to be configurable
> > only
> > > with coders for its output.
> > >
> > > Here's a smooth migration path to accomplish the above:
> > > - Make PCollection.createPrimitiveOutputInternal() take a Coder.
> > > - Make all primitive transforms optionally configurable with a coder
> for
> > > their outputs, such as ParDo.of(DoFn).withOutputCoder().
> > > - By using the above, make all composite transforms shipped with the
> SDK
> > > set a Coder on the collections they produce; in some cases, this will
> > > require adding a withSomethingCoder() option to the transform and
> > > propagating that coder to its sub-transforms. If the option is unset,
> > > that's fine for now.
> > > - As a result of the above, get rid of all setCoder() calls in the Beam
> > > repo. The call will still be there, but it will just not be used
> anywhere
> > > in the SDK or examples, and we can mark it deprecated.
> > > - Add guidance to PTransform Style Guide in line with the above
> > >
> > > Does this sound like a good idea? I'm not sure how urgent it would be
> to
> > > actually do this, but I'd like to know whether people agree that this
> is
> > a
> > > good goal in general.
> > >
> >
>



-- 
----
Mingmin

Re: Requiring PTransform to set a coder on its resulting collections

Reply via email to