Okay, first PR is in review https://github.com/apache/beam/pull/3649
On Wed, Jul 26, 2017 at 11:58 AM Robert Bradshaw <rober...@google.com.invalid> wrote: > +1, I'm a huge fan of moving this direction. Right now there's also > the ugliness that setCoder() may be called any number of times before > a PCollection is used (the last setter winning) but is an error to > call it once it has been used (and here "used" is not clear--if a > PCollection is returned from a composite operation it may or may not > have been consumed within that operation as well). Coupled with the > fact that it can also be an error to use a PCollection without setting > the coder if it was unable to infer one has bitten me several times > writing generic transforms. > > The fact that this is the one thing that prevents PCollections from > being immutable objects has bothered me from a purity point of view > too. > > On Wed, Jul 26, 2017 at 11:43 AM, Mingmin Xu <mingm...@gmail.com> wrote: > > Second that 'it's responsibility of the transform'. For the case when a > > PTransform doesn't have enough information(PTransform developer should > have > > the knowledge), I would prefer a strict way so users won't forget to call > > withSomethingCoder(), like > > - a Coder is required to new the PTransform; > > - or an interface 'getOutputCoder' to be implemented; > > > > On Wed, Jul 26, 2017 at 11:21 AM, Eugene Kirpichov < > > kirpic...@google.com.invalid> wrote: > > > >> Hm, can you elaborate? I'm not sure how this relates to my suggestion, > the > >> gist of which is "PTransform's should set the coder on all of their > >> outputs, and the user should never have to .setCoder() on a PCollection > >> obtained from a PTransform" > >> > >> On Wed, Jul 26, 2017 at 7:38 AM Lukasz Cwik <lc...@google.com.invalid> > >> wrote: > >> > >> > I'm split between our current one pass model of pipeline construction > >> and a > >> > two pass model where all information is gathered and then PTransform > >> > expansions are performed. > >> > > >> > > >> > On Tue, Jul 25, 2017 at 8:25 PM, Eugene Kirpichov < > >> > kirpic...@google.com.invalid> wrote: > >> > > >> > > Hello, > >> > > > >> > > I've worked on a few different things recently and ran repeatedly > into > >> > the > >> > > same issue: that we do not have clear guidance on who should set the > >> > Coder > >> > > on a PCollection: is it responsibility of the PTransform that > outputs > >> it, > >> > > or is it responsibility of the user, or is it sometimes one and > >> sometimes > >> > > the other? > >> > > > >> > > I believe that the answer is "it's responsibility of the transform" > and > >> > > moreover that ideally PCollection.setCoder() should not exist. > >> Instead: > >> > > > >> > > - Require that all transforms set a Coder on the PCollection's they > >> > produce > >> > > - i.e. it should never be responsibility of the user to "fix up" a > >> coder > >> > on > >> > > a PCollection produced by a transform. > >> > > > >> > > - Since all transforms are composed of primitive transforms, saying > >> > > "transforms must set a Coder" means simply that all *primitive* > >> > transforms > >> > > must set a Coder on their output. > >> > > > >> > > - In some cases, a primitive PTransform currently doesn't have > enough > >> > > information to infer a coder for its output collection - e.g. > >> > > ParDo.of(DoFn<InputT, OutputT>) might be unable to infer a coder for > >> > > OutputT. In that case such transforms should allow the user to > provide > >> a > >> > > coder: ParDo.of(DoFn).withOutputCoder(...) [note that this differs > >> from > >> > > requiring the user to set a coder on the resulting collection] > >> > > > >> > > - Corollary: composite transforms need to only configure their > >> primitive > >> > > transforms (and composite sub-transforms) properly, and give them a > >> Coder > >> > > if needed. > >> > > > >> > > - Corollary: a PTransform with type parameters <FooT, BarT, ...> > needs > >> to > >> > > be configurable with coders for all of these, because the > >> implementation > >> > of > >> > > the transform may change and it may introduce intermediate > collections > >> > > involving these types. However, in many cases, some of these type > >> > > parameters appear in the type of the transform's input, e.g. a > >> > > PTransform<PCollection<KV<FooT, BarT>>, PCollection<MooT>> will > >> always be > >> > > able to extract the coders for FooT and BarT from the input > >> PCollection, > >> > so > >> > > the user does not need to provide them. However, a coder for BarT > must > >> be > >> > > provided. I think in most cases the transform needs to be > configurable > >> > only > >> > > with coders for its output. > >> > > > >> > > Here's a smooth migration path to accomplish the above: > >> > > - Make PCollection.createPrimitiveOutputInternal() take a Coder. > >> > > - Make all primitive transforms optionally configurable with a coder > >> for > >> > > their outputs, such as ParDo.of(DoFn).withOutputCoder(). > >> > > - By using the above, make all composite transforms shipped with the > >> SDK > >> > > set a Coder on the collections they produce; in some cases, this > will > >> > > require adding a withSomethingCoder() option to the transform and > >> > > propagating that coder to its sub-transforms. If the option is > unset, > >> > > that's fine for now. > >> > > - As a result of the above, get rid of all setCoder() calls in the > Beam > >> > > repo. The call will still be there, but it will just not be used > >> anywhere > >> > > in the SDK or examples, and we can mark it deprecated. > >> > > - Add guidance to PTransform Style Guide in line with the above > >> > > > >> > > Does this sound like a good idea? I'm not sure how urgent it would > be > >> to > >> > > actually do this, but I'd like to know whether people agree that > this > >> is > >> > a > >> > > good goal in general. > >> > > > >> > > >> > > > > > > > > -- > > ---- > > Mingmin >