Hm, can you elaborate? I'm not sure how this relates to my suggestion, the gist of which is "PTransform's should set the coder on all of their outputs, and the user should never have to .setCoder() on a PCollection obtained from a PTransform"
On Wed, Jul 26, 2017 at 7:38 AM Lukasz Cwik <lc...@google.com.invalid> wrote: > I'm split between our current one pass model of pipeline construction and a > two pass model where all information is gathered and then PTransform > expansions are performed. > > > On Tue, Jul 25, 2017 at 8:25 PM, Eugene Kirpichov < > kirpic...@google.com.invalid> wrote: > > > Hello, > > > > I've worked on a few different things recently and ran repeatedly into > the > > same issue: that we do not have clear guidance on who should set the > Coder > > on a PCollection: is it responsibility of the PTransform that outputs it, > > or is it responsibility of the user, or is it sometimes one and sometimes > > the other? > > > > I believe that the answer is "it's responsibility of the transform" and > > moreover that ideally PCollection.setCoder() should not exist. Instead: > > > > - Require that all transforms set a Coder on the PCollection's they > produce > > - i.e. it should never be responsibility of the user to "fix up" a coder > on > > a PCollection produced by a transform. > > > > - Since all transforms are composed of primitive transforms, saying > > "transforms must set a Coder" means simply that all *primitive* > transforms > > must set a Coder on their output. > > > > - In some cases, a primitive PTransform currently doesn't have enough > > information to infer a coder for its output collection - e.g. > > ParDo.of(DoFn<InputT, OutputT>) might be unable to infer a coder for > > OutputT. In that case such transforms should allow the user to provide a > > coder: ParDo.of(DoFn).withOutputCoder(...) [note that this differs from > > requiring the user to set a coder on the resulting collection] > > > > - Corollary: composite transforms need to only configure their primitive > > transforms (and composite sub-transforms) properly, and give them a Coder > > if needed. > > > > - Corollary: a PTransform with type parameters <FooT, BarT, ...> needs to > > be configurable with coders for all of these, because the implementation > of > > the transform may change and it may introduce intermediate collections > > involving these types. However, in many cases, some of these type > > parameters appear in the type of the transform's input, e.g. a > > PTransform<PCollection<KV<FooT, BarT>>, PCollection<MooT>> will always be > > able to extract the coders for FooT and BarT from the input PCollection, > so > > the user does not need to provide them. However, a coder for BarT must be > > provided. I think in most cases the transform needs to be configurable > only > > with coders for its output. > > > > Here's a smooth migration path to accomplish the above: > > - Make PCollection.createPrimitiveOutputInternal() take a Coder. > > - Make all primitive transforms optionally configurable with a coder for > > their outputs, such as ParDo.of(DoFn).withOutputCoder(). > > - By using the above, make all composite transforms shipped with the SDK > > set a Coder on the collections they produce; in some cases, this will > > require adding a withSomethingCoder() option to the transform and > > propagating that coder to its sub-transforms. If the option is unset, > > that's fine for now. > > - As a result of the above, get rid of all setCoder() calls in the Beam > > repo. The call will still be there, but it will just not be used anywhere > > in the SDK or examples, and we can mark it deprecated. > > - Add guidance to PTransform Style Guide in line with the above > > > > Does this sound like a good idea? I'm not sure how urgent it would be to > > actually do this, but I'd like to know whether people agree that this is > a > > good goal in general. > > >