On Wed, Jul 26, 2017 at 8:58 PM, Eugene Kirpichov < kirpic...@google.com.invalid> wrote:
> Hmm, yes, I just noticed that PCollection has a setTypeDescriptor() method, > and I wonder how much will break if all call sites of setCoder() will call > setTypeDescriptor() instead - i.e. how far are we from the ideal state of > having a coder inferrable for every sufficiently concrete type descriptor. > It would certainly make a lot of call sites a lot less awkward. > > Perhaps if we are not too far from that, then the guidance should be that > transforms should be configurable with type descriptors rather than with > coders. > > Do you think the current PR still makes sense as a step in this direction, > removing at least some of the mess? > Yes, indeed. I think all of your arguments apply also to TypeDescriptor and we should do the same to .setTypeDescriptor. If you can build a coder, you can build a type descriptor (for one, coders can vend the corresponding type descriptor...we should be sure the APIs are rich and easy enough to do it directly). We should do it when we fabricate a fresh PCollection. Kenn > > On Wed, Jul 26, 2017 at 8:16 PM Kenneth Knowles <k...@google.com.invalid> > wrote: > > > +1 but maybe go ever further > > > > On Tue, Jul 25, 2017 at 8:25 PM, Eugene Kirpichov < > > kirpic...@google.com.invalid> wrote: > > > > > Hello, > > > > > > I've worked on a few different things recently and ran repeatedly into > > the > > > same issue: that we do not have clear guidance on who should set the > > Coder > > > on a PCollection: is it responsibility of the PTransform that outputs > it, > > > or is it responsibility of the user, or is it sometimes one and > sometimes > > > the other? > > > > > > > I think it is more ambiguous than even this, since often we expect a > third > > option: the Pipeline propagates them automatically via common type > > variables, as in Filter :: PTransform<T, T>. > > > > The unfortunate part is that this is only a best-effort process due to > > erasure and the complexities of type equivalence. I'm not convinced it is > > unsolvable but I don't necessarily think we should put our effort into > > solving it. > > > > > > > I believe that the answer is "it's responsibility of the transform" and > > > moreover that ideally PCollection.setCoder() should not exist. > Instead: > > > > > > > I would actually like this instead to be a requirement to provide a fully > > concrete type descriptor when creating a new PCollection. This allows > more > > flexibility to easily infer a coder for the type. Ideally, .getCoder() > > would be deprecated as well and this all moved to a phase after > > construction. > > > > In particular: I'd like to change our current practice of GBK crashing > when > > a key coder is nondeterministic. Instead, inference should call > > "getDeterministicCoder" when it is known to be necessary. Lots of types > > might have a deterministic coder that is more costly, and we can enable > > this as an optimization choice. The transform producing a PCollection > > cannot know that it will be used for a GBK. The pipeline also cannot know > > when it tries to infer it without knowing what transform is taking it as > > input. It has to be part of a more holistic analysis. > > > > And it seems nice to build a whole pipeline and then apply coders as > > specified by what is registered. So the same construction can be > > transparently improved if a better coder is concocted. This is likely not > > possible if transforms set up the coders a priori. > > > > So my perfect world is "no coders during construction". > > > > Kenn > > > > - Require that all transforms set a Coder on the PCollection's they > produce > > > - i.e. it should never be responsibility of the user to "fix up" a > coder > > on > > > a PCollection produced by a transform. > > > > > > - Since all transforms are composed of primitive transforms, saying > > > "transforms must set a Coder" means simply that all *primitive* > > transforms > > > must set a Coder on their output. > > > > > > - In some cases, a primitive PTransform currently doesn't have enough > > > information to infer a coder for its output collection - e.g. > > > ParDo.of(DoFn<InputT, OutputT>) might be unable to infer a coder for > > > OutputT. In that case such transforms should allow the user to provide > a > > > coder: ParDo.of(DoFn).withOutputCoder(...) [note that this differs > from > > > requiring the user to set a coder on the resulting collection] > > > > > > - Corollary: composite transforms need to only configure their > primitive > > > transforms (and composite sub-transforms) properly, and give them a > Coder > > > if needed. > > > > > > - Corollary: a PTransform with type parameters <FooT, BarT, ...> needs > to > > > be configurable with coders for all of these, because the > implementation > > of > > > the transform may change and it may introduce intermediate collections > > > involving these types. However, in many cases, some of these type > > > parameters appear in the type of the transform's input, e.g. a > > > PTransform<PCollection<KV<FooT, BarT>>, PCollection<MooT>> will > always be > > > able to extract the coders for FooT and BarT from the input > PCollection, > > so > > > the user does not need to provide them. However, a coder for BarT must > be > > > provided. I think in most cases the transform needs to be configurable > > only > > > with coders for its output. > > > > > > Here's a smooth migration path to accomplish the above: > > > - Make PCollection.createPrimitiveOutputInternal() take a Coder. > > > - Make all primitive transforms optionally configurable with a coder > for > > > their outputs, such as ParDo.of(DoFn).withOutputCoder(). > > > - By using the above, make all composite transforms shipped with the > SDK > > > set a Coder on the collections they produce; in some cases, this will > > > require adding a withSomethingCoder() option to the transform and > > > propagating that coder to its sub-transforms. If the option is unset, > > > that's fine for now. > > > - As a result of the above, get rid of all setCoder() calls in the Beam > > > repo. The call will still be there, but it will just not be used > anywhere > > > in the SDK or examples, and we can mark it deprecated. > > > - Add guidance to PTransform Style Guide in line with the above > > > > > > Does this sound like a good idea? I'm not sure how urgent it would be > to > > > actually do this, but I'd like to know whether people agree that this > is > > a > > > good goal in general. > > > > > >