I've updated the guidance in PTransform Style Guide on setting coders https://beam.apache.org/contribute/ptransform-style-guide/#coders according to this discussion. https://github.com/apache/beam-site/pull/279
On Thu, Aug 3, 2017 at 6:27 PM Robert Bradshaw <rober...@google.com.invalid> wrote: > On Thu, Aug 3, 2017 at 6:08 PM, Eugene Kirpichov > <kirpic...@google.com.invalid> wrote: > > https://github.com/apache/beam/pull/3649 has landed. The main > contribution > > of this PR is deprecating PTransform.getDefaultOutputCoder(). > > > > Next steps are to get rid of all setCoder() calls in the SDK, and > deprecate > > setCoder(). > > Nearly all setCoder() calls (perhaps simply all?) I found are on the > output > > of mapping transforms, such as ParDo, Map/FlatMapElements, WithKeys. > > I think we should simply make these transforms optionally configurable > with > > an output coder: e.g. input.apply(ParDo.of(new > > SomeFn<>()).withOutputCoder(SomeCoder.of())) > > For multi-output ParDo this is a little more complex API-wise, but doable > > too. > > > > (another minor next step is to say in PTransform Style Guide that the > > transform must set a coder on all its outputs) > > > > Sounds reasonable? > > +1 > > I'd like to do this in a way that lowers the burden for all PTransform > authors. Can't think of a better way than a special subclass of > PTransform that has the setters that one would subclass... > > > On Thu, Aug 3, 2017 at 5:34 AM Lukasz Cwik <lc...@google.com.invalid> > wrote: > > > >> I'm for (1) and am not sure about the feasibility of (2) without having > an > >> escape hatch that allows a pipeline author to specify a coder to handle > >> their special case. > >> > >> On Tue, Aug 1, 2017 at 2:15 PM, Reuven Lax <re...@google.com.invalid> > >> wrote: > >> > >> > One interesting wrinkle: I'm about to propose a set of semantics for > >> > snapshotting/in-place updating pipelines. Part of this proposal is the > >> > ability to write pipelines to "upgrade" snapshots to make them > compatible > >> > with new graphs. This relies on the ability to have two separate > coders > >> for > >> > the same type - the old coder and the new coder - in order to handle > the > >> > case where the user has changed coders in the new pipeline. > >> > > >> > On Tue, Aug 1, 2017 at 2:12 PM, Robert Bradshaw > >> > <rober...@google.com.invalid > >> > > wrote: > >> > > >> > > There are two concerns in this thread: > >> > > > >> > > (1) Getting rid of PCollection.setCoder(). Everyone seems in favor > of > >> > this > >> > > (right?) > >> > > > >> > > (2) Deprecating specifying Coders in favor of specifying > >> TypeDescriptors. > >> > > I'm generally in favor, but it's unclear how far we can push this > >> > through. > >> > > > >> > > Let's at least do (1), and separately state a preference for (2), > >> seeing > >> > > how fare we can push it. > >> > > > >> > > On Thu, Jul 27, 2017 at 9:13 PM, Kenneth Knowles > >> <k...@google.com.invalid > >> > > > >> > > wrote: > >> > > > >> > > > Another thought on this: setting a custom coder to support a > special > >> > data > >> > > > distribution is likely often a property of the input to the > pipeline. > >> > So > >> > > > setting a coder during pipeline construction - more generally, > when > >> > > writing > >> > > > a composite transform for reuse - you might not actually have the > >> > needed > >> > > > information. But setting up a special indicator type descriptor > lets > >> > your > >> > > > users map that type descriptor to a coder that works well for > their > >> > data. > >> > > > > >> > > > But Robert's example of RawUnionValue seems like a deal breaker > for > >> all > >> > > > approaches. It really requires .getCoder() during expand() and > >> > explicitly > >> > > > building coders encoding information that is cumbersome to get > into a > >> > > > TypeDescriptor. While making up new type languages is a > comfortable > >> > > > activity for me :-) I don't think we should head down that path, > for > >> > our > >> > > > users' sake. So I'll stop hoping we can eliminate this pain point > for > >> > > now. > >> > > > > >> > > > Kenn > >> > > > > >> > > > On Thu, Jul 27, 2017 at 8:48 PM, Kenneth Knowles <k...@google.com> > >> > wrote: > >> > > > > >> > > > > On Thu, Jul 27, 2017 at 11:18 AM, Thomas Groh > >> > <tg...@google.com.invalid > >> > > > > >> > > > > wrote: > >> > > > > > >> > > > >> introduce a > >> > > > >> new, specialized type to represent the restricted > >> > > > >> (alternatively-distributed?) data. The TypeDescriptor for this > >> type > >> > > can > >> > > > >> map > >> > > > >> to the specialized coder, without having to perform a > significant > >> > > degree > >> > > > >> of > >> > > > >> potentially wasted encoding work, plus it includes the > assumptions > >> > > that > >> > > > >> are > >> > > > >> being made about the distribution of data. > >> > > > >> > >> > > > > > >> > > > > This is a very cool idea, in theory :-) > >> > > > > > >> > > > > For complex types with a few allocations involved and/or > nontrivial > >> > > > > deserialization, or when a pipeline does a lot of real work, I > >> think > >> > > the > >> > > > > wrapper cost won't be perceptible. > >> > > > > > >> > > > > But for more primitive types in pipelines that don't really do > >> much > >> > > > > computation but just move data around, I think it could matter. > >> > > Certainly > >> > > > > there are languages with constructs to allow type wrappers at > zero > >> > cost > >> > > > > (Haskell's `newtype`). > >> > > > > > >> > > > > This is all just speculation until we measure, like most of this > >> > > thread. > >> > > > > > >> > > > > Kenn > >> > > > > > >> > > > > > >> > > > >> > On Thu, Jul 27, 2017 at 11:00 AM, Thomas Groh > >> > > > <tg...@google.com.invalid > >> > > > >> > > >> > > > >> > wrote: > >> > > > >> > > >> > > > >> > > +1 on getting rid of setCoder; just from a Java SDK > >> perspective, > >> > > my > >> > > > >> ideal > >> > > > >> > > world contains PCollections which don't have a user-visible > >> way > >> > to > >> > > > >> mutate > >> > > > >> > > them. > >> > > > >> > > > >> > > > >> > > My preference would be to use TypeDescriptors everywhere > >> within > >> > > > >> Pipeline > >> > > > >> > > construction (where possible), and utilize the > CoderRegistry > >> > > > >> everywhere > >> > > > >> > to > >> > > > >> > > actually extract the appropriate type. The unfortunate > >> > difficulty > >> > > of > >> > > > >> > having > >> > > > >> > > to encode a union type and the lack of variable-length > >> generics > >> > > does > >> > > > >> > > complicate that. We could consider some way of constructing > >> > coders > >> > > > in > >> > > > >> the > >> > > > >> > > registry from a collection of type descriptors (which > should > >> be > >> > > > >> > accessible > >> > > > >> > > from the point the union-type is being constructed), e.g. > >> > > something > >> > > > >> like > >> > > > >> > > `getCoder(TypeDescriptor output, TypeDescriptor... > >> components)` > >> > - > >> > > > that > >> > > > >> > does > >> > > > >> > > only permit a single flat level (but since this is being > >> invoked > >> > > by > >> > > > >> the > >> > > > >> > SDK > >> > > > >> > > during construction it could also pass Coder...). > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > On Thu, Jul 27, 2017 at 10:22 AM, Robert Bradshaw < > >> > > > >> > > rober...@google.com.invalid> wrote: > >> > > > >> > > > >> > > > >> > > > On Thu, Jul 27, 2017 at 10:04 AM, Kenneth Knowles > >> > > > >> > > > <k...@google.com.invalid> wrote: > >> > > > >> > > > > On Thu, Jul 27, 2017 at 2:22 AM, Lukasz Cwik > >> > > > >> > <lc...@google.com.invalid > >> > > > >> > > > > >> > > > >> > > > > wrote: > >> > > > >> > > > >> > >> > > > >> > > > >> Ken/Robert, I believe users will want the ability to > set > >> > the > >> > > > >> output > >> > > > >> > > > coder > >> > > > >> > > > >> because coders may have intrinsic properties where the > >> type > >> > > > isn't > >> > > > >> > > enough > >> > > > >> > > > >> information to fully specify what I want as a user. > Some > >> > > cases > >> > > > I > >> > > > >> can > >> > > > >> > > see > >> > > > >> > > > >> are: > >> > > > >> > > > >> 1) I have a cheap and fast non-deterministic coder > but a > >> > > > >> different > >> > > > >> > > > slower > >> > > > >> > > > >> coder when I want to use it as the key to a GBK, For > >> > example > >> > > > >> with a > >> > > > >> > > set > >> > > > >> > > > >> coder, it would need to consistently order the values > of > >> > the > >> > > > set > >> > > > >> > when > >> > > > >> > > > used > >> > > > >> > > > >> as the key. > >> > > > >> > > > >> 2) I know a property of the data which allows me to > have > >> a > >> > > > >> cheaper > >> > > > >> > > > >> encoding. Imagine I know that all the strings have a > >> common > >> > > > >> prefix > >> > > > >> > or > >> > > > >> > > > >> integers that are in a certain range, or that a > matrix is > >> > > > >> > > sparse/dense. > >> > > > >> > > > Not > >> > > > >> > > > >> all PCollections of strings / integers / matrices in > the > >> > > > pipeline > >> > > > >> > will > >> > > > >> > > > have > >> > > > >> > > > >> this property, just some. > >> > > > >> > > > >> 3) Sorting comes up occasionally, traditionally in > Google > >> > > this > >> > > > >> was > >> > > > >> > > done > >> > > > >> > > > by > >> > > > >> > > > >> sorting the encoded version of the object > >> lexicographically > >> > > > >> during a > >> > > > >> > > > GBK. > >> > > > >> > > > >> There are good lexicographical byte representations > for > >> > ASCII > >> > > > >> > strings, > >> > > > >> > > > >> integers, and for some IEEE number representations > which > >> > > could > >> > > > be > >> > > > >> > done > >> > > > >> > > > by > >> > > > >> > > > >> the use of a special coder. > >> > > > >> > > > >> > >> > > > >> > > > > > >> > > > >> > > > > Items (1) and (3) do not require special knowledge from > >> the > >> > > > user. > >> > > > >> > They > >> > > > >> > > > are > >> > > > >> > > > > easily observed properties of a pipeline. My proposal > >> > included > >> > > > >> full > >> > > > >> > > > > automation for both. The suggestion is new methods > >> > > > >> > > > > .getDeterministicCoder(TypeDescriptor) and > >> > > > >> > > > > .getLexicographicCoder(TypeDescriptor). > >> > > > >> > > > > >> > > > >> > > > Completely agree--usecases (1) and (3) are an indirect > use > >> of > >> > > > Coders > >> > > > >> > > > that are used to achieve an effect that would be better > >> > > expressed > >> > > > >> > > > directly. > >> > > > >> > > > > >> > > > >> > > > > (2) is an interesting hypothetical for massive scale > where > >> > > tiny > >> > > > >> > > > incremental > >> > > > >> > > > > optimization represents a lot of cost _and_ your data > has > >> > > > >> sufficient > >> > > > >> > > > > structure to realize a benefit _and_ it needs to be > >> > pinpointed > >> > > > to > >> > > > >> > just > >> > > > >> > > > some > >> > > > >> > > > > PCollections. I think our experience with coders so > far is > >> > > that > >> > > > >> their > >> > > > >> > > > > existence is almost entirely negative. It would be > nice to > >> > > > support > >> > > > >> > this > >> > > > >> > > > > vanishingly rare case without inflicting a terrible > pain > >> > point > >> > > > on > >> > > > >> the > >> > > > >> > > > model > >> > > > >> > > > > and all other users. > >> > > > >> > > > > >> > > > >> > > > (2) is not just about cheapness, sometimes there's other > >> > > structure > >> > > > >> in > >> > > > >> > > > the data we can leverage. Consider the UnionCoder used in > >> > > > >> > > > CoGBK--RawUnionValue has an integer value that specifies > >> > > indicates > >> > > > >> the > >> > > > >> > > > type of it's raw Object field. Unless we want to extend > the > >> > type > >> > > > >> > > > language, there's not a sufficient type descriptor that > can > >> be > >> > > > used > >> > > > >> to > >> > > > >> > > > infer the coder. I'm dubious going down the road of > adding > >> > > special > >> > > > >> > > > cases is the right thing here. > >> > > > >> > > > > >> > > > >> > > > > For example, in those cases you could encode in your > >> > > > >> > > > > DoFn so the type descriptor would just be byte[]. > >> > > > >> > > > > >> > > > >> > > > As well as being an extremely cumbersome API, this would > >> incur > >> > > the > >> > > > >> > > > cost of coding/decoding at that DoFn boundary even if it > is > >> > > fused > >> > > > >> > > > away. > >> > > > >> > > > > >> > > > >> > > > >> On Thu, Jul 27, 2017 at 1:34 AM, Jean-Baptiste Onofré > < > >> > > > >> > > j...@nanthrax.net> > >> > > > >> > > > >> wrote: > >> > > > >> > > > >> > >> > > > >> > > > >> > Hi, > >> > > > >> > > > >> > > >> > > > >> > > > >> > That's an interesting thread and I was wondering the > >> > > > >> relationship > >> > > > >> > > > between > >> > > > >> > > > >> > type descriptor and coder for a while ;) > >> > > > >> > > > >> > > >> > > > >> > > > >> > Today, in a PCollection, we can set the coder and we > >> also > >> > > > have > >> > > > >> a > >> > > > >> > > > >> > getTypeDescriptor(). It sounds weird to me: it > should > >> be > >> > > one > >> > > > or > >> > > > >> > the > >> > > > >> > > > >> other. > >> > > > >> > > > >> > > >> > > > >> > > > >> > Basically, if the Coder is not used to define the > type, > >> > > > than, I > >> > > > >> > > fully > >> > > > >> > > > >> > agree with Eugene. > >> > > > >> > > > >> > > >> > > > >> > > > >> > Basically, the PCollection should define only the > type > >> > > > >> descriptor, > >> > > > >> > > not > >> > > > >> > > > >> the > >> > > > >> > > > >> > coder by itself: the coder can be found using the > type > >> > > > >> descriptor. > >> > > > >> > > > >> > > >> > > > >> > > > >> > With both coder and type descriptor on the > PCollection, > >> > it > >> > > > >> sounds > >> > > > >> > a > >> > > > >> > > > big > >> > > > >> > > > >> > "decoupled" to me and it would be possible to have a > >> > coder > >> > > on > >> > > > >> the > >> > > > >> > > > >> > PCollection that doesn't match the type descriptor. > >> > > > >> > > > >> > > >> > > > >> > > > >> > I think PCollection type descriptor should be > defined, > >> > and > >> > > > the > >> > > > >> > coder > >> > > > >> > > > >> > should be implicit based on this type descriptor. > >> > > > >> > > > >> > > >> > > > >> > > > >> > Thoughts ? > >> > > > >> > > > >> > > >> > > > >> > > > >> > Regards > >> > > > >> > > > >> > JB > >> > > > >> > > > >> > > >> > > > >> > > > >> > > >> > > > >> > > > >> > On 07/26/2017 05:25 AM, Eugene Kirpichov wrote: > >> > > > >> > > > >> > > >> > > > >> > > > >> >> Hello, > >> > > > >> > > > >> >> > >> > > > >> > > > >> >> I've worked on a few different things recently and > ran > >> > > > >> repeatedly > >> > > > >> > > > into > >> > > > >> > > > >> the > >> > > > >> > > > >> >> same issue: that we do not have clear guidance on > who > >> > > should > >> > > > >> set > >> > > > >> > > the > >> > > > >> > > > >> Coder > >> > > > >> > > > >> >> on a PCollection: is it responsibility of the > >> PTransform > >> > > > that > >> > > > >> > > outputs > >> > > > >> > > > >> it, > >> > > > >> > > > >> >> or is it responsibility of the user, or is it > >> sometimes > >> > > one > >> > > > >> and > >> > > > >> > > > >> sometimes > >> > > > >> > > > >> >> the other? > >> > > > >> > > > >> >> > >> > > > >> > > > >> >> I believe that the answer is "it's responsibility > of > >> the > >> > > > >> > transform" > >> > > > >> > > > and > >> > > > >> > > > >> >> moreover that ideally PCollection.setCoder() > should > >> not > >> > > > >> exist. > >> > > > >> > > > Instead: > >> > > > >> > > > >> >> > >> > > > >> > > > >> >> - Require that all transforms set a Coder on the > >> > > > PCollection's > >> > > > >> > they > >> > > > >> > > > >> >> produce > >> > > > >> > > > >> >> - i.e. it should never be responsibility of the > user > >> to > >> > > "fix > >> > > > >> up" > >> > > > >> > a > >> > > > >> > > > coder > >> > > > >> > > > >> >> on > >> > > > >> > > > >> >> a PCollection produced by a transform. > >> > > > >> > > > >> >> > >> > > > >> > > > >> >> - Since all transforms are composed of primitive > >> > > transforms, > >> > > > >> > saying > >> > > > >> > > > >> >> "transforms must set a Coder" means simply that all > >> > > > >> *primitive* > >> > > > >> > > > >> transforms > >> > > > >> > > > >> >> must set a Coder on their output. > >> > > > >> > > > >> >> > >> > > > >> > > > >> >> - In some cases, a primitive PTransform currently > >> > doesn't > >> > > > have > >> > > > >> > > enough > >> > > > >> > > > >> >> information to infer a coder for its output > >> collection - > >> > > > e.g. > >> > > > >> > > > >> >> ParDo.of(DoFn<InputT, OutputT>) might be unable to > >> > infer a > >> > > > >> coder > >> > > > >> > > for > >> > > > >> > > > >> >> OutputT. In that case such transforms should allow > the > >> > > user > >> > > > to > >> > > > >> > > > provide a > >> > > > >> > > > >> >> coder: ParDo.of(DoFn).withOutputCoder(...) [note > that > >> > > this > >> > > > >> > differs > >> > > > >> > > > from > >> > > > >> > > > >> >> requiring the user to set a coder on the resulting > >> > > > collection] > >> > > > >> > > > >> >> > >> > > > >> > > > >> >> - Corollary: composite transforms need to only > >> configure > >> > > > their > >> > > > >> > > > primitive > >> > > > >> > > > >> >> transforms (and composite sub-transforms) properly, > >> and > >> > > give > >> > > > >> > them a > >> > > > >> > > > >> Coder > >> > > > >> > > > >> >> if needed. > >> > > > >> > > > >> >> > >> > > > >> > > > >> >> - Corollary: a PTransform with type parameters > <FooT, > >> > > BarT, > >> > > > >> ...> > >> > > > >> > > > needs > >> > > > >> > > > >> to > >> > > > >> > > > >> >> be configurable with coders for all of these, > because > >> > the > >> > > > >> > > > implementation > >> > > > >> > > > >> >> of > >> > > > >> > > > >> >> the transform may change and it may introduce > >> > intermediate > >> > > > >> > > > collections > >> > > > >> > > > >> >> involving these types. However, in many cases, > some of > >> > > these > >> > > > >> type > >> > > > >> > > > >> >> parameters appear in the type of the transform's > >> input, > >> > > > e.g. a > >> > > > >> > > > >> >> PTransform<PCollection<KV<FooT, BarT>>, > >> > > PCollection<MooT>> > >> > > > >> will > >> > > > >> > > > always > >> > > > >> > > > >> be > >> > > > >> > > > >> >> able to extract the coders for FooT and BarT from > the > >> > > input > >> > > > >> > > > PCollection, > >> > > > >> > > > >> >> so > >> > > > >> > > > >> >> the user does not need to provide them. However, a > >> coder > >> > > for > >> > > > >> BarT > >> > > > >> > > > must > >> > > > >> > > > >> be > >> > > > >> > > > >> >> provided. I think in most cases the transform > needs to > >> > be > >> > > > >> > > > configurable > >> > > > >> > > > >> >> only > >> > > > >> > > > >> >> with coders for its output. > >> > > > >> > > > >> >> > >> > > > >> > > > >> >> Here's a smooth migration path to accomplish the > >> above: > >> > > > >> > > > >> >> - Make PCollection.createPrimitiveOutputInternal() > >> > take a > >> > > > >> Coder. > >> > > > >> > > > >> >> - Make all primitive transforms optionally > >> configurable > >> > > > with a > >> > > > >> > > coder > >> > > > >> > > > for > >> > > > >> > > > >> >> their outputs, such as ParDo.of(DoFn). > >> > withOutputCoder(). > >> > > > >> > > > >> >> - By using the above, make all composite transforms > >> > > shipped > >> > > > >> with > >> > > > >> > > the > >> > > > >> > > > SDK > >> > > > >> > > > >> >> set a Coder on the collections they produce; in > some > >> > > cases, > >> > > > >> this > >> > > > >> > > will > >> > > > >> > > > >> >> require adding a withSomethingCoder() option to the > >> > > > transform > >> > > > >> and > >> > > > >> > > > >> >> propagating that coder to its sub-transforms. If > the > >> > > option > >> > > > is > >> > > > >> > > unset, > >> > > > >> > > > >> >> that's fine for now. > >> > > > >> > > > >> >> - As a result of the above, get rid of all > setCoder() > >> > > calls > >> > > > in > >> > > > >> > the > >> > > > >> > > > Beam > >> > > > >> > > > >> >> repo. The call will still be there, but it will > just > >> not > >> > > be > >> > > > >> used > >> > > > >> > > > >> anywhere > >> > > > >> > > > >> >> in the SDK or examples, and we can mark it > deprecated. > >> > > > >> > > > >> >> - Add guidance to PTransform Style Guide in line > with > >> > the > >> > > > >> above > >> > > > >> > > > >> >> > >> > > > >> > > > >> >> Does this sound like a good idea? I'm not sure how > >> > urgent > >> > > it > >> > > > >> > would > >> > > > >> > > > be to > >> > > > >> > > > >> >> actually do this, but I'd like to know whether > people > >> > > agree > >> > > > >> that > >> > > > >> > > this > >> > > > >> > > > >> is a > >> > > > >> > > > >> >> good goal in general. > >> > > > >> > > > >> >> > >> > > > >> > > > >> >> > >> > > > >> > > > >> > -- > >> > > > >> > > > >> > Jean-Baptiste Onofré > >> > > > >> > > > >> > jbono...@apache.org > >> > > > >> > > > >> > http://blog.nanthrax.net > >> > > > >> > > > >> > Talend - http://www.talend.com > >> > > > >> > > > >> > > >> > > > >> > > > >> > >> > > > >> > > > > >> > > > >> > > > >> > > > >> > > >> > > > >> > >> > > > > > >> > > > > > >> > > > > >> > > > >> > > >> >