On Thu, Jul 27, 2017 at 2:22 AM, Lukasz Cwik <lc...@google.com.invalid> wrote: > > Ken/Robert, I believe users will want the ability to set the output coder > because coders may have intrinsic properties where the type isn't enough > information to fully specify what I want as a user. Some cases I can see > are: > 1) I have a cheap and fast non-deterministic coder but a different slower > coder when I want to use it as the key to a GBK, For example with a set > coder, it would need to consistently order the values of the set when used > as the key. > 2) I know a property of the data which allows me to have a cheaper > encoding. Imagine I know that all the strings have a common prefix or > integers that are in a certain range, or that a matrix is sparse/dense. Not > all PCollections of strings / integers / matrices in the pipeline will have > this property, just some. > 3) Sorting comes up occasionally, traditionally in Google this was done by > sorting the encoded version of the object lexicographically during a GBK. > There are good lexicographical byte representations for ASCII strings, > integers, and for some IEEE number representations which could be done by > the use of a special coder. >
Items (1) and (3) do not require special knowledge from the user. They are easily observed properties of a pipeline. My proposal included full automation for both. The suggestion is new methods .getDeterministicCoder(TypeDescriptor) and .getLexicographicCoder(TypeDescriptor). (2) is an interesting hypothetical for massive scale where tiny incremental optimization represents a lot of cost _and_ your data has sufficient structure to realize a benefit _and_ it needs to be pinpointed to just some PCollections. I think our experience with coders so far is that their existence is almost entirely negative. It would be nice to support this vanishingly rare case without inflicting a terrible pain point on the model and all other users. For example, in those cases you could encode in your DoFn so the type descriptor would just be byte[]. Kenn > > On Thu, Jul 27, 2017 at 1:34 AM, Jean-Baptiste Onofré <j...@nanthrax.net> > wrote: > > > Hi, > > > > That's an interesting thread and I was wondering the relationship between > > type descriptor and coder for a while ;) > > > > Today, in a PCollection, we can set the coder and we also have a > > getTypeDescriptor(). It sounds weird to me: it should be one or the > other. > > > > Basically, if the Coder is not used to define the type, than, I fully > > agree with Eugene. > > > > Basically, the PCollection should define only the type descriptor, not > the > > coder by itself: the coder can be found using the type descriptor. > > > > With both coder and type descriptor on the PCollection, it sounds a big > > "decoupled" to me and it would be possible to have a coder on the > > PCollection that doesn't match the type descriptor. > > > > I think PCollection type descriptor should be defined, and the coder > > should be implicit based on this type descriptor. > > > > Thoughts ? > > > > Regards > > JB > > > > > > On 07/26/2017 05:25 AM, Eugene Kirpichov wrote: > > > >> Hello, > >> > >> I've worked on a few different things recently and ran repeatedly into > the > >> same issue: that we do not have clear guidance on who should set the > Coder > >> on a PCollection: is it responsibility of the PTransform that outputs > it, > >> or is it responsibility of the user, or is it sometimes one and > sometimes > >> the other? > >> > >> I believe that the answer is "it's responsibility of the transform" and > >> moreover that ideally PCollection.setCoder() should not exist. Instead: > >> > >> - Require that all transforms set a Coder on the PCollection's they > >> produce > >> - i.e. it should never be responsibility of the user to "fix up" a coder > >> on > >> a PCollection produced by a transform. > >> > >> - Since all transforms are composed of primitive transforms, saying > >> "transforms must set a Coder" means simply that all *primitive* > transforms > >> must set a Coder on their output. > >> > >> - In some cases, a primitive PTransform currently doesn't have enough > >> information to infer a coder for its output collection - e.g. > >> ParDo.of(DoFn<InputT, OutputT>) might be unable to infer a coder for > >> OutputT. In that case such transforms should allow the user to provide a > >> coder: ParDo.of(DoFn).withOutputCoder(...) [note that this differs from > >> requiring the user to set a coder on the resulting collection] > >> > >> - Corollary: composite transforms need to only configure their primitive > >> transforms (and composite sub-transforms) properly, and give them a > Coder > >> if needed. > >> > >> - Corollary: a PTransform with type parameters <FooT, BarT, ...> needs > to > >> be configurable with coders for all of these, because the implementation > >> of > >> the transform may change and it may introduce intermediate collections > >> involving these types. However, in many cases, some of these type > >> parameters appear in the type of the transform's input, e.g. a > >> PTransform<PCollection<KV<FooT, BarT>>, PCollection<MooT>> will always > be > >> able to extract the coders for FooT and BarT from the input PCollection, > >> so > >> the user does not need to provide them. However, a coder for BarT must > be > >> provided. I think in most cases the transform needs to be configurable > >> only > >> with coders for its output. > >> > >> Here's a smooth migration path to accomplish the above: > >> - Make PCollection.createPrimitiveOutputInternal() take a Coder. > >> - Make all primitive transforms optionally configurable with a coder for > >> their outputs, such as ParDo.of(DoFn).withOutputCoder(). > >> - By using the above, make all composite transforms shipped with the SDK > >> set a Coder on the collections they produce; in some cases, this will > >> require adding a withSomethingCoder() option to the transform and > >> propagating that coder to its sub-transforms. If the option is unset, > >> that's fine for now. > >> - As a result of the above, get rid of all setCoder() calls in the Beam > >> repo. The call will still be there, but it will just not be used > anywhere > >> in the SDK or examples, and we can mark it deprecated. > >> - Add guidance to PTransform Style Guide in line with the above > >> > >> Does this sound like a good idea? I'm not sure how urgent it would be to > >> actually do this, but I'd like to know whether people agree that this > is a > >> good goal in general. > >> > >> > > -- > > Jean-Baptiste Onofré > > jbono...@apache.org > > http://blog.nanthrax.net > > Talend - http://www.talend.com > > >