Another thought on this: setting a custom coder to support a special data distribution is likely often a property of the input to the pipeline. So setting a coder during pipeline construction - more generally, when writing a composite transform for reuse - you might not actually have the needed information. But setting up a special indicator type descriptor lets your users map that type descriptor to a coder that works well for their data.
But Robert's example of RawUnionValue seems like a deal breaker for all approaches. It really requires .getCoder() during expand() and explicitly building coders encoding information that is cumbersome to get into a TypeDescriptor. While making up new type languages is a comfortable activity for me :-) I don't think we should head down that path, for our users' sake. So I'll stop hoping we can eliminate this pain point for now. Kenn On Thu, Jul 27, 2017 at 8:48 PM, Kenneth Knowles <k...@google.com> wrote: > On Thu, Jul 27, 2017 at 11:18 AM, Thomas Groh <tg...@google.com.invalid> > wrote: > >> introduce a >> new, specialized type to represent the restricted >> (alternatively-distributed?) data. The TypeDescriptor for this type can >> map >> to the specialized coder, without having to perform a significant degree >> of >> potentially wasted encoding work, plus it includes the assumptions that >> are >> being made about the distribution of data. >> > > This is a very cool idea, in theory :-) > > For complex types with a few allocations involved and/or nontrivial > deserialization, or when a pipeline does a lot of real work, I think the > wrapper cost won't be perceptible. > > But for more primitive types in pipelines that don't really do much > computation but just move data around, I think it could matter. Certainly > there are languages with constructs to allow type wrappers at zero cost > (Haskell's `newtype`). > > This is all just speculation until we measure, like most of this thread. > > Kenn > > >> > On Thu, Jul 27, 2017 at 11:00 AM, Thomas Groh <tg...@google.com.invalid >> > >> > wrote: >> > >> > > +1 on getting rid of setCoder; just from a Java SDK perspective, my >> ideal >> > > world contains PCollections which don't have a user-visible way to >> mutate >> > > them. >> > > >> > > My preference would be to use TypeDescriptors everywhere within >> Pipeline >> > > construction (where possible), and utilize the CoderRegistry >> everywhere >> > to >> > > actually extract the appropriate type. The unfortunate difficulty of >> > having >> > > to encode a union type and the lack of variable-length generics does >> > > complicate that. We could consider some way of constructing coders in >> the >> > > registry from a collection of type descriptors (which should be >> > accessible >> > > from the point the union-type is being constructed), e.g. something >> like >> > > `getCoder(TypeDescriptor output, TypeDescriptor... components)` - that >> > does >> > > only permit a single flat level (but since this is being invoked by >> the >> > SDK >> > > during construction it could also pass Coder...). >> > > >> > > >> > > >> > > On Thu, Jul 27, 2017 at 10:22 AM, Robert Bradshaw < >> > > rober...@google.com.invalid> wrote: >> > > >> > > > On Thu, Jul 27, 2017 at 10:04 AM, Kenneth Knowles >> > > > <k...@google.com.invalid> wrote: >> > > > > On Thu, Jul 27, 2017 at 2:22 AM, Lukasz Cwik >> > <lc...@google.com.invalid >> > > > >> > > > > wrote: >> > > > >> >> > > > >> Ken/Robert, I believe users will want the ability to set the >> output >> > > > coder >> > > > >> because coders may have intrinsic properties where the type isn't >> > > enough >> > > > >> information to fully specify what I want as a user. Some cases I >> can >> > > see >> > > > >> are: >> > > > >> 1) I have a cheap and fast non-deterministic coder but a >> different >> > > > slower >> > > > >> coder when I want to use it as the key to a GBK, For example >> with a >> > > set >> > > > >> coder, it would need to consistently order the values of the set >> > when >> > > > used >> > > > >> as the key. >> > > > >> 2) I know a property of the data which allows me to have a >> cheaper >> > > > >> encoding. Imagine I know that all the strings have a common >> prefix >> > or >> > > > >> integers that are in a certain range, or that a matrix is >> > > sparse/dense. >> > > > Not >> > > > >> all PCollections of strings / integers / matrices in the pipeline >> > will >> > > > have >> > > > >> this property, just some. >> > > > >> 3) Sorting comes up occasionally, traditionally in Google this >> was >> > > done >> > > > by >> > > > >> sorting the encoded version of the object lexicographically >> during a >> > > > GBK. >> > > > >> There are good lexicographical byte representations for ASCII >> > strings, >> > > > >> integers, and for some IEEE number representations which could be >> > done >> > > > by >> > > > >> the use of a special coder. >> > > > >> >> > > > > >> > > > > Items (1) and (3) do not require special knowledge from the user. >> > They >> > > > are >> > > > > easily observed properties of a pipeline. My proposal included >> full >> > > > > automation for both. The suggestion is new methods >> > > > > .getDeterministicCoder(TypeDescriptor) and >> > > > > .getLexicographicCoder(TypeDescriptor). >> > > > >> > > > Completely agree--usecases (1) and (3) are an indirect use of Coders >> > > > that are used to achieve an effect that would be better expressed >> > > > directly. >> > > > >> > > > > (2) is an interesting hypothetical for massive scale where tiny >> > > > incremental >> > > > > optimization represents a lot of cost _and_ your data has >> sufficient >> > > > > structure to realize a benefit _and_ it needs to be pinpointed to >> > just >> > > > some >> > > > > PCollections. I think our experience with coders so far is that >> their >> > > > > existence is almost entirely negative. It would be nice to support >> > this >> > > > > vanishingly rare case without inflicting a terrible pain point on >> the >> > > > model >> > > > > and all other users. >> > > > >> > > > (2) is not just about cheapness, sometimes there's other structure >> in >> > > > the data we can leverage. Consider the UnionCoder used in >> > > > CoGBK--RawUnionValue has an integer value that specifies indicates >> the >> > > > type of it's raw Object field. Unless we want to extend the type >> > > > language, there's not a sufficient type descriptor that can be used >> to >> > > > infer the coder. I'm dubious going down the road of adding special >> > > > cases is the right thing here. >> > > > >> > > > > For example, in those cases you could encode in your >> > > > > DoFn so the type descriptor would just be byte[]. >> > > > >> > > > As well as being an extremely cumbersome API, this would incur the >> > > > cost of coding/decoding at that DoFn boundary even if it is fused >> > > > away. >> > > > >> > > > >> On Thu, Jul 27, 2017 at 1:34 AM, Jean-Baptiste Onofré < >> > > j...@nanthrax.net> >> > > > >> wrote: >> > > > >> >> > > > >> > Hi, >> > > > >> > >> > > > >> > That's an interesting thread and I was wondering the >> relationship >> > > > between >> > > > >> > type descriptor and coder for a while ;) >> > > > >> > >> > > > >> > Today, in a PCollection, we can set the coder and we also have >> a >> > > > >> > getTypeDescriptor(). It sounds weird to me: it should be one or >> > the >> > > > >> other. >> > > > >> > >> > > > >> > Basically, if the Coder is not used to define the type, than, I >> > > fully >> > > > >> > agree with Eugene. >> > > > >> > >> > > > >> > Basically, the PCollection should define only the type >> descriptor, >> > > not >> > > > >> the >> > > > >> > coder by itself: the coder can be found using the type >> descriptor. >> > > > >> > >> > > > >> > With both coder and type descriptor on the PCollection, it >> sounds >> > a >> > > > big >> > > > >> > "decoupled" to me and it would be possible to have a coder on >> the >> > > > >> > PCollection that doesn't match the type descriptor. >> > > > >> > >> > > > >> > I think PCollection type descriptor should be defined, and the >> > coder >> > > > >> > should be implicit based on this type descriptor. >> > > > >> > >> > > > >> > Thoughts ? >> > > > >> > >> > > > >> > Regards >> > > > >> > JB >> > > > >> > >> > > > >> > >> > > > >> > On 07/26/2017 05:25 AM, Eugene Kirpichov wrote: >> > > > >> > >> > > > >> >> Hello, >> > > > >> >> >> > > > >> >> I've worked on a few different things recently and ran >> repeatedly >> > > > into >> > > > >> the >> > > > >> >> same issue: that we do not have clear guidance on who should >> set >> > > the >> > > > >> Coder >> > > > >> >> on a PCollection: is it responsibility of the PTransform that >> > > outputs >> > > > >> it, >> > > > >> >> or is it responsibility of the user, or is it sometimes one >> and >> > > > >> sometimes >> > > > >> >> the other? >> > > > >> >> >> > > > >> >> I believe that the answer is "it's responsibility of the >> > transform" >> > > > and >> > > > >> >> moreover that ideally PCollection.setCoder() should not >> exist. >> > > > Instead: >> > > > >> >> >> > > > >> >> - Require that all transforms set a Coder on the PCollection's >> > they >> > > > >> >> produce >> > > > >> >> - i.e. it should never be responsibility of the user to "fix >> up" >> > a >> > > > coder >> > > > >> >> on >> > > > >> >> a PCollection produced by a transform. >> > > > >> >> >> > > > >> >> - Since all transforms are composed of primitive transforms, >> > saying >> > > > >> >> "transforms must set a Coder" means simply that all >> *primitive* >> > > > >> transforms >> > > > >> >> must set a Coder on their output. >> > > > >> >> >> > > > >> >> - In some cases, a primitive PTransform currently doesn't have >> > > enough >> > > > >> >> information to infer a coder for its output collection - e.g. >> > > > >> >> ParDo.of(DoFn<InputT, OutputT>) might be unable to infer a >> coder >> > > for >> > > > >> >> OutputT. In that case such transforms should allow the user to >> > > > provide a >> > > > >> >> coder: ParDo.of(DoFn).withOutputCoder(...) [note that this >> > differs >> > > > from >> > > > >> >> requiring the user to set a coder on the resulting collection] >> > > > >> >> >> > > > >> >> - Corollary: composite transforms need to only configure their >> > > > primitive >> > > > >> >> transforms (and composite sub-transforms) properly, and give >> > them a >> > > > >> Coder >> > > > >> >> if needed. >> > > > >> >> >> > > > >> >> - Corollary: a PTransform with type parameters <FooT, BarT, >> ...> >> > > > needs >> > > > >> to >> > > > >> >> be configurable with coders for all of these, because the >> > > > implementation >> > > > >> >> of >> > > > >> >> the transform may change and it may introduce intermediate >> > > > collections >> > > > >> >> involving these types. However, in many cases, some of these >> type >> > > > >> >> parameters appear in the type of the transform's input, e.g. a >> > > > >> >> PTransform<PCollection<KV<FooT, BarT>>, PCollection<MooT>> >> will >> > > > always >> > > > >> be >> > > > >> >> able to extract the coders for FooT and BarT from the input >> > > > PCollection, >> > > > >> >> so >> > > > >> >> the user does not need to provide them. However, a coder for >> BarT >> > > > must >> > > > >> be >> > > > >> >> provided. I think in most cases the transform needs to be >> > > > configurable >> > > > >> >> only >> > > > >> >> with coders for its output. >> > > > >> >> >> > > > >> >> Here's a smooth migration path to accomplish the above: >> > > > >> >> - Make PCollection.createPrimitiveOutputInternal() take a >> Coder. >> > > > >> >> - Make all primitive transforms optionally configurable with a >> > > coder >> > > > for >> > > > >> >> their outputs, such as ParDo.of(DoFn).withOutputCoder(). >> > > > >> >> - By using the above, make all composite transforms shipped >> with >> > > the >> > > > SDK >> > > > >> >> set a Coder on the collections they produce; in some cases, >> this >> > > will >> > > > >> >> require adding a withSomethingCoder() option to the transform >> and >> > > > >> >> propagating that coder to its sub-transforms. If the option is >> > > unset, >> > > > >> >> that's fine for now. >> > > > >> >> - As a result of the above, get rid of all setCoder() calls in >> > the >> > > > Beam >> > > > >> >> repo. The call will still be there, but it will just not be >> used >> > > > >> anywhere >> > > > >> >> in the SDK or examples, and we can mark it deprecated. >> > > > >> >> - Add guidance to PTransform Style Guide in line with the >> above >> > > > >> >> >> > > > >> >> Does this sound like a good idea? I'm not sure how urgent it >> > would >> > > > be to >> > > > >> >> actually do this, but I'd like to know whether people agree >> that >> > > this >> > > > >> is a >> > > > >> >> good goal in general. >> > > > >> >> >> > > > >> >> >> > > > >> > -- >> > > > >> > Jean-Baptiste Onofré >> > > > >> > jbono...@apache.org >> > > > >> > http://blog.nanthrax.net >> > > > >> > Talend - http://www.talend.com >> > > > >> > >> > > > >> >> > > > >> > > >> > >> > >