I think there's a simpler solution than encoding to byte[]: introduce a new, specialized type to represent the restricted (alternatively-distributed?) data. The TypeDescriptor for this type can map to the specialized coder, without having to perform a significant degree of potentially wasted encoding work, plus it includes the assumptions that are being made about the distribution of data.
On Thu, Jul 27, 2017 at 11:04 AM, Reuven Lax <re...@google.com.invalid> wrote: > I tend to agree with Robert - it would be unfortunate if a single > TypeDescrictor was forced to have the same encoding all through the > pipeline. However it's also unfortunate if this flexibility impacted every > part of the programming model. I also think that our experience has been > that "large scale where tiny incremental > optimization represents a lot of cost" is far more common than people > expect, especially since coding/decoding is often a dominant cost for such > pipelines. > > On Thu, Jul 27, 2017 at 11:00 AM, Thomas Groh <tg...@google.com.invalid> > wrote: > > > +1 on getting rid of setCoder; just from a Java SDK perspective, my ideal > > world contains PCollections which don't have a user-visible way to mutate > > them. > > > > My preference would be to use TypeDescriptors everywhere within Pipeline > > construction (where possible), and utilize the CoderRegistry everywhere > to > > actually extract the appropriate type. The unfortunate difficulty of > having > > to encode a union type and the lack of variable-length generics does > > complicate that. We could consider some way of constructing coders in the > > registry from a collection of type descriptors (which should be > accessible > > from the point the union-type is being constructed), e.g. something like > > `getCoder(TypeDescriptor output, TypeDescriptor... components)` - that > does > > only permit a single flat level (but since this is being invoked by the > SDK > > during construction it could also pass Coder...). > > > > > > > > On Thu, Jul 27, 2017 at 10:22 AM, Robert Bradshaw < > > rober...@google.com.invalid> wrote: > > > > > On Thu, Jul 27, 2017 at 10:04 AM, Kenneth Knowles > > > <k...@google.com.invalid> wrote: > > > > On Thu, Jul 27, 2017 at 2:22 AM, Lukasz Cwik > <lc...@google.com.invalid > > > > > > > wrote: > > > >> > > > >> Ken/Robert, I believe users will want the ability to set the output > > > coder > > > >> because coders may have intrinsic properties where the type isn't > > enough > > > >> information to fully specify what I want as a user. Some cases I can > > see > > > >> are: > > > >> 1) I have a cheap and fast non-deterministic coder but a different > > > slower > > > >> coder when I want to use it as the key to a GBK, For example with a > > set > > > >> coder, it would need to consistently order the values of the set > when > > > used > > > >> as the key. > > > >> 2) I know a property of the data which allows me to have a cheaper > > > >> encoding. Imagine I know that all the strings have a common prefix > or > > > >> integers that are in a certain range, or that a matrix is > > sparse/dense. > > > Not > > > >> all PCollections of strings / integers / matrices in the pipeline > will > > > have > > > >> this property, just some. > > > >> 3) Sorting comes up occasionally, traditionally in Google this was > > done > > > by > > > >> sorting the encoded version of the object lexicographically during a > > > GBK. > > > >> There are good lexicographical byte representations for ASCII > strings, > > > >> integers, and for some IEEE number representations which could be > done > > > by > > > >> the use of a special coder. > > > >> > > > > > > > > Items (1) and (3) do not require special knowledge from the user. > They > > > are > > > > easily observed properties of a pipeline. My proposal included full > > > > automation for both. The suggestion is new methods > > > > .getDeterministicCoder(TypeDescriptor) and > > > > .getLexicographicCoder(TypeDescriptor). > > > > > > Completely agree--usecases (1) and (3) are an indirect use of Coders > > > that are used to achieve an effect that would be better expressed > > > directly. > > > > > > > (2) is an interesting hypothetical for massive scale where tiny > > > incremental > > > > optimization represents a lot of cost _and_ your data has sufficient > > > > structure to realize a benefit _and_ it needs to be pinpointed to > just > > > some > > > > PCollections. I think our experience with coders so far is that their > > > > existence is almost entirely negative. It would be nice to support > this > > > > vanishingly rare case without inflicting a terrible pain point on the > > > model > > > > and all other users. > > > > > > (2) is not just about cheapness, sometimes there's other structure in > > > the data we can leverage. Consider the UnionCoder used in > > > CoGBK--RawUnionValue has an integer value that specifies indicates the > > > type of it's raw Object field. Unless we want to extend the type > > > language, there's not a sufficient type descriptor that can be used to > > > infer the coder. I'm dubious going down the road of adding special > > > cases is the right thing here. > > > > > > > For example, in those cases you could encode in your > > > > DoFn so the type descriptor would just be byte[]. > > > > > > As well as being an extremely cumbersome API, this would incur the > > > cost of coding/decoding at that DoFn boundary even if it is fused > > > away. > > > > > > >> On Thu, Jul 27, 2017 at 1:34 AM, Jean-Baptiste Onofré < > > j...@nanthrax.net> > > > >> wrote: > > > >> > > > >> > Hi, > > > >> > > > > >> > That's an interesting thread and I was wondering the relationship > > > between > > > >> > type descriptor and coder for a while ;) > > > >> > > > > >> > Today, in a PCollection, we can set the coder and we also have a > > > >> > getTypeDescriptor(). It sounds weird to me: it should be one or > the > > > >> other. > > > >> > > > > >> > Basically, if the Coder is not used to define the type, than, I > > fully > > > >> > agree with Eugene. > > > >> > > > > >> > Basically, the PCollection should define only the type descriptor, > > not > > > >> the > > > >> > coder by itself: the coder can be found using the type descriptor. > > > >> > > > > >> > With both coder and type descriptor on the PCollection, it sounds > a > > > big > > > >> > "decoupled" to me and it would be possible to have a coder on the > > > >> > PCollection that doesn't match the type descriptor. > > > >> > > > > >> > I think PCollection type descriptor should be defined, and the > coder > > > >> > should be implicit based on this type descriptor. > > > >> > > > > >> > Thoughts ? > > > >> > > > > >> > Regards > > > >> > JB > > > >> > > > > >> > > > > >> > On 07/26/2017 05:25 AM, Eugene Kirpichov wrote: > > > >> > > > > >> >> Hello, > > > >> >> > > > >> >> I've worked on a few different things recently and ran repeatedly > > > into > > > >> the > > > >> >> same issue: that we do not have clear guidance on who should set > > the > > > >> Coder > > > >> >> on a PCollection: is it responsibility of the PTransform that > > outputs > > > >> it, > > > >> >> or is it responsibility of the user, or is it sometimes one and > > > >> sometimes > > > >> >> the other? > > > >> >> > > > >> >> I believe that the answer is "it's responsibility of the > transform" > > > and > > > >> >> moreover that ideally PCollection.setCoder() should not exist. > > > Instead: > > > >> >> > > > >> >> - Require that all transforms set a Coder on the PCollection's > they > > > >> >> produce > > > >> >> - i.e. it should never be responsibility of the user to "fix up" > a > > > coder > > > >> >> on > > > >> >> a PCollection produced by a transform. > > > >> >> > > > >> >> - Since all transforms are composed of primitive transforms, > saying > > > >> >> "transforms must set a Coder" means simply that all *primitive* > > > >> transforms > > > >> >> must set a Coder on their output. > > > >> >> > > > >> >> - In some cases, a primitive PTransform currently doesn't have > > enough > > > >> >> information to infer a coder for its output collection - e.g. > > > >> >> ParDo.of(DoFn<InputT, OutputT>) might be unable to infer a coder > > for > > > >> >> OutputT. In that case such transforms should allow the user to > > > provide a > > > >> >> coder: ParDo.of(DoFn).withOutputCoder(...) [note that this > differs > > > from > > > >> >> requiring the user to set a coder on the resulting collection] > > > >> >> > > > >> >> - Corollary: composite transforms need to only configure their > > > primitive > > > >> >> transforms (and composite sub-transforms) properly, and give > them a > > > >> Coder > > > >> >> if needed. > > > >> >> > > > >> >> - Corollary: a PTransform with type parameters <FooT, BarT, ...> > > > needs > > > >> to > > > >> >> be configurable with coders for all of these, because the > > > implementation > > > >> >> of > > > >> >> the transform may change and it may introduce intermediate > > > collections > > > >> >> involving these types. However, in many cases, some of these type > > > >> >> parameters appear in the type of the transform's input, e.g. a > > > >> >> PTransform<PCollection<KV<FooT, BarT>>, PCollection<MooT>> will > > > always > > > >> be > > > >> >> able to extract the coders for FooT and BarT from the input > > > PCollection, > > > >> >> so > > > >> >> the user does not need to provide them. However, a coder for BarT > > > must > > > >> be > > > >> >> provided. I think in most cases the transform needs to be > > > configurable > > > >> >> only > > > >> >> with coders for its output. > > > >> >> > > > >> >> Here's a smooth migration path to accomplish the above: > > > >> >> - Make PCollection.createPrimitiveOutputInternal() take a Coder. > > > >> >> - Make all primitive transforms optionally configurable with a > > coder > > > for > > > >> >> their outputs, such as ParDo.of(DoFn).withOutputCoder(). > > > >> >> - By using the above, make all composite transforms shipped with > > the > > > SDK > > > >> >> set a Coder on the collections they produce; in some cases, this > > will > > > >> >> require adding a withSomethingCoder() option to the transform and > > > >> >> propagating that coder to its sub-transforms. If the option is > > unset, > > > >> >> that's fine for now. > > > >> >> - As a result of the above, get rid of all setCoder() calls in > the > > > Beam > > > >> >> repo. The call will still be there, but it will just not be used > > > >> anywhere > > > >> >> in the SDK or examples, and we can mark it deprecated. > > > >> >> - Add guidance to PTransform Style Guide in line with the above > > > >> >> > > > >> >> Does this sound like a good idea? I'm not sure how urgent it > would > > > be to > > > >> >> actually do this, but I'd like to know whether people agree that > > this > > > >> is a > > > >> >> good goal in general. > > > >> >> > > > >> >> > > > >> > -- > > > >> > Jean-Baptiste Onofré > > > >> > jbono...@apache.org > > > >> > http://blog.nanthrax.net > > > >> > Talend - http://www.talend.com > > > >> > > > > >> > > > > > >