Re: Requiring PTransform to set a coder on its resulting collections

Thomas Groh Thu, 27 Jul 2017 11:18:39 -0700

I think there's a simpler solution than encoding to byte[]: introduce a
new, specialized type to represent the restricted
(alternatively-distributed?) data. The TypeDescriptor for this type can map
to the specialized coder, without having to perform a significant degree of
potentially wasted encoding work, plus it includes the assumptions that are
being made about the distribution of data.


On Thu, Jul 27, 2017 at 11:04 AM, Reuven Lax <re...@google.com.invalid>
wrote:

> I tend to agree with Robert - it would be unfortunate if a single
> TypeDescrictor was forced to have the same encoding all through the
> pipeline. However it's also unfortunate if this flexibility impacted every
> part of the programming model. I also think that our experience has been
> that "large scale where tiny incremental
> optimization represents a lot of cost" is far more common than people
> expect, especially since coding/decoding is often a dominant cost for such
> pipelines.
>
> On Thu, Jul 27, 2017 at 11:00 AM, Thomas Groh <tg...@google.com.invalid>
> wrote:
>
> > +1 on getting rid of setCoder; just from a Java SDK perspective, my ideal
> > world contains PCollections which don't have a user-visible way to mutate
> > them.
> >
> > My preference would be to use TypeDescriptors everywhere within Pipeline
> > construction (where possible), and utilize the CoderRegistry everywhere
> to
> > actually extract the appropriate type. The unfortunate difficulty of
> having
> > to encode a union type and the lack of variable-length generics does
> > complicate that. We could consider some way of constructing coders in the
> > registry from a collection of type descriptors (which should be
> accessible
> > from the point the union-type is being constructed), e.g. something like
> > `getCoder(TypeDescriptor output, TypeDescriptor... components)` - that
> does
> > only permit a single flat level (but since this is being invoked by the
> SDK
> > during construction it could also pass Coder...).
> >
> >
> >
> > On Thu, Jul 27, 2017 at 10:22 AM, Robert Bradshaw <
> > rober...@google.com.invalid> wrote:
> >
> > > On Thu, Jul 27, 2017 at 10:04 AM, Kenneth Knowles
> > > <k...@google.com.invalid> wrote:
> > > > On Thu, Jul 27, 2017 at 2:22 AM, Lukasz Cwik
> <lc...@google.com.invalid
> > >
> > > > wrote:
> > > >>
> > > >> Ken/Robert, I believe users will want the ability to set the output
> > > coder
> > > >> because coders may have intrinsic properties where the type isn't
> > enough
> > > >> information to fully specify what I want as a user. Some cases I can
> > see
> > > >> are:
> > > >> 1) I have a cheap and fast non-deterministic coder but a different
> > > slower
> > > >> coder when I want to use it as the key to a GBK, For example with a
> > set
> > > >> coder, it would need to consistently order the values of the set
> when
> > > used
> > > >> as the key.
> > > >> 2) I know a property of the data which allows me to have a cheaper
> > > >> encoding. Imagine I know that all the strings have a common prefix
> or
> > > >> integers that are in a certain range, or that a matrix is
> > sparse/dense.
> > > Not
> > > >> all PCollections of strings / integers / matrices in the pipeline
> will
> > > have
> > > >> this property, just some.
> > > >> 3) Sorting comes up occasionally, traditionally in Google this was
> > done
> > > by
> > > >> sorting the encoded version of the object lexicographically during a
> > > GBK.
> > > >> There are good lexicographical byte representations for ASCII
> strings,
> > > >> integers, and for some IEEE number representations which could be
> done
> > > by
> > > >> the use of a special coder.
> > > >>
> > > >
> > > > Items (1) and (3) do not require special knowledge from the user.
> They
> > > are
> > > > easily observed properties of a pipeline. My proposal included full
> > > > automation for both. The suggestion is new methods
> > > > .getDeterministicCoder(TypeDescriptor) and
> > > > .getLexicographicCoder(TypeDescriptor).
> > >
> > > Completely agree--usecases (1) and (3) are an indirect use of Coders
> > > that are used to achieve an effect that would be better expressed
> > > directly.
> > >
> > > > (2) is an interesting hypothetical for massive scale where tiny
> > > incremental
> > > > optimization represents a lot of cost _and_ your data has sufficient
> > > > structure to realize a benefit _and_ it needs to be pinpointed to
> just
> > > some
> > > > PCollections. I think our experience with coders so far is that their
> > > > existence is almost entirely negative. It would be nice to support
> this
> > > > vanishingly rare case without inflicting a terrible pain point on the
> > > model
> > > > and all other users.
> > >
> > > (2) is not just about cheapness, sometimes there's other structure in
> > > the data we can leverage. Consider the UnionCoder used in
> > > CoGBK--RawUnionValue has an integer value that specifies indicates the
> > > type of it's raw Object field. Unless we want to extend the type
> > > language, there's not a sufficient type descriptor that can be used to
> > > infer the coder. I'm dubious going down the road of adding special
> > > cases is the right thing here.
> > >
> > > > For example, in those cases you could encode in your
> > > > DoFn so the type descriptor would just be byte[].
> > >
> > > As well as being an extremely cumbersome API, this would incur the
> > > cost of coding/decoding at that DoFn boundary even if it is fused
> > > away.
> > >
> > > >> On Thu, Jul 27, 2017 at 1:34 AM, Jean-Baptiste Onofré <
> > j...@nanthrax.net>
> > > >> wrote:
> > > >>
> > > >> > Hi,
> > > >> >
> > > >> > That's an interesting thread and I was wondering the relationship
> > > between
> > > >> > type descriptor and coder for a while ;)
> > > >> >
> > > >> > Today, in a PCollection, we can set the coder and we also have a
> > > >> > getTypeDescriptor(). It sounds weird to me: it should be one or
> the
> > > >> other.
> > > >> >
> > > >> > Basically, if the Coder is not used to define the type, than, I
> > fully
> > > >> > agree with Eugene.
> > > >> >
> > > >> > Basically, the PCollection should define only the type descriptor,
> > not
> > > >> the
> > > >> > coder by itself: the coder can be found using the type descriptor.
> > > >> >
> > > >> > With both coder and type descriptor on the PCollection, it sounds
> a
> > > big
> > > >> > "decoupled" to me and it would be possible to have a coder on the
> > > >> > PCollection that doesn't match the type descriptor.
> > > >> >
> > > >> > I think PCollection type descriptor should be defined, and the
> coder
> > > >> > should be implicit based on this type descriptor.
> > > >> >
> > > >> > Thoughts ?
> > > >> >
> > > >> > Regards
> > > >> > JB
> > > >> >
> > > >> >
> > > >> > On 07/26/2017 05:25 AM, Eugene Kirpichov wrote:
> > > >> >
> > > >> >> Hello,
> > > >> >>
> > > >> >> I've worked on a few different things recently and ran repeatedly
> > > into
> > > >> the
> > > >> >> same issue: that we do not have clear guidance on who should set
> > the
> > > >> Coder
> > > >> >> on a PCollection: is it responsibility of the PTransform that
> > outputs
> > > >> it,
> > > >> >> or is it responsibility of the user, or is it sometimes one and
> > > >> sometimes
> > > >> >> the other?
> > > >> >>
> > > >> >> I believe that the answer is "it's responsibility of the
> transform"
> > > and
> > > >> >> moreover that  ideally PCollection.setCoder() should not exist.
> > > Instead:
> > > >> >>
> > > >> >> - Require that all transforms set a Coder on the PCollection's
> they
> > > >> >> produce
> > > >> >> - i.e. it should never be responsibility of the user to "fix up"
> a
> > > coder
> > > >> >> on
> > > >> >> a PCollection produced by a transform.
> > > >> >>
> > > >> >> - Since all transforms are composed of primitive transforms,
> saying
> > > >> >> "transforms must set a Coder" means simply that all *primitive*
> > > >> transforms
> > > >> >> must set a Coder on their output.
> > > >> >>
> > > >> >> - In some cases, a primitive PTransform currently doesn't have
> > enough
> > > >> >> information to infer a coder for its output collection - e.g.
> > > >> >> ParDo.of(DoFn<InputT, OutputT>) might be unable to infer a coder
> > for
> > > >> >> OutputT. In that case such transforms should allow the user to
> > > provide a
> > > >> >> coder: ParDo.of(DoFn).withOutputCoder(...) [note that this
> differs
> > > from
> > > >> >> requiring the user to set a coder on the resulting collection]
> > > >> >>
> > > >> >> - Corollary: composite transforms need to only configure their
> > > primitive
> > > >> >> transforms (and composite sub-transforms) properly, and give
> them a
> > > >> Coder
> > > >> >> if needed.
> > > >> >>
> > > >> >> - Corollary: a PTransform with type parameters <FooT, BarT, ...>
> > > needs
> > > >> to
> > > >> >> be configurable with coders for all of these, because the
> > > implementation
> > > >> >> of
> > > >> >> the transform may change and it may introduce intermediate
> > > collections
> > > >> >> involving these types. However, in many cases, some of these type
> > > >> >> parameters appear in the type of the transform's input, e.g. a
> > > >> >> PTransform<PCollection<KV<FooT, BarT>>, PCollection<MooT>> will
> > > always
> > > >> be
> > > >> >> able to extract the coders for FooT and BarT from the input
> > > PCollection,
> > > >> >> so
> > > >> >> the user does not need to provide them. However, a coder for BarT
> > > must
> > > >> be
> > > >> >> provided. I think in most cases the transform needs to be
> > > configurable
> > > >> >> only
> > > >> >> with coders for its output.
> > > >> >>
> > > >> >> Here's a smooth migration path to accomplish the above:
> > > >> >> - Make PCollection.createPrimitiveOutputInternal() take a Coder.
> > > >> >> - Make all primitive transforms optionally configurable with a
> > coder
> > > for
> > > >> >> their outputs, such as ParDo.of(DoFn).withOutputCoder().
> > > >> >> - By using the above, make all composite transforms shipped with
> > the
> > > SDK
> > > >> >> set a Coder on the collections they produce; in some cases, this
> > will
> > > >> >> require adding a withSomethingCoder() option to the transform and
> > > >> >> propagating that coder to its sub-transforms. If the option is
> > unset,
> > > >> >> that's fine for now.
> > > >> >> - As a result of the above, get rid of all setCoder() calls in
> the
> > > Beam
> > > >> >> repo. The call will still be there, but it will just not be used
> > > >> anywhere
> > > >> >> in the SDK or examples, and we can mark it deprecated.
> > > >> >> - Add guidance to PTransform Style Guide in line with the above
> > > >> >>
> > > >> >> Does this sound like a good idea? I'm not sure how urgent it
> would
> > > be to
> > > >> >> actually do this, but I'd like to know whether people agree that
> > this
> > > >> is a
> > > >> >> good goal in general.
> > > >> >>
> > > >> >>
> > > >> > --
> > > >> > Jean-Baptiste Onofré
> > > >> > jbono...@apache.org
> > > >> > http://blog.nanthrax.net
> > > >> > Talend - http://www.talend.com
> > > >> >
> > > >>
> > >
> >
>

Re: Requiring PTransform to set a coder on its resulting collections

Reply via email to