Re: Requiring PTransform to set a coder on its resulting collections

Eugene Kirpichov Thu, 10 Aug 2017 16:58:13 -0700

I've updated the guidance in PTransform Style Guide on setting coders
https://beam.apache.org/contribute/ptransform-style-guide/#coders
according to this discussion.
https://github.com/apache/beam-site/pull/279


On Thu, Aug 3, 2017 at 6:27 PM Robert Bradshaw <[email protected]>
wrote:

> On Thu, Aug 3, 2017 at 6:08 PM, Eugene Kirpichov
> <[email protected]> wrote:
> > https://github.com/apache/beam/pull/3649 has landed. The main
> contribution
> > of this PR is deprecating PTransform.getDefaultOutputCoder().
> >
> > Next steps are to get rid of all setCoder() calls in the SDK, and
> deprecate
> > setCoder().
> > Nearly all setCoder() calls (perhaps simply all?) I found are on the
> output
> > of mapping transforms, such as ParDo, Map/FlatMapElements, WithKeys.
> > I think we should simply make these transforms optionally configurable
> with
> > an output coder: e.g. input.apply(ParDo.of(new
> > SomeFn<>()).withOutputCoder(SomeCoder.of()))
> > For multi-output ParDo this is a little more complex API-wise, but doable
> > too.
> >
> > (another minor next step is to say in PTransform Style Guide that the
> > transform must set a coder on all its outputs)
> >
> > Sounds reasonable?
>
> +1
>
> I'd like to do this in a way that lowers the burden for all PTransform
> authors. Can't think of a better way than a special subclass of
> PTransform that has the setters that one would subclass...
>
> > On Thu, Aug 3, 2017 at 5:34 AM Lukasz Cwik <[email protected]>
> wrote:
> >
> >> I'm for (1) and am not sure about the feasibility of (2) without having
> an
> >> escape hatch that allows a pipeline author to specify a coder to handle
> >> their special case.
> >>
> >> On Tue, Aug 1, 2017 at 2:15 PM, Reuven Lax <[email protected]>
> >> wrote:
> >>
> >> > One interesting wrinkle: I'm about to propose a set of semantics for
> >> > snapshotting/in-place updating pipelines. Part of this proposal is the
> >> > ability to write pipelines to "upgrade" snapshots to make them
> compatible
> >> > with new graphs. This relies on the ability to have two separate
> coders
> >> for
> >> > the same type - the old coder and the new coder - in order to handle
> the
> >> > case where the user has changed coders in the new pipeline.
> >> >
> >> > On Tue, Aug 1, 2017 at 2:12 PM, Robert Bradshaw
> >> > <[email protected]
> >> > > wrote:
> >> >
> >> > > There are two concerns in this thread:
> >> > >
> >> > > (1) Getting rid of PCollection.setCoder(). Everyone seems in favor
> of
> >> > this
> >> > > (right?)
> >> > >
> >> > > (2) Deprecating specifying Coders in favor of specifying
> >> TypeDescriptors.
> >> > > I'm generally in favor, but it's unclear how far we can push this
> >> > through.
> >> > >
> >> > > Let's at least do (1), and separately state a preference for (2),
> >> seeing
> >> > > how fare we can push it.
> >> > >
> >> > > On Thu, Jul 27, 2017 at 9:13 PM, Kenneth Knowles
> >> <[email protected]
> >> > >
> >> > > wrote:
> >> > >
> >> > > > Another thought on this: setting a custom coder to support a
> special
> >> > data
> >> > > > distribution is likely often a property of the input to the
> pipeline.
> >> > So
> >> > > > setting a coder during pipeline construction - more generally,
> when
> >> > > writing
> >> > > > a composite transform for reuse - you might not actually have the
> >> > needed
> >> > > > information. But setting up a special indicator type descriptor
> lets
> >> > your
> >> > > > users map that type descriptor to a coder that works well for
> their
> >> > data.
> >> > > >
> >> > > > But Robert's example of RawUnionValue seems like a deal breaker
> for
> >> all
> >> > > > approaches. It really requires .getCoder() during expand() and
> >> > explicitly
> >> > > > building coders encoding information that is cumbersome to get
> into a
> >> > > > TypeDescriptor. While making up new type languages is a
> comfortable
> >> > > > activity for me :-) I don't think we should head down that path,
> for
> >> > our
> >> > > > users' sake. So I'll stop hoping we can eliminate this pain point
> for
> >> > > now.
> >> > > >
> >> > > > Kenn
> >> > > >
> >> > > > On Thu, Jul 27, 2017 at 8:48 PM, Kenneth Knowles <[email protected]>
> >> > wrote:
> >> > > >
> >> > > > > On Thu, Jul 27, 2017 at 11:18 AM, Thomas Groh
> >> > <[email protected]
> >> > > >
> >> > > > > wrote:
> >> > > > >
> >> > > > >> introduce a
> >> > > > >> new, specialized type to represent the restricted
> >> > > > >> (alternatively-distributed?) data. The TypeDescriptor for this
> >> type
> >> > > can
> >> > > > >> map
> >> > > > >> to the specialized coder, without having to perform a
> significant
> >> > > degree
> >> > > > >> of
> >> > > > >> potentially wasted encoding work, plus it includes the
> assumptions
> >> > > that
> >> > > > >> are
> >> > > > >> being made about the distribution of data.
> >> > > > >>
> >> > > > >
> >> > > > > This is a very cool idea, in theory :-)
> >> > > > >
> >> > > > > For complex types with a few allocations involved and/or
> nontrivial
> >> > > > > deserialization, or when a pipeline does a lot of real work, I
> >> think
> >> > > the
> >> > > > > wrapper cost won't be perceptible.
> >> > > > >
> >> > > > > But  for more primitive types in pipelines that don't really do
> >> much
> >> > > > > computation but just move data around, I think it could matter.
> >> > > Certainly
> >> > > > > there are languages with constructs to allow type wrappers at
> zero
> >> > cost
> >> > > > > (Haskell's `newtype`).
> >> > > > >
> >> > > > > This is all just speculation until we measure, like most of this
> >> > > thread.
> >> > > > >
> >> > > > > Kenn
> >> > > > >
> >> > > > >
> >> > > > >> > On Thu, Jul 27, 2017 at 11:00 AM, Thomas Groh
> >> > > > <[email protected]
> >> > > > >> >
> >> > > > >> > wrote:
> >> > > > >> >
> >> > > > >> > > +1 on getting rid of setCoder; just from a Java SDK
> >> perspective,
> >> > > my
> >> > > > >> ideal
> >> > > > >> > > world contains PCollections which don't have a user-visible
> >> way
> >> > to
> >> > > > >> mutate
> >> > > > >> > > them.
> >> > > > >> > >
> >> > > > >> > > My preference would be to use TypeDescriptors everywhere
> >> within
> >> > > > >> Pipeline
> >> > > > >> > > construction (where possible), and utilize the
> CoderRegistry
> >> > > > >> everywhere
> >> > > > >> > to
> >> > > > >> > > actually extract the appropriate type. The unfortunate
> >> > difficulty
> >> > > of
> >> > > > >> > having
> >> > > > >> > > to encode a union type and the lack of variable-length
> >> generics
> >> > > does
> >> > > > >> > > complicate that. We could consider some way of constructing
> >> > coders
> >> > > > in
> >> > > > >> the
> >> > > > >> > > registry from a collection of type descriptors (which
> should
> >> be
> >> > > > >> > accessible
> >> > > > >> > > from the point the union-type is being constructed), e.g.
> >> > > something
> >> > > > >> like
> >> > > > >> > > `getCoder(TypeDescriptor output, TypeDescriptor...
> >> components)`
> >> > -
> >> > > > that
> >> > > > >> > does
> >> > > > >> > > only permit a single flat level (but since this is being
> >> invoked
> >> > > by
> >> > > > >> the
> >> > > > >> > SDK
> >> > > > >> > > during construction it could also pass Coder...).
> >> > > > >> > >
> >> > > > >> > >
> >> > > > >> > >
> >> > > > >> > > On Thu, Jul 27, 2017 at 10:22 AM, Robert Bradshaw <
> >> > > > >> > > [email protected]> wrote:
> >> > > > >> > >
> >> > > > >> > > > On Thu, Jul 27, 2017 at 10:04 AM, Kenneth Knowles
> >> > > > >> > > > <[email protected]> wrote:
> >> > > > >> > > > > On Thu, Jul 27, 2017 at 2:22 AM, Lukasz Cwik
> >> > > > >> > <[email protected]
> >> > > > >> > > >
> >> > > > >> > > > > wrote:
> >> > > > >> > > > >>
> >> > > > >> > > > >> Ken/Robert, I believe users will want the ability to
> set
> >> > the
> >> > > > >> output
> >> > > > >> > > > coder
> >> > > > >> > > > >> because coders may have intrinsic properties where the
> >> type
> >> > > > isn't
> >> > > > >> > > enough
> >> > > > >> > > > >> information to fully specify what I want as a user.
> Some
> >> > > cases
> >> > > > I
> >> > > > >> can
> >> > > > >> > > see
> >> > > > >> > > > >> are:
> >> > > > >> > > > >> 1) I have a cheap and fast non-deterministic coder
> but a
> >> > > > >> different
> >> > > > >> > > > slower
> >> > > > >> > > > >> coder when I want to use it as the key to a GBK, For
> >> > example
> >> > > > >> with a
> >> > > > >> > > set
> >> > > > >> > > > >> coder, it would need to consistently order the values
> of
> >> > the
> >> > > > set
> >> > > > >> > when
> >> > > > >> > > > used
> >> > > > >> > > > >> as the key.
> >> > > > >> > > > >> 2) I know a property of the data which allows me to
> have
> >> a
> >> > > > >> cheaper
> >> > > > >> > > > >> encoding. Imagine I know that all the strings have a
> >> common
> >> > > > >> prefix
> >> > > > >> > or
> >> > > > >> > > > >> integers that are in a certain range, or that a
> matrix is
> >> > > > >> > > sparse/dense.
> >> > > > >> > > > Not
> >> > > > >> > > > >> all PCollections of strings / integers / matrices in
> the
> >> > > > pipeline
> >> > > > >> > will
> >> > > > >> > > > have
> >> > > > >> > > > >> this property, just some.
> >> > > > >> > > > >> 3) Sorting comes up occasionally, traditionally in
> Google
> >> > > this
> >> > > > >> was
> >> > > > >> > > done
> >> > > > >> > > > by
> >> > > > >> > > > >> sorting the encoded version of the object
> >> lexicographically
> >> > > > >> during a
> >> > > > >> > > > GBK.
> >> > > > >> > > > >> There are good lexicographical byte representations
> for
> >> > ASCII
> >> > > > >> > strings,
> >> > > > >> > > > >> integers, and for some IEEE number representations
> which
> >> > > could
> >> > > > be
> >> > > > >> > done
> >> > > > >> > > > by
> >> > > > >> > > > >> the use of a special coder.
> >> > > > >> > > > >>
> >> > > > >> > > > >
> >> > > > >> > > > > Items (1) and (3) do not require special knowledge from
> >> the
> >> > > > user.
> >> > > > >> > They
> >> > > > >> > > > are
> >> > > > >> > > > > easily observed properties of a pipeline. My proposal
> >> > included
> >> > > > >> full
> >> > > > >> > > > > automation for both. The suggestion is new methods
> >> > > > >> > > > > .getDeterministicCoder(TypeDescriptor) and
> >> > > > >> > > > > .getLexicographicCoder(TypeDescriptor).
> >> > > > >> > > >
> >> > > > >> > > > Completely agree--usecases (1) and (3) are an indirect
> use
> >> of
> >> > > > Coders
> >> > > > >> > > > that are used to achieve an effect that would be better
> >> > > expressed
> >> > > > >> > > > directly.
> >> > > > >> > > >
> >> > > > >> > > > > (2) is an interesting hypothetical for massive scale
> where
> >> > > tiny
> >> > > > >> > > > incremental
> >> > > > >> > > > > optimization represents a lot of cost _and_ your data
> has
> >> > > > >> sufficient
> >> > > > >> > > > > structure to realize a benefit _and_ it needs to be
> >> > pinpointed
> >> > > > to
> >> > > > >> > just
> >> > > > >> > > > some
> >> > > > >> > > > > PCollections. I think our experience with coders so
> far is
> >> > > that
> >> > > > >> their
> >> > > > >> > > > > existence is almost entirely negative. It would be
> nice to
> >> > > > support
> >> > > > >> > this
> >> > > > >> > > > > vanishingly rare case without inflicting a terrible
> pain
> >> > point
> >> > > > on
> >> > > > >> the
> >> > > > >> > > > model
> >> > > > >> > > > > and all other users.
> >> > > > >> > > >
> >> > > > >> > > > (2) is not just about cheapness, sometimes there's other
> >> > > structure
> >> > > > >> in
> >> > > > >> > > > the data we can leverage. Consider the UnionCoder used in
> >> > > > >> > > > CoGBK--RawUnionValue has an integer value that specifies
> >> > > indicates
> >> > > > >> the
> >> > > > >> > > > type of it's raw Object field. Unless we want to extend
> the
> >> > type
> >> > > > >> > > > language, there's not a sufficient type descriptor that
> can
> >> be
> >> > > > used
> >> > > > >> to
> >> > > > >> > > > infer the coder. I'm dubious going down the road of
> adding
> >> > > special
> >> > > > >> > > > cases is the right thing here.
> >> > > > >> > > >
> >> > > > >> > > > > For example, in those cases you could encode in your
> >> > > > >> > > > > DoFn so the type descriptor would just be byte[].
> >> > > > >> > > >
> >> > > > >> > > > As well as being an extremely cumbersome API, this would
> >> incur
> >> > > the
> >> > > > >> > > > cost of coding/decoding at that DoFn boundary even if it
> is
> >> > > fused
> >> > > > >> > > > away.
> >> > > > >> > > >
> >> > > > >> > > > >> On Thu, Jul 27, 2017 at 1:34 AM, Jean-Baptiste Onofré
> <
> >> > > > >> > > [email protected]>
> >> > > > >> > > > >> wrote:
> >> > > > >> > > > >>
> >> > > > >> > > > >> > Hi,
> >> > > > >> > > > >> >
> >> > > > >> > > > >> > That's an interesting thread and I was wondering the
> >> > > > >> relationship
> >> > > > >> > > > between
> >> > > > >> > > > >> > type descriptor and coder for a while ;)
> >> > > > >> > > > >> >
> >> > > > >> > > > >> > Today, in a PCollection, we can set the coder and we
> >> also
> >> > > > have
> >> > > > >> a
> >> > > > >> > > > >> > getTypeDescriptor(). It sounds weird to me: it
> should
> >> be
> >> > > one
> >> > > > or
> >> > > > >> > the
> >> > > > >> > > > >> other.
> >> > > > >> > > > >> >
> >> > > > >> > > > >> > Basically, if the Coder is not used to define the
> type,
> >> > > > than, I
> >> > > > >> > > fully
> >> > > > >> > > > >> > agree with Eugene.
> >> > > > >> > > > >> >
> >> > > > >> > > > >> > Basically, the PCollection should define only the
> type
> >> > > > >> descriptor,
> >> > > > >> > > not
> >> > > > >> > > > >> the
> >> > > > >> > > > >> > coder by itself: the coder can be found using the
> type
> >> > > > >> descriptor.
> >> > > > >> > > > >> >
> >> > > > >> > > > >> > With both coder and type descriptor on the
> PCollection,
> >> > it
> >> > > > >> sounds
> >> > > > >> > a
> >> > > > >> > > > big
> >> > > > >> > > > >> > "decoupled" to me and it would be possible to have a
> >> > coder
> >> > > on
> >> > > > >> the
> >> > > > >> > > > >> > PCollection that doesn't match the type descriptor.
> >> > > > >> > > > >> >
> >> > > > >> > > > >> > I think PCollection type descriptor should be
> defined,
> >> > and
> >> > > > the
> >> > > > >> > coder
> >> > > > >> > > > >> > should be implicit based on this type descriptor.
> >> > > > >> > > > >> >
> >> > > > >> > > > >> > Thoughts ?
> >> > > > >> > > > >> >
> >> > > > >> > > > >> > Regards
> >> > > > >> > > > >> > JB
> >> > > > >> > > > >> >
> >> > > > >> > > > >> >
> >> > > > >> > > > >> > On 07/26/2017 05:25 AM, Eugene Kirpichov wrote:
> >> > > > >> > > > >> >
> >> > > > >> > > > >> >> Hello,
> >> > > > >> > > > >> >>
> >> > > > >> > > > >> >> I've worked on a few different things recently and
> ran
> >> > > > >> repeatedly
> >> > > > >> > > > into
> >> > > > >> > > > >> the
> >> > > > >> > > > >> >> same issue: that we do not have clear guidance on
> who
> >> > > should
> >> > > > >> set
> >> > > > >> > > the
> >> > > > >> > > > >> Coder
> >> > > > >> > > > >> >> on a PCollection: is it responsibility of the
> >> PTransform
> >> > > > that
> >> > > > >> > > outputs
> >> > > > >> > > > >> it,
> >> > > > >> > > > >> >> or is it responsibility of the user, or is it
> >> sometimes
> >> > > one
> >> > > > >> and
> >> > > > >> > > > >> sometimes
> >> > > > >> > > > >> >> the other?
> >> > > > >> > > > >> >>
> >> > > > >> > > > >> >> I believe that the answer is "it's responsibility
> of
> >> the
> >> > > > >> > transform"
> >> > > > >> > > > and
> >> > > > >> > > > >> >> moreover that  ideally PCollection.setCoder()
> should
> >> not
> >> > > > >> exist.
> >> > > > >> > > > Instead:
> >> > > > >> > > > >> >>
> >> > > > >> > > > >> >> - Require that all transforms set a Coder on the
> >> > > > PCollection's
> >> > > > >> > they
> >> > > > >> > > > >> >> produce
> >> > > > >> > > > >> >> - i.e. it should never be responsibility of the
> user
> >> to
> >> > > "fix
> >> > > > >> up"
> >> > > > >> > a
> >> > > > >> > > > coder
> >> > > > >> > > > >> >> on
> >> > > > >> > > > >> >> a PCollection produced by a transform.
> >> > > > >> > > > >> >>
> >> > > > >> > > > >> >> - Since all transforms are composed of primitive
> >> > > transforms,
> >> > > > >> > saying
> >> > > > >> > > > >> >> "transforms must set a Coder" means simply that all
> >> > > > >> *primitive*
> >> > > > >> > > > >> transforms
> >> > > > >> > > > >> >> must set a Coder on their output.
> >> > > > >> > > > >> >>
> >> > > > >> > > > >> >> - In some cases, a primitive PTransform currently
> >> > doesn't
> >> > > > have
> >> > > > >> > > enough
> >> > > > >> > > > >> >> information to infer a coder for its output
> >> collection -
> >> > > > e.g.
> >> > > > >> > > > >> >> ParDo.of(DoFn<InputT, OutputT>) might be unable to
> >> > infer a
> >> > > > >> coder
> >> > > > >> > > for
> >> > > > >> > > > >> >> OutputT. In that case such transforms should allow
> the
> >> > > user
> >> > > > to
> >> > > > >> > > > provide a
> >> > > > >> > > > >> >> coder: ParDo.of(DoFn).withOutputCoder(...) [note
> that
> >> > > this
> >> > > > >> > differs
> >> > > > >> > > > from
> >> > > > >> > > > >> >> requiring the user to set a coder on the resulting
> >> > > > collection]
> >> > > > >> > > > >> >>
> >> > > > >> > > > >> >> - Corollary: composite transforms need to only
> >> configure
> >> > > > their
> >> > > > >> > > > primitive
> >> > > > >> > > > >> >> transforms (and composite sub-transforms) properly,
> >> and
> >> > > give
> >> > > > >> > them a
> >> > > > >> > > > >> Coder
> >> > > > >> > > > >> >> if needed.
> >> > > > >> > > > >> >>
> >> > > > >> > > > >> >> - Corollary: a PTransform with type parameters
> <FooT,
> >> > > BarT,
> >> > > > >> ...>
> >> > > > >> > > > needs
> >> > > > >> > > > >> to
> >> > > > >> > > > >> >> be configurable with coders for all of these,
> because
> >> > the
> >> > > > >> > > > implementation
> >> > > > >> > > > >> >> of
> >> > > > >> > > > >> >> the transform may change and it may introduce
> >> > intermediate
> >> > > > >> > > > collections
> >> > > > >> > > > >> >> involving these types. However, in many cases,
> some of
> >> > > these
> >> > > > >> type
> >> > > > >> > > > >> >> parameters appear in the type of the transform's
> >> input,
> >> > > > e.g. a
> >> > > > >> > > > >> >> PTransform<PCollection<KV<FooT, BarT>>,
> >> > > PCollection<MooT>>
> >> > > > >> will
> >> > > > >> > > > always
> >> > > > >> > > > >> be
> >> > > > >> > > > >> >> able to extract the coders for FooT and BarT from
> the
> >> > > input
> >> > > > >> > > > PCollection,
> >> > > > >> > > > >> >> so
> >> > > > >> > > > >> >> the user does not need to provide them. However, a
> >> coder
> >> > > for
> >> > > > >> BarT
> >> > > > >> > > > must
> >> > > > >> > > > >> be
> >> > > > >> > > > >> >> provided. I think in most cases the transform
> needs to
> >> > be
> >> > > > >> > > > configurable
> >> > > > >> > > > >> >> only
> >> > > > >> > > > >> >> with coders for its output.
> >> > > > >> > > > >> >>
> >> > > > >> > > > >> >> Here's a smooth migration path to accomplish the
> >> above:
> >> > > > >> > > > >> >> - Make PCollection.createPrimitiveOutputInternal()
> >> > take a
> >> > > > >> Coder.
> >> > > > >> > > > >> >> - Make all primitive transforms optionally
> >> configurable
> >> > > > with a
> >> > > > >> > > coder
> >> > > > >> > > > for
> >> > > > >> > > > >> >> their outputs, such as ParDo.of(DoFn).
> >> > withOutputCoder().
> >> > > > >> > > > >> >> - By using the above, make all composite transforms
> >> > > shipped
> >> > > > >> with
> >> > > > >> > > the
> >> > > > >> > > > SDK
> >> > > > >> > > > >> >> set a Coder on the collections they produce; in
> some
> >> > > cases,
> >> > > > >> this
> >> > > > >> > > will
> >> > > > >> > > > >> >> require adding a withSomethingCoder() option to the
> >> > > > transform
> >> > > > >> and
> >> > > > >> > > > >> >> propagating that coder to its sub-transforms. If
> the
> >> > > option
> >> > > > is
> >> > > > >> > > unset,
> >> > > > >> > > > >> >> that's fine for now.
> >> > > > >> > > > >> >> - As a result of the above, get rid of all
> setCoder()
> >> > > calls
> >> > > > in
> >> > > > >> > the
> >> > > > >> > > > Beam
> >> > > > >> > > > >> >> repo. The call will still be there, but it will
> just
> >> not
> >> > > be
> >> > > > >> used
> >> > > > >> > > > >> anywhere
> >> > > > >> > > > >> >> in the SDK or examples, and we can mark it
> deprecated.
> >> > > > >> > > > >> >> - Add guidance to PTransform Style Guide in line
> with
> >> > the
> >> > > > >> above
> >> > > > >> > > > >> >>
> >> > > > >> > > > >> >> Does this sound like a good idea? I'm not sure how
> >> > urgent
> >> > > it
> >> > > > >> > would
> >> > > > >> > > > be to
> >> > > > >> > > > >> >> actually do this, but I'd like to know whether
> people
> >> > > agree
> >> > > > >> that
> >> > > > >> > > this
> >> > > > >> > > > >> is a
> >> > > > >> > > > >> >> good goal in general.
> >> > > > >> > > > >> >>
> >> > > > >> > > > >> >>
> >> > > > >> > > > >> > --
> >> > > > >> > > > >> > Jean-Baptiste Onofré
> >> > > > >> > > > >> > [email protected]
> >> > > > >> > > > >> > http://blog.nanthrax.net
> >> > > > >> > > > >> > Talend - http://www.talend.com
> >> > > > >> > > > >> >
> >> > > > >> > > > >>
> >> > > > >> > > >
> >> > > > >> > >
> >> > > > >> >
> >> > > > >>
> >> > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
>

Re: Requiring PTransform to set a coder on its resulting collections

Reply via email to