Re: Requiring PTransform to set a coder on its resulting collections

Robert Bradshaw Fri, 11 Aug 2017 21:46:38 -0700

On Thu, Aug 10, 2017 at 5:06 PM, Reuven Lax <re...@google.com.invalid> wrote:
> Interestingly I've seen examples of PTransforms where the transform itself
> is unable to easily set its own coder. This happens when the transform is
> parametrized in such a way that its ouput coder is not determinable except
> by the caller of the PTransform. The caller can of course pass a coder into
> the constructor of the PTransform, but that's not any cleaner than simply
> calling setCoder on the output.


The argument is that having a setCoder on the output is infact more
problematic than passing a coder to the PTransform (via the
constructor, or a builder-style method). It does introduce the
ugliness that each such PTransform must manually provide this
capability though; it'd be nice to reduce this boilerplate.

> On Thu, Aug 10, 2017 at 4:57 PM, Eugene Kirpichov <
> kirpic...@google.com.invalid> wrote:
>
>> I've updated the guidance in PTransform Style Guide on setting coders
>> https://beam.apache.org/contribute/ptransform-style-guide/#coders
>> according to this discussion.
>> https://github.com/apache/beam-site/pull/279
>>
>> On Thu, Aug 3, 2017 at 6:27 PM Robert Bradshaw <rober...@google.com.invalid
>> >
>> wrote:
>>
>> > On Thu, Aug 3, 2017 at 6:08 PM, Eugene Kirpichov
>> > <kirpic...@google.com.invalid> wrote:
>> > > https://github.com/apache/beam/pull/3649 has landed. The main
>> > contribution
>> > > of this PR is deprecating PTransform.getDefaultOutputCoder().
>> > >
>> > > Next steps are to get rid of all setCoder() calls in the SDK, and
>> > deprecate
>> > > setCoder().
>> > > Nearly all setCoder() calls (perhaps simply all?) I found are on the
>> > output
>> > > of mapping transforms, such as ParDo, Map/FlatMapElements, WithKeys.
>> > > I think we should simply make these transforms optionally configurable
>> > with
>> > > an output coder: e.g. input.apply(ParDo.of(new
>> > > SomeFn<>()).withOutputCoder(SomeCoder.of()))
>> > > For multi-output ParDo this is a little more complex API-wise, but
>> doable
>> > > too.
>> > >
>> > > (another minor next step is to say in PTransform Style Guide that the
>> > > transform must set a coder on all its outputs)
>> > >
>> > > Sounds reasonable?
>> >
>> > +1
>> >
>> > I'd like to do this in a way that lowers the burden for all PTransform
>> > authors. Can't think of a better way than a special subclass of
>> > PTransform that has the setters that one would subclass...
>> >
>> > > On Thu, Aug 3, 2017 at 5:34 AM Lukasz Cwik <lc...@google.com.invalid>
>> > wrote:
>> > >
>> > >> I'm for (1) and am not sure about the feasibility of (2) without
>> having
>> > an
>> > >> escape hatch that allows a pipeline author to specify a coder to
>> handle
>> > >> their special case.
>> > >>
>> > >> On Tue, Aug 1, 2017 at 2:15 PM, Reuven Lax <re...@google.com.invalid>
>> > >> wrote:
>> > >>
>> > >> > One interesting wrinkle: I'm about to propose a set of semantics for
>> > >> > snapshotting/in-place updating pipelines. Part of this proposal is
>> the
>> > >> > ability to write pipelines to "upgrade" snapshots to make them
>> > compatible
>> > >> > with new graphs. This relies on the ability to have two separate
>> > coders
>> > >> for
>> > >> > the same type - the old coder and the new coder - in order to handle
>> > the
>> > >> > case where the user has changed coders in the new pipeline.
>> > >> >
>> > >> > On Tue, Aug 1, 2017 at 2:12 PM, Robert Bradshaw
>> > >> > <rober...@google.com.invalid
>> > >> > > wrote:
>> > >> >
>> > >> > > There are two concerns in this thread:
>> > >> > >
>> > >> > > (1) Getting rid of PCollection.setCoder(). Everyone seems in favor
>> > of
>> > >> > this
>> > >> > > (right?)
>> > >> > >
>> > >> > > (2) Deprecating specifying Coders in favor of specifying
>> > >> TypeDescriptors.
>> > >> > > I'm generally in favor, but it's unclear how far we can push this
>> > >> > through.
>> > >> > >
>> > >> > > Let's at least do (1), and separately state a preference for (2),
>> > >> seeing
>> > >> > > how fare we can push it.
>> > >> > >
>> > >> > > On Thu, Jul 27, 2017 at 9:13 PM, Kenneth Knowles
>> > >> <k...@google.com.invalid
>> > >> > >
>> > >> > > wrote:
>> > >> > >
>> > >> > > > Another thought on this: setting a custom coder to support a
>> > special
>> > >> > data
>> > >> > > > distribution is likely often a property of the input to the
>> > pipeline.
>> > >> > So
>> > >> > > > setting a coder during pipeline construction - more generally,
>> > when
>> > >> > > writing
>> > >> > > > a composite transform for reuse - you might not actually have
>> the
>> > >> > needed
>> > >> > > > information. But setting up a special indicator type descriptor
>> > lets
>> > >> > your
>> > >> > > > users map that type descriptor to a coder that works well for
>> > their
>> > >> > data.
>> > >> > > >
>> > >> > > > But Robert's example of RawUnionValue seems like a deal breaker
>> > for
>> > >> all
>> > >> > > > approaches. It really requires .getCoder() during expand() and
>> > >> > explicitly
>> > >> > > > building coders encoding information that is cumbersome to get
>> > into a
>> > >> > > > TypeDescriptor. While making up new type languages is a
>> > comfortable
>> > >> > > > activity for me :-) I don't think we should head down that path,
>> > for
>> > >> > our
>> > >> > > > users' sake. So I'll stop hoping we can eliminate this pain
>> point
>> > for
>> > >> > > now.
>> > >> > > >
>> > >> > > > Kenn
>> > >> > > >
>> > >> > > > On Thu, Jul 27, 2017 at 8:48 PM, Kenneth Knowles <
>> k...@google.com>
>> > >> > wrote:
>> > >> > > >
>> > >> > > > > On Thu, Jul 27, 2017 at 11:18 AM, Thomas Groh
>> > >> > <tg...@google.com.invalid
>> > >> > > >
>> > >> > > > > wrote:
>> > >> > > > >
>> > >> > > > >> introduce a
>> > >> > > > >> new, specialized type to represent the restricted
>> > >> > > > >> (alternatively-distributed?) data. The TypeDescriptor for
>> this
>> > >> type
>> > >> > > can
>> > >> > > > >> map
>> > >> > > > >> to the specialized coder, without having to perform a
>> > significant
>> > >> > > degree
>> > >> > > > >> of
>> > >> > > > >> potentially wasted encoding work, plus it includes the
>> > assumptions
>> > >> > > that
>> > >> > > > >> are
>> > >> > > > >> being made about the distribution of data.
>> > >> > > > >>
>> > >> > > > >
>> > >> > > > > This is a very cool idea, in theory :-)
>> > >> > > > >
>> > >> > > > > For complex types with a few allocations involved and/or
>> > nontrivial
>> > >> > > > > deserialization, or when a pipeline does a lot of real work, I
>> > >> think
>> > >> > > the
>> > >> > > > > wrapper cost won't be perceptible.
>> > >> > > > >
>> > >> > > > > But  for more primitive types in pipelines that don't really
>> do
>> > >> much
>> > >> > > > > computation but just move data around, I think it could
>> matter.
>> > >> > > Certainly
>> > >> > > > > there are languages with constructs to allow type wrappers at
>> > zero
>> > >> > cost
>> > >> > > > > (Haskell's `newtype`).
>> > >> > > > >
>> > >> > > > > This is all just speculation until we measure, like most of
>> this
>> > >> > > thread.
>> > >> > > > >
>> > >> > > > > Kenn
>> > >> > > > >
>> > >> > > > >
>> > >> > > > >> > On Thu, Jul 27, 2017 at 11:00 AM, Thomas Groh
>> > >> > > > <tg...@google.com.invalid
>> > >> > > > >> >
>> > >> > > > >> > wrote:
>> > >> > > > >> >
>> > >> > > > >> > > +1 on getting rid of setCoder; just from a Java SDK
>> > >> perspective,
>> > >> > > my
>> > >> > > > >> ideal
>> > >> > > > >> > > world contains PCollections which don't have a
>> user-visible
>> > >> way
>> > >> > to
>> > >> > > > >> mutate
>> > >> > > > >> > > them.
>> > >> > > > >> > >
>> > >> > > > >> > > My preference would be to use TypeDescriptors everywhere
>> > >> within
>> > >> > > > >> Pipeline
>> > >> > > > >> > > construction (where possible), and utilize the
>> > CoderRegistry
>> > >> > > > >> everywhere
>> > >> > > > >> > to
>> > >> > > > >> > > actually extract the appropriate type. The unfortunate
>> > >> > difficulty
>> > >> > > of
>> > >> > > > >> > having
>> > >> > > > >> > > to encode a union type and the lack of variable-length
>> > >> generics
>> > >> > > does
>> > >> > > > >> > > complicate that. We could consider some way of
>> constructing
>> > >> > coders
>> > >> > > > in
>> > >> > > > >> the
>> > >> > > > >> > > registry from a collection of type descriptors (which
>> > should
>> > >> be
>> > >> > > > >> > accessible
>> > >> > > > >> > > from the point the union-type is being constructed), e.g.
>> > >> > > something
>> > >> > > > >> like
>> > >> > > > >> > > `getCoder(TypeDescriptor output, TypeDescriptor...
>> > >> components)`
>> > >> > -
>> > >> > > > that
>> > >> > > > >> > does
>> > >> > > > >> > > only permit a single flat level (but since this is being
>> > >> invoked
>> > >> > > by
>> > >> > > > >> the
>> > >> > > > >> > SDK
>> > >> > > > >> > > during construction it could also pass Coder...).
>> > >> > > > >> > >
>> > >> > > > >> > >
>> > >> > > > >> > >
>> > >> > > > >> > > On Thu, Jul 27, 2017 at 10:22 AM, Robert Bradshaw <
>> > >> > > > >> > > rober...@google.com.invalid> wrote:
>> > >> > > > >> > >
>> > >> > > > >> > > > On Thu, Jul 27, 2017 at 10:04 AM, Kenneth Knowles
>> > >> > > > >> > > > <k...@google.com.invalid> wrote:
>> > >> > > > >> > > > > On Thu, Jul 27, 2017 at 2:22 AM, Lukasz Cwik
>> > >> > > > >> > <lc...@google.com.invalid
>> > >> > > > >> > > >
>> > >> > > > >> > > > > wrote:
>> > >> > > > >> > > > >>
>> > >> > > > >> > > > >> Ken/Robert, I believe users will want the ability to
>> > set
>> > >> > the
>> > >> > > > >> output
>> > >> > > > >> > > > coder
>> > >> > > > >> > > > >> because coders may have intrinsic properties where
>> the
>> > >> type
>> > >> > > > isn't
>> > >> > > > >> > > enough
>> > >> > > > >> > > > >> information to fully specify what I want as a user.
>> > Some
>> > >> > > cases
>> > >> > > > I
>> > >> > > > >> can
>> > >> > > > >> > > see
>> > >> > > > >> > > > >> are:
>> > >> > > > >> > > > >> 1) I have a cheap and fast non-deterministic coder
>> > but a
>> > >> > > > >> different
>> > >> > > > >> > > > slower
>> > >> > > > >> > > > >> coder when I want to use it as the key to a GBK, For
>> > >> > example
>> > >> > > > >> with a
>> > >> > > > >> > > set
>> > >> > > > >> > > > >> coder, it would need to consistently order the
>> values
>> > of
>> > >> > the
>> > >> > > > set
>> > >> > > > >> > when
>> > >> > > > >> > > > used
>> > >> > > > >> > > > >> as the key.
>> > >> > > > >> > > > >> 2) I know a property of the data which allows me to
>> > have
>> > >> a
>> > >> > > > >> cheaper
>> > >> > > > >> > > > >> encoding. Imagine I know that all the strings have a
>> > >> common
>> > >> > > > >> prefix
>> > >> > > > >> > or
>> > >> > > > >> > > > >> integers that are in a certain range, or that a
>> > matrix is
>> > >> > > > >> > > sparse/dense.
>> > >> > > > >> > > > Not
>> > >> > > > >> > > > >> all PCollections of strings / integers / matrices in
>> > the
>> > >> > > > pipeline
>> > >> > > > >> > will
>> > >> > > > >> > > > have
>> > >> > > > >> > > > >> this property, just some.
>> > >> > > > >> > > > >> 3) Sorting comes up occasionally, traditionally in
>> > Google
>> > >> > > this
>> > >> > > > >> was
>> > >> > > > >> > > done
>> > >> > > > >> > > > by
>> > >> > > > >> > > > >> sorting the encoded version of the object
>> > >> lexicographically
>> > >> > > > >> during a
>> > >> > > > >> > > > GBK.
>> > >> > > > >> > > > >> There are good lexicographical byte representations
>> > for
>> > >> > ASCII
>> > >> > > > >> > strings,
>> > >> > > > >> > > > >> integers, and for some IEEE number representations
>> > which
>> > >> > > could
>> > >> > > > be
>> > >> > > > >> > done
>> > >> > > > >> > > > by
>> > >> > > > >> > > > >> the use of a special coder.
>> > >> > > > >> > > > >>
>> > >> > > > >> > > > >
>> > >> > > > >> > > > > Items (1) and (3) do not require special knowledge
>> from
>> > >> the
>> > >> > > > user.
>> > >> > > > >> > They
>> > >> > > > >> > > > are
>> > >> > > > >> > > > > easily observed properties of a pipeline. My proposal
>> > >> > included
>> > >> > > > >> full
>> > >> > > > >> > > > > automation for both. The suggestion is new methods
>> > >> > > > >> > > > > .getDeterministicCoder(TypeDescriptor) and
>> > >> > > > >> > > > > .getLexicographicCoder(TypeDescriptor).
>> > >> > > > >> > > >
>> > >> > > > >> > > > Completely agree--usecases (1) and (3) are an indirect
>> > use
>> > >> of
>> > >> > > > Coders
>> > >> > > > >> > > > that are used to achieve an effect that would be better
>> > >> > > expressed
>> > >> > > > >> > > > directly.
>> > >> > > > >> > > >
>> > >> > > > >> > > > > (2) is an interesting hypothetical for massive scale
>> > where
>> > >> > > tiny
>> > >> > > > >> > > > incremental
>> > >> > > > >> > > > > optimization represents a lot of cost _and_ your data
>> > has
>> > >> > > > >> sufficient
>> > >> > > > >> > > > > structure to realize a benefit _and_ it needs to be
>> > >> > pinpointed
>> > >> > > > to
>> > >> > > > >> > just
>> > >> > > > >> > > > some
>> > >> > > > >> > > > > PCollections. I think our experience with coders so
>> > far is
>> > >> > > that
>> > >> > > > >> their
>> > >> > > > >> > > > > existence is almost entirely negative. It would be
>> > nice to
>> > >> > > > support
>> > >> > > > >> > this
>> > >> > > > >> > > > > vanishingly rare case without inflicting a terrible
>> > pain
>> > >> > point
>> > >> > > > on
>> > >> > > > >> the
>> > >> > > > >> > > > model
>> > >> > > > >> > > > > and all other users.
>> > >> > > > >> > > >
>> > >> > > > >> > > > (2) is not just about cheapness, sometimes there's
>> other
>> > >> > > structure
>> > >> > > > >> in
>> > >> > > > >> > > > the data we can leverage. Consider the UnionCoder used
>> in
>> > >> > > > >> > > > CoGBK--RawUnionValue has an integer value that
>> specifies
>> > >> > > indicates
>> > >> > > > >> the
>> > >> > > > >> > > > type of it's raw Object field. Unless we want to extend
>> > the
>> > >> > type
>> > >> > > > >> > > > language, there's not a sufficient type descriptor that
>> > can
>> > >> be
>> > >> > > > used
>> > >> > > > >> to
>> > >> > > > >> > > > infer the coder. I'm dubious going down the road of
>> > adding
>> > >> > > special
>> > >> > > > >> > > > cases is the right thing here.
>> > >> > > > >> > > >
>> > >> > > > >> > > > > For example, in those cases you could encode in your
>> > >> > > > >> > > > > DoFn so the type descriptor would just be byte[].
>> > >> > > > >> > > >
>> > >> > > > >> > > > As well as being an extremely cumbersome API, this
>> would
>> > >> incur
>> > >> > > the
>> > >> > > > >> > > > cost of coding/decoding at that DoFn boundary even if
>> it
>> > is
>> > >> > > fused
>> > >> > > > >> > > > away.
>> > >> > > > >> > > >
>> > >> > > > >> > > > >> On Thu, Jul 27, 2017 at 1:34 AM, Jean-Baptiste
>> Onofré
>> > <
>> > >> > > > >> > > j...@nanthrax.net>
>> > >> > > > >> > > > >> wrote:
>> > >> > > > >> > > > >>
>> > >> > > > >> > > > >> > Hi,
>> > >> > > > >> > > > >> >
>> > >> > > > >> > > > >> > That's an interesting thread and I was wondering
>> the
>> > >> > > > >> relationship
>> > >> > > > >> > > > between
>> > >> > > > >> > > > >> > type descriptor and coder for a while ;)
>> > >> > > > >> > > > >> >
>> > >> > > > >> > > > >> > Today, in a PCollection, we can set the coder and
>> we
>> > >> also
>> > >> > > > have
>> > >> > > > >> a
>> > >> > > > >> > > > >> > getTypeDescriptor(). It sounds weird to me: it
>> > should
>> > >> be
>> > >> > > one
>> > >> > > > or
>> > >> > > > >> > the
>> > >> > > > >> > > > >> other.
>> > >> > > > >> > > > >> >
>> > >> > > > >> > > > >> > Basically, if the Coder is not used to define the
>> > type,
>> > >> > > > than, I
>> > >> > > > >> > > fully
>> > >> > > > >> > > > >> > agree with Eugene.
>> > >> > > > >> > > > >> >
>> > >> > > > >> > > > >> > Basically, the PCollection should define only the
>> > type
>> > >> > > > >> descriptor,
>> > >> > > > >> > > not
>> > >> > > > >> > > > >> the
>> > >> > > > >> > > > >> > coder by itself: the coder can be found using the
>> > type
>> > >> > > > >> descriptor.
>> > >> > > > >> > > > >> >
>> > >> > > > >> > > > >> > With both coder and type descriptor on the
>> > PCollection,
>> > >> > it
>> > >> > > > >> sounds
>> > >> > > > >> > a
>> > >> > > > >> > > > big
>> > >> > > > >> > > > >> > "decoupled" to me and it would be possible to
>> have a
>> > >> > coder
>> > >> > > on
>> > >> > > > >> the
>> > >> > > > >> > > > >> > PCollection that doesn't match the type
>> descriptor.
>> > >> > > > >> > > > >> >
>> > >> > > > >> > > > >> > I think PCollection type descriptor should be
>> > defined,
>> > >> > and
>> > >> > > > the
>> > >> > > > >> > coder
>> > >> > > > >> > > > >> > should be implicit based on this type descriptor.
>> > >> > > > >> > > > >> >
>> > >> > > > >> > > > >> > Thoughts ?
>> > >> > > > >> > > > >> >
>> > >> > > > >> > > > >> > Regards
>> > >> > > > >> > > > >> > JB
>> > >> > > > >> > > > >> >
>> > >> > > > >> > > > >> >
>> > >> > > > >> > > > >> > On 07/26/2017 05:25 AM, Eugene Kirpichov wrote:
>> > >> > > > >> > > > >> >
>> > >> > > > >> > > > >> >> Hello,
>> > >> > > > >> > > > >> >>
>> > >> > > > >> > > > >> >> I've worked on a few different things recently
>> and
>> > ran
>> > >> > > > >> repeatedly
>> > >> > > > >> > > > into
>> > >> > > > >> > > > >> the
>> > >> > > > >> > > > >> >> same issue: that we do not have clear guidance on
>> > who
>> > >> > > should
>> > >> > > > >> set
>> > >> > > > >> > > the
>> > >> > > > >> > > > >> Coder
>> > >> > > > >> > > > >> >> on a PCollection: is it responsibility of the
>> > >> PTransform
>> > >> > > > that
>> > >> > > > >> > > outputs
>> > >> > > > >> > > > >> it,
>> > >> > > > >> > > > >> >> or is it responsibility of the user, or is it
>> > >> sometimes
>> > >> > > one
>> > >> > > > >> and
>> > >> > > > >> > > > >> sometimes
>> > >> > > > >> > > > >> >> the other?
>> > >> > > > >> > > > >> >>
>> > >> > > > >> > > > >> >> I believe that the answer is "it's responsibility
>> > of
>> > >> the
>> > >> > > > >> > transform"
>> > >> > > > >> > > > and
>> > >> > > > >> > > > >> >> moreover that  ideally PCollection.setCoder()
>> > should
>> > >> not
>> > >> > > > >> exist.
>> > >> > > > >> > > > Instead:
>> > >> > > > >> > > > >> >>
>> > >> > > > >> > > > >> >> - Require that all transforms set a Coder on the
>> > >> > > > PCollection's
>> > >> > > > >> > they
>> > >> > > > >> > > > >> >> produce
>> > >> > > > >> > > > >> >> - i.e. it should never be responsibility of the
>> > user
>> > >> to
>> > >> > > "fix
>> > >> > > > >> up"
>> > >> > > > >> > a
>> > >> > > > >> > > > coder
>> > >> > > > >> > > > >> >> on
>> > >> > > > >> > > > >> >> a PCollection produced by a transform.
>> > >> > > > >> > > > >> >>
>> > >> > > > >> > > > >> >> - Since all transforms are composed of primitive
>> > >> > > transforms,
>> > >> > > > >> > saying
>> > >> > > > >> > > > >> >> "transforms must set a Coder" means simply that
>> all
>> > >> > > > >> *primitive*
>> > >> > > > >> > > > >> transforms
>> > >> > > > >> > > > >> >> must set a Coder on their output.
>> > >> > > > >> > > > >> >>
>> > >> > > > >> > > > >> >> - In some cases, a primitive PTransform currently
>> > >> > doesn't
>> > >> > > > have
>> > >> > > > >> > > enough
>> > >> > > > >> > > > >> >> information to infer a coder for its output
>> > >> collection -
>> > >> > > > e.g.
>> > >> > > > >> > > > >> >> ParDo.of(DoFn<InputT, OutputT>) might be unable
>> to
>> > >> > infer a
>> > >> > > > >> coder
>> > >> > > > >> > > for
>> > >> > > > >> > > > >> >> OutputT. In that case such transforms should
>> allow
>> > the
>> > >> > > user
>> > >> > > > to
>> > >> > > > >> > > > provide a
>> > >> > > > >> > > > >> >> coder: ParDo.of(DoFn).withOutputCoder(...) [note
>> > that
>> > >> > > this
>> > >> > > > >> > differs
>> > >> > > > >> > > > from
>> > >> > > > >> > > > >> >> requiring the user to set a coder on the
>> resulting
>> > >> > > > collection]
>> > >> > > > >> > > > >> >>
>> > >> > > > >> > > > >> >> - Corollary: composite transforms need to only
>> > >> configure
>> > >> > > > their
>> > >> > > > >> > > > primitive
>> > >> > > > >> > > > >> >> transforms (and composite sub-transforms)
>> properly,
>> > >> and
>> > >> > > give
>> > >> > > > >> > them a
>> > >> > > > >> > > > >> Coder
>> > >> > > > >> > > > >> >> if needed.
>> > >> > > > >> > > > >> >>
>> > >> > > > >> > > > >> >> - Corollary: a PTransform with type parameters
>> > <FooT,
>> > >> > > BarT,
>> > >> > > > >> ...>
>> > >> > > > >> > > > needs
>> > >> > > > >> > > > >> to
>> > >> > > > >> > > > >> >> be configurable with coders for all of these,
>> > because
>> > >> > the
>> > >> > > > >> > > > implementation
>> > >> > > > >> > > > >> >> of
>> > >> > > > >> > > > >> >> the transform may change and it may introduce
>> > >> > intermediate
>> > >> > > > >> > > > collections
>> > >> > > > >> > > > >> >> involving these types. However, in many cases,
>> > some of
>> > >> > > these
>> > >> > > > >> type
>> > >> > > > >> > > > >> >> parameters appear in the type of the transform's
>> > >> input,
>> > >> > > > e.g. a
>> > >> > > > >> > > > >> >> PTransform<PCollection<KV<FooT, BarT>>,
>> > >> > > PCollection<MooT>>
>> > >> > > > >> will
>> > >> > > > >> > > > always
>> > >> > > > >> > > > >> be
>> > >> > > > >> > > > >> >> able to extract the coders for FooT and BarT from
>> > the
>> > >> > > input
>> > >> > > > >> > > > PCollection,
>> > >> > > > >> > > > >> >> so
>> > >> > > > >> > > > >> >> the user does not need to provide them. However,
>> a
>> > >> coder
>> > >> > > for
>> > >> > > > >> BarT
>> > >> > > > >> > > > must
>> > >> > > > >> > > > >> be
>> > >> > > > >> > > > >> >> provided. I think in most cases the transform
>> > needs to
>> > >> > be
>> > >> > > > >> > > > configurable
>> > >> > > > >> > > > >> >> only
>> > >> > > > >> > > > >> >> with coders for its output.
>> > >> > > > >> > > > >> >>
>> > >> > > > >> > > > >> >> Here's a smooth migration path to accomplish the
>> > >> above:
>> > >> > > > >> > > > >> >> - Make PCollection.
>> createPrimitiveOutputInternal()
>> > >> > take a
>> > >> > > > >> Coder.
>> > >> > > > >> > > > >> >> - Make all primitive transforms optionally
>> > >> configurable
>> > >> > > > with a
>> > >> > > > >> > > coder
>> > >> > > > >> > > > for
>> > >> > > > >> > > > >> >> their outputs, such as ParDo.of(DoFn).
>> > >> > withOutputCoder().
>> > >> > > > >> > > > >> >> - By using the above, make all composite
>> transforms
>> > >> > > shipped
>> > >> > > > >> with
>> > >> > > > >> > > the
>> > >> > > > >> > > > SDK
>> > >> > > > >> > > > >> >> set a Coder on the collections they produce; in
>> > some
>> > >> > > cases,
>> > >> > > > >> this
>> > >> > > > >> > > will
>> > >> > > > >> > > > >> >> require adding a withSomethingCoder() option to
>> the
>> > >> > > > transform
>> > >> > > > >> and
>> > >> > > > >> > > > >> >> propagating that coder to its sub-transforms. If
>> > the
>> > >> > > option
>> > >> > > > is
>> > >> > > > >> > > unset,
>> > >> > > > >> > > > >> >> that's fine for now.
>> > >> > > > >> > > > >> >> - As a result of the above, get rid of all
>> > setCoder()
>> > >> > > calls
>> > >> > > > in
>> > >> > > > >> > the
>> > >> > > > >> > > > Beam
>> > >> > > > >> > > > >> >> repo. The call will still be there, but it will
>> > just
>> > >> not
>> > >> > > be
>> > >> > > > >> used
>> > >> > > > >> > > > >> anywhere
>> > >> > > > >> > > > >> >> in the SDK or examples, and we can mark it
>> > deprecated.
>> > >> > > > >> > > > >> >> - Add guidance to PTransform Style Guide in line
>> > with
>> > >> > the
>> > >> > > > >> above
>> > >> > > > >> > > > >> >>
>> > >> > > > >> > > > >> >> Does this sound like a good idea? I'm not sure
>> how
>> > >> > urgent
>> > >> > > it
>> > >> > > > >> > would
>> > >> > > > >> > > > be to
>> > >> > > > >> > > > >> >> actually do this, but I'd like to know whether
>> > people
>> > >> > > agree
>> > >> > > > >> that
>> > >> > > > >> > > this
>> > >> > > > >> > > > >> is a
>> > >> > > > >> > > > >> >> good goal in general.
>> > >> > > > >> > > > >> >>
>> > >> > > > >> > > > >> >>
>> > >> > > > >> > > > >> > --
>> > >> > > > >> > > > >> > Jean-Baptiste Onofré
>> > >> > > > >> > > > >> > jbono...@apache.org
>> > >> > > > >> > > > >> > http://blog.nanthrax.net
>> > >> > > > >> > > > >> > Talend - http://www.talend.com
>> > >> > > > >> > > > >> >
>> > >> > > > >> > > > >>
>> > >> > > > >> > > >
>> > >> > > > >> > >
>> > >> > > > >> >
>> > >> > > > >>
>> > >> > > > >
>> > >> > > > >
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> >
>>

Re: Requiring PTransform to set a coder on its resulting collections

Reply via email to