On Thu, Jul 27, 2017 at 2:22 AM, Lukasz Cwik <lc...@google.com.invalid>
wrote:
>
> Ken/Robert, I believe users will want the ability to set the output coder
> because coders may have intrinsic properties where the type isn't enough
> information to fully specify what I want as a user. Some cases I can see
> are:
> 1) I have a cheap and fast non-deterministic coder but a different slower
> coder when I want to use it as the key to a GBK, For example with a set
> coder, it would need to consistently order the values of the set when used
> as the key.
> 2) I know a property of the data which allows me to have a cheaper
> encoding. Imagine I know that all the strings have a common prefix or
> integers that are in a certain range, or that a matrix is sparse/dense. Not
> all PCollections of strings / integers / matrices in the pipeline will have
> this property, just some.
> 3) Sorting comes up occasionally, traditionally in Google this was done by
> sorting the encoded version of the object lexicographically during a GBK.
> There are good lexicographical byte representations for ASCII strings,
> integers, and for some IEEE number representations which could be done by
> the use of a special coder.
>

Items (1) and (3) do not require special knowledge from the user. They are
easily observed properties of a pipeline. My proposal included full
automation for both. The suggestion is new methods
.getDeterministicCoder(TypeDescriptor) and
.getLexicographicCoder(TypeDescriptor).

(2) is an interesting hypothetical for massive scale where tiny incremental
optimization represents a lot of cost _and_ your data has sufficient
structure to realize a benefit _and_ it needs to be pinpointed to just some
PCollections. I think our experience with coders so far is that their
existence is almost entirely negative. It would be nice to support this
vanishingly rare case without inflicting a terrible pain point on the model
and all other users. For example, in those cases you could encode in your
DoFn so the type descriptor would just be byte[].

Kenn



>
> On Thu, Jul 27, 2017 at 1:34 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
>
> > Hi,
> >
> > That's an interesting thread and I was wondering the relationship between
> > type descriptor and coder for a while ;)
> >
> > Today, in a PCollection, we can set the coder and we also have a
> > getTypeDescriptor(). It sounds weird to me: it should be one or the
> other.
> >
> > Basically, if the Coder is not used to define the type, than, I fully
> > agree with Eugene.
> >
> > Basically, the PCollection should define only the type descriptor, not
> the
> > coder by itself: the coder can be found using the type descriptor.
> >
> > With both coder and type descriptor on the PCollection, it sounds a big
> > "decoupled" to me and it would be possible to have a coder on the
> > PCollection that doesn't match the type descriptor.
> >
> > I think PCollection type descriptor should be defined, and the coder
> > should be implicit based on this type descriptor.
> >
> > Thoughts ?
> >
> > Regards
> > JB
> >
> >
> > On 07/26/2017 05:25 AM, Eugene Kirpichov wrote:
> >
> >> Hello,
> >>
> >> I've worked on a few different things recently and ran repeatedly into
> the
> >> same issue: that we do not have clear guidance on who should set the
> Coder
> >> on a PCollection: is it responsibility of the PTransform that outputs
> it,
> >> or is it responsibility of the user, or is it sometimes one and
> sometimes
> >> the other?
> >>
> >> I believe that the answer is "it's responsibility of the transform" and
> >> moreover that  ideally PCollection.setCoder() should not exist. Instead:
> >>
> >> - Require that all transforms set a Coder on the PCollection's they
> >> produce
> >> - i.e. it should never be responsibility of the user to "fix up" a coder
> >> on
> >> a PCollection produced by a transform.
> >>
> >> - Since all transforms are composed of primitive transforms, saying
> >> "transforms must set a Coder" means simply that all *primitive*
> transforms
> >> must set a Coder on their output.
> >>
> >> - In some cases, a primitive PTransform currently doesn't have enough
> >> information to infer a coder for its output collection - e.g.
> >> ParDo.of(DoFn<InputT, OutputT>) might be unable to infer a coder for
> >> OutputT. In that case such transforms should allow the user to provide a
> >> coder: ParDo.of(DoFn).withOutputCoder(...) [note that this differs from
> >> requiring the user to set a coder on the resulting collection]
> >>
> >> - Corollary: composite transforms need to only configure their primitive
> >> transforms (and composite sub-transforms) properly, and give them a
> Coder
> >> if needed.
> >>
> >> - Corollary: a PTransform with type parameters <FooT, BarT, ...> needs
> to
> >> be configurable with coders for all of these, because the implementation
> >> of
> >> the transform may change and it may introduce intermediate collections
> >> involving these types. However, in many cases, some of these type
> >> parameters appear in the type of the transform's input, e.g. a
> >> PTransform<PCollection<KV<FooT, BarT>>, PCollection<MooT>> will always
> be
> >> able to extract the coders for FooT and BarT from the input PCollection,
> >> so
> >> the user does not need to provide them. However, a coder for BarT must
> be
> >> provided. I think in most cases the transform needs to be configurable
> >> only
> >> with coders for its output.
> >>
> >> Here's a smooth migration path to accomplish the above:
> >> - Make PCollection.createPrimitiveOutputInternal() take a Coder.
> >> - Make all primitive transforms optionally configurable with a coder for
> >> their outputs, such as ParDo.of(DoFn).withOutputCoder().
> >> - By using the above, make all composite transforms shipped with the SDK
> >> set a Coder on the collections they produce; in some cases, this will
> >> require adding a withSomethingCoder() option to the transform and
> >> propagating that coder to its sub-transforms. If the option is unset,
> >> that's fine for now.
> >> - As a result of the above, get rid of all setCoder() calls in the Beam
> >> repo. The call will still be there, but it will just not be used
> anywhere
> >> in the SDK or examples, and we can mark it deprecated.
> >> - Add guidance to PTransform Style Guide in line with the above
> >>
> >> Does this sound like a good idea? I'm not sure how urgent it would be to
> >> actually do this, but I'd like to know whether people agree that this
> is a
> >> good goal in general.
> >>
> >>
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>

Reply via email to