Eugene, a two pass model allows for the framework to first gather all the
information the user specified, and then in the second pass you actually
perform the expansion of PTransforms. Since the user isn't adding any
additional data to the pipeline, every transform could inspect all the
attributes of every predecessor PTransform and of its own output
PCollections to be able to do anything it needs to. The Create transform is
one that I always go back to in my mind about this.

Ken/Robert, I believe users will want the ability to set the output coder
because coders may have intrinsic properties where the type isn't enough
information to fully specify what I want as a user. Some cases I can see
are:
1) I have a cheap and fast non-deterministic coder but a different slower
coder when I want to use it as the key to a GBK, For example with a set
coder, it would need to consistently order the values of the set when used
as the key.
2) I know a property of the data which allows me to have a cheaper
encoding. Imagine I know that all the strings have a common prefix or
integers that are in a certain range, or that a matrix is sparse/dense. Not
all PCollections of strings / integers / matrices in the pipeline will have
this property, just some.
3) Sorting comes up occasionally, traditionally in Google this was done by
sorting the encoded version of the object lexicographically during a GBK.
There are good lexicographical byte representations for ASCII strings,
integers, and for some IEEE number representations which could be done by
the use of a special coder.

On Thu, Jul 27, 2017 at 1:34 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> Hi,
>
> That's an interesting thread and I was wondering the relationship between
> type descriptor and coder for a while ;)
>
> Today, in a PCollection, we can set the coder and we also have a
> getTypeDescriptor(). It sounds weird to me: it should be one or the other.
>
> Basically, if the Coder is not used to define the type, than, I fully
> agree with Eugene.
>
> Basically, the PCollection should define only the type descriptor, not the
> coder by itself: the coder can be found using the type descriptor.
>
> With both coder and type descriptor on the PCollection, it sounds a big
> "decoupled" to me and it would be possible to have a coder on the
> PCollection that doesn't match the type descriptor.
>
> I think PCollection type descriptor should be defined, and the coder
> should be implicit based on this type descriptor.
>
> Thoughts ?
>
> Regards
> JB
>
>
> On 07/26/2017 05:25 AM, Eugene Kirpichov wrote:
>
>> Hello,
>>
>> I've worked on a few different things recently and ran repeatedly into the
>> same issue: that we do not have clear guidance on who should set the Coder
>> on a PCollection: is it responsibility of the PTransform that outputs it,
>> or is it responsibility of the user, or is it sometimes one and sometimes
>> the other?
>>
>> I believe that the answer is "it's responsibility of the transform" and
>> moreover that  ideally PCollection.setCoder() should not exist. Instead:
>>
>> - Require that all transforms set a Coder on the PCollection's they
>> produce
>> - i.e. it should never be responsibility of the user to "fix up" a coder
>> on
>> a PCollection produced by a transform.
>>
>> - Since all transforms are composed of primitive transforms, saying
>> "transforms must set a Coder" means simply that all *primitive* transforms
>> must set a Coder on their output.
>>
>> - In some cases, a primitive PTransform currently doesn't have enough
>> information to infer a coder for its output collection - e.g.
>> ParDo.of(DoFn<InputT, OutputT>) might be unable to infer a coder for
>> OutputT. In that case such transforms should allow the user to provide a
>> coder: ParDo.of(DoFn).withOutputCoder(...) [note that this differs from
>> requiring the user to set a coder on the resulting collection]
>>
>> - Corollary: composite transforms need to only configure their primitive
>> transforms (and composite sub-transforms) properly, and give them a Coder
>> if needed.
>>
>> - Corollary: a PTransform with type parameters <FooT, BarT, ...> needs to
>> be configurable with coders for all of these, because the implementation
>> of
>> the transform may change and it may introduce intermediate collections
>> involving these types. However, in many cases, some of these type
>> parameters appear in the type of the transform's input, e.g. a
>> PTransform<PCollection<KV<FooT, BarT>>, PCollection<MooT>> will always be
>> able to extract the coders for FooT and BarT from the input PCollection,
>> so
>> the user does not need to provide them. However, a coder for BarT must be
>> provided. I think in most cases the transform needs to be configurable
>> only
>> with coders for its output.
>>
>> Here's a smooth migration path to accomplish the above:
>> - Make PCollection.createPrimitiveOutputInternal() take a Coder.
>> - Make all primitive transforms optionally configurable with a coder for
>> their outputs, such as ParDo.of(DoFn).withOutputCoder().
>> - By using the above, make all composite transforms shipped with the SDK
>> set a Coder on the collections they produce; in some cases, this will
>> require adding a withSomethingCoder() option to the transform and
>> propagating that coder to its sub-transforms. If the option is unset,
>> that's fine for now.
>> - As a result of the above, get rid of all setCoder() calls in the Beam
>> repo. The call will still be there, but it will just not be used anywhere
>> in the SDK or examples, and we can mark it deprecated.
>> - Add guidance to PTransform Style Guide in line with the above
>>
>> Does this sound like a good idea? I'm not sure how urgent it would be to
>> actually do this, but I'd like to know whether people agree that this is a
>> good goal in general.
>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Reply via email to