Re: Requiring PTransform to set a coder on its resulting collections

Jean-Baptiste Onofré Thu, 27 Jul 2017 14:32:35 -0700

Fully agree Thomas, I think it's way better.

Regards
JB


On 07/27/2017 08:00 PM, Thomas Groh wrote:

+1 on getting rid of setCoder; just from a Java SDK perspective, my ideal
world contains PCollections which don't have a user-visible way to mutate
them.

My preference would be to use TypeDescriptors everywhere within Pipeline
construction (where possible), and utilize the CoderRegistry everywhere to
actually extract the appropriate type. The unfortunate difficulty of having
to encode a union type and the lack of variable-length generics does
complicate that. We could consider some way of constructing coders in the
registry from a collection of type descriptors (which should be accessible
from the point the union-type is being constructed), e.g. something like
`getCoder(TypeDescriptor output, TypeDescriptor... components)` - that does
only permit a single flat level (but since this is being invoked by the SDK
during construction it could also pass Coder...).



On Thu, Jul 27, 2017 at 10:22 AM, Robert Bradshaw <
[email protected]> wrote:

On Thu, Jul 27, 2017 at 10:04 AM, Kenneth Knowles
<[email protected]> wrote:

On Thu, Jul 27, 2017 at 2:22 AM, Lukasz Cwik <[email protected]>
wrote:


Ken/Robert, I believe users will want the ability to set the output

coder

because coders may have intrinsic properties where the type isn't enough
information to fully specify what I want as a user. Some cases I can see
are:
1) I have a cheap and fast non-deterministic coder but a different

slower

coder when I want to use it as the key to a GBK, For example with a set
coder, it would need to consistently order the values of the set when

used

as the key.
2) I know a property of the data which allows me to have a cheaper
encoding. Imagine I know that all the strings have a common prefix or
integers that are in a certain range, or that a matrix is sparse/dense.

Not

all PCollections of strings / integers / matrices in the pipeline will

have

this property, just some.
3) Sorting comes up occasionally, traditionally in Google this was done

by

sorting the encoded version of the object lexicographically during a

GBK.

There are good lexicographical byte representations for ASCII strings,
integers, and for some IEEE number representations which could be done

by

the use of a special coder.


Items (1) and (3) do not require special knowledge from the user. They

are

easily observed properties of a pipeline. My proposal included full
automation for both. The suggestion is new methods
.getDeterministicCoder(TypeDescriptor) and
.getLexicographicCoder(TypeDescriptor).


Completely agree--usecases (1) and (3) are an indirect use of Coders
that are used to achieve an effect that would be better expressed
directly.

(2) is an interesting hypothetical for massive scale where tiny

incremental

optimization represents a lot of cost _and_ your data has sufficient
structure to realize a benefit _and_ it needs to be pinpointed to just

some

PCollections. I think our experience with coders so far is that their
existence is almost entirely negative. It would be nice to support this
vanishingly rare case without inflicting a terrible pain point on the

model

and all other users.


(2) is not just about cheapness, sometimes there's other structure in
the data we can leverage. Consider the UnionCoder used in
CoGBK--RawUnionValue has an integer value that specifies indicates the
type of it's raw Object field. Unless we want to extend the type
language, there's not a sufficient type descriptor that can be used to
infer the coder. I'm dubious going down the road of adding special
cases is the right thing here.

For example, in those cases you could encode in your
DoFn so the type descriptor would just be byte[].


As well as being an extremely cumbersome API, this would incur the
cost of coding/decoding at that DoFn boundary even if it is fused
away.

On Thu, Jul 27, 2017 at 1:34 AM, Jean-Baptiste Onofré <[email protected]>
wrote:

Hi,

That's an interesting thread and I was wondering the relationship

between

type descriptor and coder for a while ;)

Today, in a PCollection, we can set the coder and we also have a
getTypeDescriptor(). It sounds weird to me: it should be one or the

other.


Basically, if the Coder is not used to define the type, than, I fully
agree with Eugene.

Basically, the PCollection should define only the type descriptor, not

the

coder by itself: the coder can be found using the type descriptor.

With both coder and type descriptor on the PCollection, it sounds a

big

"decoupled" to me and it would be possible to have a coder on the
PCollection that doesn't match the type descriptor.

I think PCollection type descriptor should be defined, and the coder
should be implicit based on this type descriptor.

Thoughts ?

Regards
JB


On 07/26/2017 05:25 AM, Eugene Kirpichov wrote:

Hello,

I've worked on a few different things recently and ran repeatedly

into

the

same issue: that we do not have clear guidance on who should set the

Coder

on a PCollection: is it responsibility of the PTransform that outputs

it,

or is it responsibility of the user, or is it sometimes one and

sometimes

the other?

I believe that the answer is "it's responsibility of the transform"

and

moreover that  ideally PCollection.setCoder() should not exist.

Instead:


- Require that all transforms set a Coder on the PCollection's they
produce
- i.e. it should never be responsibility of the user to "fix up" a

coder

on
a PCollection produced by a transform.

- Since all transforms are composed of primitive transforms, saying
"transforms must set a Coder" means simply that all *primitive*

transforms

must set a Coder on their output.

- In some cases, a primitive PTransform currently doesn't have enough
information to infer a coder for its output collection - e.g.
ParDo.of(DoFn<InputT, OutputT>) might be unable to infer a coder for
OutputT. In that case such transforms should allow the user to

provide a

coder: ParDo.of(DoFn).withOutputCoder(...) [note that this differs

from

requiring the user to set a coder on the resulting collection]

- Corollary: composite transforms need to only configure their

primitive

transforms (and composite sub-transforms) properly, and give them a

Coder

if needed.

- Corollary: a PTransform with type parameters <FooT, BarT, ...>

needs

to

be configurable with coders for all of these, because the

implementation

of
the transform may change and it may introduce intermediate

collections

involving these types. However, in many cases, some of these type
parameters appear in the type of the transform's input, e.g. a
PTransform<PCollection<KV<FooT, BarT>>, PCollection<MooT>> will

always

be

able to extract the coders for FooT and BarT from the input

PCollection,

so
the user does not need to provide them. However, a coder for BarT

must

be

provided. I think in most cases the transform needs to be

configurable

only
with coders for its output.

Here's a smooth migration path to accomplish the above:
- Make PCollection.createPrimitiveOutputInternal() take a Coder.
- Make all primitive transforms optionally configurable with a coder

for

their outputs, such as ParDo.of(DoFn).withOutputCoder().
- By using the above, make all composite transforms shipped with the

SDK

set a Coder on the collections they produce; in some cases, this will
require adding a withSomethingCoder() option to the transform and
propagating that coder to its sub-transforms. If the option is unset,
that's fine for now.
- As a result of the above, get rid of all setCoder() calls in the

Beam

repo. The call will still be there, but it will just not be used

anywhere

in the SDK or examples, and we can mark it deprecated.
- Add guidance to PTransform Style Guide in line with the above

Does this sound like a good idea? I'm not sure how urgent it would

be to

actually do this, but I'd like to know whether people agree that this

is a

good goal in general.

--
Jean-Baptiste Onofré
[email protected]
http://blog.nanthrax.net
Talend - http://www.talend.com


--
Jean-Baptiste Onofré
[email protected]
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Requiring PTransform to set a coder on its resulting collections

Reply via email to