Re: Requiring PTransform to set a coder on its resulting collections

Jean-Baptiste Onofré Thu, 27 Jul 2017 14:33:58 -0700

And what about having a kind of map in the CoderRegistry with type -> coder ?


byte[] could be an entry, a specialized type another.

Thought ?

Regards
JB

On 07/27/2017 08:18 PM, Thomas Groh wrote:

I think there's a simpler solution than encoding to byte[]: introduce a
new, specialized type to represent the restricted
(alternatively-distributed?) data. The TypeDescriptor for this type can map
to the specialized coder, without having to perform a significant degree of
potentially wasted encoding work, plus it includes the assumptions that are
being made about the distribution of data.

On Thu, Jul 27, 2017 at 11:04 AM, Reuven Lax <[email protected]>
wrote:

I tend to agree with Robert - it would be unfortunate if a single
TypeDescrictor was forced to have the same encoding all through the
pipeline. However it's also unfortunate if this flexibility impacted every
part of the programming model. I also think that our experience has been
that "large scale where tiny incremental
optimization represents a lot of cost" is far more common than people
expect, especially since coding/decoding is often a dominant cost for such
pipelines.

On Thu, Jul 27, 2017 at 11:00 AM, Thomas Groh <[email protected]>
wrote:

+1 on getting rid of setCoder; just from a Java SDK perspective, my ideal
world contains PCollections which don't have a user-visible way to mutate
them.

My preference would be to use TypeDescriptors everywhere within Pipeline
construction (where possible), and utilize the CoderRegistry everywhere

to

actually extract the appropriate type. The unfortunate difficulty of

having

to encode a union type and the lack of variable-length generics does
complicate that. We could consider some way of constructing coders in the
registry from a collection of type descriptors (which should be

accessible

from the point the union-type is being constructed), e.g. something like
`getCoder(TypeDescriptor output, TypeDescriptor... components)` - that

does

only permit a single flat level (but since this is being invoked by the

SDK

during construction it could also pass Coder...).



On Thu, Jul 27, 2017 at 10:22 AM, Robert Bradshaw <
[email protected]> wrote:

On Thu, Jul 27, 2017 at 10:04 AM, Kenneth Knowles
<[email protected]> wrote:

On Thu, Jul 27, 2017 at 2:22 AM, Lukasz Cwik

<[email protected]

wrote:


Ken/Robert, I believe users will want the ability to set the output

coder

because coders may have intrinsic properties where the type isn't

enough

information to fully specify what I want as a user. Some cases I can

see

are:
1) I have a cheap and fast non-deterministic coder but a different

slower

coder when I want to use it as the key to a GBK, For example with a

set

coder, it would need to consistently order the values of the set

when

used

as the key.
2) I know a property of the data which allows me to have a cheaper
encoding. Imagine I know that all the strings have a common prefix

or

integers that are in a certain range, or that a matrix is

sparse/dense.

Not

all PCollections of strings / integers / matrices in the pipeline

will

have

this property, just some.
3) Sorting comes up occasionally, traditionally in Google this was

done

by

sorting the encoded version of the object lexicographically during a

GBK.

There are good lexicographical byte representations for ASCII

strings,

integers, and for some IEEE number representations which could be

done

by

the use of a special coder.


Items (1) and (3) do not require special knowledge from the user.

They

are

easily observed properties of a pipeline. My proposal included full
automation for both. The suggestion is new methods
.getDeterministicCoder(TypeDescriptor) and
.getLexicographicCoder(TypeDescriptor).


Completely agree--usecases (1) and (3) are an indirect use of Coders
that are used to achieve an effect that would be better expressed
directly.

(2) is an interesting hypothetical for massive scale where tiny

incremental

optimization represents a lot of cost _and_ your data has sufficient
structure to realize a benefit _and_ it needs to be pinpointed to

just

some

PCollections. I think our experience with coders so far is that their
existence is almost entirely negative. It would be nice to support

this

vanishingly rare case without inflicting a terrible pain point on the

model

and all other users.


(2) is not just about cheapness, sometimes there's other structure in
the data we can leverage. Consider the UnionCoder used in
CoGBK--RawUnionValue has an integer value that specifies indicates the
type of it's raw Object field. Unless we want to extend the type
language, there's not a sufficient type descriptor that can be used to
infer the coder. I'm dubious going down the road of adding special
cases is the right thing here.

For example, in those cases you could encode in your
DoFn so the type descriptor would just be byte[].


As well as being an extremely cumbersome API, this would incur the
cost of coding/decoding at that DoFn boundary even if it is fused
away.

On Thu, Jul 27, 2017 at 1:34 AM, Jean-Baptiste Onofré <

[email protected]>

wrote:

Hi,

That's an interesting thread and I was wondering the relationship

between

type descriptor and coder for a while ;)

Today, in a PCollection, we can set the coder and we also have a
getTypeDescriptor(). It sounds weird to me: it should be one or

the

other.


Basically, if the Coder is not used to define the type, than, I

fully

agree with Eugene.

Basically, the PCollection should define only the type descriptor,

not

the

coder by itself: the coder can be found using the type descriptor.

With both coder and type descriptor on the PCollection, it sounds

big

"decoupled" to me and it would be possible to have a coder on the
PCollection that doesn't match the type descriptor.

I think PCollection type descriptor should be defined, and the

coder

should be implicit based on this type descriptor.

Thoughts ?

Regards
JB


On 07/26/2017 05:25 AM, Eugene Kirpichov wrote:

Hello,

I've worked on a few different things recently and ran repeatedly

into

the

same issue: that we do not have clear guidance on who should set

the

Coder

on a PCollection: is it responsibility of the PTransform that

outputs

it,

or is it responsibility of the user, or is it sometimes one and

sometimes

the other?

I believe that the answer is "it's responsibility of the

transform"

and

moreover that  ideally PCollection.setCoder() should not exist.

Instead:


- Require that all transforms set a Coder on the PCollection's

they

produce
- i.e. it should never be responsibility of the user to "fix up"

coder

on
a PCollection produced by a transform.

- Since all transforms are composed of primitive transforms,

saying

"transforms must set a Coder" means simply that all *primitive*

transforms

must set a Coder on their output.

- In some cases, a primitive PTransform currently doesn't have

enough

information to infer a coder for its output collection - e.g.
ParDo.of(DoFn<InputT, OutputT>) might be unable to infer a coder

for

OutputT. In that case such transforms should allow the user to

provide a

coder: ParDo.of(DoFn).withOutputCoder(...) [note that this

differs

from

requiring the user to set a coder on the resulting collection]

- Corollary: composite transforms need to only configure their

primitive

transforms (and composite sub-transforms) properly, and give

them a

Coder

if needed.

- Corollary: a PTransform with type parameters <FooT, BarT, ...>

needs

to

be configurable with coders for all of these, because the

implementation

of
the transform may change and it may introduce intermediate

collections

involving these types. However, in many cases, some of these type
parameters appear in the type of the transform's input, e.g. a
PTransform<PCollection<KV<FooT, BarT>>, PCollection<MooT>> will

always

be

able to extract the coders for FooT and BarT from the input

PCollection,

so
the user does not need to provide them. However, a coder for BarT

must

be

provided. I think in most cases the transform needs to be

configurable

only
with coders for its output.

Here's a smooth migration path to accomplish the above:
- Make PCollection.createPrimitiveOutputInternal() take a Coder.
- Make all primitive transforms optionally configurable with a

coder

for

their outputs, such as ParDo.of(DoFn).withOutputCoder().
- By using the above, make all composite transforms shipped with

the

SDK

set a Coder on the collections they produce; in some cases, this

will

require adding a withSomethingCoder() option to the transform and
propagating that coder to its sub-transforms. If the option is

unset,

that's fine for now.
- As a result of the above, get rid of all setCoder() calls in

the

Beam

repo. The call will still be there, but it will just not be used

anywhere

in the SDK or examples, and we can mark it deprecated.
- Add guidance to PTransform Style Guide in line with the above

Does this sound like a good idea? I'm not sure how urgent it

would

be to

actually do this, but I'd like to know whether people agree that

this

is a

good goal in general.

--
Jean-Baptiste Onofré
[email protected]
http://blog.nanthrax.net
Talend - http://www.talend.com


--
Jean-Baptiste Onofré
[email protected]
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Requiring PTransform to set a coder on its resulting collections

Reply via email to