Fully agree Thomas, I think it's way better.

Regards
JB

On 07/27/2017 08:00 PM, Thomas Groh wrote:
+1 on getting rid of setCoder; just from a Java SDK perspective, my ideal
world contains PCollections which don't have a user-visible way to mutate
them.

My preference would be to use TypeDescriptors everywhere within Pipeline
construction (where possible), and utilize the CoderRegistry everywhere to
actually extract the appropriate type. The unfortunate difficulty of having
to encode a union type and the lack of variable-length generics does
complicate that. We could consider some way of constructing coders in the
registry from a collection of type descriptors (which should be accessible
from the point the union-type is being constructed), e.g. something like
`getCoder(TypeDescriptor output, TypeDescriptor... components)` - that does
only permit a single flat level (but since this is being invoked by the SDK
during construction it could also pass Coder...).



On Thu, Jul 27, 2017 at 10:22 AM, Robert Bradshaw <
rober...@google.com.invalid> wrote:

On Thu, Jul 27, 2017 at 10:04 AM, Kenneth Knowles
<k...@google.com.invalid> wrote:
On Thu, Jul 27, 2017 at 2:22 AM, Lukasz Cwik <lc...@google.com.invalid>
wrote:

Ken/Robert, I believe users will want the ability to set the output
coder
because coders may have intrinsic properties where the type isn't enough
information to fully specify what I want as a user. Some cases I can see
are:
1) I have a cheap and fast non-deterministic coder but a different
slower
coder when I want to use it as the key to a GBK, For example with a set
coder, it would need to consistently order the values of the set when
used
as the key.
2) I know a property of the data which allows me to have a cheaper
encoding. Imagine I know that all the strings have a common prefix or
integers that are in a certain range, or that a matrix is sparse/dense.
Not
all PCollections of strings / integers / matrices in the pipeline will
have
this property, just some.
3) Sorting comes up occasionally, traditionally in Google this was done
by
sorting the encoded version of the object lexicographically during a
GBK.
There are good lexicographical byte representations for ASCII strings,
integers, and for some IEEE number representations which could be done
by
the use of a special coder.


Items (1) and (3) do not require special knowledge from the user. They
are
easily observed properties of a pipeline. My proposal included full
automation for both. The suggestion is new methods
.getDeterministicCoder(TypeDescriptor) and
.getLexicographicCoder(TypeDescriptor).

Completely agree--usecases (1) and (3) are an indirect use of Coders
that are used to achieve an effect that would be better expressed
directly.

(2) is an interesting hypothetical for massive scale where tiny
incremental
optimization represents a lot of cost _and_ your data has sufficient
structure to realize a benefit _and_ it needs to be pinpointed to just
some
PCollections. I think our experience with coders so far is that their
existence is almost entirely negative. It would be nice to support this
vanishingly rare case without inflicting a terrible pain point on the
model
and all other users.

(2) is not just about cheapness, sometimes there's other structure in
the data we can leverage. Consider the UnionCoder used in
CoGBK--RawUnionValue has an integer value that specifies indicates the
type of it's raw Object field. Unless we want to extend the type
language, there's not a sufficient type descriptor that can be used to
infer the coder. I'm dubious going down the road of adding special
cases is the right thing here.

For example, in those cases you could encode in your
DoFn so the type descriptor would just be byte[].

As well as being an extremely cumbersome API, this would incur the
cost of coding/decoding at that DoFn boundary even if it is fused
away.

On Thu, Jul 27, 2017 at 1:34 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

Hi,

That's an interesting thread and I was wondering the relationship
between
type descriptor and coder for a while ;)

Today, in a PCollection, we can set the coder and we also have a
getTypeDescriptor(). It sounds weird to me: it should be one or the
other.

Basically, if the Coder is not used to define the type, than, I fully
agree with Eugene.

Basically, the PCollection should define only the type descriptor, not
the
coder by itself: the coder can be found using the type descriptor.

With both coder and type descriptor on the PCollection, it sounds a
big
"decoupled" to me and it would be possible to have a coder on the
PCollection that doesn't match the type descriptor.

I think PCollection type descriptor should be defined, and the coder
should be implicit based on this type descriptor.

Thoughts ?

Regards
JB


On 07/26/2017 05:25 AM, Eugene Kirpichov wrote:

Hello,

I've worked on a few different things recently and ran repeatedly
into
the
same issue: that we do not have clear guidance on who should set the
Coder
on a PCollection: is it responsibility of the PTransform that outputs
it,
or is it responsibility of the user, or is it sometimes one and
sometimes
the other?

I believe that the answer is "it's responsibility of the transform"
and
moreover that  ideally PCollection.setCoder() should not exist.
Instead:

- Require that all transforms set a Coder on the PCollection's they
produce
- i.e. it should never be responsibility of the user to "fix up" a
coder
on
a PCollection produced by a transform.

- Since all transforms are composed of primitive transforms, saying
"transforms must set a Coder" means simply that all *primitive*
transforms
must set a Coder on their output.

- In some cases, a primitive PTransform currently doesn't have enough
information to infer a coder for its output collection - e.g.
ParDo.of(DoFn<InputT, OutputT>) might be unable to infer a coder for
OutputT. In that case such transforms should allow the user to
provide a
coder: ParDo.of(DoFn).withOutputCoder(...) [note that this differs
from
requiring the user to set a coder on the resulting collection]

- Corollary: composite transforms need to only configure their
primitive
transforms (and composite sub-transforms) properly, and give them a
Coder
if needed.

- Corollary: a PTransform with type parameters <FooT, BarT, ...>
needs
to
be configurable with coders for all of these, because the
implementation
of
the transform may change and it may introduce intermediate
collections
involving these types. However, in many cases, some of these type
parameters appear in the type of the transform's input, e.g. a
PTransform<PCollection<KV<FooT, BarT>>, PCollection<MooT>> will
always
be
able to extract the coders for FooT and BarT from the input
PCollection,
so
the user does not need to provide them. However, a coder for BarT
must
be
provided. I think in most cases the transform needs to be
configurable
only
with coders for its output.

Here's a smooth migration path to accomplish the above:
- Make PCollection.createPrimitiveOutputInternal() take a Coder.
- Make all primitive transforms optionally configurable with a coder
for
their outputs, such as ParDo.of(DoFn).withOutputCoder().
- By using the above, make all composite transforms shipped with the
SDK
set a Coder on the collections they produce; in some cases, this will
require adding a withSomethingCoder() option to the transform and
propagating that coder to its sub-transforms. If the option is unset,
that's fine for now.
- As a result of the above, get rid of all setCoder() calls in the
Beam
repo. The call will still be there, but it will just not be used
anywhere
in the SDK or examples, and we can mark it deprecated.
- Add guidance to PTransform Style Guide in line with the above

Does this sound like a good idea? I'm not sure how urgent it would
be to
actually do this, but I'd like to know whether people agree that this
is a
good goal in general.


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Reply via email to