Re: Is org.apache.beam.sdk.transforms.FlattenTest.testFlattenMultipleCoders supposed to be supported ?

Etienne Chauchot Wed, 11 Dec 2019 03:48:13 -0800

Ok,

Thanks Kenn.

Le Flatten javadoc says that by default the coder of the output shouldbe the coder of the first input. But in the test, it sets the outputcoder to something different. Waiting for a consensus on this modelpoint and a common impl in the runners, I'll just exclude this test asother runner do.


Etienne

On 11/12/2019 04:46, Kenneth Knowles wrote:

It is a good point. Nullable(VarLong) and VarLong are two differenttypes, with least upper bound that is Nullable(VarLong). BigEndianLongand VarLong are two different types, with no least upper bound in the"coders" type system. Yet we understand that the values they encodeare equal. I do not think this is clearly formalized anywhere what therules are (corollary: not thought carefully about).
I think both possibilities are reasonable:
1. Make the rule that Flatten only accepts inputs with identicalcoders. This will be sometimes annoying, requiring vacuous "re-encode"noop ParDos (they will be fused away on maybe all runners).2. Define types as the domain of values, and Flatten accepts sets ofPCollections with the same domain of values. Runners must "do whateverit takes" to respect the coders on the collection.2a. For very simple cases, Flatten takes the least upper bound of theinput types. The output coder of Flatten has to be this least upperbound. For example, a non-nullable output coder would be an error.
Very interesting and nuanced problem. Flatten just became quite aninteresting transform, for me :-)
Kenn
On Tue, Dec 10, 2019 at 12:37 AM Etienne Chauchot<echauc...@apache.org <mailto:echauc...@apache.org>> wrote:
    Hi all,

    I have an interrogation around testFlattenMultipleCoders test:

    This test uses 2 collections

    1. long and null data encoded using NullableCoder(BigEndianLongCoder)

    2. long data encoded using VarlongCoder

    It then flattens the 2 collections and set the coder of the resulting
    collection to NullableCoder(VarlongCoder)

    Most runners translate flatten as a simple union of the 2
    PCollections
    without any re-encoding. As a result all the runners exclude this
    test
    from the test set because of coders issues. For example flink
    raises an
    exception if the type of elements in PCollection1 is different of the
    type of PCollection2 in flatten translation. Another example is
    direct
    runner and spark (RDD based) runner that do not exclude this test
    simply
    because they don't need to serialize elements so they don't even call
    the coders.

    That means that having an output PCollection of the flatten with
    heterogeneous coders is not really tested so it is not really
    supported.

    Should we drop this test case (that is executed by no runner) or
    should
    we force each runner to re-encode ?

    Best

    Etienne

Re: Is org.apache.beam.sdk.transforms.FlattenTest.testFlattenMultipleCoders supposed to be supported ?

Reply via email to