Hi all,
I have an interrogation around testFlattenMultipleCoders test:
This test uses 2 collections
1. long and null data encoded using NullableCoder(BigEndianLongCoder)
2. long data encoded using VarlongCoder
It then flattens the 2 collections and set the coder of the resulting
collection to NullableCoder(VarlongCoder)
Most runners translate flatten as a simple union of the 2 PCollections
without any re-encoding. As a result all the runners exclude this test
from the test set because of coders issues. For example flink raises an
exception if the type of elements in PCollection1 is different of the
type of PCollection2 in flatten translation. Another example is direct
runner and spark (RDD based) runner that do not exclude this test simply
because they don't need to serialize elements so they don't even call
the coders.
That means that having an output PCollection of the flatten with
heterogeneous coders is not really tested so it is not really supported.
Should we drop this test case (that is executed by no runner) or should
we force each runner to re-encode ?
Best
Etienne