Hi all,

I have an interrogation around testFlattenMultipleCoders test:

This test uses 2 collections

1. long and null data encoded using NullableCoder(BigEndianLongCoder)

2. long data encoded using VarlongCoder

It then flattens the 2 collections and set the coder of the resulting collection to NullableCoder(VarlongCoder)

Most runners translate flatten as a simple union of the 2 PCollections without any re-encoding. As a result all the runners exclude this test from the test set because of coders issues. For example flink raises an exception if the type of elements in PCollection1 is different of the type of PCollection2 in flatten translation. Another example is direct runner and spark (RDD based) runner that do not exclude this test simply because they don't need to serialize elements so they don't even call the coders.

That means that having an output PCollection of the flatten with heterogeneous coders is not really tested so it is not really supported.

Should we drop this test case (that is executed by no runner) or should we force each runner to re-encode ?

Best

Etienne



Reply via email to