Reading Parquet file with array of structs cause error

Benenson, Michael via user Tue, 15 Nov 2022 11:52:10 -0800

Hi, folks

I’m using flink 1.16.0, and I would like to read Parquet file (attached), that 
has schema [1].


I could read this file with Spark, but when I try to read it with Flink 1.16.0 
(program attached) using schema [2]

I got IndexOutOfBoundsException [3]

My code, and parquet file are attached. Is it:

·         the problem, described in 
FLINK-28867<https://issues.apache.org/jira/browse/FLINK-28867> or

·         something new, that deserve a separate Jira, or

·         something wrong with my code?


[1]: Parquet Schema

root
|-- amount: decimal(38,9) (nullable = true)
|-- connectionAccountId: string (nullable = true)
|-- sourceEntity: struct (nullable = true)
|    |-- extendedProperties: array (nullable = true)
|    |    |-- element: struct (containsNull = true)
|    |    |    |-- key: string (nullable = true)
|    |    |    |-- value: string (nullable = true)
|    |-- sourceAccountId: string (nullable = true)
|    |-- sourceEntityId: string (nullable = true)
|    |-- sourceEntityType: string (nullable = true)
|    |-- sourceSystem: string (nullable = true)


[2]: Schema used in Flink:

    static RowType getSchema()
    {
        RowType elementType = RowType.of(
            new LogicalType[] {
                new VarCharType(VarCharType.MAX_LENGTH),
                new VarCharType(VarCharType.MAX_LENGTH)
            },
            new String[] {
                "key",
                "value"
            }
        );

        RowType element = RowType.of(
            new LogicalType[] { elementType },
            new String[] { "element" }
        );

        RowType sourceEntity = RowType.of(
            new LogicalType[] {
                new ArrayType(element),
                new VarCharType(),
                new VarCharType(),
                new VarCharType(),
                new VarCharType(),
            },
            new String[] {
                "extendedProperties",
                "sourceAccountId",
                "sourceEntityId",
                "sourceEntityType",
                "sourceSystem"
            }
        );

        return  RowType.of(
            new LogicalType[] {
                new DecimalType(),
                new VarCharType(),
                sourceEntity
            },
            new String[] {
                "amount",
                "connectionAccountId",
                "sourceEntity",
        });
    }

[3]:  Execution Exception:

2022/11/15 11:39:58.657 ERROR o.a.f.c.b.s.r.f.SplitFetcherManager - Received 
uncaught exception.
java.lang.RuntimeException: SplitFetcher thread 0 received unexpected exception 
while polling the records
    at 
org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.runOnce(SplitFetcher.java:150)
    ...
Caused by: java.lang.IndexOutOfBoundsException: Index 1 out of bounds for 
length 1
    at 
java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64)
    at 
java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)
    at 
java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:248)
    at java.base/java.util.Objects.checkIndex(Objects.java:372)
    at java.base/java.util.ArrayList.get(ArrayList.java:459)
    at org.apache.parquet.schema.GroupType.getType(GroupType.java:216)
    at 
org.apache.flink.formats.parquet.vector.ParquetSplitReaderUtil.createWritableColumnVector(ParquetSplitReaderUtil.java:536)
    at 
org.apache.flink.formats.parquet.vector.ParquetSplitReaderUtil.createWritableColumnVector(ParquetSplitReaderUtil.java:533)
    at 
org.apache.flink.formats.parquet.vector.ParquetSplitReaderUtil.createWritableColumnVector(ParquetSplitReaderUtil.java:503)
    at 
org.apache.flink.formats.parquet.vector.ParquetSplitReaderUtil.createWritableColumnVector(ParquetSplitReaderUtil.java:533)
    at 
org.apache.flink.formats.parquet.ParquetVectorizedInputFormat.createWritableVectors(ParquetVectorizedInputFormat.java:281)
    at 
org.apache.flink.formats.parquet.ParquetVectorizedInputFormat.createReaderBatch(ParquetVectorizedInputFormat.java:270)
    at 
org.apache.flink.formats.parquet.ParquetVectorizedInputFormat.createPoolOfBatches(ParquetVectorizedInputFormat.java:260)
    at 
org.apache.flink.formats.parquet.ParquetVectorizedInputFormat.createReader(ParquetVectorizedInputFormat.java:143)
    at 
org.apache.flink.formats.parquet.ParquetVectorizedInputFormat.createReader(ParquetVectorizedInputFormat.java:77)
    at 
org.apache.flink.connector.file.src.impl.FileSourceSplitReader.checkSplitOrStartNext(FileSourceSplitReader.java:112)
    at 
org.apache.flink.connector.file.src.impl.FileSourceSplitReader.fetch(FileSourceSplitReader.java:65)
    at 
org.apache.flink.connector.base.source.reader.fetcher.FetchTask.run(FetchTask.java:58)
    at 
org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.runOnce(SplitFetcher.java:142)
    ... 6 common frames omitted

Thanks

part-00121.parquet
Description: part-00121.parquet

ReadParquetArray1.java
Description: ReadParquetArray1.java

Reading Parquet file with array of structs cause error

Reply via email to