Alex Sorokoumov created FLINK-35641: ---------------------------------------
Summary: ParquetSchemaConverter should correctly handle field optionality Key: FLINK-35641 URL: https://issues.apache.org/jira/browse/FLINK-35641 Project: Flink Issue Type: Bug Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) Reporter: Alex Sorokoumov At the moment, [ParquetSchemaConverter|https://github.com/apache/flink/blob/99d6fd3c68f46daf0397a35566414e1d19774c3d/flink-formats/flink-parquet/src/main/java/org/apache/flink/formats/parquet/utils/ParquetSchemaConverter.java#L64] marks all fields as optional. This is not correct in general and especially when it comes to handling maps. For example, [parquet-tools|https://pypi.org/project/parquet-tools/] breaks on the Parquet file produced by [ParquetRowDataWriterTest#complexTypeTest|https://github.com/apache/flink/blob/99d6fd3c68f46daf0397a35566414e1d19774c3d/flink-formats/flink-parquet/src/test/java/org/apache/flink/formats/parquet/row/ParquetRowDataWriterTest.java#L140-L151]: {noformat} parquet-tools inspect /var/folders/sc/k3hr87fj4x169rdq9n107whw0000gp/T/junit14646865447948471989/3b328592-7315-48c6-8fa9-38da4048fb4e Traceback (most recent call last): File "/Users/asorokoumov/.pyenv/versions/3.12.3/bin/parquet-tools", line 8, in <module> sys.exit(main()) ^^^^^^ File "/Users/asorokoumov/.pyenv/versions/3.12.3/lib/python3.12/site-packages/parquet_tools/cli.py", line 26, in main args.handler(args) File "/Users/asorokoumov/.pyenv/versions/3.12.3/lib/python3.12/site-packages/parquet_tools/commands/inspect.py", line 55, in _cli _execute_simple( File "/Users/asorokoumov/.pyenv/versions/3.12.3/lib/python3.12/site-packages/parquet_tools/commands/inspect.py", line 63, in _execute_simple pq_file: pq.ParquetFile = pq.ParquetFile(filename) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/asorokoumov/.pyenv/versions/3.12.3/lib/python3.12/site-packages/pyarrow/parquet/core.py", line 317, in __init__ self.reader.open( File "pyarrow/_parquet.pyx", line 1492, in pyarrow._parquet.ParquetReader.open File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Map keys must be annotated as required. {noformat} The correct thing to do is to mark nullable fields as optional, otherwise required. -- This message was sent by Atlassian Jira (v8.20.10#820010)