Alex Sorokoumov created FLINK-35641:
---------------------------------------

             Summary: ParquetSchemaConverter should correctly handle field 
optionality
                 Key: FLINK-35641
                 URL: https://issues.apache.org/jira/browse/FLINK-35641
             Project: Flink
          Issue Type: Bug
          Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
            Reporter: Alex Sorokoumov


At the moment, 
[ParquetSchemaConverter|https://github.com/apache/flink/blob/99d6fd3c68f46daf0397a35566414e1d19774c3d/flink-formats/flink-parquet/src/main/java/org/apache/flink/formats/parquet/utils/ParquetSchemaConverter.java#L64]
 marks all fields as optional. This is not correct in general and especially 
when it comes to handling maps. For example, 
[parquet-tools|https://pypi.org/project/parquet-tools/] breaks on the Parquet 
file produced by 
[ParquetRowDataWriterTest#complexTypeTest|https://github.com/apache/flink/blob/99d6fd3c68f46daf0397a35566414e1d19774c3d/flink-formats/flink-parquet/src/test/java/org/apache/flink/formats/parquet/row/ParquetRowDataWriterTest.java#L140-L151]:

{noformat}
parquet-tools inspect 
/var/folders/sc/k3hr87fj4x169rdq9n107whw0000gp/T/junit14646865447948471989/3b328592-7315-48c6-8fa9-38da4048fb4e
Traceback (most recent call last):
  File "/Users/asorokoumov/.pyenv/versions/3.12.3/bin/parquet-tools", line 8, 
in <module>
    sys.exit(main())
             ^^^^^^
  File 
"/Users/asorokoumov/.pyenv/versions/3.12.3/lib/python3.12/site-packages/parquet_tools/cli.py",
 line 26, in main
    args.handler(args)
  File 
"/Users/asorokoumov/.pyenv/versions/3.12.3/lib/python3.12/site-packages/parquet_tools/commands/inspect.py",
 line 55, in _cli
    _execute_simple(
  File 
"/Users/asorokoumov/.pyenv/versions/3.12.3/lib/python3.12/site-packages/parquet_tools/commands/inspect.py",
 line 63, in _execute_simple
    pq_file: pq.ParquetFile = pq.ParquetFile(filename)
                              ^^^^^^^^^^^^^^^^^^^^^^^^
  File 
"/Users/asorokoumov/.pyenv/versions/3.12.3/lib/python3.12/site-packages/pyarrow/parquet/core.py",
 line 317, in __init__
    self.reader.open(
  File "pyarrow/_parquet.pyx", line 1492, in pyarrow._parquet.ParquetReader.open
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Map keys must be annotated as required.
{noformat}

The correct thing to do is to mark nullable fields as optional, otherwise 
required.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to