[ 
https://issues.apache.org/jira/browse/FLINK-35641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tzu-Li (Gordon) Tai reassigned FLINK-35641:
-------------------------------------------

    Assignee: Alex Sorokoumov

> ParquetSchemaConverter should correctly handle field optionality
> ----------------------------------------------------------------
>
>                 Key: FLINK-35641
>                 URL: https://issues.apache.org/jira/browse/FLINK-35641
>             Project: Flink
>          Issue Type: Bug
>          Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>            Reporter: Alex Sorokoumov
>            Assignee: Alex Sorokoumov
>            Priority: Major
>
> At the moment, 
> [ParquetSchemaConverter|https://github.com/apache/flink/blob/99d6fd3c68f46daf0397a35566414e1d19774c3d/flink-formats/flink-parquet/src/main/java/org/apache/flink/formats/parquet/utils/ParquetSchemaConverter.java#L64]
>  marks all fields as optional. This is not correct in general and especially 
> when it comes to handling maps. For example, 
> [parquet-tools|https://pypi.org/project/parquet-tools/] breaks on the Parquet 
> file produced by 
> [ParquetRowDataWriterTest#complexTypeTest|https://github.com/apache/flink/blob/99d6fd3c68f46daf0397a35566414e1d19774c3d/flink-formats/flink-parquet/src/test/java/org/apache/flink/formats/parquet/row/ParquetRowDataWriterTest.java#L140-L151]:
> {noformat}
> parquet-tools inspect 
> /var/folders/sc/k3hr87fj4x169rdq9n107whw0000gp/T/junit14646865447948471989/3b328592-7315-48c6-8fa9-38da4048fb4e
> Traceback (most recent call last):
>   File "/Users/asorokoumov/.pyenv/versions/3.12.3/bin/parquet-tools", line 8, 
> in <module>
>     sys.exit(main())
>              ^^^^^^
>   File 
> "/Users/asorokoumov/.pyenv/versions/3.12.3/lib/python3.12/site-packages/parquet_tools/cli.py",
>  line 26, in main
>     args.handler(args)
>   File 
> "/Users/asorokoumov/.pyenv/versions/3.12.3/lib/python3.12/site-packages/parquet_tools/commands/inspect.py",
>  line 55, in _cli
>     _execute_simple(
>   File 
> "/Users/asorokoumov/.pyenv/versions/3.12.3/lib/python3.12/site-packages/parquet_tools/commands/inspect.py",
>  line 63, in _execute_simple
>     pq_file: pq.ParquetFile = pq.ParquetFile(filename)
>                               ^^^^^^^^^^^^^^^^^^^^^^^^
>   File 
> "/Users/asorokoumov/.pyenv/versions/3.12.3/lib/python3.12/site-packages/pyarrow/parquet/core.py",
>  line 317, in __init__
>     self.reader.open(
>   File "pyarrow/_parquet.pyx", line 1492, in 
> pyarrow._parquet.ParquetReader.open
>   File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Map keys must be annotated as required.
> {noformat}
> [The correct thing to 
> do|https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps] 
> is to mark nullable fields as optional, otherwise required.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to