[ https://issues.apache.org/jira/browse/ARROW-10514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Antoine Pitrou resolved ARROW-10514. ------------------------------------ Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 9649 [https://github.com/apache/arrow/pull/9649] > [C++][Parquet] Data inconsistency in parquet-reader output modes > ---------------------------------------------------------------- > > Key: ARROW-10514 > URL: https://issues.apache.org/jira/browse/ARROW-10514 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Reporter: Zosimova Zhanna > Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: > 0001-Make-the-column-name-the-same-for-both-output-format.patch > > Time Spent: 1h > Remaining Estimate: 0h > > I tried reading description for Parquet > [file|https://github.com/ClickHouse/ClickHouse/blob/master/tests/queries/0_stateless/data_parquet/nested_maps.snappy.parquet] > with nested maps using [parquet-reader > tool|https://github.com/apache/arrow/blob/master/cpp/tools/parquet/parquet_reader.cc]. > > This file has the following structure: > {code:java} > required group field_id=0 spark_schema { > optional group field_id=1 a (Map) { > repeated group field_id=2 key_value { > required binary field_id=3 key (String); > optional group field_id=4 value (Map) { > repeated group field_id=5 key_value { > required int32 field_id=6 key; > required boolean field_id=7 value; > } > } > } > } > required int32 field_id=8 b; > required double field_id=9 c; > } {code} > When I print it using DebugPrint, I see: > {code:java} > $ ./parquet-reader nested_maps.snappy.parquet --only-metadata > <some text is omitted for the sake of readability> > Column 0: a.key_value.key (BYTE_ARRAY/UTF8) > Column 1: a.key_value.value.key_value.key (INT32) > Column 2: a.key_value.value.key_value.value (BOOLEAN) > Column 3: b (INT32) > Column 4: c (DOUBLE) > </some text is omitted for the sake of readability>{code} > When I pring it using JSONPrint, I see: > {code:java} > $ ./parquet-reader nested_maps.snappy.parquet --json > <some text is omitted for the sake of readability> > "Columns": [ > { "Id": "0", "Name": "key", "PhysicalType": "BYTE_ARRAY", "ConvertedType": > "UTF8", "LogicalType": {"Type": "String"} }, > { "Id": "1", "Name": "key", "PhysicalType": "INT32", "ConvertedType": > "NONE", "LogicalType": {"Type": "None"} }, > { "Id": "2", "Name": "value", "PhysicalType": "BOOLEAN", "ConvertedType": > "NONE", "LogicalType": {"Type": "None"} }, > { "Id": "3", "Name": "b", "PhysicalType": "INT32", "ConvertedType": "NONE", > "LogicalType": {"Type": "None"} }, > { "Id": "4", "Name": "c", "PhysicalType": "DOUBLE", "ConvertedType": > "NONE", "LogicalType": {"Type": "None"} } > ] > </some text is omitted for the sake of readability>{code} > Column 0 and Column 1 has the same Name in JSON output. That's very > confusing. It would be more correct to output the full path of the column > (key -> a.key_value.key). > > This issue can be corrected by changing a single line: > [https://github.com/apache/arrow/blob/master/cpp/src/parquet/printer.cc#L218] > > The proposed patch in the attachment -- This message was sent by Atlassian Jira (v8.3.4#803005)