[ 
https://issues.apache.org/jira/browse/SPARK-32639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated SPARK-32639:
-------------------------------
    Attachment: 000.snappy.parquet

> Support GroupType parquet mapkey field
> --------------------------------------
>
>                 Key: SPARK-32639
>                 URL: https://issues.apache.org/jira/browse/SPARK-32639
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.6, 3.0.0
>            Reporter: Chen Zhang
>            Priority: Major
>         Attachments: 000.snappy.parquet
>
>
> I have a parquet file, and the MessageType recorded in the file is:
> {code:java}
> message parquet_schema {
>   optional group value (MAP) {
>     repeated group key_value {
>       required group key {
>         optional binary first (UTF8);
>         optional binary middle (UTF8);
>         optional binary last (UTF8);
>       }
>       optional binary value (UTF8);
>     }
>   }
> }{code}
>  
> Use +spark.read.parquet("000.snappy.parquet")+ to read the file. Spark will 
> throw an exception when converting Parquet MessageType to Spark SQL 
> StructType:
> {code:java}
> AssertionError(Map key type is expected to be a primitive type, but found...)
> {code}
>  
> Use +spark.read.schema("value MAP<STRUCT<first:STRING, middle:STRING, 
> last:STRING>, STRING>").parquet("000.snappy.parquet")+ to read the file, 
> spark returns the correct result .
> According to the parquet project document 
> (https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps), 
> the mapKey in the parquet format does not need to be a primitive type.
>  
> Note: This parquet file is not written by spark, because spark will write 
> additional sparkSchema string information in the parquet file. When Spark 
> reads, it will directly use the additional sparkSchema information in the 
> file instead of converting Parquet MessageType to Spark SQL StructType.
> I will submit a PR later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to