[jira] [Commented] (SPARK-10434) Parquet compatibility with 1.4 is broken when writing arrays that may contain nulls

Cheng Lian (JIRA) Thu, 03 Sep 2015 22:05:17 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-10434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14730326#comment-14730326
 ]


Cheng Lian commented on SPARK-10434:
------------------------------------

It's true that in general forwards-compatibility is not easy to guarantee, but 
to this specific case, compatibility is broken by an unfortunate typo rather 
than any reasonable design decision.  However, this probably shouldn't block 
1.5.

cc [~rxin]

> Parquet compatibility with 1.4 is broken when writing arrays that may contain 
> nulls
> -----------------------------------------------------------------------------------
>
>                 Key: SPARK-10434
>                 URL: https://issues.apache.org/jira/browse/SPARK-10434
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.0
>            Reporter: Cheng Lian
>            Assignee: Cheng Lian
>            Priority: Minor
>
> When writing arrays that may contain nulls, for example:
> {noformat}
> StructType(
>   StructField(
>     "f",
>     ArrayType(IntegerType, containsNull = true),
>     nullable = false))
> {noformat}
> Spark 1.4 uses the following schema:
> {noformat}
> message m {
>   required group f (LIST) {
>     repeated group bag {
>       optional int32 array;
>     }
>   }
> }
> {noformat}
> This behavior is a hybrid of parquet-avro and parquet-hive: the 3-level 
> structure and repeated group name "bag" are borrowed from parquet-hive, while 
> the innermost element field name "array" is borrowed from parquet-avro.
> However, in Spark 1.5, I failed to notice the latter fact and used a schema 
> in purely parquet-hive flavor, namely:
> {noformat}
> message m {
>   required group f (LIST) {
>     repeated group bag {
>       optional int32 array_element;
>     }
>   }
> }
> {noformat}
> One of the direct consequence is that, Parquet files containing such array 
> fields written by Spark 1.5 can't be read by Spark 1.4 (all array elements 
> become null).
> To fix this issue, the name of the innermost field should be changed back to 
> "array".  Notice that this fix doesn't affect interoperability with Hive 
> (saving Parquet files using {{saveAsTable()}} and then read them using Hive).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10434) Parquet compatibility with 1.4 is broken when writing arrays that may contain nulls

Reply via email to