[jira] [Commented] (SPARK-3037) Add ArrayType containing null value support to Parquet.

Michael Armbrust (JIRA) Wed, 20 Aug 2014 13:52:06 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-3037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14104550#comment-14104550
 ]


Michael Armbrust commented on SPARK-3037:
-----------------------------------------

[~ueshin], thanks for investigating these issues.  Here are my thoughts:

 - As far as I understand, this does not result in any breaking changes (i.e. 
newer versions of Spark SQL are unable to read data written by older versions). 
However, it does add new functionality. (i.e. older versions of Spark SQL will 
not be able to read data written by newer versions).  Please correct me if I'm 
wrong here.
 - Hive compatibility is important, but I think its also important for us to do 
the 'right' thing and push Hive to fix their implementation if necessary.  
Also, since Hive cannot read data written by us now, we are not introducing a 
regression.

Based on this, here is what I think is the most reasonable thing to do.  Update 
our code to use the "bag" approach when containsNull is true and just use 
repeated fields when containsNull is false.  Create a JIRA with Hive asking 
them to support reading data in both formats.  What do you think?

> Add ArrayType containing null value support to Parquet.
> -------------------------------------------------------
>
>                 Key: SPARK-3037
>                 URL: https://issues.apache.org/jira/browse/SPARK-3037
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>            Reporter: Takuya Ueshin
>            Priority: Blocker
>
> Parquet support should handle {{ArrayType}} when {{containsNull}} is {{true}}.
> When {{containsNull}} is {{true}}, the schema should be as follows:
> {noformat}
> message root {
>   optional group a (LIST) {
>     repeated group bag {
>       optional int32 array_element;
>     }
>   }
> }
> {noformat}
> FYI:
> Hive's Parquet writer *always* uses this schema, and reader can read only 
> from this schema, i.e. current Parquet support of SparkSQL is not compatible 
> with Hive.
> NOTICE:
> If Hive compatiblity is top priority, we also have to use this schma 
> regardless of {{containsNull}}, which will break backward compatibility.
> But using this schema could affect performance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-3037) Add ArrayType containing null value support to Parquet.

Reply via email to