[jira] [Commented] (SPARK-9340) ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch

Cheng Lian (JIRA) Mon, 10 Aug 2015 00:10:07 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14679684#comment-14679684
 ]


Cheng Lian commented on SPARK-9340:
-----------------------------------

[~damianguy] This is actually a Parquet complex type interoperability issue 
rather than Spark issue alone. The root cause is that, parquet-format spec 
didn't specify how LIST and MAP should be exactly represented. So in the early 
days, different Parquet libraries write Parquet LIST and MAP values in 
different formats. The unfortunate consequence is that, Parquet files generated 
by different libraries/systems are not fully interoperable. And I believe the 
case you've faced is just one of those broken interoperability cases.

>From parquet-format spec side, this issue has been addressed by PARQUET-113, 
>which clearly specified how LIST and MAP should be represented, and how 
>different Parquet data models should deal with legacy data systematically via 
>various backwards-compatibility rules.

>From Spark side, the new spec together with all backwards-compatibility rules 
>have been implemented in 1.5  via SPARK-6775, SPARK-6776, and SPARK-6777. So, 
>would you please try branch-1.5 and see whether it fixes your problem? PR 
>#8063 was opened against the master branch, which should have already fixed 
>this issue.

> ParquetTypeConverter incorrectly handling of repeated types results in schema 
> mismatch
> --------------------------------------------------------------------------------------
>
>                 Key: SPARK-9340
>                 URL: https://issues.apache.org/jira/browse/SPARK-9340
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.2.0, 1.3.0, 1.4.0, 1.5.0
>            Reporter: Damian Guy
>         Attachments: ParquetTypesConverterTest.scala
>
>
> The way ParquetTypesConverter handles primitive repeated types results in an 
> incompatible schema being used for querying data. For example, given a schema 
> like so:
> message root {
>    repeated int32 repeated_field;
>  }
> Spark produces a read schema like:
> message root {
>    optional int32 repeated_field;
>  }
> These are incompatible and all attempts to read fail.
> In ParquetTypesConverter.toDataType:
>  if (parquetType.isPrimitive) {
>       toPrimitiveDataType(parquetType.asPrimitiveType, isBinaryAsString, 
> isInt96AsTimestamp)
>     } else {...}
> The if condition should also have 
> !parquetType.isRepetition(Repetition.REPEATED)
>  
> And then this case will need to be handled in the else 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9340) ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch

Reply via email to