[jira] [Commented] (SPARK-9340) ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch

Cheng Lian (JIRA) Mon, 10 Aug 2015 00:40:35 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14679719#comment-14679719
 ]


Cheng Lian commented on SPARK-9340:
-----------------------------------

Would like to add that, parquet-avro can be  from the perspective of read path, 
Spark 1.5 and parquet-avro 1.7.0+ are two of the most standard Parquet data 
model implementations which implement all backwards-compatibility rules defined 
in parquet-format spec, and are able to read legacy Parquet data generated by 
various systems. IIRC, no other libraries are capable to do so for now.

As for write path, parquet-avro 1.8.1+ is the only library I know that writes 
fully standard Parquet data according the most recent parquet-format spec. 
Spark SQL is also refactoring its own Parquet write path in 
https://github.com/apache/spark/pull/7679, but it probably won't be part of 
Spark 1.5.

In general, we can categorize existing Parquet libraries into 3 categories:

# Libraries that don't write data in standard Parquet format, and don't 
implement backwards-compatibility rules

  Using these libraries will probably hit similar issues you suffered when 
reading Parquet data with LIST and MAP generated by other systems. 
parquet-protobuf 1.4.3 and Spark SQL (<= 1.4) are both in this category.  
That's why they don't play well with each other

# Libraries that don't write data in standard Parquet format, but implement 
backwards-compatibility rules

  Using these libraries, you'll be able to read legacy Parquet data generated 
by various systems. Although they don't write standard Parquet data, the format 
they use are also covered by the backwards-compatibility rules, so it's still 
fine.

  Spark 1.5-SNAPSHOT and parquet-avro 1.7.0 are the only two that I know of in 
this category up until now.

# Libraries that write standard Parquet data, and implement 
backwards-compatibility rules

  They are the best citizens among all Parquet libraries.  Unfortunately, 
parquet-avro 1.8.1 is the only one that I know of up until now.  Hopefully 
Spark 1.6 will join this category.


As a summary, some rules to be aware of if you want true Parquet 
interoperability:

- parquet-avro is usually the most standard Parquet data model out there, 
partly because there are Parquet committers who are also Avro committers (e.g. 
Ryan Blue).
- Always use libraries of category 3, or at least category 2 whenver possible.
- If you have to use libraries of category 1, don't use them to read Parquet 
data generated by other libraries.

I'll leave this issue open for now until you confirm that 1.5 solves your 
problem. If it doesn't, let's investigate further. But I'm removing 1.5.0 from 
"Affects Version/s" field.

> ParquetTypeConverter incorrectly handling of repeated types results in schema 
> mismatch
> --------------------------------------------------------------------------------------
>
>                 Key: SPARK-9340
>                 URL: https://issues.apache.org/jira/browse/SPARK-9340
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.2.0, 1.3.0, 1.4.0, 1.5.0
>            Reporter: Damian Guy
>         Attachments: ParquetTypesConverterTest.scala
>
>
> The way ParquetTypesConverter handles primitive repeated types results in an 
> incompatible schema being used for querying data. For example, given a schema 
> like so:
> message root {
>    repeated int32 repeated_field;
>  }
> Spark produces a read schema like:
> message root {
>    optional int32 repeated_field;
>  }
> These are incompatible and all attempts to read fail.
> In ParquetTypesConverter.toDataType:
>  if (parquetType.isPrimitive) {
>       toPrimitiveDataType(parquetType.asPrimitiveType, isBinaryAsString, 
> isInt96AsTimestamp)
>     } else {...}
> The if condition should also have 
> !parquetType.isRepetition(Repetition.REPEATED)
>  
> And then this case will need to be handled in the else 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9340) ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch

Reply via email to