[ https://issues.apache.org/jira/browse/SPARK-10289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14802938#comment-14802938 ]
Maximilian Michels commented on SPARK-10289: -------------------------------------------- User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/8460 > A direct write API for testing Parquet compatibility > ---------------------------------------------------- > > Key: SPARK-10289 > URL: https://issues.apache.org/jira/browse/SPARK-10289 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests > Affects Versions: 1.5.0 > Reporter: Cheng Lian > Assignee: Cheng Lian > Fix For: 1.6.0 > > > Due to a set of unfortunate historical issues, it's relatively hard to > achieve full interoperability among various Parquet data models. Spark 1.5 > implemented all backwards-compatibility rules defined in parquet-format spec > on the read path (SPARK-6774) to improve this. However, testing all those > corner cases can be really challenging. Currently, we are testing Parquet > compatibility/interoperability by two means: > # Generate Parquet files by other systems, bundle them into Spark source tree > as testing resources, and write test cases against them to ensure that we can > interpret them correctly. Currently, we are testing parquet-thrift and > parquet-protobuf compatibility in this way. > #- Pros: Easy to write test cases, easy to test against multiple versions of > a given external system/libraries (by generating Parquet files with these > versions) > #- Cons: Hard to track how testing Parquet files are generated > # Make external libraries as testing dependencies, and call their APIs > directly to write Parquet files and verify them. Currently, parquet-avro > compatibility is tested using this approach. > #- Pros: Easy to track how testing Parquet files are generated > #- Cons: > ##- Often requires code generation (Avro/Thrift/ProtoBuf/...), either > complicates build system by using build time code generation, or bloats the > code base by checking in generated Java files. The former one is especially > annoying because Spark has two build systems, and require two sets of plugins > to do code generation (e.g., for Avro, we need both sbt-avro and > avro-maven-plugin). > ##- Can only test a single version of a given target library > Inspired by the > [{{writeDirect}}|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.1/parquet-avro/src/test/java/org/apache/parquet/avro/TestArrayCompatibility.java#L945-L972] > method in parquet-avro testing code, a direct write API can be a good > complement for testing Parquet compatibilities. Ideally, this API should > # be easy to construct arbitrary complex Parquet records > # have a DSL that reflects the nested nature of Parquet records > In this way, it would be both easy to track Parquet file generation and easy > to cover various versions of external libraries. However, test case authors > must be really careful when constructing the test cases and ensure > constructed Parquet structures are identical to those generated by the target > systems/libraries. We're probably not going to replace the above two > approaches with this API, but just add it as a complement. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org