Cheng Lian created SPARK-10289:
----------------------------------
Summary: A direct write API for testing Parquet compatibility
Key: SPARK-10289
URL: https://issues.apache.org/jira/browse/SPARK-10289
Project: Spark
Issue Type: Test
Components: SQL, Tests
Affects Versions: 1.5.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Due to a set of unfortunate historical issues, it's relatively hard to achieve
full interoperability among various Parquet data models. Spark 1.5 implemented
all backwards-compatibility rules defined in parquet-format spec on the read
path (SPARK-6774) to improve this. However, testing all those corner cases can
be really challenging. Currently, we are testing Parquet
compatibility/interoperability by two means:
# Generate Parquet files by other systems, bundle them into Spark source tree
as testing resources, and write test cases against them to ensure that we can
interpret them correctly. Currently, we are testing parquet-thrift and
parquet-protobuf compatibility in this way.
#- Pros: Easy to write test cases, easy to test against multiple versions of a
given external system/libraries (by generating Parquet files with these
versions)
#- Cons: Hard to track how testing Parquet files are generated
# Make external libraries as testing dependencies, and call their APIs directly
to write Parquet files and verify them. Currently, parquet-avro compatibility
is tested using this approach.
#- Pros: Easy to track how testing Parquet files are generated
#- Cons:
##- Often requires code generation (Avro/Thrift/ProtoBuf/...), either
complicates build system by using build time code generation, or bloats the
code base by checking in generated Java files. The former one is especially
annoying because Spark has two build systems, and require two sets of plugins
to do code generation (e.g., for Avro, we need both sbt-avro and
avro-maven-plugin).
##- Can only test a single version of a given target library
Inspired by the
[{{writeDirect}}|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.1/parquet-avro/src/test/java/org/apache/parquet/avro/TestArrayCompatibility.java#L945-L972]
method in parquet-avro testing code, a direct write API can be a good
complement for testing Parquet compatibilities. Ideally, this API should
# be easy to construct arbitrary complex Parquet records
# have a DSL that reflects the nested nature of Parquet records
In this way, it would be both easy to track Parquet file generation and easy to
cover various versions of external libraries. However, test case authors must
be really careful when constructing the test cases and ensure constructed
Parquet structures are identical to those generated by the target
systems/libraries. We'll probably not going to replace the above two
approaches with this API, but just add it as a complement.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]