[ https://issues.apache.org/jira/browse/BEAM-5164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907232#comment-16907232 ]
Ryan Skraba commented on BEAM-5164: ----------------------------------- OK -- one huge complication :/ Relocating the parquet library *also* requires relocating avro... It uses an Avro method that is not API-compatible with avro-1.7.7 delivered with Spark 2.2 and 2.3 (specifically logical types in the Avro write support). At the same time, pretty much anything using Avro with Beam SQL is also very likely to fail on avro-1.7.7 (therefore Spark 2.2 and 2.3) due to logical type conversions. Amusingly enough, *AvroIO* looks like it's fine on Spark 2.2, 2.3. How does this sound: instead of trying to correctly satisfy everyone with Parquet shading, I'll create the equivalent doc for the workaround (similar to what [hcatalog already does|https://beam.apache.org/documentation/io/built-in/hcatalog/]) and link it to the ParquetIO javadoc and built-in sources documentation. In any case, this documentation should exist for users of Spark 2.2, 2.3 with Beam 2.12.0-2.15.0.. At the same time I can start investigating what it would take to vendor/shade Avro overall and find/create a JIRA for that. What do you think? > ParquetIOIT fails on Spark and Flink > ------------------------------------ > > Key: BEAM-5164 > URL: https://issues.apache.org/jira/browse/BEAM-5164 > Project: Beam > Issue Type: Bug > Components: testing > Reporter: Lukasz Gajowy > Priority: Minor > > When run on Spark or Flink remote cluster, ParquetIOIT fails with the > following stacktrace: > {code:java} > org.apache.beam.sdk.io.parquet.ParquetIOIT > writeThenReadAll FAILED > org.apache.beam.sdk.Pipeline$PipelineExecutionException: > java.lang.NoSuchMethodError: > org.apache.parquet.hadoop.ParquetWriter$Builder.<init>(Lorg/apache/parquet/io/OutputFile;)V > at > org.apache.beam.runners.spark.SparkPipelineResult.beamExceptionFrom(SparkPipelineResult.java:66) > at > org.apache.beam.runners.spark.SparkPipelineResult.waitUntilFinish(SparkPipelineResult.java:99) > at > org.apache.beam.runners.spark.SparkPipelineResult.waitUntilFinish(SparkPipelineResult.java:87) > at org.apache.beam.runners.spark.TestSparkRunner.run(TestSparkRunner.java:116) > at org.apache.beam.runners.spark.TestSparkRunner.run(TestSparkRunner.java:61) > at org.apache.beam.sdk.Pipeline.run(Pipeline.java:313) > at org.apache.beam.sdk.testing.TestPipeline.run(TestPipeline.java:350) > at org.apache.beam.sdk.testing.TestPipeline.run(TestPipeline.java:331) > at > org.apache.beam.sdk.io.parquet.ParquetIOIT.writeThenReadAll(ParquetIOIT.java:133) > Caused by: > java.lang.NoSuchMethodError: > org.apache.parquet.hadoop.ParquetWriter$Builder.<init>(Lorg/apache/parquet/io/OutputFile;)V{code} > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)