[jira] [Commented] (BEAM-5164) ParquetIOIT fails on Spark and Flink

Luke Cwik (JIRA) Tue, 13 Aug 2019 10:39:46 -0700


    [ 
https://issues.apache.org/jira/browse/BEAM-5164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16906445#comment-16906445
 ]


Luke Cwik commented on BEAM-5164:
---------------------------------

In this specific case I think we could shade the parquet library.

 

Your reasoning listed above is correct:

 _(1) we should shade to prevent transitive dependency collisions in runners 
when necessary, but (2) don't shade systematically by default "just in case", 
and (3) once a dependency has reached a certain threshold, like the extremely 
common guava and grpc jars, vendor them for reuse._

 

The downside to shading/vendoring is that it makes it more difficult for users 
to force a dependency version change without having the Apache Beam folks 
perform a release and also getting the shading/vendoring done correctly is 
quite annoying and very error prone. Vendoring requires two releases (the 
vendored artifact, and then core Beam projects that are updated to consume it) 
while shading only needs one but vendoring is much easier to reason 
about/builds faster/...

 

The best option is typically to try and get all parts aligned to use the same 
version but this is not possible always (such as in the case where you are 
trying to use multiple versions of Spark and Spark itself is incompatible with 
the newer version of a library) then your forced to shade/vendor.

 

 

> ParquetIOIT fails on Spark and Flink
> ------------------------------------
>
>                 Key: BEAM-5164
>                 URL: https://issues.apache.org/jira/browse/BEAM-5164
>             Project: Beam
>          Issue Type: Bug
>          Components: testing
>            Reporter: Lukasz Gajowy
>            Priority: Minor
>
> When run on Spark or Flink remote cluster, ParquetIOIT fails with the 
> following stacktrace: 
> {code:java}
> org.apache.beam.sdk.io.parquet.ParquetIOIT > writeThenReadAll FAILED
> org.apache.beam.sdk.Pipeline$PipelineExecutionException: 
> java.lang.NoSuchMethodError: 
> org.apache.parquet.hadoop.ParquetWriter$Builder.<init>(Lorg/apache/parquet/io/OutputFile;)V
> at 
> org.apache.beam.runners.spark.SparkPipelineResult.beamExceptionFrom(SparkPipelineResult.java:66)
> at 
> org.apache.beam.runners.spark.SparkPipelineResult.waitUntilFinish(SparkPipelineResult.java:99)
> at 
> org.apache.beam.runners.spark.SparkPipelineResult.waitUntilFinish(SparkPipelineResult.java:87)
> at org.apache.beam.runners.spark.TestSparkRunner.run(TestSparkRunner.java:116)
> at org.apache.beam.runners.spark.TestSparkRunner.run(TestSparkRunner.java:61)
> at org.apache.beam.sdk.Pipeline.run(Pipeline.java:313)
> at org.apache.beam.sdk.testing.TestPipeline.run(TestPipeline.java:350)
> at org.apache.beam.sdk.testing.TestPipeline.run(TestPipeline.java:331)
> at 
> org.apache.beam.sdk.io.parquet.ParquetIOIT.writeThenReadAll(ParquetIOIT.java:133)
> Caused by:
> java.lang.NoSuchMethodError: 
> org.apache.parquet.hadoop.ParquetWriter$Builder.<init>(Lorg/apache/parquet/io/OutputFile;)V{code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (BEAM-5164) ParquetIOIT fails on Spark and Flink

Reply via email to