[ 
https://issues.apache.org/jira/browse/BEAM-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aviem Zur updated BEAM-1145:
----------------------------
    Description: 
Shade plugin configured in spark runner's pom adds a classifier to spark runner 
shaded jar
{code:xml}
<shadedArtifactAttached>true</shadedArtifactAttached>
<shadedClassifierName>spark-app</shadedClassifierName>
{code}
This means, that in order for a user application that is dependent on 
spark-runner to work in cluster mode, they have to add the classifier in their 
dependency declaration, like so:
{code:xml}
        <dependency>
            <groupId>org.apache.beam</groupId>
            <artifactId>beam-runners-spark</artifactId>
            <version>0.4.0-incubating-SNAPSHOT</version>
            <classifier>spark-app</classifier>
        </dependency>
{code}
Otherwise, if they do not specify classifier, the jar they get is unshaded, 
which in cluster mode, causes collisions between different guava versions.
Example exception in cluster mode when adding the dependency without classifier:
{code}
16/12/12 06:58:56 WARN TaskSetManager: Lost task 4.0 in stage 8.0 (TID 153, 
***********.com): java.lang.NoSuchMethodError: 
com.google.common.base.Stopwatch.createStarted()Lcom/google/common/base/Stopwatch;
        at 
org.apache.beam.runners.spark.stateful.StateSpecFunctions$1.apply(StateSpecFunctions.java:137)
        at 
org.apache.beam.runners.spark.stateful.StateSpecFunctions$1.apply(StateSpecFunctions.java:98)
        at 
org.apache.spark.streaming.StateSpec$$anonfun$1.apply(StateSpec.scala:180)
        at 
org.apache.spark.streaming.StateSpec$$anonfun$1.apply(StateSpec.scala:179)
        at 
org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:57)
        at 
org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:55)
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)
        at 
org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
        at 
org.apache.spark.streaming.rdd.MapWithStateRDDRecord$.updateRecordWithData(MapWithStateRDD.scala:55)
        at 
org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:155)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
        at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:275)
        at 
org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:149)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
        at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:275)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
        at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:275)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:89)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
{code}
I would suggest that the classifier be removed from the shaded jar, to avoid 
confusion among users, and have a better user experience.

P.S. Looks like Dataflow runner does not add a classifier to its shaded jar.

  was:
Shade plugin configured in spark runner's pom adds a classifier to spark runner 
shaded jar
{code:xml}
<shadedArtifactAttached>true</shadedArtifactAttached>
<shadedClassifierName>spark-app</shadedClassifierName>
{code}
This means, that in order for a user application that is dependent on 
spark-runner to work in cluster mode, they have to add the classifier in their 
dependency declaration, like so:
{code:xml}
        <dependency>
            <groupId>org.apache.beam</groupId>
            <artifactId>beam-runners-spark</artifactId>
            <version>0.4.0-incubating-SNAPSHOT</version>
            <classifier>spark-app</classifier>
        </dependency>
{code}
Otherwise, if they do not specify classifier, the jar they get is unshaded, 
which in cluster mode, causes collisions between different guava versions.
Example exception in cluster mode when adding the dependency without classifier:
{code}
16/12/12 06:58:56 WARN TaskSetManager: Lost task 4.0 in stage 8.0 (TID 153, 
lvsriskng02.lvs.paypal.com): java.lang.NoSuchMethodError: 
com.google.common.base.Stopwatch.createStarted()Lcom/google/common/base/Stopwatch;
        at 
org.apache.beam.runners.spark.stateful.StateSpecFunctions$1.apply(StateSpecFunctions.java:137)
        at 
org.apache.beam.runners.spark.stateful.StateSpecFunctions$1.apply(StateSpecFunctions.java:98)
        at 
org.apache.spark.streaming.StateSpec$$anonfun$1.apply(StateSpec.scala:180)
        at 
org.apache.spark.streaming.StateSpec$$anonfun$1.apply(StateSpec.scala:179)
        at 
org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:57)
        at 
org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:55)
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)
        at 
org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
        at 
org.apache.spark.streaming.rdd.MapWithStateRDDRecord$.updateRecordWithData(MapWithStateRDD.scala:55)
        at 
org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:155)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
        at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:275)
        at 
org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:149)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
        at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:275)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
        at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:275)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:89)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
{code}
I would suggest that the classifier be removed from the shaded jar, to avoid 
confusion among users, and have a better user experience.

P.S. Looks like Dataflow runner does not add a classifier to its shaded jar.


> Remove classifier from shaded spark runner artifact
> ---------------------------------------------------
>
>                 Key: BEAM-1145
>                 URL: https://issues.apache.org/jira/browse/BEAM-1145
>             Project: Beam
>          Issue Type: Improvement
>          Components: runner-spark
>            Reporter: Aviem Zur
>            Assignee: Aviem Zur
>             Fix For: 0.5.0
>
>
> Shade plugin configured in spark runner's pom adds a classifier to spark 
> runner shaded jar
> {code:xml}
> <shadedArtifactAttached>true</shadedArtifactAttached>
> <shadedClassifierName>spark-app</shadedClassifierName>
> {code}
> This means, that in order for a user application that is dependent on 
> spark-runner to work in cluster mode, they have to add the classifier in 
> their dependency declaration, like so:
> {code:xml}
>         <dependency>
>             <groupId>org.apache.beam</groupId>
>             <artifactId>beam-runners-spark</artifactId>
>             <version>0.4.0-incubating-SNAPSHOT</version>
>             <classifier>spark-app</classifier>
>         </dependency>
> {code}
> Otherwise, if they do not specify classifier, the jar they get is unshaded, 
> which in cluster mode, causes collisions between different guava versions.
> Example exception in cluster mode when adding the dependency without 
> classifier:
> {code}
> 16/12/12 06:58:56 WARN TaskSetManager: Lost task 4.0 in stage 8.0 (TID 153, 
> ***********.com): java.lang.NoSuchMethodError: 
> com.google.common.base.Stopwatch.createStarted()Lcom/google/common/base/Stopwatch;
>       at 
> org.apache.beam.runners.spark.stateful.StateSpecFunctions$1.apply(StateSpecFunctions.java:137)
>       at 
> org.apache.beam.runners.spark.stateful.StateSpecFunctions$1.apply(StateSpecFunctions.java:98)
>       at 
> org.apache.spark.streaming.StateSpec$$anonfun$1.apply(StateSpec.scala:180)
>       at 
> org.apache.spark.streaming.StateSpec$$anonfun$1.apply(StateSpec.scala:179)
>       at 
> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:57)
>       at 
> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:55)
>       at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>       at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>       at 
> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$.updateRecordWithData(MapWithStateRDD.scala:55)
>       at 
> org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:155)
>       at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>       at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
>       at org.apache.spark.rdd.RDD.iterator(RDD.scala:275)
>       at 
> org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:149)
>       at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>       at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
>       at org.apache.spark.rdd.RDD.iterator(RDD.scala:275)
>       at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>       at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>       at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>       at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>       at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>       at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
>       at org.apache.spark.rdd.RDD.iterator(RDD.scala:275)
>       at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>       at org.apache.spark.scheduler.Task.run(Task.scala:89)
>       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>       at java.lang.Thread.run(Thread.java:745)
> {code}
> I would suggest that the classifier be removed from the shaded jar, to avoid 
> confusion among users, and have a better user experience.
> P.S. Looks like Dataflow runner does not add a classifier to its shaded jar.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to