Re: weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary
Well, it's always a good idea to used matched binary versions. Here it is more acutely necessary. You can use a pre built binary -- if you use it to compile and also run. Why does it not make sense to publish artifacts? Not sure what you mean about core vs assembly, as the assembly contains all of the modules. You don't literally need the same jar file. On Thu, Dec 18, 2014 at 3:20 AM, Sun, Rui rui@intel.com wrote: Not using spark-submit. The App directly communicates with the Spark cluster in standalone mode. If mark the Spark dependency as 'provided’, then the spark-core .jar elsewhere must be pointe to in CLASSPATH. However, the pre-built Spark binary only has an assembly jar, not having individual module jars. So you don’t have a chance to point to a module.jar which is the same binary as that in the pre-built Spark binary. Maybe the Spark distribution should contain not only the assembly jar but also individual module jars. Any opinion? From: Shivaram Venkataraman [mailto:shiva...@eecs.berkeley.edu] Sent: Thursday, December 18, 2014 2:20 AM To: Sean Owen Cc: Sun, Rui; user@spark.apache.org Subject: Re: weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary Just to clarify, are you running the application using spark-submit after packaging with sbt package ? One thing that might help is to mark the Spark dependency as 'provided' as then you shouldn't have the Spark classes in your jar. Thanks Shivaram On Wed, Dec 17, 2014 at 4:39 AM, Sean Owen so...@cloudera.com wrote: You should use the same binaries everywhere. The problem here is that anonymous functions get compiled to different names when you build different (potentially) so you actually have one function being called when another function is meant. On Wed, Dec 17, 2014 at 12:07 PM, Sun, Rui rui@intel.com wrote: Hi, I encountered a weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary. Steps to reproduce: 1. Download the official pre-built Spark binary 1.1.1 at http://d3kbcqa49mib13.cloudfront.net/spark-1.1.1-bin-hadoop1.tgz 2. Launch the Spark cluster in pseudo cluster mode 3. A small scala APP which calls RDD.saveAsObjectFile() scalaVersion := 2.10.4 libraryDependencies ++= Seq( org.apache.spark %% spark-core % 1.1.1 ) val sc = new SparkContext(args(0), test) //args[0] is the Spark master URI val rdd = sc.parallelize(List(1, 2, 3)) rdd.saveAsObjectFile(/tmp/mysaoftmp) sc.stop throws an exception as follows: [error] (run-main-0) org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 6, ray-desktop.sh.intel.com): java.lang.ClassCastException: scala.Tuple2 cannot be cast to scala.collection.Iterator [error] org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) [error] org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) [error] org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) [error] org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) [error] org.apache.spark.rdd.RDD.iterator(RDD.scala:229) [error] org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) [error] org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) [error] org.apache.spark.rdd.RDD.iterator(RDD.scala:229) [error] org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) [error] org.apache.spark.scheduler.Task.run(Task.scala:54) [error] org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) [error] java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) [error] java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [error] java.lang.Thread.run(Thread.java:701) After investigation, I found that this is caused by bytecode incompatibility issue between RDD.class in spark-core_2.10-1.1.1.jar and the pre-built spark assembly respectively. This issue also happens with spark 1.1.0. Is there anything wrong in my usage of Spark? Or anything wrong in the process of deploying Spark module jars to maven repo? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary
@Rui do you mean the spark-core jar in the maven central repo are incompatible with the same version of the the official pre-built Spark binary? That's really weird. I thought they should have used the same codes. Best Regards, Shixiong Zhu 2014-12-18 17:22 GMT+08:00 Sean Owen so...@cloudera.com: Well, it's always a good idea to used matched binary versions. Here it is more acutely necessary. You can use a pre built binary -- if you use it to compile and also run. Why does it not make sense to publish artifacts? Not sure what you mean about core vs assembly, as the assembly contains all of the modules. You don't literally need the same jar file. On Thu, Dec 18, 2014 at 3:20 AM, Sun, Rui rui@intel.com wrote: Not using spark-submit. The App directly communicates with the Spark cluster in standalone mode. If mark the Spark dependency as 'provided’, then the spark-core .jar elsewhere must be pointe to in CLASSPATH. However, the pre-built Spark binary only has an assembly jar, not having individual module jars. So you don’t have a chance to point to a module.jar which is the same binary as that in the pre-built Spark binary. Maybe the Spark distribution should contain not only the assembly jar but also individual module jars. Any opinion? From: Shivaram Venkataraman [mailto:shiva...@eecs.berkeley.edu] Sent: Thursday, December 18, 2014 2:20 AM To: Sean Owen Cc: Sun, Rui; user@spark.apache.org Subject: Re: weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary Just to clarify, are you running the application using spark-submit after packaging with sbt package ? One thing that might help is to mark the Spark dependency as 'provided' as then you shouldn't have the Spark classes in your jar. Thanks Shivaram On Wed, Dec 17, 2014 at 4:39 AM, Sean Owen so...@cloudera.com wrote: You should use the same binaries everywhere. The problem here is that anonymous functions get compiled to different names when you build different (potentially) so you actually have one function being called when another function is meant. On Wed, Dec 17, 2014 at 12:07 PM, Sun, Rui rui@intel.com wrote: Hi, I encountered a weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary. Steps to reproduce: 1. Download the official pre-built Spark binary 1.1.1 at http://d3kbcqa49mib13.cloudfront.net/spark-1.1.1-bin-hadoop1.tgz 2. Launch the Spark cluster in pseudo cluster mode 3. A small scala APP which calls RDD.saveAsObjectFile() scalaVersion := 2.10.4 libraryDependencies ++= Seq( org.apache.spark %% spark-core % 1.1.1 ) val sc = new SparkContext(args(0), test) //args[0] is the Spark master URI val rdd = sc.parallelize(List(1, 2, 3)) rdd.saveAsObjectFile(/tmp/mysaoftmp) sc.stop throws an exception as follows: [error] (run-main-0) org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 6, ray-desktop.sh.intel.com): java.lang.ClassCastException: scala.Tuple2 cannot be cast to scala.collection.Iterator [error] org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) [error] org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) [error] org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) [error] org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) [error] org.apache.spark.rdd.RDD.iterator(RDD.scala:229) [error] org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) [error] org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) [error] org.apache.spark.rdd.RDD.iterator(RDD.scala:229) [error] org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) [error] org.apache.spark.scheduler.Task.run(Task.scala:54) [error] org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) [error] java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) [error] java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [error] java.lang.Thread.run(Thread.java:701) After investigation, I found that this is caused by bytecode incompatibility issue between RDD.class in spark-core_2.10-1.1.1.jar and the pre-built spark assembly respectively. This issue also happens with spark 1.1.0. Is there anything wrong in my usage of Spark? Or anything wrong in the process of deploying Spark module jars to maven repo? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail
Re: weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary
Have a look at https://issues.apache.org/jira/browse/SPARK-2075 It's not quite that the API is different, but indeed building different 'flavors' of the same version (hadoop1 vs 2) can strangely lead to this problem, even though the public API is identical and in theory the API is completely separate from the backend bindings. IIRC the idea is that only submitting via spark-submit is really supported, because there you're definitely running exactly what's on your cluster. That should always work. This sort of gotcha turns up in some specific cases but you can always work around it by matching your embedded Spark version as well. On Thu, Dec 18, 2014 at 9:38 AM, Shixiong Zhu zsxw...@gmail.com wrote: @Rui do you mean the spark-core jar in the maven central repo are incompatible with the same version of the the official pre-built Spark binary? That's really weird. I thought they should have used the same codes. Best Regards, Shixiong Zhu 2014-12-18 17:22 GMT+08:00 Sean Owen so...@cloudera.com: Well, it's always a good idea to used matched binary versions. Here it is more acutely necessary. You can use a pre built binary -- if you use it to compile and also run. Why does it not make sense to publish artifacts? Not sure what you mean about core vs assembly, as the assembly contains all of the modules. You don't literally need the same jar file. On Thu, Dec 18, 2014 at 3:20 AM, Sun, Rui rui@intel.com wrote: Not using spark-submit. The App directly communicates with the Spark cluster in standalone mode. If mark the Spark dependency as 'provided’, then the spark-core .jar elsewhere must be pointe to in CLASSPATH. However, the pre-built Spark binary only has an assembly jar, not having individual module jars. So you don’t have a chance to point to a module.jar which is the same binary as that in the pre-built Spark binary. Maybe the Spark distribution should contain not only the assembly jar but also individual module jars. Any opinion? From: Shivaram Venkataraman [mailto:shiva...@eecs.berkeley.edu] Sent: Thursday, December 18, 2014 2:20 AM To: Sean Owen Cc: Sun, Rui; user@spark.apache.org Subject: Re: weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary Just to clarify, are you running the application using spark-submit after packaging with sbt package ? One thing that might help is to mark the Spark dependency as 'provided' as then you shouldn't have the Spark classes in your jar. Thanks Shivaram On Wed, Dec 17, 2014 at 4:39 AM, Sean Owen so...@cloudera.com wrote: You should use the same binaries everywhere. The problem here is that anonymous functions get compiled to different names when you build different (potentially) so you actually have one function being called when another function is meant. On Wed, Dec 17, 2014 at 12:07 PM, Sun, Rui rui@intel.com wrote: Hi, I encountered a weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary. Steps to reproduce: 1. Download the official pre-built Spark binary 1.1.1 at http://d3kbcqa49mib13.cloudfront.net/spark-1.1.1-bin-hadoop1.tgz 2. Launch the Spark cluster in pseudo cluster mode 3. A small scala APP which calls RDD.saveAsObjectFile() scalaVersion := 2.10.4 libraryDependencies ++= Seq( org.apache.spark %% spark-core % 1.1.1 ) val sc = new SparkContext(args(0), test) //args[0] is the Spark master URI val rdd = sc.parallelize(List(1, 2, 3)) rdd.saveAsObjectFile(/tmp/mysaoftmp) sc.stop throws an exception as follows: [error] (run-main-0) org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 6, ray-desktop.sh.intel.com): java.lang.ClassCastException: scala.Tuple2 cannot be cast to scala.collection.Iterator [error] org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) [error] org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) [error] org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) [error] org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) [error] org.apache.spark.rdd.RDD.iterator(RDD.scala:229) [error] org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) [error] org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) [error] org.apache.spark.rdd.RDD.iterator(RDD.scala:229) [error] org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) [error] org.apache.spark.scheduler.Task.run(Task.scala:54) [error] org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) [error] java.util.concurrent.ThreadPoolExecutor.runWorker
RE: weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary
Owen, Since we have individual module jars published into the central maven repo for an official release, then we need to make sure the official Spark assembly jar should be assembled exactly from these jars, so there will be no binary compatibility issue. We can also publish the official assembly jar to maven for convenience. I doubt there is some mistake in the release procedure for an official release. Yes, you are correct : the assembly contains all of the modules:) But I am not sure if the app want to build itself as an assembly including the dependent modules, can it do in such case? -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Thursday, December 18, 2014 5:23 PM To: Sun, Rui Cc: shiva...@eecs.berkeley.edu; user@spark.apache.org Subject: Re: weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary Well, it's always a good idea to used matched binary versions. Here it is more acutely necessary. You can use a pre built binary -- if you use it to compile and also run. Why does it not make sense to publish artifacts? Not sure what you mean about core vs assembly, as the assembly contains all of the modules. You don't literally need the same jar file. On Thu, Dec 18, 2014 at 3:20 AM, Sun, Rui rui@intel.com wrote: Not using spark-submit. The App directly communicates with the Spark cluster in standalone mode. If mark the Spark dependency as 'provided’, then the spark-core .jar elsewhere must be pointe to in CLASSPATH. However, the pre-built Spark binary only has an assembly jar, not having individual module jars. So you don’t have a chance to point to a module.jar which is the same binary as that in the pre-built Spark binary. Maybe the Spark distribution should contain not only the assembly jar but also individual module jars. Any opinion? From: Shivaram Venkataraman [mailto:shiva...@eecs.berkeley.edu] Sent: Thursday, December 18, 2014 2:20 AM To: Sean Owen Cc: Sun, Rui; user@spark.apache.org Subject: Re: weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary Just to clarify, are you running the application using spark-submit after packaging with sbt package ? One thing that might help is to mark the Spark dependency as 'provided' as then you shouldn't have the Spark classes in your jar. Thanks Shivaram On Wed, Dec 17, 2014 at 4:39 AM, Sean Owen so...@cloudera.com wrote: You should use the same binaries everywhere. The problem here is that anonymous functions get compiled to different names when you build different (potentially) so you actually have one function being called when another function is meant. On Wed, Dec 17, 2014 at 12:07 PM, Sun, Rui rui@intel.com wrote: Hi, I encountered a weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary. Steps to reproduce: 1. Download the official pre-built Spark binary 1.1.1 at http://d3kbcqa49mib13.cloudfront.net/spark-1.1.1-bin-hadoop1.tgz 2. Launch the Spark cluster in pseudo cluster mode 3. A small scala APP which calls RDD.saveAsObjectFile() scalaVersion := 2.10.4 libraryDependencies ++= Seq( org.apache.spark %% spark-core % 1.1.1 ) val sc = new SparkContext(args(0), test) //args[0] is the Spark master URI val rdd = sc.parallelize(List(1, 2, 3)) rdd.saveAsObjectFile(/tmp/mysaoftmp) sc.stop throws an exception as follows: [error] (run-main-0) org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 6, ray-desktop.sh.intel.com): java.lang.ClassCastException: scala.Tuple2 cannot be cast to scala.collection.Iterator [error] org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) [error] org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) [error] org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala: 35) [error] org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) [error] org.apache.spark.rdd.RDD.iterator(RDD.scala:229) [error] org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) [error] org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) [error] org.apache.spark.rdd.RDD.iterator(RDD.scala:229) [error] org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) [error] org.apache.spark.scheduler.Task.run(Task.scala:54) [error] org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) [error] java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor. java:1146) [error] java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor .java:615) [error] java.lang.Thread.run(Thread.java:701) After investigation, I found
Re: weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary
You should use the same binaries everywhere. The problem here is that anonymous functions get compiled to different names when you build different (potentially) so you actually have one function being called when another function is meant. On Wed, Dec 17, 2014 at 12:07 PM, Sun, Rui rui@intel.com wrote: Hi, I encountered a weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary. Steps to reproduce: 1. Download the official pre-built Spark binary 1.1.1 at http://d3kbcqa49mib13.cloudfront.net/spark-1.1.1-bin-hadoop1.tgz 2. Launch the Spark cluster in pseudo cluster mode 3. A small scala APP which calls RDD.saveAsObjectFile() scalaVersion := 2.10.4 libraryDependencies ++= Seq( org.apache.spark %% spark-core % 1.1.1 ) val sc = new SparkContext(args(0), test) //args[0] is the Spark master URI val rdd = sc.parallelize(List(1, 2, 3)) rdd.saveAsObjectFile(/tmp/mysaoftmp) sc.stop throws an exception as follows: [error] (run-main-0) org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 6, ray-desktop.sh.intel.com): java.lang.ClassCastException: scala.Tuple2 cannot be cast to scala.collection.Iterator [error] org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) [error] org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) [error] org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) [error] org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) [error] org.apache.spark.rdd.RDD.iterator(RDD.scala:229) [error] org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) [error] org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) [error] org.apache.spark.rdd.RDD.iterator(RDD.scala:229) [error] org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) [error] org.apache.spark.scheduler.Task.run(Task.scala:54) [error] org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) [error] java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) [error] java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [error] java.lang.Thread.run(Thread.java:701) After investigation, I found that this is caused by bytecode incompatibility issue between RDD.class in spark-core_2.10-1.1.1.jar and the pre-built spark assembly respectively. This issue also happens with spark 1.1.0. Is there anything wrong in my usage of Spark? Or anything wrong in the process of deploying Spark module jars to maven repo? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary
Just to clarify, are you running the application using spark-submit after packaging with sbt package ? One thing that might help is to mark the Spark dependency as 'provided' as then you shouldn't have the Spark classes in your jar. Thanks Shivaram On Wed, Dec 17, 2014 at 4:39 AM, Sean Owen so...@cloudera.com wrote: You should use the same binaries everywhere. The problem here is that anonymous functions get compiled to different names when you build different (potentially) so you actually have one function being called when another function is meant. On Wed, Dec 17, 2014 at 12:07 PM, Sun, Rui rui@intel.com wrote: Hi, I encountered a weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary. Steps to reproduce: 1. Download the official pre-built Spark binary 1.1.1 at http://d3kbcqa49mib13.cloudfront.net/spark-1.1.1-bin-hadoop1.tgz 2. Launch the Spark cluster in pseudo cluster mode 3. A small scala APP which calls RDD.saveAsObjectFile() scalaVersion := 2.10.4 libraryDependencies ++= Seq( org.apache.spark %% spark-core % 1.1.1 ) val sc = new SparkContext(args(0), test) //args[0] is the Spark master URI val rdd = sc.parallelize(List(1, 2, 3)) rdd.saveAsObjectFile(/tmp/mysaoftmp) sc.stop throws an exception as follows: [error] (run-main-0) org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 6, ray-desktop.sh.intel.com): java.lang.ClassCastException: scala.Tuple2 cannot be cast to scala.collection.Iterator [error] org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) [error] org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) [error] org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) [error] org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) [error] org.apache.spark.rdd.RDD.iterator(RDD.scala:229) [error] org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) [error] org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) [error] org.apache.spark.rdd.RDD.iterator(RDD.scala:229) [error] org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) [error] org.apache.spark.scheduler.Task.run(Task.scala:54) [error] org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) [error] java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) [error] java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [error] java.lang.Thread.run(Thread.java:701) After investigation, I found that this is caused by bytecode incompatibility issue between RDD.class in spark-core_2.10-1.1.1.jar and the pre-built spark assembly respectively. This issue also happens with spark 1.1.0. Is there anything wrong in my usage of Spark? Or anything wrong in the process of deploying Spark module jars to maven repo? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary
Sean, Yes, the problem is exactly anonymous function mis-matching as you described So if an Spark app (driver) depends on a Spark module jar (for example spark-core) to programmatically communicate with a Spark cluster, user should not use pre-built Spark binary but build Spark from the source and publish the module jars into local maven repo And then build the app to make sure the binary is same. It makes no sense to publish Spark module jars into the central maven repo because binary compatibility with a Spark cluster of the same version is not ensured. Is my understanding correct? -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Wednesday, December 17, 2014 8:39 PM To: Sun, Rui Cc: user@spark.apache.org Subject: Re: weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary You should use the same binaries everywhere. The problem here is that anonymous functions get compiled to different names when you build different (potentially) so you actually have one function being called when another function is meant. On Wed, Dec 17, 2014 at 12:07 PM, Sun, Rui rui@intel.com wrote: Hi, I encountered a weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary. Steps to reproduce: 1. Download the official pre-built Spark binary 1.1.1 at http://d3kbcqa49mib13.cloudfront.net/spark-1.1.1-bin-hadoop1.tgz 2. Launch the Spark cluster in pseudo cluster mode 3. A small scala APP which calls RDD.saveAsObjectFile() scalaVersion := 2.10.4 libraryDependencies ++= Seq( org.apache.spark %% spark-core % 1.1.1 ) val sc = new SparkContext(args(0), test) //args[0] is the Spark master URI val rdd = sc.parallelize(List(1, 2, 3)) rdd.saveAsObjectFile(/tmp/mysaoftmp) sc.stop throws an exception as follows: [error] (run-main-0) org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 6, ray-desktop.sh.intel.com): java.lang.ClassCastException: scala.Tuple2 cannot be cast to scala.collection.Iterator [error] org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) [error] org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) [error] org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:3 5) [error] org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) [error] org.apache.spark.rdd.RDD.iterator(RDD.scala:229) [error] org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) [error] org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) [error] org.apache.spark.rdd.RDD.iterator(RDD.scala:229) [error] org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) [error] org.apache.spark.scheduler.Task.run(Task.scala:54) [error] org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) [error] java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.j ava:1146) [error] java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor. java:615) [error] java.lang.Thread.run(Thread.java:701) After investigation, I found that this is caused by bytecode incompatibility issue between RDD.class in spark-core_2.10-1.1.1.jar and the pre-built spark assembly respectively. This issue also happens with spark 1.1.0. Is there anything wrong in my usage of Spark? Or anything wrong in the process of deploying Spark module jars to maven repo?
RE: weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary
Not using spark-submit. The App directly communicates with the Spark cluster in standalone mode. If mark the Spark dependency as 'provided’, then the spark-core .jar elsewhere must be pointe to in CLASSPATH. However, the pre-built Spark binary only has an assembly jar, not having individual module jars. So you don’t have a chance to point to a module.jar which is the same binary as that in the pre-built Spark binary. Maybe the Spark distribution should contain not only the assembly jar but also individual module jars. Any opinion? From: Shivaram Venkataraman [mailto:shiva...@eecs.berkeley.edu] Sent: Thursday, December 18, 2014 2:20 AM To: Sean Owen Cc: Sun, Rui; user@spark.apache.org Subject: Re: weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary Just to clarify, are you running the application using spark-submit after packaging with sbt package ? One thing that might help is to mark the Spark dependency as 'provided' as then you shouldn't have the Spark classes in your jar. Thanks Shivaram On Wed, Dec 17, 2014 at 4:39 AM, Sean Owen so...@cloudera.commailto:so...@cloudera.com wrote: You should use the same binaries everywhere. The problem here is that anonymous functions get compiled to different names when you build different (potentially) so you actually have one function being called when another function is meant. On Wed, Dec 17, 2014 at 12:07 PM, Sun, Rui rui@intel.commailto:rui@intel.com wrote: Hi, I encountered a weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary. Steps to reproduce: 1. Download the official pre-built Spark binary 1.1.1 at http://d3kbcqa49mib13.cloudfront.net/spark-1.1.1-bin-hadoop1.tgz 2. Launch the Spark cluster in pseudo cluster mode 3. A small scala APP which calls RDD.saveAsObjectFile() scalaVersion := 2.10.4 libraryDependencies ++= Seq( org.apache.spark %% spark-core % 1.1.1 ) val sc = new SparkContext(args(0), test) //args[0] is the Spark master URI val rdd = sc.parallelize(List(1, 2, 3)) rdd.saveAsObjectFile(/tmp/mysaoftmp) sc.stop throws an exception as follows: [error] (run-main-0) org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 6, ray-desktop.sh.intel.comhttp://ray-desktop.sh.intel.com): java.lang.ClassCastException: scala.Tuple2 cannot be cast to scala.collection.Iterator [error] org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) [error] org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) [error] org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) [error] org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) [error] org.apache.spark.rdd.RDD.iterator(RDD.scala:229) [error] org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) [error] org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) [error] org.apache.spark.rdd.RDD.iterator(RDD.scala:229) [error] org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) [error] org.apache.spark.scheduler.Task.run(Task.scala:54) [error] org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) [error] java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) [error] java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [error] java.lang.Thread.run(Thread.java:701) After investigation, I found that this is caused by bytecode incompatibility issue between RDD.class in spark-core_2.10-1.1.1.jar and the pre-built spark assembly respectively. This issue also happens with spark 1.1.0. Is there anything wrong in my usage of Spark? Or anything wrong in the process of deploying Spark module jars to maven repo? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.orgmailto:user-h...@spark.apache.org