Re: weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary

2014-12-18 Thread Sean Owen
Well, it's always a good idea to used matched binary versions. Here it
is more acutely necessary. You can use a pre built binary -- if you
use it to compile and also run. Why does it not make sense to publish
artifacts?

Not sure what you mean about core vs assembly, as the assembly
contains all of the modules. You don't literally need the same jar
file.

On Thu, Dec 18, 2014 at 3:20 AM, Sun, Rui rui@intel.com wrote:
 Not using spark-submit. The App directly communicates with the Spark cluster
 in standalone mode.



 If mark the Spark dependency as 'provided’, then the spark-core .jar
 elsewhere must be pointe to in CLASSPATH. However, the pre-built Spark
 binary only has an assembly jar, not having individual module jars. So you
 don’t have a chance to point to a module.jar which is the same binary as
 that in the pre-built Spark binary.



 Maybe the Spark distribution should contain not only the assembly jar but
 also individual module jars. Any opinion?



 From: Shivaram Venkataraman [mailto:shiva...@eecs.berkeley.edu]
 Sent: Thursday, December 18, 2014 2:20 AM
 To: Sean Owen
 Cc: Sun, Rui; user@spark.apache.org
 Subject: Re: weird bytecode incompatability issue between spark-core jar
 from mvn repo and official spark prebuilt binary



 Just to clarify, are you running the application using spark-submit after
 packaging with sbt package ? One thing that might help is to mark the Spark
 dependency as 'provided' as then you shouldn't have the Spark classes in
 your jar.



 Thanks

 Shivaram



 On Wed, Dec 17, 2014 at 4:39 AM, Sean Owen so...@cloudera.com wrote:

 You should use the same binaries everywhere. The problem here is that
 anonymous functions get compiled to different names when you build
 different (potentially) so you actually have one function being called
 when another function is meant.


 On Wed, Dec 17, 2014 at 12:07 PM, Sun, Rui rui@intel.com wrote:
 Hi,



 I encountered a weird bytecode incompatability issue between spark-core
 jar
 from mvn repo and official spark prebuilt binary.



 Steps to reproduce:

 1. Download the official pre-built Spark binary 1.1.1 at
 http://d3kbcqa49mib13.cloudfront.net/spark-1.1.1-bin-hadoop1.tgz

 2. Launch the Spark cluster in pseudo cluster mode

 3. A small scala APP which calls RDD.saveAsObjectFile()

 scalaVersion := 2.10.4



 libraryDependencies ++= Seq(

   org.apache.spark %% spark-core % 1.1.1

 )



 val sc = new SparkContext(args(0), test) //args[0] is the Spark master
 URI

   val rdd = sc.parallelize(List(1, 2, 3))

   rdd.saveAsObjectFile(/tmp/mysaoftmp)

   sc.stop



 throws an exception as follows:

 [error] (run-main-0) org.apache.spark.SparkException: Job aborted due to
 stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure:
 Lost
 task 1.3 in stage 0.0 (TID 6, ray-desktop.sh.intel.com):
 java.lang.ClassCastException: scala.Tuple2 cannot be cast to
 scala.collection.Iterator

 [error] org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)

 [error] org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)

 [error]
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

 [error]
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

 [error] org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

 [error] org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)

 [error]
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

 [error] org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

 [error]
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)

 [error] org.apache.spark.scheduler.Task.run(Task.scala:54)

 [error]
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)

 [error]

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)

 [error]

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

 [error] java.lang.Thread.run(Thread.java:701)



 After investigation, I found that this is caused by bytecode
 incompatibility
 issue between RDD.class in spark-core_2.10-1.1.1.jar and the pre-built
 spark
 assembly respectively.



 This issue also happens with spark 1.1.0.



 Is there anything wrong in my usage of Spark? Or anything wrong in the
 process of deploying Spark module jars to maven repo?



 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary

2014-12-18 Thread Shixiong Zhu
@Rui do you mean the spark-core jar in the maven central repo
are incompatible with the same version of the the official pre-built Spark
binary? That's really weird. I thought they should have used the same codes.

Best Regards,
Shixiong Zhu

2014-12-18 17:22 GMT+08:00 Sean Owen so...@cloudera.com:

 Well, it's always a good idea to used matched binary versions. Here it
 is more acutely necessary. You can use a pre built binary -- if you
 use it to compile and also run. Why does it not make sense to publish
 artifacts?

 Not sure what you mean about core vs assembly, as the assembly
 contains all of the modules. You don't literally need the same jar
 file.

 On Thu, Dec 18, 2014 at 3:20 AM, Sun, Rui rui@intel.com wrote:
  Not using spark-submit. The App directly communicates with the Spark
 cluster
  in standalone mode.
 
 
 
  If mark the Spark dependency as 'provided’, then the spark-core .jar
  elsewhere must be pointe to in CLASSPATH. However, the pre-built Spark
  binary only has an assembly jar, not having individual module jars. So
 you
  don’t have a chance to point to a module.jar which is the same binary as
  that in the pre-built Spark binary.
 
 
 
  Maybe the Spark distribution should contain not only the assembly jar but
  also individual module jars. Any opinion?
 
 
 
  From: Shivaram Venkataraman [mailto:shiva...@eecs.berkeley.edu]
  Sent: Thursday, December 18, 2014 2:20 AM
  To: Sean Owen
  Cc: Sun, Rui; user@spark.apache.org
  Subject: Re: weird bytecode incompatability issue between spark-core jar
  from mvn repo and official spark prebuilt binary
 
 
 
  Just to clarify, are you running the application using spark-submit after
  packaging with sbt package ? One thing that might help is to mark the
 Spark
  dependency as 'provided' as then you shouldn't have the Spark classes in
  your jar.
 
 
 
  Thanks
 
  Shivaram
 
 
 
  On Wed, Dec 17, 2014 at 4:39 AM, Sean Owen so...@cloudera.com wrote:
 
  You should use the same binaries everywhere. The problem here is that
  anonymous functions get compiled to different names when you build
  different (potentially) so you actually have one function being called
  when another function is meant.
 
 
  On Wed, Dec 17, 2014 at 12:07 PM, Sun, Rui rui@intel.com wrote:
  Hi,
 
 
 
  I encountered a weird bytecode incompatability issue between spark-core
  jar
  from mvn repo and official spark prebuilt binary.
 
 
 
  Steps to reproduce:
 
  1. Download the official pre-built Spark binary 1.1.1 at
  http://d3kbcqa49mib13.cloudfront.net/spark-1.1.1-bin-hadoop1.tgz
 
  2. Launch the Spark cluster in pseudo cluster mode
 
  3. A small scala APP which calls RDD.saveAsObjectFile()
 
  scalaVersion := 2.10.4
 
 
 
  libraryDependencies ++= Seq(
 
org.apache.spark %% spark-core % 1.1.1
 
  )
 
 
 
  val sc = new SparkContext(args(0), test) //args[0] is the Spark master
  URI
 
val rdd = sc.parallelize(List(1, 2, 3))
 
rdd.saveAsObjectFile(/tmp/mysaoftmp)
 
sc.stop
 
 
 
  throws an exception as follows:
 
  [error] (run-main-0) org.apache.spark.SparkException: Job aborted due to
  stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure:
  Lost
  task 1.3 in stage 0.0 (TID 6, ray-desktop.sh.intel.com):
  java.lang.ClassCastException: scala.Tuple2 cannot be cast to
  scala.collection.Iterator
 
  [error]
  org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
 
  [error]
  org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
 
  [error]
  org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 
  [error]
  org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 
  [error] org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
  [error]
  org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 
  [error]
  org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 
  [error] org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
  [error]
  org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 
  [error] org.apache.spark.scheduler.Task.run(Task.scala:54)
 
  [error]
  org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
 
  [error]
 
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
 
  [error]
 
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 
  [error] java.lang.Thread.run(Thread.java:701)
 
 
 
  After investigation, I found that this is caused by bytecode
  incompatibility
  issue between RDD.class in spark-core_2.10-1.1.1.jar and the pre-built
  spark
  assembly respectively.
 
 
 
  This issue also happens with spark 1.1.0.
 
 
 
  Is there anything wrong in my usage of Spark? Or anything wrong in the
  process of deploying Spark module jars to maven repo?
 
 
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail

Re: weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary

2014-12-18 Thread Sean Owen
Have a look at https://issues.apache.org/jira/browse/SPARK-2075

It's not quite that the API is different, but indeed building
different 'flavors' of the same version (hadoop1 vs 2) can strangely
lead to this problem, even though the public API is identical and in
theory the API is completely separate from the backend bindings.

IIRC the idea is that only submitting via spark-submit is really
supported, because there you're definitely running exactly what's on
your cluster. That should always work.

This sort of gotcha turns up in some specific cases but you can always
work around it by matching your embedded Spark version as well.

On Thu, Dec 18, 2014 at 9:38 AM, Shixiong Zhu zsxw...@gmail.com wrote:
 @Rui do you mean the spark-core jar in the maven central repo are
 incompatible with the same version of the the official pre-built Spark
 binary? That's really weird. I thought they should have used the same codes.

 Best Regards,

 Shixiong Zhu

 2014-12-18 17:22 GMT+08:00 Sean Owen so...@cloudera.com:

 Well, it's always a good idea to used matched binary versions. Here it
 is more acutely necessary. You can use a pre built binary -- if you
 use it to compile and also run. Why does it not make sense to publish
 artifacts?

 Not sure what you mean about core vs assembly, as the assembly
 contains all of the modules. You don't literally need the same jar
 file.

 On Thu, Dec 18, 2014 at 3:20 AM, Sun, Rui rui@intel.com wrote:
  Not using spark-submit. The App directly communicates with the Spark
  cluster
  in standalone mode.
 
 
 
  If mark the Spark dependency as 'provided’, then the spark-core .jar
  elsewhere must be pointe to in CLASSPATH. However, the pre-built Spark
  binary only has an assembly jar, not having individual module jars. So
  you
  don’t have a chance to point to a module.jar which is the same binary as
  that in the pre-built Spark binary.
 
 
 
  Maybe the Spark distribution should contain not only the assembly jar
  but
  also individual module jars. Any opinion?
 
 
 
  From: Shivaram Venkataraman [mailto:shiva...@eecs.berkeley.edu]
  Sent: Thursday, December 18, 2014 2:20 AM
  To: Sean Owen
  Cc: Sun, Rui; user@spark.apache.org
  Subject: Re: weird bytecode incompatability issue between spark-core jar
  from mvn repo and official spark prebuilt binary
 
 
 
  Just to clarify, are you running the application using spark-submit
  after
  packaging with sbt package ? One thing that might help is to mark the
  Spark
  dependency as 'provided' as then you shouldn't have the Spark classes in
  your jar.
 
 
 
  Thanks
 
  Shivaram
 
 
 
  On Wed, Dec 17, 2014 at 4:39 AM, Sean Owen so...@cloudera.com wrote:
 
  You should use the same binaries everywhere. The problem here is that
  anonymous functions get compiled to different names when you build
  different (potentially) so you actually have one function being called
  when another function is meant.
 
 
  On Wed, Dec 17, 2014 at 12:07 PM, Sun, Rui rui@intel.com wrote:
  Hi,
 
 
 
  I encountered a weird bytecode incompatability issue between spark-core
  jar
  from mvn repo and official spark prebuilt binary.
 
 
 
  Steps to reproduce:
 
  1. Download the official pre-built Spark binary 1.1.1 at
  http://d3kbcqa49mib13.cloudfront.net/spark-1.1.1-bin-hadoop1.tgz
 
  2. Launch the Spark cluster in pseudo cluster mode
 
  3. A small scala APP which calls RDD.saveAsObjectFile()
 
  scalaVersion := 2.10.4
 
 
 
  libraryDependencies ++= Seq(
 
org.apache.spark %% spark-core % 1.1.1
 
  )
 
 
 
  val sc = new SparkContext(args(0), test) //args[0] is the Spark
  master
  URI
 
val rdd = sc.parallelize(List(1, 2, 3))
 
rdd.saveAsObjectFile(/tmp/mysaoftmp)
 
sc.stop
 
 
 
  throws an exception as follows:
 
  [error] (run-main-0) org.apache.spark.SparkException: Job aborted due
  to
  stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure:
  Lost
  task 1.3 in stage 0.0 (TID 6, ray-desktop.sh.intel.com):
  java.lang.ClassCastException: scala.Tuple2 cannot be cast to
  scala.collection.Iterator
 
  [error]
  org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
 
  [error]
  org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
 
  [error]
 
  org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 
  [error]
  org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 
  [error] org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
  [error]
  org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 
  [error]
  org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 
  [error] org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
  [error]
  org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 
  [error] org.apache.spark.scheduler.Task.run(Task.scala:54)
 
  [error]
  org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
 
  [error]
 
 
  java.util.concurrent.ThreadPoolExecutor.runWorker

RE: weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary

2014-12-18 Thread Sun, Rui
Owen,

Since we have individual module jars published into the central maven repo for 
an official release, then we need to make sure the official Spark assembly jar 
should be assembled exactly from these jars, so there will be no binary 
compatibility issue. We can also publish the official assembly jar to maven for 
convenience. I doubt there is some mistake in the release procedure for an 
official release.

Yes, you are correct : the assembly contains all of the modules:)  But I am not 
sure if the app want to build itself as an assembly including the dependent 
modules, can it do in such case?

-Original Message-
From: Sean Owen [mailto:so...@cloudera.com] 
Sent: Thursday, December 18, 2014 5:23 PM
To: Sun, Rui
Cc: shiva...@eecs.berkeley.edu; user@spark.apache.org
Subject: Re: weird bytecode incompatability issue between spark-core jar from 
mvn repo and official spark prebuilt binary

Well, it's always a good idea to used matched binary versions. Here it is more 
acutely necessary. You can use a pre built binary -- if you use it to compile 
and also run. Why does it not make sense to publish artifacts?

Not sure what you mean about core vs assembly, as the assembly contains all of 
the modules. You don't literally need the same jar file.

On Thu, Dec 18, 2014 at 3:20 AM, Sun, Rui rui@intel.com wrote:
 Not using spark-submit. The App directly communicates with the Spark 
 cluster in standalone mode.



 If mark the Spark dependency as 'provided’, then the spark-core .jar 
 elsewhere must be pointe to in CLASSPATH. However, the pre-built Spark 
 binary only has an assembly jar, not having individual module jars. So 
 you don’t have a chance to point to a module.jar which is the same 
 binary as that in the pre-built Spark binary.



 Maybe the Spark distribution should contain not only the assembly jar 
 but also individual module jars. Any opinion?



 From: Shivaram Venkataraman [mailto:shiva...@eecs.berkeley.edu]
 Sent: Thursday, December 18, 2014 2:20 AM
 To: Sean Owen
 Cc: Sun, Rui; user@spark.apache.org
 Subject: Re: weird bytecode incompatability issue between spark-core 
 jar from mvn repo and official spark prebuilt binary



 Just to clarify, are you running the application using spark-submit 
 after packaging with sbt package ? One thing that might help is to 
 mark the Spark dependency as 'provided' as then you shouldn't have the 
 Spark classes in your jar.



 Thanks

 Shivaram



 On Wed, Dec 17, 2014 at 4:39 AM, Sean Owen so...@cloudera.com wrote:

 You should use the same binaries everywhere. The problem here is that 
 anonymous functions get compiled to different names when you build 
 different (potentially) so you actually have one function being called 
 when another function is meant.


 On Wed, Dec 17, 2014 at 12:07 PM, Sun, Rui rui@intel.com wrote:
 Hi,



 I encountered a weird bytecode incompatability issue between 
 spark-core jar from mvn repo and official spark prebuilt binary.



 Steps to reproduce:

 1. Download the official pre-built Spark binary 1.1.1 at
 http://d3kbcqa49mib13.cloudfront.net/spark-1.1.1-bin-hadoop1.tgz

 2. Launch the Spark cluster in pseudo cluster mode

 3. A small scala APP which calls RDD.saveAsObjectFile()

 scalaVersion := 2.10.4



 libraryDependencies ++= Seq(

   org.apache.spark %% spark-core % 1.1.1

 )



 val sc = new SparkContext(args(0), test) //args[0] is the Spark 
 master URI

   val rdd = sc.parallelize(List(1, 2, 3))

   rdd.saveAsObjectFile(/tmp/mysaoftmp)

   sc.stop



 throws an exception as follows:

 [error] (run-main-0) org.apache.spark.SparkException: Job aborted due 
 to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure:
 Lost
 task 1.3 in stage 0.0 (TID 6, ray-desktop.sh.intel.com):
 java.lang.ClassCastException: scala.Tuple2 cannot be cast to 
 scala.collection.Iterator

 [error] org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)

 [error] org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)

 [error]
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:
 35)

 [error]
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

 [error] org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

 [error] org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)

 [error]
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

 [error] org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

 [error]
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)

 [error] org.apache.spark.scheduler.Task.run(Task.scala:54)

 [error]
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)

 [error]

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
 java:1146)

 [error]

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
 .java:615)

 [error] java.lang.Thread.run(Thread.java:701)



 After investigation, I found

Re: weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary

2014-12-17 Thread Sean Owen
You should use the same binaries everywhere. The problem here is that
anonymous functions get compiled to different names when you build
different (potentially) so you actually have one function being called
when another function is meant.

On Wed, Dec 17, 2014 at 12:07 PM, Sun, Rui rui@intel.com wrote:
 Hi,



 I encountered a weird bytecode incompatability issue between spark-core jar
 from mvn repo and official spark prebuilt binary.



 Steps to reproduce:

 1. Download the official pre-built Spark binary 1.1.1 at
 http://d3kbcqa49mib13.cloudfront.net/spark-1.1.1-bin-hadoop1.tgz

 2. Launch the Spark cluster in pseudo cluster mode

 3. A small scala APP which calls RDD.saveAsObjectFile()

 scalaVersion := 2.10.4



 libraryDependencies ++= Seq(

   org.apache.spark %% spark-core % 1.1.1

 )



 val sc = new SparkContext(args(0), test) //args[0] is the Spark master URI

   val rdd = sc.parallelize(List(1, 2, 3))

   rdd.saveAsObjectFile(/tmp/mysaoftmp)

   sc.stop



 throws an exception as follows:

 [error] (run-main-0) org.apache.spark.SparkException: Job aborted due to
 stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost
 task 1.3 in stage 0.0 (TID 6, ray-desktop.sh.intel.com):
 java.lang.ClassCastException: scala.Tuple2 cannot be cast to
 scala.collection.Iterator

 [error] org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)

 [error] org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)

 [error]
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

 [error]
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

 [error] org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

 [error] org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)

 [error]
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

 [error] org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

 [error]
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)

 [error] org.apache.spark.scheduler.Task.run(Task.scala:54)

 [error]
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)

 [error]
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)

 [error]
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

 [error] java.lang.Thread.run(Thread.java:701)



 After investigation, I found that this is caused by bytecode incompatibility
 issue between RDD.class in spark-core_2.10-1.1.1.jar and the pre-built spark
 assembly respectively.



 This issue also happens with spark 1.1.0.



 Is there anything wrong in my usage of Spark? Or anything wrong in the
 process of deploying Spark module jars to maven repo?



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary

2014-12-17 Thread Shivaram Venkataraman
Just to clarify, are you running the application using spark-submit after
packaging with sbt package ? One thing that might help is to mark the Spark
dependency as 'provided' as then you shouldn't have the Spark classes in
your jar.

Thanks
Shivaram

On Wed, Dec 17, 2014 at 4:39 AM, Sean Owen so...@cloudera.com wrote:

 You should use the same binaries everywhere. The problem here is that
 anonymous functions get compiled to different names when you build
 different (potentially) so you actually have one function being called
 when another function is meant.

 On Wed, Dec 17, 2014 at 12:07 PM, Sun, Rui rui@intel.com wrote:
  Hi,
 
 
 
  I encountered a weird bytecode incompatability issue between spark-core
 jar
  from mvn repo and official spark prebuilt binary.
 
 
 
  Steps to reproduce:
 
  1. Download the official pre-built Spark binary 1.1.1 at
  http://d3kbcqa49mib13.cloudfront.net/spark-1.1.1-bin-hadoop1.tgz
 
  2. Launch the Spark cluster in pseudo cluster mode
 
  3. A small scala APP which calls RDD.saveAsObjectFile()
 
  scalaVersion := 2.10.4
 
 
 
  libraryDependencies ++= Seq(
 
org.apache.spark %% spark-core % 1.1.1
 
  )
 
 
 
  val sc = new SparkContext(args(0), test) //args[0] is the Spark master
 URI
 
val rdd = sc.parallelize(List(1, 2, 3))
 
rdd.saveAsObjectFile(/tmp/mysaoftmp)
 
sc.stop
 
 
 
  throws an exception as follows:
 
  [error] (run-main-0) org.apache.spark.SparkException: Job aborted due to
  stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure:
 Lost
  task 1.3 in stage 0.0 (TID 6, ray-desktop.sh.intel.com):
  java.lang.ClassCastException: scala.Tuple2 cannot be cast to
  scala.collection.Iterator
 
  [error] org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
 
  [error] org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
 
  [error]
  org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 
  [error]
  org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 
  [error] org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
  [error]
  org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 
  [error]
  org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 
  [error] org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
  [error]
  org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 
  [error] org.apache.spark.scheduler.Task.run(Task.scala:54)
 
  [error]
  org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
 
  [error]
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
 
  [error]
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 
  [error] java.lang.Thread.run(Thread.java:701)
 
 
 
  After investigation, I found that this is caused by bytecode
 incompatibility
  issue between RDD.class in spark-core_2.10-1.1.1.jar and the pre-built
 spark
  assembly respectively.
 
 
 
  This issue also happens with spark 1.1.0.
 
 
 
  Is there anything wrong in my usage of Spark? Or anything wrong in the
  process of deploying Spark module jars to maven repo?
 
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




RE: weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary

2014-12-17 Thread Sun, Rui
Sean,

Yes, the problem is exactly anonymous function mis-matching as you described

So if an Spark app (driver) depends on a Spark module jar (for example 
spark-core) to programmatically communicate with a Spark cluster, user should 
not use pre-built Spark binary but build Spark from the source and publish the 
module jars into local maven repo And then build the app to make sure the 
binary is same. It makes no sense to publish Spark module jars into the central 
maven repo because binary compatibility with a Spark cluster of the same 
version is not ensured. Is my understanding correct?


-Original Message-
From: Sean Owen [mailto:so...@cloudera.com] 
Sent: Wednesday, December 17, 2014 8:39 PM
To: Sun, Rui
Cc: user@spark.apache.org
Subject: Re: weird bytecode incompatability issue between spark-core jar from 
mvn repo and official spark prebuilt binary

You should use the same binaries everywhere. The problem here is that anonymous 
functions get compiled to different names when you build different 
(potentially) so you actually have one function being called when another 
function is meant.

On Wed, Dec 17, 2014 at 12:07 PM, Sun, Rui rui@intel.com wrote:
 Hi,



 I encountered a weird bytecode incompatability issue between 
 spark-core jar from mvn repo and official spark prebuilt binary.



 Steps to reproduce:

 1. Download the official pre-built Spark binary 1.1.1 at
 http://d3kbcqa49mib13.cloudfront.net/spark-1.1.1-bin-hadoop1.tgz

 2. Launch the Spark cluster in pseudo cluster mode

 3. A small scala APP which calls RDD.saveAsObjectFile()

 scalaVersion := 2.10.4



 libraryDependencies ++= Seq(

   org.apache.spark %% spark-core % 1.1.1

 )



 val sc = new SparkContext(args(0), test) //args[0] is the Spark 
 master URI

   val rdd = sc.parallelize(List(1, 2, 3))

   rdd.saveAsObjectFile(/tmp/mysaoftmp)

   sc.stop



 throws an exception as follows:

 [error] (run-main-0) org.apache.spark.SparkException: Job aborted due 
 to stage failure: Task 1 in stage 0.0 failed 4 times, most recent 
 failure: Lost task 1.3 in stage 0.0 (TID 6, ray-desktop.sh.intel.com):
 java.lang.ClassCastException: scala.Tuple2 cannot be cast to 
 scala.collection.Iterator

 [error] org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)

 [error] org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)

 [error]
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:3
 5)

 [error]
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

 [error] org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

 [error] org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)

 [error]
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

 [error] org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

 [error]
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)

 [error] org.apache.spark.scheduler.Task.run(Task.scala:54)

 [error]
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)

 [error]
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.j
 ava:1146)

 [error]
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.
 java:615)

 [error] java.lang.Thread.run(Thread.java:701)



 After investigation, I found that this is caused by bytecode 
 incompatibility issue between RDD.class in spark-core_2.10-1.1.1.jar 
 and the pre-built spark assembly respectively.



 This issue also happens with spark 1.1.0.



 Is there anything wrong in my usage of Spark? Or anything wrong in the 
 process of deploying Spark module jars to maven repo?




RE: weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary

2014-12-17 Thread Sun, Rui
Not using spark-submit. The App directly communicates with the Spark cluster in 
standalone mode.

If mark the Spark dependency as 'provided’, then the spark-core .jar elsewhere 
must be pointe to in CLASSPATH. However, the pre-built Spark binary only has an 
assembly jar, not having individual module jars. So you don’t have a chance to 
point to a module.jar which is the same binary as that in the pre-built Spark 
binary.

Maybe the Spark distribution should contain not only the assembly jar but also 
individual module jars. Any opinion?

From: Shivaram Venkataraman [mailto:shiva...@eecs.berkeley.edu]
Sent: Thursday, December 18, 2014 2:20 AM
To: Sean Owen
Cc: Sun, Rui; user@spark.apache.org
Subject: Re: weird bytecode incompatability issue between spark-core jar from 
mvn repo and official spark prebuilt binary

Just to clarify, are you running the application using spark-submit after 
packaging with sbt package ? One thing that might help is to mark the Spark 
dependency as 'provided' as then you shouldn't have the Spark classes in your 
jar.

Thanks
Shivaram

On Wed, Dec 17, 2014 at 4:39 AM, Sean Owen 
so...@cloudera.commailto:so...@cloudera.com wrote:
You should use the same binaries everywhere. The problem here is that
anonymous functions get compiled to different names when you build
different (potentially) so you actually have one function being called
when another function is meant.

On Wed, Dec 17, 2014 at 12:07 PM, Sun, Rui 
rui@intel.commailto:rui@intel.com wrote:
 Hi,



 I encountered a weird bytecode incompatability issue between spark-core jar
 from mvn repo and official spark prebuilt binary.



 Steps to reproduce:

 1. Download the official pre-built Spark binary 1.1.1 at
 http://d3kbcqa49mib13.cloudfront.net/spark-1.1.1-bin-hadoop1.tgz

 2. Launch the Spark cluster in pseudo cluster mode

 3. A small scala APP which calls RDD.saveAsObjectFile()

 scalaVersion := 2.10.4



 libraryDependencies ++= Seq(

   org.apache.spark %% spark-core % 1.1.1

 )



 val sc = new SparkContext(args(0), test) //args[0] is the Spark master URI

   val rdd = sc.parallelize(List(1, 2, 3))

   rdd.saveAsObjectFile(/tmp/mysaoftmp)

   sc.stop



 throws an exception as follows:

 [error] (run-main-0) org.apache.spark.SparkException: Job aborted due to
 stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost
 task 1.3 in stage 0.0 (TID 6, 
 ray-desktop.sh.intel.comhttp://ray-desktop.sh.intel.com):
 java.lang.ClassCastException: scala.Tuple2 cannot be cast to
 scala.collection.Iterator

 [error] org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)

 [error] org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)

 [error]
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

 [error]
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

 [error] org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

 [error] org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)

 [error]
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

 [error] org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

 [error]
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)

 [error] org.apache.spark.scheduler.Task.run(Task.scala:54)

 [error]
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)

 [error]
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)

 [error]
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

 [error] java.lang.Thread.run(Thread.java:701)



 After investigation, I found that this is caused by bytecode incompatibility
 issue between RDD.class in spark-core_2.10-1.1.1.jar and the pre-built spark
 assembly respectively.



 This issue also happens with spark 1.1.0.



 Is there anything wrong in my usage of Spark? Or anything wrong in the
 process of deploying Spark module jars to maven repo?


-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.orgmailto:user-h...@spark.apache.org