[jira] [Commented] (SPARK-4796) Spark does not remove temp files

2015-08-11 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692692#comment-14692692
 ] 

Pat Ferrel commented on SPARK-4796:
---

Why is this marked resolved? Spark does indeed leave around a lot of files and 
unless you are looking you'd never know. It sounds like the only safe method to 
remove these is to shutdown Spark and delete them.

I skimmed the issue so sorry if I missed something. 15G on the MBP and counting 
:-)


> Spark does not remove temp files
> 
>
> Key: SPARK-4796
> URL: https://issues.apache.org/jira/browse/SPARK-4796
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.1.0
> Environment: I'm runnin spark on mesos and mesos slaves are docker 
> containers. Spark 1.1.0, elasticsearch spark 2.1.0-Beta3, mesos 0.20.0, 
> docker 1.2.0.
>Reporter: Ian Babrou
>
> I started a job that cannot fill into memory and got "no space left on 
> device". That was fair, because docker containers only have 10gb of disk 
> space and some is taken by OS already.
> But then I found out when job failed it didn't release any disk space and 
> left container without any free disk space.
> Then I decided to check if spark removes temp files in any case, because many 
> mesos slaves had /tmp/spark-local-*. Apparently some garbage stays after 
> spark task is finished. I attached with strace to running job:
> [pid 30212] 
> unlink("/tmp/spark-local-20141209091330-48b5/12/temp_8a73fcc2-4baa-499a-8add-0161f918de8a")
>  = 0
> [pid 30212] 
> unlink("/tmp/spark-local-20141209091330-48b5/31/temp_47efd04b-d427-4139-8f48-3d5d421e9be4")
>  = 0
> [pid 30212] 
> unlink("/tmp/spark-local-20141209091330-48b5/15/temp_619a46dc-40de-43f1-a844-4db146a607c6")
>  = 0
> [pid 30212] 
> unlink("/tmp/spark-local-20141209091330-48b5/05/temp_d97d90a7-8bc1-4742-ba9b-41d74ea73c36"
>  
> [pid 30212] <... unlink resumed> )  = 0
> [pid 30212] 
> unlink("/tmp/spark-local-20141209091330-48b5/36/temp_a2deb806-714a-457a-90c8-5d9f3247a5d7")
>  = 0
> [pid 30212] 
> unlink("/tmp/spark-local-20141209091330-48b5/04/temp_afd558f1-2fd0-48d7-bc65-07b5f4455b22")
>  = 0
> [pid 30212] 
> unlink("/tmp/spark-local-20141209091330-48b5/32/temp_a7add910-8dc3-482c-baf5-09d5a187c62a"
>  
> [pid 30212] <... unlink resumed> )  = 0
> [pid 30212] 
> unlink("/tmp/spark-local-20141209091330-48b5/21/temp_485612f0-527f-47b0-bb8b-6016f3b9ec19")
>  = 0
> [pid 30212] 
> unlink("/tmp/spark-local-20141209091330-48b5/12/temp_bb2b4e06-a9dd-408e-8395-f6c5f4e2d52f")
>  = 0
> [pid 30212] 
> unlink("/tmp/spark-local-20141209091330-48b5/1e/temp_825293c6-9d3b-4451-9cb8-91e2abe5a19d"
>  
> [pid 30212] <... unlink resumed> )  = 0
> [pid 30212] 
> unlink("/tmp/spark-local-20141209091330-48b5/15/temp_43fbb94c-9163-4aa7-ab83-e7693b9f21fc")
>  = 0
> [pid 30212] 
> unlink("/tmp/spark-local-20141209091330-48b5/3d/temp_37f3629c-1b09-4907-b599-61b7df94b898"
>  
> [pid 30212] <... unlink resumed> )  = 0
> [pid 30212] 
> unlink("/tmp/spark-local-20141209091330-48b5/35/temp_d18f49f6-1fb1-4c01-a694-0ee0a72294c0")
>  = 0
> And after job is finished, some files are still there:
> /tmp/spark-local-20141209091330-48b5/
> /tmp/spark-local-20141209091330-48b5/11
> /tmp/spark-local-20141209091330-48b5/11/shuffle_0_1_4
> /tmp/spark-local-20141209091330-48b5/32
> /tmp/spark-local-20141209091330-48b5/04
> /tmp/spark-local-20141209091330-48b5/05
> /tmp/spark-local-20141209091330-48b5/0f
> /tmp/spark-local-20141209091330-48b5/0f/shuffle_0_1_2
> /tmp/spark-local-20141209091330-48b5/3d
> /tmp/spark-local-20141209091330-48b5/0e
> /tmp/spark-local-20141209091330-48b5/0e/shuffle_0_1_1
> /tmp/spark-local-20141209091330-48b5/15
> /tmp/spark-local-20141209091330-48b5/0d
> /tmp/spark-local-20141209091330-48b5/0d/shuffle_0_1_0
> /tmp/spark-local-20141209091330-48b5/36
> /tmp/spark-local-20141209091330-48b5/31
> /tmp/spark-local-20141209091330-48b5/12
> /tmp/spark-local-20141209091330-48b5/21
> /tmp/spark-local-20141209091330-48b5/10
> /tmp/spark-local-20141209091330-48b5/10/shuffle_0_1_3
> /tmp/spark-local-20141209091330-48b5/1e
> /tmp/spark-local-20141209091330-48b5/35
> If I look into my mesos slaves, there are mostly "shuffle" files, overall 
> picture for single node:
> root@web338:~# find /tmp/spark-local-20141* -type f | fgrep shuffle | wc -l
> 781
> root@web338:~# find /tmp/spark-local-20141* -type f | fgrep -v shuffle | wc -l
> 10
> root@web338:~# find /tmp/spark-local-20141* -type f | fgrep -v shuffle
> /tmp/spark-local-20141119144512-67c4/2d/temp_9056f380-3edb-48d6-a7df-d4896f1e1cc3
> /tmp/spark-local-20141119144512-67c4/3d/temp_e005659b-eddf-4a34-947f-4f63fcddf111
> /tmp/spark-local-20141119144512-67c4/16/temp_71eba702-36b4-4e1a-aebc-20d2080f1705
> /tmp/spark-local-20141119144512-67c4/0d/temp_8037b9db-2d8a-4786-a554-a8cad922bf5e
> /

[jira] [Commented] (SPARK-6069) Deserialization Error ClassNotFoundException with Kryo, Guava 14

2015-05-02 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14525267#comment-14525267
 ] 

Pat Ferrel commented on SPARK-6069:
---

Same with Mahout where we were using Guava. Using Scala collections only will 
solve this for us.

The way to work around this is to use the spark.executor.extraClassPath, which 
should point to the correct jar in the native filesystem on every worker! So 
you have to move a dependency jar to every worker. 


> Deserialization Error ClassNotFoundException with Kryo, Guava 14
> 
>
> Key: SPARK-6069
> URL: https://issues.apache.org/jira/browse/SPARK-6069
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1
> Environment: Standalone one worker cluster on localhost, or any 
> cluster
>Reporter: Pat Ferrel
>Priority: Critical
>
> A class is contained in the jars passed in when creating a context. It is 
> registered with kryo. The class (Guava HashBiMap) is created correctly from 
> an RDD and broadcast but the deserialization fails with ClassNotFound.
> The work around is to hard code the path to the jar and make it available on 
> all workers. Hard code because we are creating a library so there is no easy 
> way to pass in to the app something like:
> spark.executor.extraClassPath  /path/to/some.jar



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6069) Deserialization Error ClassNotFoundException with Kryo, Guava 14

2015-04-30 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14521664#comment-14521664
 ] 

Pat Ferrel commented on SPARK-6069:
---

didn't mean for those comments to be cross-posted.

Removing the use of Javaserializer to work around this problem in Spark 1.2, 
which is in wide use.

> Deserialization Error ClassNotFoundException with Kryo, Guava 14
> 
>
> Key: SPARK-6069
> URL: https://issues.apache.org/jira/browse/SPARK-6069
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1
> Environment: Standalone one worker cluster on localhost, or any 
> cluster
>Reporter: Pat Ferrel
>Priority: Critical
>
> A class is contained in the jars passed in when creating a context. It is 
> registered with kryo. The class (Guava HashBiMap) is created correctly from 
> an RDD and broadcast but the deserialization fails with ClassNotFound.
> The work around is to hard code the path to the jar and make it available on 
> all workers. Hard code because we are creating a library so there is no easy 
> way to pass in to the app something like:
> spark.executor.extraClassPath  /path/to/some.jar



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6069) Deserialization Error ClassNotFoundException with Kryo, Guava 14

2015-03-02 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14343597#comment-14343597
 ] 

Pat Ferrel commented on SPARK-6069:
---

great since it's fixed in 1.3 I'll definitely try that next and resolve this if 
it flies.

> Deserialization Error ClassNotFoundException with Kryo, Guava 14
> 
>
> Key: SPARK-6069
> URL: https://issues.apache.org/jira/browse/SPARK-6069
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1
> Environment: Standalone one worker cluster on localhost, or any 
> cluster
>Reporter: Pat Ferrel
>Priority: Critical
>
> A class is contained in the jars passed in when creating a context. It is 
> registered with kryo. The class (Guava HashBiMap) is created correctly from 
> an RDD and broadcast but the deserialization fails with ClassNotFound.
> The work around is to hard code the path to the jar and make it available on 
> all workers. Hard code because we are creating a library so there is no easy 
> way to pass in to the app something like:
> spark.executor.extraClassPath  /path/to/some.jar



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6069) Deserialization Error ClassNotFoundException with Kryo, Guava 14

2015-03-02 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14343372#comment-14343372
 ] 

Pat Ferrel commented on SPARK-6069:
---

ok, I'll try 1.3 as soon as I get a chance.

> Deserialization Error ClassNotFoundException with Kryo, Guava 14
> 
>
> Key: SPARK-6069
> URL: https://issues.apache.org/jira/browse/SPARK-6069
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1
> Environment: Standalone one worker cluster on localhost, or any 
> cluster
>Reporter: Pat Ferrel
>Priority: Critical
>
> A class is contained in the jars passed in when creating a context. It is 
> registered with kryo. The class (Guava HashBiMap) is created correctly from 
> an RDD and broadcast but the deserialization fails with ClassNotFound.
> The work around is to hard code the path to the jar and make it available on 
> all workers. Hard code because we are creating a library so there is no easy 
> way to pass in to the app something like:
> spark.executor.extraClassPath  /path/to/some.jar



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6069) Deserialization Error ClassNotFoundException with Kryo, Guava 14

2015-02-28 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341809#comment-14341809
 ] 

Pat Ferrel commented on SPARK-6069:
---

Embarrassed to say still on Hadoop 1.2.1 and so no yarn. The packaging is not 
in the app jar but a separate pruned down dependencies-only jar. I can see why 
yarn would throw a unique kink into the situation.  So I guess you ran into 
this and had to use the {{user.classpath.first}} work around or are you saying 
it doesn't occur in oryx?

Still none of this should be necessary, right? Why else would jars be specified 
in to context creation? We do have a work around if someone has to work with 
1.2.1 but because of that it doesn't seem like a good version to recommend. 
Maybe I'll try 1.2 and install H2 and yarn--which seems like what the distros 
support.

> Deserialization Error ClassNotFoundException with Kryo, Guava 14
> 
>
> Key: SPARK-6069
> URL: https://issues.apache.org/jira/browse/SPARK-6069
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1
> Environment: Standalone one worker cluster on localhost, or any 
> cluster
>Reporter: Pat Ferrel
>Priority: Critical
>
> A class is contained in the jars passed in when creating a context. It is 
> registered with kryo. The class (Guava HashBiMap) is created correctly from 
> an RDD and broadcast but the deserialization fails with ClassNotFound.
> The work around is to hard code the path to the jar and make it available on 
> all workers. Hard code because we are creating a library so there is no easy 
> way to pass in to the app something like:
> spark.executor.extraClassPath  /path/to/some.jar



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6069) Deserialization Error ClassNotFound

2015-02-28 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341705#comment-14341705
 ] 

Pat Ferrel commented on SPARK-6069:
---

I agree, that part makes me suspicious, which is why I’m not sure I trust my 
builds completely.

No the ‘app' is one of the Spark-Mahout’s CLI drivers. The jar is a 
dependencies-reduced type thing that has only scopt and guava.

In any case if I put 
-D:spark.executor.extraClassPath=/Users/pat/mahout/spark/target/mahout-spark_2.10-1.0-SNAPSHOT-dependency-reduced.jar
 on the command line, which passes the key=value to the SparkConf then the 
Mahout CLI driver it works. The test setup is a standalone localhost only 
cluster (not local[n]). It is started with sbin/start-all.sh The same jar is 
used to create the context and I’ve checked that and the contents of the jar 
quite carefully.

On Feb 28, 2015, at 10:09 AM, Sean Owen (JIRA)  wrote:


   [ 
https://issues.apache.org/jira/browse/SPARK-6069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341699#comment-14341699
 ] 

Sean Owen commented on SPARK-6069:
--

Hm, the thing is I have been successfully running an app, without spark-submit, 
with kryo, with Guava 14 just like you and have never had a problem. I can't 
figure out what the difference is here.

The kryo not-found exception is stranger still. You aren't packaging spark 
classes with your app right?




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)



> Deserialization Error ClassNotFound 
> 
>
> Key: SPARK-6069
> URL: https://issues.apache.org/jira/browse/SPARK-6069
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1
> Environment: Standalone one worker cluster on localhost, or any 
> cluster
>Reporter: Pat Ferrel
>
> A class is contained in the jars passed in when creating a context. It is 
> registered with kryo. The class (Guava HashBiMap) is created correctly from 
> an RDD and broadcast but the deserialization fails with ClassNotFound.
> The work around is to hard code the path to the jar and make it available on 
> all workers. Hard code because we are creating a library so there is no easy 
> way to pass in to the app something like:
> spark.executor.extraClassPath  /path/to/some.jar



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6069) Deserialization Error ClassNotFound

2015-02-28 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341679#comment-14341679
 ] 

Pat Ferrel commented on SPARK-6069:
---

No goodness from spark.executor.userClassPathFirst either--same error as above. 
I'll try again Monday when I'm back to my regular cluster.

> Deserialization Error ClassNotFound 
> 
>
> Key: SPARK-6069
> URL: https://issues.apache.org/jira/browse/SPARK-6069
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1
> Environment: Standalone one worker cluster on localhost, or any 
> cluster
>Reporter: Pat Ferrel
>
> A class is contained in the jars passed in when creating a context. It is 
> registered with kryo. The class (Guava HashBiMap) is created correctly from 
> an RDD and broadcast but the deserialization fails with ClassNotFound.
> The work around is to hard code the path to the jar and make it available on 
> all workers. Hard code because we are creating a library so there is no easy 
> way to pass in to the app something like:
> spark.executor.extraClassPath  /path/to/some.jar



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6069) Deserialization Error ClassNotFound

2015-02-28 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341672#comment-14341672
 ] 

Pat Ferrel edited comment on SPARK-6069 at 2/28/15 5:31 PM:


Not sure I completely trust this result--I'm away from my HDFS cluster right 
now and so the standalone Spark is not quite that same as before...

Also didn't see you spark.executor.userClassPathFirst comment--will try next. 

I tried: 

 sparkConf.set("spark.files.userClassPathFirst", "true")

But got the following error:

15/02/28 09:23:00 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 
(TID 0, 192.168.0.7): java.lang.NoClassDefFoundError: 
org/apache/spark/serializer/KryoRegistrator
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at 
org.apache.spark.executor.ChildExecutorURLClassLoader$userClassLoader$.findClass(ExecutorURLClassLoader.scala:42)
at 
org.apache.spark.executor.ChildExecutorURLClassLoader.findClass(ExecutorURLClassLoader.scala:50)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:274)
at 
org.apache.spark.serializer.KryoSerializer$$anonfun$newKryo$3.apply(KryoSerializer.scala:103)
at 
org.apache.spark.serializer.KryoSerializer$$anonfun$newKryo$3.apply(KryoSerializer.scala:103)
at scala.Option.map(Option.scala:145)
at 
org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:103)
at 
org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:159)
at 
org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:121)
at 
org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:214)
at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:177)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1090)
at 
org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164)
at 
org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
at 
org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
at 
org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:61)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: 
org.apache.spark.serializer.KryoRegistrator
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at 
org.apache.spark.executor.ChildExecutorURLClassLoader$userClassLoader$.findClass(ExecutorURLClassLoader.scala:42)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 36 more



was (Author: pferrel):
Not sure I completely trust this result--I'm away from my HDFS cluster right 
now and so the standalone Spark is not quite that same as before...

I tried: 

 sparkConf.set("spark.files.userClassPathFirst", "true")

But got the following error:

15/02/28 09:23:00 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 
(TID 0, 192.168.0.7): java.lang.NoClassDefFoundError: 
org/apache/spark/serializer/KryoRegistrator
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
at 
java.security.SecureClassLoader.defi

[jira] [Commented] (SPARK-6069) Deserialization Error ClassNotFound

2015-02-28 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341672#comment-14341672
 ] 

Pat Ferrel commented on SPARK-6069:
---

Not sure I completely trust this result--I'm away from my HDFS cluster right 
now and so the standalone Spark is not quite that same as before...

I tried: 

 sparkConf.set("spark.files.userClassPathFirst", "true")

But got the following error:

15/02/28 09:23:00 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 
(TID 0, 192.168.0.7): java.lang.NoClassDefFoundError: 
org/apache/spark/serializer/KryoRegistrator
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at 
org.apache.spark.executor.ChildExecutorURLClassLoader$userClassLoader$.findClass(ExecutorURLClassLoader.scala:42)
at 
org.apache.spark.executor.ChildExecutorURLClassLoader.findClass(ExecutorURLClassLoader.scala:50)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:274)
at 
org.apache.spark.serializer.KryoSerializer$$anonfun$newKryo$3.apply(KryoSerializer.scala:103)
at 
org.apache.spark.serializer.KryoSerializer$$anonfun$newKryo$3.apply(KryoSerializer.scala:103)
at scala.Option.map(Option.scala:145)
at 
org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:103)
at 
org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:159)
at 
org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:121)
at 
org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:214)
at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:177)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1090)
at 
org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164)
at 
org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
at 
org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
at 
org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:61)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: 
org.apache.spark.serializer.KryoRegistrator
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at 
org.apache.spark.executor.ChildExecutorURLClassLoader$userClassLoader$.findClass(ExecutorURLClassLoader.scala:42)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 36 more


> Deserialization Error ClassNotFound 
> 
>
> Key: SPARK-6069
> URL: https://issues.apache.org/jira/browse/SPARK-6069
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1
> Environment: Standalone one worker cluster on localhost, or any 
> cluster
>Reporter: Pat Ferrel
>
> A class is contained in the jars passed in when creating a context. It is 
> registered with kryo. The class (Guava HashBiMap) is created correctly from 
> an RDD and broadcast but the deserialization fails with ClassNotFound.
> The work around is to hard code the path to the jar and make it available on 
> all workers. Hard 

[jira] [Comment Edited] (SPARK-6069) Deserialization Error ClassNotFound

2015-02-28 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341640#comment-14341640
 ] 

Pat Ferrel edited comment on SPARK-6069 at 2/28/15 4:48 PM:


I can try it. Are you suggesting an app change or a master conf change?

I need to add to conf/spar-default.conf?
spark.files.userClassPathFirst  true

Or should I add that to the context via SparkConf?

We have a standalone app that is not launched via spark-submit. But I guess 
your comment suggests an app change via SparkConf so I'll try that.


was (Author: pferrel):
I can try it. Are you suggesting an app change or a master conf change?

I need to add to conf/spar-default.conf?
spark.files.userClassPathFirst  true

Or should I add that to the context via SparkConf?

We have a standalone app that is not launched via spark-submit.

> Deserialization Error ClassNotFound 
> 
>
> Key: SPARK-6069
> URL: https://issues.apache.org/jira/browse/SPARK-6069
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1
> Environment: Standalone one worker cluster on localhost, or any 
> cluster
>Reporter: Pat Ferrel
>
> A class is contained in the jars passed in when creating a context. It is 
> registered with kryo. The class (Guava HashBiMap) is created correctly from 
> an RDD and broadcast but the deserialization fails with ClassNotFound.
> The work around is to hard code the path to the jar and make it available on 
> all workers. Hard code because we are creating a library so there is no easy 
> way to pass in to the app something like:
> spark.executor.extraClassPath  /path/to/some.jar



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6069) Deserialization Error ClassNotFound

2015-02-28 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341640#comment-14341640
 ] 

Pat Ferrel edited comment on SPARK-6069 at 2/28/15 4:47 PM:


I can try it. Are you suggesting an app change or a master conf change?

I need to add to conf/spar-default.conf?
spark.files.userClassPathFirst  true

Or should I add that to the context via SparkConf?

We have a standalone app that is not launched via spark-submit.


was (Author: pferrel):
I can try it. Are you suggesting an app change or a master conf change?

I need to add to conf/spar-default.conf?
spark.files.userClassPathFirst  true

Or should I add that to the context via SparkConf?

> Deserialization Error ClassNotFound 
> 
>
> Key: SPARK-6069
> URL: https://issues.apache.org/jira/browse/SPARK-6069
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1
> Environment: Standalone one worker cluster on localhost, or any 
> cluster
>Reporter: Pat Ferrel
>
> A class is contained in the jars passed in when creating a context. It is 
> registered with kryo. The class (Guava HashBiMap) is created correctly from 
> an RDD and broadcast but the deserialization fails with ClassNotFound.
> The work around is to hard code the path to the jar and make it available on 
> all workers. Hard code because we are creating a library so there is no easy 
> way to pass in to the app something like:
> spark.executor.extraClassPath  /path/to/some.jar



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6069) Deserialization Error ClassNotFound

2015-02-28 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341640#comment-14341640
 ] 

Pat Ferrel commented on SPARK-6069:
---

I can try it. Are you suggesting an app change or a master conf change?

I need to add to conf/spar-default.conf?
spark.files.userClassPathFirst  true

Or should I add that to the context via SparkConf?

> Deserialization Error ClassNotFound 
> 
>
> Key: SPARK-6069
> URL: https://issues.apache.org/jira/browse/SPARK-6069
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1
> Environment: Standalone one worker cluster on localhost, or any 
> cluster
>Reporter: Pat Ferrel
>
> A class is contained in the jars passed in when creating a context. It is 
> registered with kryo. The class (Guava HashBiMap) is created correctly from 
> an RDD and broadcast but the deserialization fails with ClassNotFound.
> The work around is to hard code the path to the jar and make it available on 
> all workers. Hard code because we are creating a library so there is no easy 
> way to pass in to the app something like:
> spark.executor.extraClassPath  /path/to/some.jar



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6069) Deserialization Error ClassNotFound

2015-02-28 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341608#comment-14341608
 ] 

Pat Ferrel commented on SPARK-6069:
---

It may be a dup, [~vanzin] said as much but I couldn't find the obvious Jira.

Any time the work around is to use "spark-submit --conf 
spark.executor.extraClassPath=/guava.jar blah” that means a standalone apps 
must have hard coded paths that are honored on every worker. And as you know a 
lib is pretty much blocked from use of this version of Spark—hence the blocker 
severity. We’ll have to warn people to not use this version of Spark.

I could easily be wrong but userClassPathFirst doesn’t seem to be the issue. 
There is no class conflict.


> Deserialization Error ClassNotFound 
> 
>
> Key: SPARK-6069
> URL: https://issues.apache.org/jira/browse/SPARK-6069
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1
> Environment: Standalone one worker cluster on localhost, or any 
> cluster
>Reporter: Pat Ferrel
>
> A class is contained in the jars passed in when creating a context. It is 
> registered with kryo. The class (Guava HashBiMap) is created correctly from 
> an RDD and broadcast but the deserialization fails with ClassNotFound.
> The work around is to hard code the path to the jar and make it available on 
> all workers. Hard code because we are creating a library so there is no easy 
> way to pass in to the app something like:
> spark.executor.extraClassPath  /path/to/some.jar



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6069) Deserialization Error ClassNotFound

2015-02-27 Thread Pat Ferrel (JIRA)
Pat Ferrel created SPARK-6069:
-

 Summary: Deserialization Error ClassNotFound 
 Key: SPARK-6069
 URL: https://issues.apache.org/jira/browse/SPARK-6069
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.1
 Environment: Standalone one worker cluster on localhost, or any cluster
Reporter: Pat Ferrel
Priority: Blocker
 Fix For: 1.2.2


A class is contained in the jars passed in when creating a context. It is 
registered with kryo. The class (Guava HashBiMap) is created correctly from an 
RDD and broadcast but the deserialization fails with ClassNotFound.

The work around is to hard code the path to the jar and make it available on 
all workers. Hard code because we are creating a library so there is no easy 
way to pass in to the app something like:

spark.executor.extraClassPath  /path/to/some.jar





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2075) Anonymous classes are missing from Spark distribution

2014-12-10 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241701#comment-14241701
 ] 

Pat Ferrel commented on SPARK-2075:
---

If the explanation is correct this needs to be filed against Spark as putting 
the wrong or not enough artifacts into maven repos. There would need to be a 
different artifact for every config option that will change internal naming.

I can't understand why lots of people aren't running into this, all it requires 
is that you link against the repo artifact and run against a user compiled 
Spark.

> Anonymous classes are missing from Spark distribution
> -
>
> Key: SPARK-2075
> URL: https://issues.apache.org/jira/browse/SPARK-2075
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Spark Core
>Affects Versions: 1.0.0
>Reporter: Paul R. Brown
>Priority: Critical
>
> Running a job built against the Maven dep for 1.0.0 and the hadoop1 
> distribution produces:
> {code}
> java.lang.ClassNotFoundException:
> org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1
> {code}
> Here's what's in the Maven dep as of 1.0.0:
> {code}
> jar tvf 
> ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar
>  | grep 'rdd/RDD' | grep 'saveAs'
>   1519 Mon May 26 13:57:58 PDT 2014 
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
>   1560 Mon May 26 13:57:58 PDT 2014 
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
> {code}
> And here's what's in the hadoop1 distribution:
> {code}
> jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs'
> {code}
> I.e., it's not there.  It is in the hadoop2 distribution:
> {code}
> jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs'
>   1519 Mon May 26 07:29:54 PDT 2014 
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
>   1560 Mon May 26 07:29:54 PDT 2014 
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2292) NullPointerException in JavaPairRDD.mapToPair

2014-10-21 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14178879#comment-14178879
 ] 

Pat Ferrel edited comment on SPARK-2292 at 10/21/14 7:21 PM:
-

If this is related to SPARK-2075 the answer is:

If you need to build Spark, use "mvn install" not "mvn package" and build Spark 
before you build your-project.

Then when you build your-project it will get the exact Spark bits needed from 
your Maven cache instead of potentially incompatible ones from the repos.

If this applies it's a nasty one for the whole maven echo-system to solve. 
First step might be changing those build instructions on the Spark site.


was (Author: pferrel):
If this is related to SPARK-2075 the answer is:

If you need to build Spark, use "mvn install" not "mvn package" and build Spark 
before you build you-project.

Then when you build your-project it will get the exact Spark bits needed from 
your Maven cache instead of potentially incompatible ones from the repos.

If this applies it's a nasty one for the whole maven echo-system to solve. 
First step might be changing those build instructions on the Spark site.

> NullPointerException in JavaPairRDD.mapToPair
> -
>
> Key: SPARK-2292
> URL: https://issues.apache.org/jira/browse/SPARK-2292
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
> Environment: Spark 1.0.0, Standalone with the master & single slave 
> running on Ubuntu on a laptop. 4G mem and 8 cores were available to the 
> executor .
>Reporter: Bharath Ravi Kumar
> Attachments: SPARK-2292-aash-repro.tar.gz
>
>
> Correction: Invoking JavaPairRDD.mapToPair results in an NPE:
> {noformat}
> 14/06/26 21:05:35 WARN scheduler.TaskSetManager: Loss was due to 
> java.lang.NullPointerException
> java.lang.NullPointerException
>   at 
> org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:750)
>   at 
> org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:750)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:59)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:96)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
>   at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
>   at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>   at org.apache.spark.scheduler.Task.run(Task.scala:51)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>   at java.lang.Thread.run(Thread.java:722)
> {noformat}
>  This occurs only after migrating to the 1.0.0 API. The details of the code 
> the data file used to test are included in this gist : 
> https://gist.github.com/reachbach/d8977c8eb5f71f889301



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2292) NullPointerException in JavaPairRDD.mapToPair

2014-10-21 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14178879#comment-14178879
 ] 

Pat Ferrel commented on SPARK-2292:
---

If this is related to SPARK-2075 the answer is:

If you need to build Spark, use "mvn install" not "mvn package" and build Spark 
before you build you-project.

Then when you build your-project it will get the exact Spark bits needed from 
your Maven cache instead of potentially incompatible ones from the repos.

If this applies it's a nasty one for the whole maven echo-system to solve. 
First step might be changing those build instructions on the Spark site.

> NullPointerException in JavaPairRDD.mapToPair
> -
>
> Key: SPARK-2292
> URL: https://issues.apache.org/jira/browse/SPARK-2292
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
> Environment: Spark 1.0.0, Standalone with the master & single slave 
> running on Ubuntu on a laptop. 4G mem and 8 cores were available to the 
> executor .
>Reporter: Bharath Ravi Kumar
> Attachments: SPARK-2292-aash-repro.tar.gz
>
>
> Correction: Invoking JavaPairRDD.mapToPair results in an NPE:
> {noformat}
> 14/06/26 21:05:35 WARN scheduler.TaskSetManager: Loss was due to 
> java.lang.NullPointerException
> java.lang.NullPointerException
>   at 
> org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:750)
>   at 
> org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:750)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:59)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:96)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
>   at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
>   at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>   at org.apache.spark.scheduler.Task.run(Task.scala:51)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>   at java.lang.Thread.run(Thread.java:722)
> {noformat}
>  This occurs only after migrating to the 1.0.0 API. The details of the code 
> the data file used to test are included in this gist : 
> https://gist.github.com/reachbach/d8977c8eb5f71f889301



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2075) Anonymous classes are missing from Spark distribution

2014-10-21 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14178852#comment-14178852
 ] 

Pat Ferrel commented on SPARK-2075:
---

OK solved. The WAG worked.

Instead of 'mvn package ...' use 'mvn install ...' so you'll get the exact 
version of Spark needed in your local maven cache at ~/.m2

This should be changed in the instructions until some way to reliable point to 
the right version of Spark in the Maven repos is implemented. In my case I had 
to build Spark to target Hadoop 1.2.1 but the Maven repo was not built for that.

> Anonymous classes are missing from Spark distribution
> -
>
> Key: SPARK-2075
> URL: https://issues.apache.org/jira/browse/SPARK-2075
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Spark Core
>Affects Versions: 1.0.0
>Reporter: Paul R. Brown
>Priority: Critical
> Fix For: 1.0.1
>
>
> Running a job built against the Maven dep for 1.0.0 and the hadoop1 
> distribution produces:
> {code}
> java.lang.ClassNotFoundException:
> org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1
> {code}
> Here's what's in the Maven dep as of 1.0.0:
> {code}
> jar tvf 
> ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar
>  | grep 'rdd/RDD' | grep 'saveAs'
>   1519 Mon May 26 13:57:58 PDT 2014 
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
>   1560 Mon May 26 13:57:58 PDT 2014 
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
> {code}
> And here's what's in the hadoop1 distribution:
> {code}
> jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs'
> {code}
> I.e., it's not there.  It is in the hadoop2 distribution:
> {code}
> jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs'
>   1519 Mon May 26 07:29:54 PDT 2014 
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
>   1560 Mon May 26 07:29:54 PDT 2014 
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2075) Anonymous classes are missing from Spark distribution

2014-10-21 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14178720#comment-14178720
 ] 

Pat Ferrel commented on SPARK-2075:
---

trying mvn install instead of the documented mvn package to put Spark in the 
maven cache so that when building Mahout it will get exactly the same bits as 
will later be run on the cluster--WAG

> Anonymous classes are missing from Spark distribution
> -
>
> Key: SPARK-2075
> URL: https://issues.apache.org/jira/browse/SPARK-2075
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Spark Core
>Affects Versions: 1.0.0
>Reporter: Paul R. Brown
>Priority: Critical
> Fix For: 1.0.1
>
>
> Running a job built against the Maven dep for 1.0.0 and the hadoop1 
> distribution produces:
> {code}
> java.lang.ClassNotFoundException:
> org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1
> {code}
> Here's what's in the Maven dep as of 1.0.0:
> {code}
> jar tvf 
> ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar
>  | grep 'rdd/RDD' | grep 'saveAs'
>   1519 Mon May 26 13:57:58 PDT 2014 
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
>   1560 Mon May 26 13:57:58 PDT 2014 
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
> {code}
> And here's what's in the hadoop1 distribution:
> {code}
> jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs'
> {code}
> I.e., it's not there.  It is in the hadoop2 distribution:
> {code}
> jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs'
>   1519 Mon May 26 07:29:54 PDT 2014 
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
>   1560 Mon May 26 07:29:54 PDT 2014 
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2075) Anonymous classes are missing from Spark distribution

2014-10-21 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14178688#comment-14178688
 ] 

Pat Ferrel commented on SPARK-2075:
---

Oops, right. But the function name is being constructed at Mahout build time, 
right? So the rules for constructing the name are different when building Spark 
and Mahout OR the function is not being put in the Spark jars?

Rebuilding yet again so I can't check the jars just yet. This error was from a 
clean build of Spark 1.1.0 and Mahout in that sequence. I cleaned the .m2 repo 
and see that it is filled in only when Mahout is build and is filled in with a 
repo version of Spark 1.1.0

Could the problem be related to the fact that Spark as executed on my 
standalone cluster is from a local build but as Mahout is built it uses the 
repo version of Spark? Since I am using hadoop 1.2.1 I suspect the repo version 
may not be exactly what I build.

> Anonymous classes are missing from Spark distribution
> -
>
> Key: SPARK-2075
> URL: https://issues.apache.org/jira/browse/SPARK-2075
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Spark Core
>Affects Versions: 1.0.0
>Reporter: Paul R. Brown
>Priority: Critical
> Fix For: 1.0.1
>
>
> Running a job built against the Maven dep for 1.0.0 and the hadoop1 
> distribution produces:
> {code}
> java.lang.ClassNotFoundException:
> org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1
> {code}
> Here's what's in the Maven dep as of 1.0.0:
> {code}
> jar tvf 
> ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar
>  | grep 'rdd/RDD' | grep 'saveAs'
>   1519 Mon May 26 13:57:58 PDT 2014 
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
>   1560 Mon May 26 13:57:58 PDT 2014 
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
> {code}
> And here's what's in the hadoop1 distribution:
> {code}
> jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs'
> {code}
> I.e., it's not there.  It is in the hadoop2 distribution:
> {code}
> jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs'
>   1519 Mon May 26 07:29:54 PDT 2014 
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
>   1560 Mon May 26 07:29:54 PDT 2014 
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2075) Anonymous classes are missing from Spark distribution

2014-10-20 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177748#comment-14177748
 ] 

Pat Ferrel edited comment on SPARK-2075 at 10/21/14 12:57 AM:
--

Is there any more on this?

Building Spark from the 1.1.0 tar for Hadoop 1.2.1--all is well. Trying to 
upgrade Mahout to use Spark 1.1.0. The Mahout 1.0-snapshot source builds and 
build tests pass with spark 1.1.0 as a maven dependency. Running the Mahout 
build on some bigger data using my dev machine as a standalone single node 
Spark cluster. So the same code is running as executed the build tests, just in 
single node cluster mode. Also since I built Spark i assume it is using the 
artifact from my .m2 maven cache, but not 100% on that. Anyway I get the class 
not found error below.

I assume the missing function is the anon function passed to the 

{code}
rdd.map(
  {anon function}
)saveAsTextFile  
{code}

so shouldn't the function be in the Mahout jar (it isn't)? Isn't this function 
passed in from Mahout so I don't understand why it matters how Spark was built. 

Several other users are getting this for Spark 1.0.2. If we are doing something 
wrong in our build process we'd appreciate a pointer.

Here's the error I get:

14/10/20 17:21:36 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 8.0 
(TID 16, 192.168.0.2): java.lang.ClassNotFoundException: 
org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1
java.net.URLClassLoader$1.run(URLClassLoader.java:202)
java.security.AccessController.doPrivileged(Native Method)
java.net.URLClassLoader.findClass(URLClassLoader.java:190)
java.lang.ClassLoader.loadClass(ClassLoader.java:306)
java.lang.ClassLoader.loadClass(ClassLoader.java:247)
java.lang.Class.forName0(Native Method)
java.lang.Class.forName(Class.java:249)

org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59)
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1591)
java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1496)

java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1750)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329)
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1970)
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1895)

java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1777)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329)
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1970)
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1895)

java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1777)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329)
java.io.ObjectInputStream.readObject(ObjectInputStream.java:349)

org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)

org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
java.lang.Thread.run(Thread.java:695)
  


was (Author: pferrel):
Is there any more on this?

Building Spark from the 1.1.0 tar for Hadoop 1.2.1--all is well. Trying to 
upgrade Mahout to use Spark 1.1.0. The Mahout 1.0-snapshot source builds and 
build tests pass with spark 1.1.0 as a maven dependency. Running the Mahout 
build on some bigger data using my dev machine as a standalone single node 
Spark cluster. So the same code is running as executed the build tests, just in 
single node cluster mode. Also since I built Spark i assume it is using the 
artifact from my .m2 maven cache, but not 100% on that. Anyway I get the class 
not found error below.

I assume the missing function is the anon function passed to the 

```
rdd.map(
  {anon function}
)saveAsTextFile  
```

so shouldn't the function be in the Mahout jar (it isn't)? Isn't this function 
passed in from Mahout so I don't understand why it matters how Spark was built. 

Several other users are getting this for Spark 1.0.2. If we are doing something 
wrong in our build process we'd appreciate a pointer.

Here's the error I get:

14/10/20 17:21:36 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 8.0 
(TID 16, 192.168.0.2): java.lang.ClassNotFoundException: 
org.apache.spark.rdd.RDD$$anonfun$saveAsTextF

[jira] [Comment Edited] (SPARK-2075) Anonymous classes are missing from Spark distribution

2014-10-20 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177748#comment-14177748
 ] 

Pat Ferrel edited comment on SPARK-2075 at 10/21/14 12:56 AM:
--

Is there any more on this?

Building Spark from the 1.1.0 tar for Hadoop 1.2.1--all is well. Trying to 
upgrade Mahout to use Spark 1.1.0. The Mahout 1.0-snapshot source builds and 
build tests pass with spark 1.1.0 as a maven dependency. Running the Mahout 
build on some bigger data using my dev machine as a standalone single node 
Spark cluster. So the same code is running as executed the build tests, just in 
single node cluster mode. Also since I built Spark i assume it is using the 
artifact from my .m2 maven cache, but not 100% on that. Anyway I get the class 
not found error below.

I assume the missing function is the anon function passed to the 

rdd.map(
  {anon function}
)saveAsTextFile  

so shouldn't the function be in the Mahout jar (it isn't)? Isn't this function 
passed in from Mahout so I don't understand why it matters how Spark was built. 

Several other users are getting this for Spark 1.0.2. If we are doing something 
wrong in our build process we'd appreciate a pointer.

Here's the error I get:

14/10/20 17:21:36 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 8.0 
(TID 16, 192.168.0.2): java.lang.ClassNotFoundException: 
org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1
java.net.URLClassLoader$1.run(URLClassLoader.java:202)
java.security.AccessController.doPrivileged(Native Method)
java.net.URLClassLoader.findClass(URLClassLoader.java:190)
java.lang.ClassLoader.loadClass(ClassLoader.java:306)
java.lang.ClassLoader.loadClass(ClassLoader.java:247)
java.lang.Class.forName0(Native Method)
java.lang.Class.forName(Class.java:249)

org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59)
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1591)
java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1496)

java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1750)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329)
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1970)
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1895)

java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1777)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329)
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1970)
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1895)

java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1777)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329)
java.io.ObjectInputStream.readObject(ObjectInputStream.java:349)

org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)

org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
java.lang.Thread.run(Thread.java:695)
  


was (Author: pferrel):
Is there any more on this?

Building Spark from the 1.1.0 tar for Hadoop 1.2.1--all is well. Trying to 
upgrade Mahout to use Spark 1.1.0. The Mahout 1.0-snapshot source builds and 
build tests pass with spark 1.1.0 as a maven dependency. Running the Mahout 
build on some bigger data using my dev machine as a standalone single node 
Spark cluster. So the same code is running as executed the build tests, just in 
single node cluster mode. Also since I built Spark i assume it is using the 
artifact from my .m2 maven cache, but not 100% on that. Anyway I get the class 
not found error below.

I assume the missing function is the anon function passed to the rdd.map({anon 
function})saveAsTextFile  so shouldn't the function be in the Mahout jar 
(it isn't)? Isn't this function passed in from Mahout so I don't understand why 
it matters how Spark was built. 

Several other users are getting this for Spark 1.0.2. If we are doing something 
wrong in our build process we'd appreciate a pointer.

Here's the error I get:

14/10/20 17:21:36 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 8.0 
(TID 16, 192.168.0.2): java.lang.ClassNotFoundException: 
org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1
java.net.URLClassLoader$1.

[jira] [Comment Edited] (SPARK-2075) Anonymous classes are missing from Spark distribution

2014-10-20 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177748#comment-14177748
 ] 

Pat Ferrel edited comment on SPARK-2075 at 10/21/14 12:57 AM:
--

Is there any more on this?

Building Spark from the 1.1.0 tar for Hadoop 1.2.1--all is well. Trying to 
upgrade Mahout to use Spark 1.1.0. The Mahout 1.0-snapshot source builds and 
build tests pass with spark 1.1.0 as a maven dependency. Running the Mahout 
build on some bigger data using my dev machine as a standalone single node 
Spark cluster. So the same code is running as executed the build tests, just in 
single node cluster mode. Also since I built Spark i assume it is using the 
artifact from my .m2 maven cache, but not 100% on that. Anyway I get the class 
not found error below.

I assume the missing function is the anon function passed to the 

```
rdd.map(
  {anon function}
)saveAsTextFile  
```

so shouldn't the function be in the Mahout jar (it isn't)? Isn't this function 
passed in from Mahout so I don't understand why it matters how Spark was built. 

Several other users are getting this for Spark 1.0.2. If we are doing something 
wrong in our build process we'd appreciate a pointer.

Here's the error I get:

14/10/20 17:21:36 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 8.0 
(TID 16, 192.168.0.2): java.lang.ClassNotFoundException: 
org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1
java.net.URLClassLoader$1.run(URLClassLoader.java:202)
java.security.AccessController.doPrivileged(Native Method)
java.net.URLClassLoader.findClass(URLClassLoader.java:190)
java.lang.ClassLoader.loadClass(ClassLoader.java:306)
java.lang.ClassLoader.loadClass(ClassLoader.java:247)
java.lang.Class.forName0(Native Method)
java.lang.Class.forName(Class.java:249)

org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59)
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1591)
java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1496)

java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1750)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329)
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1970)
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1895)

java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1777)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329)
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1970)
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1895)

java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1777)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329)
java.io.ObjectInputStream.readObject(ObjectInputStream.java:349)

org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)

org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
java.lang.Thread.run(Thread.java:695)
  


was (Author: pferrel):
Is there any more on this?

Building Spark from the 1.1.0 tar for Hadoop 1.2.1--all is well. Trying to 
upgrade Mahout to use Spark 1.1.0. The Mahout 1.0-snapshot source builds and 
build tests pass with spark 1.1.0 as a maven dependency. Running the Mahout 
build on some bigger data using my dev machine as a standalone single node 
Spark cluster. So the same code is running as executed the build tests, just in 
single node cluster mode. Also since I built Spark i assume it is using the 
artifact from my .m2 maven cache, but not 100% on that. Anyway I get the class 
not found error below.

I assume the missing function is the anon function passed to the 

rdd.map(
  {anon function}
)saveAsTextFile  

so shouldn't the function be in the Mahout jar (it isn't)? Isn't this function 
passed in from Mahout so I don't understand why it matters how Spark was built. 

Several other users are getting this for Spark 1.0.2. If we are doing something 
wrong in our build process we'd appreciate a pointer.

Here's the error I get:

14/10/20 17:21:36 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 8.0 
(TID 16, 192.168.0.2): java.lang.ClassNotFoundException: 
org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1

[jira] [Commented] (SPARK-2075) Anonymous classes are missing from Spark distribution

2014-10-20 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177748#comment-14177748
 ] 

Pat Ferrel commented on SPARK-2075:
---

Is there any more on this?

Building Spark from the 1.1.0 tar for Hadoop 1.2.1--all is well. Trying to 
upgrade Mahout to use Spark 1.1.0. The Mahout 1.0-snapshot source builds and 
build tests pass with spark 1.1.0 as a maven dependency. Running the Mahout 
build on some bigger data using my dev machine as a standalone single node 
Spark cluster. So the same code is running as executed the build tests, just in 
single node cluster mode. Also since I built Spark i assume it is using the 
artifact from my .m2 maven cache, but not 100% on that. Anyway I get the class 
not found error below.

I assume the missing function is the anon function passed to the rdd.map({anon 
function})saveAsTextFile  so shouldn't the function be in the Mahout jar 
(it isn't)? Isn't this function passed in from Mahout so I don't understand why 
it matters how Spark was built. 

Several other users are getting this for Spark 1.0.2. If we are doing something 
wrong in our build process we'd appreciate a pointer.

Here's the error I get:

14/10/20 17:21:36 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 8.0 
(TID 16, 192.168.0.2): java.lang.ClassNotFoundException: 
org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1
java.net.URLClassLoader$1.run(URLClassLoader.java:202)
java.security.AccessController.doPrivileged(Native Method)
java.net.URLClassLoader.findClass(URLClassLoader.java:190)
java.lang.ClassLoader.loadClass(ClassLoader.java:306)
java.lang.ClassLoader.loadClass(ClassLoader.java:247)
java.lang.Class.forName0(Native Method)
java.lang.Class.forName(Class.java:249)

org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59)
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1591)
java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1496)

java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1750)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329)
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1970)
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1895)

java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1777)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329)
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1970)
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1895)

java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1777)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329)
java.io.ObjectInputStream.readObject(ObjectInputStream.java:349)

org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)

org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
java.lang.Thread.run(Thread.java:695)
  

> Anonymous classes are missing from Spark distribution
> -
>
> Key: SPARK-2075
> URL: https://issues.apache.org/jira/browse/SPARK-2075
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Spark Core
>Affects Versions: 1.0.0
>Reporter: Paul R. Brown
>Priority: Critical
> Fix For: 1.0.1
>
>
> Running a job built against the Maven dep for 1.0.0 and the hadoop1 
> distribution produces:
> {code}
> java.lang.ClassNotFoundException:
> org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1
> {code}
> Here's what's in the Maven dep as of 1.0.0:
> {code}
> jar tvf 
> ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar
>  | grep 'rdd/RDD' | grep 'saveAs'
>   1519 Mon May 26 13:57:58 PDT 2014 
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
>   1560 Mon May 26 13:57:58 PDT 2014 
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
> {code}
> And here's what's in the hadoop1 distribution:
> {code}
> jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs'
> {code}
> I.e., it's not there.  It is in the hadoop2 distribution:
> {code}
> jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs'
>   1519 Mon