[jira] [Commented] (SPARK-4796) Spark does not remove temp files
[ https://issues.apache.org/jira/browse/SPARK-4796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692692#comment-14692692 ] Pat Ferrel commented on SPARK-4796: --- Why is this marked resolved? Spark does indeed leave around a lot of files and unless you are looking you'd never know. It sounds like the only safe method to remove these is to shutdown Spark and delete them. I skimmed the issue so sorry if I missed something. 15G on the MBP and counting :-) > Spark does not remove temp files > > > Key: SPARK-4796 > URL: https://issues.apache.org/jira/browse/SPARK-4796 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 1.1.0 > Environment: I'm runnin spark on mesos and mesos slaves are docker > containers. Spark 1.1.0, elasticsearch spark 2.1.0-Beta3, mesos 0.20.0, > docker 1.2.0. >Reporter: Ian Babrou > > I started a job that cannot fill into memory and got "no space left on > device". That was fair, because docker containers only have 10gb of disk > space and some is taken by OS already. > But then I found out when job failed it didn't release any disk space and > left container without any free disk space. > Then I decided to check if spark removes temp files in any case, because many > mesos slaves had /tmp/spark-local-*. Apparently some garbage stays after > spark task is finished. I attached with strace to running job: > [pid 30212] > unlink("/tmp/spark-local-20141209091330-48b5/12/temp_8a73fcc2-4baa-499a-8add-0161f918de8a") > = 0 > [pid 30212] > unlink("/tmp/spark-local-20141209091330-48b5/31/temp_47efd04b-d427-4139-8f48-3d5d421e9be4") > = 0 > [pid 30212] > unlink("/tmp/spark-local-20141209091330-48b5/15/temp_619a46dc-40de-43f1-a844-4db146a607c6") > = 0 > [pid 30212] > unlink("/tmp/spark-local-20141209091330-48b5/05/temp_d97d90a7-8bc1-4742-ba9b-41d74ea73c36" > > [pid 30212] <... unlink resumed> ) = 0 > [pid 30212] > unlink("/tmp/spark-local-20141209091330-48b5/36/temp_a2deb806-714a-457a-90c8-5d9f3247a5d7") > = 0 > [pid 30212] > unlink("/tmp/spark-local-20141209091330-48b5/04/temp_afd558f1-2fd0-48d7-bc65-07b5f4455b22") > = 0 > [pid 30212] > unlink("/tmp/spark-local-20141209091330-48b5/32/temp_a7add910-8dc3-482c-baf5-09d5a187c62a" > > [pid 30212] <... unlink resumed> ) = 0 > [pid 30212] > unlink("/tmp/spark-local-20141209091330-48b5/21/temp_485612f0-527f-47b0-bb8b-6016f3b9ec19") > = 0 > [pid 30212] > unlink("/tmp/spark-local-20141209091330-48b5/12/temp_bb2b4e06-a9dd-408e-8395-f6c5f4e2d52f") > = 0 > [pid 30212] > unlink("/tmp/spark-local-20141209091330-48b5/1e/temp_825293c6-9d3b-4451-9cb8-91e2abe5a19d" > > [pid 30212] <... unlink resumed> ) = 0 > [pid 30212] > unlink("/tmp/spark-local-20141209091330-48b5/15/temp_43fbb94c-9163-4aa7-ab83-e7693b9f21fc") > = 0 > [pid 30212] > unlink("/tmp/spark-local-20141209091330-48b5/3d/temp_37f3629c-1b09-4907-b599-61b7df94b898" > > [pid 30212] <... unlink resumed> ) = 0 > [pid 30212] > unlink("/tmp/spark-local-20141209091330-48b5/35/temp_d18f49f6-1fb1-4c01-a694-0ee0a72294c0") > = 0 > And after job is finished, some files are still there: > /tmp/spark-local-20141209091330-48b5/ > /tmp/spark-local-20141209091330-48b5/11 > /tmp/spark-local-20141209091330-48b5/11/shuffle_0_1_4 > /tmp/spark-local-20141209091330-48b5/32 > /tmp/spark-local-20141209091330-48b5/04 > /tmp/spark-local-20141209091330-48b5/05 > /tmp/spark-local-20141209091330-48b5/0f > /tmp/spark-local-20141209091330-48b5/0f/shuffle_0_1_2 > /tmp/spark-local-20141209091330-48b5/3d > /tmp/spark-local-20141209091330-48b5/0e > /tmp/spark-local-20141209091330-48b5/0e/shuffle_0_1_1 > /tmp/spark-local-20141209091330-48b5/15 > /tmp/spark-local-20141209091330-48b5/0d > /tmp/spark-local-20141209091330-48b5/0d/shuffle_0_1_0 > /tmp/spark-local-20141209091330-48b5/36 > /tmp/spark-local-20141209091330-48b5/31 > /tmp/spark-local-20141209091330-48b5/12 > /tmp/spark-local-20141209091330-48b5/21 > /tmp/spark-local-20141209091330-48b5/10 > /tmp/spark-local-20141209091330-48b5/10/shuffle_0_1_3 > /tmp/spark-local-20141209091330-48b5/1e > /tmp/spark-local-20141209091330-48b5/35 > If I look into my mesos slaves, there are mostly "shuffle" files, overall > picture for single node: > root@web338:~# find /tmp/spark-local-20141* -type f | fgrep shuffle | wc -l > 781 > root@web338:~# find /tmp/spark-local-20141* -type f | fgrep -v shuffle | wc -l > 10 > root@web338:~# find /tmp/spark-local-20141* -type f | fgrep -v shuffle > /tmp/spark-local-20141119144512-67c4/2d/temp_9056f380-3edb-48d6-a7df-d4896f1e1cc3 > /tmp/spark-local-20141119144512-67c4/3d/temp_e005659b-eddf-4a34-947f-4f63fcddf111 > /tmp/spark-local-20141119144512-67c4/16/temp_71eba702-36b4-4e1a-aebc-20d2080f1705 > /tmp/spark-local-20141119144512-67c4/0d/temp_8037b9db-2d8a-4786-a554-a8cad922bf5e > /
[jira] [Commented] (SPARK-6069) Deserialization Error ClassNotFoundException with Kryo, Guava 14
[ https://issues.apache.org/jira/browse/SPARK-6069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14525267#comment-14525267 ] Pat Ferrel commented on SPARK-6069: --- Same with Mahout where we were using Guava. Using Scala collections only will solve this for us. The way to work around this is to use the spark.executor.extraClassPath, which should point to the correct jar in the native filesystem on every worker! So you have to move a dependency jar to every worker. > Deserialization Error ClassNotFoundException with Kryo, Guava 14 > > > Key: SPARK-6069 > URL: https://issues.apache.org/jira/browse/SPARK-6069 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.1 > Environment: Standalone one worker cluster on localhost, or any > cluster >Reporter: Pat Ferrel >Priority: Critical > > A class is contained in the jars passed in when creating a context. It is > registered with kryo. The class (Guava HashBiMap) is created correctly from > an RDD and broadcast but the deserialization fails with ClassNotFound. > The work around is to hard code the path to the jar and make it available on > all workers. Hard code because we are creating a library so there is no easy > way to pass in to the app something like: > spark.executor.extraClassPath /path/to/some.jar -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6069) Deserialization Error ClassNotFoundException with Kryo, Guava 14
[ https://issues.apache.org/jira/browse/SPARK-6069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14521664#comment-14521664 ] Pat Ferrel commented on SPARK-6069: --- didn't mean for those comments to be cross-posted. Removing the use of Javaserializer to work around this problem in Spark 1.2, which is in wide use. > Deserialization Error ClassNotFoundException with Kryo, Guava 14 > > > Key: SPARK-6069 > URL: https://issues.apache.org/jira/browse/SPARK-6069 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.1 > Environment: Standalone one worker cluster on localhost, or any > cluster >Reporter: Pat Ferrel >Priority: Critical > > A class is contained in the jars passed in when creating a context. It is > registered with kryo. The class (Guava HashBiMap) is created correctly from > an RDD and broadcast but the deserialization fails with ClassNotFound. > The work around is to hard code the path to the jar and make it available on > all workers. Hard code because we are creating a library so there is no easy > way to pass in to the app something like: > spark.executor.extraClassPath /path/to/some.jar -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6069) Deserialization Error ClassNotFoundException with Kryo, Guava 14
[ https://issues.apache.org/jira/browse/SPARK-6069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14343597#comment-14343597 ] Pat Ferrel commented on SPARK-6069: --- great since it's fixed in 1.3 I'll definitely try that next and resolve this if it flies. > Deserialization Error ClassNotFoundException with Kryo, Guava 14 > > > Key: SPARK-6069 > URL: https://issues.apache.org/jira/browse/SPARK-6069 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.1 > Environment: Standalone one worker cluster on localhost, or any > cluster >Reporter: Pat Ferrel >Priority: Critical > > A class is contained in the jars passed in when creating a context. It is > registered with kryo. The class (Guava HashBiMap) is created correctly from > an RDD and broadcast but the deserialization fails with ClassNotFound. > The work around is to hard code the path to the jar and make it available on > all workers. Hard code because we are creating a library so there is no easy > way to pass in to the app something like: > spark.executor.extraClassPath /path/to/some.jar -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6069) Deserialization Error ClassNotFoundException with Kryo, Guava 14
[ https://issues.apache.org/jira/browse/SPARK-6069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14343372#comment-14343372 ] Pat Ferrel commented on SPARK-6069: --- ok, I'll try 1.3 as soon as I get a chance. > Deserialization Error ClassNotFoundException with Kryo, Guava 14 > > > Key: SPARK-6069 > URL: https://issues.apache.org/jira/browse/SPARK-6069 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.1 > Environment: Standalone one worker cluster on localhost, or any > cluster >Reporter: Pat Ferrel >Priority: Critical > > A class is contained in the jars passed in when creating a context. It is > registered with kryo. The class (Guava HashBiMap) is created correctly from > an RDD and broadcast but the deserialization fails with ClassNotFound. > The work around is to hard code the path to the jar and make it available on > all workers. Hard code because we are creating a library so there is no easy > way to pass in to the app something like: > spark.executor.extraClassPath /path/to/some.jar -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6069) Deserialization Error ClassNotFoundException with Kryo, Guava 14
[ https://issues.apache.org/jira/browse/SPARK-6069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341809#comment-14341809 ] Pat Ferrel commented on SPARK-6069: --- Embarrassed to say still on Hadoop 1.2.1 and so no yarn. The packaging is not in the app jar but a separate pruned down dependencies-only jar. I can see why yarn would throw a unique kink into the situation. So I guess you ran into this and had to use the {{user.classpath.first}} work around or are you saying it doesn't occur in oryx? Still none of this should be necessary, right? Why else would jars be specified in to context creation? We do have a work around if someone has to work with 1.2.1 but because of that it doesn't seem like a good version to recommend. Maybe I'll try 1.2 and install H2 and yarn--which seems like what the distros support. > Deserialization Error ClassNotFoundException with Kryo, Guava 14 > > > Key: SPARK-6069 > URL: https://issues.apache.org/jira/browse/SPARK-6069 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.1 > Environment: Standalone one worker cluster on localhost, or any > cluster >Reporter: Pat Ferrel >Priority: Critical > > A class is contained in the jars passed in when creating a context. It is > registered with kryo. The class (Guava HashBiMap) is created correctly from > an RDD and broadcast but the deserialization fails with ClassNotFound. > The work around is to hard code the path to the jar and make it available on > all workers. Hard code because we are creating a library so there is no easy > way to pass in to the app something like: > spark.executor.extraClassPath /path/to/some.jar -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6069) Deserialization Error ClassNotFound
[ https://issues.apache.org/jira/browse/SPARK-6069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341705#comment-14341705 ] Pat Ferrel commented on SPARK-6069: --- I agree, that part makes me suspicious, which is why I’m not sure I trust my builds completely. No the ‘app' is one of the Spark-Mahout’s CLI drivers. The jar is a dependencies-reduced type thing that has only scopt and guava. In any case if I put -D:spark.executor.extraClassPath=/Users/pat/mahout/spark/target/mahout-spark_2.10-1.0-SNAPSHOT-dependency-reduced.jar on the command line, which passes the key=value to the SparkConf then the Mahout CLI driver it works. The test setup is a standalone localhost only cluster (not local[n]). It is started with sbin/start-all.sh The same jar is used to create the context and I’ve checked that and the contents of the jar quite carefully. On Feb 28, 2015, at 10:09 AM, Sean Owen (JIRA) wrote: [ https://issues.apache.org/jira/browse/SPARK-6069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341699#comment-14341699 ] Sean Owen commented on SPARK-6069: -- Hm, the thing is I have been successfully running an app, without spark-submit, with kryo, with Guava 14 just like you and have never had a problem. I can't figure out what the difference is here. The kryo not-found exception is stranger still. You aren't packaging spark classes with your app right? -- This message was sent by Atlassian JIRA (v6.3.4#6332) > Deserialization Error ClassNotFound > > > Key: SPARK-6069 > URL: https://issues.apache.org/jira/browse/SPARK-6069 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.1 > Environment: Standalone one worker cluster on localhost, or any > cluster >Reporter: Pat Ferrel > > A class is contained in the jars passed in when creating a context. It is > registered with kryo. The class (Guava HashBiMap) is created correctly from > an RDD and broadcast but the deserialization fails with ClassNotFound. > The work around is to hard code the path to the jar and make it available on > all workers. Hard code because we are creating a library so there is no easy > way to pass in to the app something like: > spark.executor.extraClassPath /path/to/some.jar -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6069) Deserialization Error ClassNotFound
[ https://issues.apache.org/jira/browse/SPARK-6069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341679#comment-14341679 ] Pat Ferrel commented on SPARK-6069: --- No goodness from spark.executor.userClassPathFirst either--same error as above. I'll try again Monday when I'm back to my regular cluster. > Deserialization Error ClassNotFound > > > Key: SPARK-6069 > URL: https://issues.apache.org/jira/browse/SPARK-6069 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.1 > Environment: Standalone one worker cluster on localhost, or any > cluster >Reporter: Pat Ferrel > > A class is contained in the jars passed in when creating a context. It is > registered with kryo. The class (Guava HashBiMap) is created correctly from > an RDD and broadcast but the deserialization fails with ClassNotFound. > The work around is to hard code the path to the jar and make it available on > all workers. Hard code because we are creating a library so there is no easy > way to pass in to the app something like: > spark.executor.extraClassPath /path/to/some.jar -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6069) Deserialization Error ClassNotFound
[ https://issues.apache.org/jira/browse/SPARK-6069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341672#comment-14341672 ] Pat Ferrel edited comment on SPARK-6069 at 2/28/15 5:31 PM: Not sure I completely trust this result--I'm away from my HDFS cluster right now and so the standalone Spark is not quite that same as before... Also didn't see you spark.executor.userClassPathFirst comment--will try next. I tried: sparkConf.set("spark.files.userClassPathFirst", "true") But got the following error: 15/02/28 09:23:00 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 192.168.0.7): java.lang.NoClassDefFoundError: org/apache/spark/serializer/KryoRegistrator at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:800) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) at java.net.URLClassLoader.access$100(URLClassLoader.java:71) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at org.apache.spark.executor.ChildExecutorURLClassLoader$userClassLoader$.findClass(ExecutorURLClassLoader.scala:42) at org.apache.spark.executor.ChildExecutorURLClassLoader.findClass(ExecutorURLClassLoader.scala:50) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:274) at org.apache.spark.serializer.KryoSerializer$$anonfun$newKryo$3.apply(KryoSerializer.scala:103) at org.apache.spark.serializer.KryoSerializer$$anonfun$newKryo$3.apply(KryoSerializer.scala:103) at scala.Option.map(Option.scala:145) at org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:103) at org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:159) at org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:121) at org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:214) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:177) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1090) at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164) at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64) at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64) at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87) at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:61) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ClassNotFoundException: org.apache.spark.serializer.KryoRegistrator at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at org.apache.spark.executor.ChildExecutorURLClassLoader$userClassLoader$.findClass(ExecutorURLClassLoader.scala:42) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 36 more was (Author: pferrel): Not sure I completely trust this result--I'm away from my HDFS cluster right now and so the standalone Spark is not quite that same as before... I tried: sparkConf.set("spark.files.userClassPathFirst", "true") But got the following error: 15/02/28 09:23:00 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 192.168.0.7): java.lang.NoClassDefFoundError: org/apache/spark/serializer/KryoRegistrator at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:800) at java.security.SecureClassLoader.defi
[jira] [Commented] (SPARK-6069) Deserialization Error ClassNotFound
[ https://issues.apache.org/jira/browse/SPARK-6069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341672#comment-14341672 ] Pat Ferrel commented on SPARK-6069: --- Not sure I completely trust this result--I'm away from my HDFS cluster right now and so the standalone Spark is not quite that same as before... I tried: sparkConf.set("spark.files.userClassPathFirst", "true") But got the following error: 15/02/28 09:23:00 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 192.168.0.7): java.lang.NoClassDefFoundError: org/apache/spark/serializer/KryoRegistrator at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:800) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) at java.net.URLClassLoader.access$100(URLClassLoader.java:71) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at org.apache.spark.executor.ChildExecutorURLClassLoader$userClassLoader$.findClass(ExecutorURLClassLoader.scala:42) at org.apache.spark.executor.ChildExecutorURLClassLoader.findClass(ExecutorURLClassLoader.scala:50) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:274) at org.apache.spark.serializer.KryoSerializer$$anonfun$newKryo$3.apply(KryoSerializer.scala:103) at org.apache.spark.serializer.KryoSerializer$$anonfun$newKryo$3.apply(KryoSerializer.scala:103) at scala.Option.map(Option.scala:145) at org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:103) at org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:159) at org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:121) at org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:214) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:177) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1090) at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164) at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64) at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64) at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87) at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:61) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ClassNotFoundException: org.apache.spark.serializer.KryoRegistrator at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at org.apache.spark.executor.ChildExecutorURLClassLoader$userClassLoader$.findClass(ExecutorURLClassLoader.scala:42) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 36 more > Deserialization Error ClassNotFound > > > Key: SPARK-6069 > URL: https://issues.apache.org/jira/browse/SPARK-6069 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.1 > Environment: Standalone one worker cluster on localhost, or any > cluster >Reporter: Pat Ferrel > > A class is contained in the jars passed in when creating a context. It is > registered with kryo. The class (Guava HashBiMap) is created correctly from > an RDD and broadcast but the deserialization fails with ClassNotFound. > The work around is to hard code the path to the jar and make it available on > all workers. Hard
[jira] [Comment Edited] (SPARK-6069) Deserialization Error ClassNotFound
[ https://issues.apache.org/jira/browse/SPARK-6069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341640#comment-14341640 ] Pat Ferrel edited comment on SPARK-6069 at 2/28/15 4:48 PM: I can try it. Are you suggesting an app change or a master conf change? I need to add to conf/spar-default.conf? spark.files.userClassPathFirst true Or should I add that to the context via SparkConf? We have a standalone app that is not launched via spark-submit. But I guess your comment suggests an app change via SparkConf so I'll try that. was (Author: pferrel): I can try it. Are you suggesting an app change or a master conf change? I need to add to conf/spar-default.conf? spark.files.userClassPathFirst true Or should I add that to the context via SparkConf? We have a standalone app that is not launched via spark-submit. > Deserialization Error ClassNotFound > > > Key: SPARK-6069 > URL: https://issues.apache.org/jira/browse/SPARK-6069 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.1 > Environment: Standalone one worker cluster on localhost, or any > cluster >Reporter: Pat Ferrel > > A class is contained in the jars passed in when creating a context. It is > registered with kryo. The class (Guava HashBiMap) is created correctly from > an RDD and broadcast but the deserialization fails with ClassNotFound. > The work around is to hard code the path to the jar and make it available on > all workers. Hard code because we are creating a library so there is no easy > way to pass in to the app something like: > spark.executor.extraClassPath /path/to/some.jar -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6069) Deserialization Error ClassNotFound
[ https://issues.apache.org/jira/browse/SPARK-6069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341640#comment-14341640 ] Pat Ferrel edited comment on SPARK-6069 at 2/28/15 4:47 PM: I can try it. Are you suggesting an app change or a master conf change? I need to add to conf/spar-default.conf? spark.files.userClassPathFirst true Or should I add that to the context via SparkConf? We have a standalone app that is not launched via spark-submit. was (Author: pferrel): I can try it. Are you suggesting an app change or a master conf change? I need to add to conf/spar-default.conf? spark.files.userClassPathFirst true Or should I add that to the context via SparkConf? > Deserialization Error ClassNotFound > > > Key: SPARK-6069 > URL: https://issues.apache.org/jira/browse/SPARK-6069 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.1 > Environment: Standalone one worker cluster on localhost, or any > cluster >Reporter: Pat Ferrel > > A class is contained in the jars passed in when creating a context. It is > registered with kryo. The class (Guava HashBiMap) is created correctly from > an RDD and broadcast but the deserialization fails with ClassNotFound. > The work around is to hard code the path to the jar and make it available on > all workers. Hard code because we are creating a library so there is no easy > way to pass in to the app something like: > spark.executor.extraClassPath /path/to/some.jar -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6069) Deserialization Error ClassNotFound
[ https://issues.apache.org/jira/browse/SPARK-6069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341640#comment-14341640 ] Pat Ferrel commented on SPARK-6069: --- I can try it. Are you suggesting an app change or a master conf change? I need to add to conf/spar-default.conf? spark.files.userClassPathFirst true Or should I add that to the context via SparkConf? > Deserialization Error ClassNotFound > > > Key: SPARK-6069 > URL: https://issues.apache.org/jira/browse/SPARK-6069 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.1 > Environment: Standalone one worker cluster on localhost, or any > cluster >Reporter: Pat Ferrel > > A class is contained in the jars passed in when creating a context. It is > registered with kryo. The class (Guava HashBiMap) is created correctly from > an RDD and broadcast but the deserialization fails with ClassNotFound. > The work around is to hard code the path to the jar and make it available on > all workers. Hard code because we are creating a library so there is no easy > way to pass in to the app something like: > spark.executor.extraClassPath /path/to/some.jar -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6069) Deserialization Error ClassNotFound
[ https://issues.apache.org/jira/browse/SPARK-6069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341608#comment-14341608 ] Pat Ferrel commented on SPARK-6069: --- It may be a dup, [~vanzin] said as much but I couldn't find the obvious Jira. Any time the work around is to use "spark-submit --conf spark.executor.extraClassPath=/guava.jar blah” that means a standalone apps must have hard coded paths that are honored on every worker. And as you know a lib is pretty much blocked from use of this version of Spark—hence the blocker severity. We’ll have to warn people to not use this version of Spark. I could easily be wrong but userClassPathFirst doesn’t seem to be the issue. There is no class conflict. > Deserialization Error ClassNotFound > > > Key: SPARK-6069 > URL: https://issues.apache.org/jira/browse/SPARK-6069 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.1 > Environment: Standalone one worker cluster on localhost, or any > cluster >Reporter: Pat Ferrel > > A class is contained in the jars passed in when creating a context. It is > registered with kryo. The class (Guava HashBiMap) is created correctly from > an RDD and broadcast but the deserialization fails with ClassNotFound. > The work around is to hard code the path to the jar and make it available on > all workers. Hard code because we are creating a library so there is no easy > way to pass in to the app something like: > spark.executor.extraClassPath /path/to/some.jar -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6069) Deserialization Error ClassNotFound
Pat Ferrel created SPARK-6069: - Summary: Deserialization Error ClassNotFound Key: SPARK-6069 URL: https://issues.apache.org/jira/browse/SPARK-6069 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.1 Environment: Standalone one worker cluster on localhost, or any cluster Reporter: Pat Ferrel Priority: Blocker Fix For: 1.2.2 A class is contained in the jars passed in when creating a context. It is registered with kryo. The class (Guava HashBiMap) is created correctly from an RDD and broadcast but the deserialization fails with ClassNotFound. The work around is to hard code the path to the jar and make it available on all workers. Hard code because we are creating a library so there is no easy way to pass in to the app something like: spark.executor.extraClassPath /path/to/some.jar -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2075) Anonymous classes are missing from Spark distribution
[ https://issues.apache.org/jira/browse/SPARK-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241701#comment-14241701 ] Pat Ferrel commented on SPARK-2075: --- If the explanation is correct this needs to be filed against Spark as putting the wrong or not enough artifacts into maven repos. There would need to be a different artifact for every config option that will change internal naming. I can't understand why lots of people aren't running into this, all it requires is that you link against the repo artifact and run against a user compiled Spark. > Anonymous classes are missing from Spark distribution > - > > Key: SPARK-2075 > URL: https://issues.apache.org/jira/browse/SPARK-2075 > Project: Spark > Issue Type: Bug > Components: Build, Spark Core >Affects Versions: 1.0.0 >Reporter: Paul R. Brown >Priority: Critical > > Running a job built against the Maven dep for 1.0.0 and the hadoop1 > distribution produces: > {code} > java.lang.ClassNotFoundException: > org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1 > {code} > Here's what's in the Maven dep as of 1.0.0: > {code} > jar tvf > ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar > | grep 'rdd/RDD' | grep 'saveAs' > 1519 Mon May 26 13:57:58 PDT 2014 > org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class > 1560 Mon May 26 13:57:58 PDT 2014 > org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class > {code} > And here's what's in the hadoop1 distribution: > {code} > jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs' > {code} > I.e., it's not there. It is in the hadoop2 distribution: > {code} > jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs' > 1519 Mon May 26 07:29:54 PDT 2014 > org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class > 1560 Mon May 26 07:29:54 PDT 2014 > org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2292) NullPointerException in JavaPairRDD.mapToPair
[ https://issues.apache.org/jira/browse/SPARK-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14178879#comment-14178879 ] Pat Ferrel edited comment on SPARK-2292 at 10/21/14 7:21 PM: - If this is related to SPARK-2075 the answer is: If you need to build Spark, use "mvn install" not "mvn package" and build Spark before you build your-project. Then when you build your-project it will get the exact Spark bits needed from your Maven cache instead of potentially incompatible ones from the repos. If this applies it's a nasty one for the whole maven echo-system to solve. First step might be changing those build instructions on the Spark site. was (Author: pferrel): If this is related to SPARK-2075 the answer is: If you need to build Spark, use "mvn install" not "mvn package" and build Spark before you build you-project. Then when you build your-project it will get the exact Spark bits needed from your Maven cache instead of potentially incompatible ones from the repos. If this applies it's a nasty one for the whole maven echo-system to solve. First step might be changing those build instructions on the Spark site. > NullPointerException in JavaPairRDD.mapToPair > - > > Key: SPARK-2292 > URL: https://issues.apache.org/jira/browse/SPARK-2292 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 > Environment: Spark 1.0.0, Standalone with the master & single slave > running on Ubuntu on a laptop. 4G mem and 8 cores were available to the > executor . >Reporter: Bharath Ravi Kumar > Attachments: SPARK-2292-aash-repro.tar.gz > > > Correction: Invoking JavaPairRDD.mapToPair results in an NPE: > {noformat} > 14/06/26 21:05:35 WARN scheduler.TaskSetManager: Loss was due to > java.lang.NullPointerException > java.lang.NullPointerException > at > org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:750) > at > org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:750) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:59) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:96) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95) > at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582) > at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at org.apache.spark.scheduler.Task.run(Task.scala:51) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > at java.lang.Thread.run(Thread.java:722) > {noformat} > This occurs only after migrating to the 1.0.0 API. The details of the code > the data file used to test are included in this gist : > https://gist.github.com/reachbach/d8977c8eb5f71f889301 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2292) NullPointerException in JavaPairRDD.mapToPair
[ https://issues.apache.org/jira/browse/SPARK-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14178879#comment-14178879 ] Pat Ferrel commented on SPARK-2292: --- If this is related to SPARK-2075 the answer is: If you need to build Spark, use "mvn install" not "mvn package" and build Spark before you build you-project. Then when you build your-project it will get the exact Spark bits needed from your Maven cache instead of potentially incompatible ones from the repos. If this applies it's a nasty one for the whole maven echo-system to solve. First step might be changing those build instructions on the Spark site. > NullPointerException in JavaPairRDD.mapToPair > - > > Key: SPARK-2292 > URL: https://issues.apache.org/jira/browse/SPARK-2292 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 > Environment: Spark 1.0.0, Standalone with the master & single slave > running on Ubuntu on a laptop. 4G mem and 8 cores were available to the > executor . >Reporter: Bharath Ravi Kumar > Attachments: SPARK-2292-aash-repro.tar.gz > > > Correction: Invoking JavaPairRDD.mapToPair results in an NPE: > {noformat} > 14/06/26 21:05:35 WARN scheduler.TaskSetManager: Loss was due to > java.lang.NullPointerException > java.lang.NullPointerException > at > org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:750) > at > org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:750) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:59) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:96) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95) > at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582) > at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at org.apache.spark.scheduler.Task.run(Task.scala:51) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > at java.lang.Thread.run(Thread.java:722) > {noformat} > This occurs only after migrating to the 1.0.0 API. The details of the code > the data file used to test are included in this gist : > https://gist.github.com/reachbach/d8977c8eb5f71f889301 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2075) Anonymous classes are missing from Spark distribution
[ https://issues.apache.org/jira/browse/SPARK-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14178852#comment-14178852 ] Pat Ferrel commented on SPARK-2075: --- OK solved. The WAG worked. Instead of 'mvn package ...' use 'mvn install ...' so you'll get the exact version of Spark needed in your local maven cache at ~/.m2 This should be changed in the instructions until some way to reliable point to the right version of Spark in the Maven repos is implemented. In my case I had to build Spark to target Hadoop 1.2.1 but the Maven repo was not built for that. > Anonymous classes are missing from Spark distribution > - > > Key: SPARK-2075 > URL: https://issues.apache.org/jira/browse/SPARK-2075 > Project: Spark > Issue Type: Bug > Components: Build, Spark Core >Affects Versions: 1.0.0 >Reporter: Paul R. Brown >Priority: Critical > Fix For: 1.0.1 > > > Running a job built against the Maven dep for 1.0.0 and the hadoop1 > distribution produces: > {code} > java.lang.ClassNotFoundException: > org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1 > {code} > Here's what's in the Maven dep as of 1.0.0: > {code} > jar tvf > ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar > | grep 'rdd/RDD' | grep 'saveAs' > 1519 Mon May 26 13:57:58 PDT 2014 > org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class > 1560 Mon May 26 13:57:58 PDT 2014 > org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class > {code} > And here's what's in the hadoop1 distribution: > {code} > jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs' > {code} > I.e., it's not there. It is in the hadoop2 distribution: > {code} > jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs' > 1519 Mon May 26 07:29:54 PDT 2014 > org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class > 1560 Mon May 26 07:29:54 PDT 2014 > org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2075) Anonymous classes are missing from Spark distribution
[ https://issues.apache.org/jira/browse/SPARK-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14178720#comment-14178720 ] Pat Ferrel commented on SPARK-2075: --- trying mvn install instead of the documented mvn package to put Spark in the maven cache so that when building Mahout it will get exactly the same bits as will later be run on the cluster--WAG > Anonymous classes are missing from Spark distribution > - > > Key: SPARK-2075 > URL: https://issues.apache.org/jira/browse/SPARK-2075 > Project: Spark > Issue Type: Bug > Components: Build, Spark Core >Affects Versions: 1.0.0 >Reporter: Paul R. Brown >Priority: Critical > Fix For: 1.0.1 > > > Running a job built against the Maven dep for 1.0.0 and the hadoop1 > distribution produces: > {code} > java.lang.ClassNotFoundException: > org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1 > {code} > Here's what's in the Maven dep as of 1.0.0: > {code} > jar tvf > ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar > | grep 'rdd/RDD' | grep 'saveAs' > 1519 Mon May 26 13:57:58 PDT 2014 > org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class > 1560 Mon May 26 13:57:58 PDT 2014 > org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class > {code} > And here's what's in the hadoop1 distribution: > {code} > jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs' > {code} > I.e., it's not there. It is in the hadoop2 distribution: > {code} > jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs' > 1519 Mon May 26 07:29:54 PDT 2014 > org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class > 1560 Mon May 26 07:29:54 PDT 2014 > org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2075) Anonymous classes are missing from Spark distribution
[ https://issues.apache.org/jira/browse/SPARK-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14178688#comment-14178688 ] Pat Ferrel commented on SPARK-2075: --- Oops, right. But the function name is being constructed at Mahout build time, right? So the rules for constructing the name are different when building Spark and Mahout OR the function is not being put in the Spark jars? Rebuilding yet again so I can't check the jars just yet. This error was from a clean build of Spark 1.1.0 and Mahout in that sequence. I cleaned the .m2 repo and see that it is filled in only when Mahout is build and is filled in with a repo version of Spark 1.1.0 Could the problem be related to the fact that Spark as executed on my standalone cluster is from a local build but as Mahout is built it uses the repo version of Spark? Since I am using hadoop 1.2.1 I suspect the repo version may not be exactly what I build. > Anonymous classes are missing from Spark distribution > - > > Key: SPARK-2075 > URL: https://issues.apache.org/jira/browse/SPARK-2075 > Project: Spark > Issue Type: Bug > Components: Build, Spark Core >Affects Versions: 1.0.0 >Reporter: Paul R. Brown >Priority: Critical > Fix For: 1.0.1 > > > Running a job built against the Maven dep for 1.0.0 and the hadoop1 > distribution produces: > {code} > java.lang.ClassNotFoundException: > org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1 > {code} > Here's what's in the Maven dep as of 1.0.0: > {code} > jar tvf > ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar > | grep 'rdd/RDD' | grep 'saveAs' > 1519 Mon May 26 13:57:58 PDT 2014 > org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class > 1560 Mon May 26 13:57:58 PDT 2014 > org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class > {code} > And here's what's in the hadoop1 distribution: > {code} > jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs' > {code} > I.e., it's not there. It is in the hadoop2 distribution: > {code} > jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs' > 1519 Mon May 26 07:29:54 PDT 2014 > org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class > 1560 Mon May 26 07:29:54 PDT 2014 > org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2075) Anonymous classes are missing from Spark distribution
[ https://issues.apache.org/jira/browse/SPARK-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177748#comment-14177748 ] Pat Ferrel edited comment on SPARK-2075 at 10/21/14 12:57 AM: -- Is there any more on this? Building Spark from the 1.1.0 tar for Hadoop 1.2.1--all is well. Trying to upgrade Mahout to use Spark 1.1.0. The Mahout 1.0-snapshot source builds and build tests pass with spark 1.1.0 as a maven dependency. Running the Mahout build on some bigger data using my dev machine as a standalone single node Spark cluster. So the same code is running as executed the build tests, just in single node cluster mode. Also since I built Spark i assume it is using the artifact from my .m2 maven cache, but not 100% on that. Anyway I get the class not found error below. I assume the missing function is the anon function passed to the {code} rdd.map( {anon function} )saveAsTextFile {code} so shouldn't the function be in the Mahout jar (it isn't)? Isn't this function passed in from Mahout so I don't understand why it matters how Spark was built. Several other users are getting this for Spark 1.0.2. If we are doing something wrong in our build process we'd appreciate a pointer. Here's the error I get: 14/10/20 17:21:36 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 8.0 (TID 16, 192.168.0.2): java.lang.ClassNotFoundException: org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1 java.net.URLClassLoader$1.run(URLClassLoader.java:202) java.security.AccessController.doPrivileged(Native Method) java.net.URLClassLoader.findClass(URLClassLoader.java:190) java.lang.ClassLoader.loadClass(ClassLoader.java:306) java.lang.ClassLoader.loadClass(ClassLoader.java:247) java.lang.Class.forName0(Native Method) java.lang.Class.forName(Class.java:249) org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59) java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1591) java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1496) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1750) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329) java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1970) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1895) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1777) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329) java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1970) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1895) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1777) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329) java.io.ObjectInputStream.readObject(ObjectInputStream.java:349) org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) java.lang.Thread.run(Thread.java:695) was (Author: pferrel): Is there any more on this? Building Spark from the 1.1.0 tar for Hadoop 1.2.1--all is well. Trying to upgrade Mahout to use Spark 1.1.0. The Mahout 1.0-snapshot source builds and build tests pass with spark 1.1.0 as a maven dependency. Running the Mahout build on some bigger data using my dev machine as a standalone single node Spark cluster. So the same code is running as executed the build tests, just in single node cluster mode. Also since I built Spark i assume it is using the artifact from my .m2 maven cache, but not 100% on that. Anyway I get the class not found error below. I assume the missing function is the anon function passed to the ``` rdd.map( {anon function} )saveAsTextFile ``` so shouldn't the function be in the Mahout jar (it isn't)? Isn't this function passed in from Mahout so I don't understand why it matters how Spark was built. Several other users are getting this for Spark 1.0.2. If we are doing something wrong in our build process we'd appreciate a pointer. Here's the error I get: 14/10/20 17:21:36 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 8.0 (TID 16, 192.168.0.2): java.lang.ClassNotFoundException: org.apache.spark.rdd.RDD$$anonfun$saveAsTextF
[jira] [Comment Edited] (SPARK-2075) Anonymous classes are missing from Spark distribution
[ https://issues.apache.org/jira/browse/SPARK-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177748#comment-14177748 ] Pat Ferrel edited comment on SPARK-2075 at 10/21/14 12:56 AM: -- Is there any more on this? Building Spark from the 1.1.0 tar for Hadoop 1.2.1--all is well. Trying to upgrade Mahout to use Spark 1.1.0. The Mahout 1.0-snapshot source builds and build tests pass with spark 1.1.0 as a maven dependency. Running the Mahout build on some bigger data using my dev machine as a standalone single node Spark cluster. So the same code is running as executed the build tests, just in single node cluster mode. Also since I built Spark i assume it is using the artifact from my .m2 maven cache, but not 100% on that. Anyway I get the class not found error below. I assume the missing function is the anon function passed to the rdd.map( {anon function} )saveAsTextFile so shouldn't the function be in the Mahout jar (it isn't)? Isn't this function passed in from Mahout so I don't understand why it matters how Spark was built. Several other users are getting this for Spark 1.0.2. If we are doing something wrong in our build process we'd appreciate a pointer. Here's the error I get: 14/10/20 17:21:36 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 8.0 (TID 16, 192.168.0.2): java.lang.ClassNotFoundException: org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1 java.net.URLClassLoader$1.run(URLClassLoader.java:202) java.security.AccessController.doPrivileged(Native Method) java.net.URLClassLoader.findClass(URLClassLoader.java:190) java.lang.ClassLoader.loadClass(ClassLoader.java:306) java.lang.ClassLoader.loadClass(ClassLoader.java:247) java.lang.Class.forName0(Native Method) java.lang.Class.forName(Class.java:249) org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59) java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1591) java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1496) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1750) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329) java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1970) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1895) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1777) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329) java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1970) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1895) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1777) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329) java.io.ObjectInputStream.readObject(ObjectInputStream.java:349) org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) java.lang.Thread.run(Thread.java:695) was (Author: pferrel): Is there any more on this? Building Spark from the 1.1.0 tar for Hadoop 1.2.1--all is well. Trying to upgrade Mahout to use Spark 1.1.0. The Mahout 1.0-snapshot source builds and build tests pass with spark 1.1.0 as a maven dependency. Running the Mahout build on some bigger data using my dev machine as a standalone single node Spark cluster. So the same code is running as executed the build tests, just in single node cluster mode. Also since I built Spark i assume it is using the artifact from my .m2 maven cache, but not 100% on that. Anyway I get the class not found error below. I assume the missing function is the anon function passed to the rdd.map({anon function})saveAsTextFile so shouldn't the function be in the Mahout jar (it isn't)? Isn't this function passed in from Mahout so I don't understand why it matters how Spark was built. Several other users are getting this for Spark 1.0.2. If we are doing something wrong in our build process we'd appreciate a pointer. Here's the error I get: 14/10/20 17:21:36 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 8.0 (TID 16, 192.168.0.2): java.lang.ClassNotFoundException: org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1 java.net.URLClassLoader$1.
[jira] [Comment Edited] (SPARK-2075) Anonymous classes are missing from Spark distribution
[ https://issues.apache.org/jira/browse/SPARK-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177748#comment-14177748 ] Pat Ferrel edited comment on SPARK-2075 at 10/21/14 12:57 AM: -- Is there any more on this? Building Spark from the 1.1.0 tar for Hadoop 1.2.1--all is well. Trying to upgrade Mahout to use Spark 1.1.0. The Mahout 1.0-snapshot source builds and build tests pass with spark 1.1.0 as a maven dependency. Running the Mahout build on some bigger data using my dev machine as a standalone single node Spark cluster. So the same code is running as executed the build tests, just in single node cluster mode. Also since I built Spark i assume it is using the artifact from my .m2 maven cache, but not 100% on that. Anyway I get the class not found error below. I assume the missing function is the anon function passed to the ``` rdd.map( {anon function} )saveAsTextFile ``` so shouldn't the function be in the Mahout jar (it isn't)? Isn't this function passed in from Mahout so I don't understand why it matters how Spark was built. Several other users are getting this for Spark 1.0.2. If we are doing something wrong in our build process we'd appreciate a pointer. Here's the error I get: 14/10/20 17:21:36 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 8.0 (TID 16, 192.168.0.2): java.lang.ClassNotFoundException: org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1 java.net.URLClassLoader$1.run(URLClassLoader.java:202) java.security.AccessController.doPrivileged(Native Method) java.net.URLClassLoader.findClass(URLClassLoader.java:190) java.lang.ClassLoader.loadClass(ClassLoader.java:306) java.lang.ClassLoader.loadClass(ClassLoader.java:247) java.lang.Class.forName0(Native Method) java.lang.Class.forName(Class.java:249) org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59) java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1591) java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1496) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1750) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329) java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1970) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1895) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1777) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329) java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1970) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1895) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1777) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329) java.io.ObjectInputStream.readObject(ObjectInputStream.java:349) org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) java.lang.Thread.run(Thread.java:695) was (Author: pferrel): Is there any more on this? Building Spark from the 1.1.0 tar for Hadoop 1.2.1--all is well. Trying to upgrade Mahout to use Spark 1.1.0. The Mahout 1.0-snapshot source builds and build tests pass with spark 1.1.0 as a maven dependency. Running the Mahout build on some bigger data using my dev machine as a standalone single node Spark cluster. So the same code is running as executed the build tests, just in single node cluster mode. Also since I built Spark i assume it is using the artifact from my .m2 maven cache, but not 100% on that. Anyway I get the class not found error below. I assume the missing function is the anon function passed to the rdd.map( {anon function} )saveAsTextFile so shouldn't the function be in the Mahout jar (it isn't)? Isn't this function passed in from Mahout so I don't understand why it matters how Spark was built. Several other users are getting this for Spark 1.0.2. If we are doing something wrong in our build process we'd appreciate a pointer. Here's the error I get: 14/10/20 17:21:36 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 8.0 (TID 16, 192.168.0.2): java.lang.ClassNotFoundException: org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1
[jira] [Commented] (SPARK-2075) Anonymous classes are missing from Spark distribution
[ https://issues.apache.org/jira/browse/SPARK-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177748#comment-14177748 ] Pat Ferrel commented on SPARK-2075: --- Is there any more on this? Building Spark from the 1.1.0 tar for Hadoop 1.2.1--all is well. Trying to upgrade Mahout to use Spark 1.1.0. The Mahout 1.0-snapshot source builds and build tests pass with spark 1.1.0 as a maven dependency. Running the Mahout build on some bigger data using my dev machine as a standalone single node Spark cluster. So the same code is running as executed the build tests, just in single node cluster mode. Also since I built Spark i assume it is using the artifact from my .m2 maven cache, but not 100% on that. Anyway I get the class not found error below. I assume the missing function is the anon function passed to the rdd.map({anon function})saveAsTextFile so shouldn't the function be in the Mahout jar (it isn't)? Isn't this function passed in from Mahout so I don't understand why it matters how Spark was built. Several other users are getting this for Spark 1.0.2. If we are doing something wrong in our build process we'd appreciate a pointer. Here's the error I get: 14/10/20 17:21:36 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 8.0 (TID 16, 192.168.0.2): java.lang.ClassNotFoundException: org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1 java.net.URLClassLoader$1.run(URLClassLoader.java:202) java.security.AccessController.doPrivileged(Native Method) java.net.URLClassLoader.findClass(URLClassLoader.java:190) java.lang.ClassLoader.loadClass(ClassLoader.java:306) java.lang.ClassLoader.loadClass(ClassLoader.java:247) java.lang.Class.forName0(Native Method) java.lang.Class.forName(Class.java:249) org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59) java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1591) java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1496) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1750) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329) java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1970) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1895) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1777) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329) java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1970) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1895) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1777) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329) java.io.ObjectInputStream.readObject(ObjectInputStream.java:349) org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) java.lang.Thread.run(Thread.java:695) > Anonymous classes are missing from Spark distribution > - > > Key: SPARK-2075 > URL: https://issues.apache.org/jira/browse/SPARK-2075 > Project: Spark > Issue Type: Bug > Components: Build, Spark Core >Affects Versions: 1.0.0 >Reporter: Paul R. Brown >Priority: Critical > Fix For: 1.0.1 > > > Running a job built against the Maven dep for 1.0.0 and the hadoop1 > distribution produces: > {code} > java.lang.ClassNotFoundException: > org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1 > {code} > Here's what's in the Maven dep as of 1.0.0: > {code} > jar tvf > ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar > | grep 'rdd/RDD' | grep 'saveAs' > 1519 Mon May 26 13:57:58 PDT 2014 > org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class > 1560 Mon May 26 13:57:58 PDT 2014 > org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class > {code} > And here's what's in the hadoop1 distribution: > {code} > jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs' > {code} > I.e., it's not there. It is in the hadoop2 distribution: > {code} > jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs' > 1519 Mon