Re: Upgrade to Spark 1.2.1 using Guava

2015-02-28 Thread Erlend Hamnaberg
Yes. I ran into this problem with mahout snapshot and spark 1.2.0 not
really trying to figure out why that was a problem, since there were
already too many moving parts in my app. Obviously there is a classpath
issue somewhere.

/Erlend
On 27 Feb 2015 22:30, Pat Ferrel p...@occamsmachete.com wrote:

 @Erlend hah, we were trying to merge your PR and ran into this—small
 world. You actually compile the JavaSerializer source in your project?

 @Marcelo do you mean by modifying spark.executor.extraClassPath on all
 workers, that didn’t seem to work?

 On Feb 27, 2015, at 1:23 PM, Erlend Hamnaberg erl...@hamnaberg.net
 wrote:

 Hi.

 I have had a simliar issue. I had to pull the JavaSerializer source into
 my own project, just so I got the classloading of this class under control.

 This must be a class loader issue with spark.

 -E

 On Fri, Feb 27, 2015 at 8:52 PM, Pat Ferrel p...@occamsmachete.com wrote:

 I understand that I need to supply Guava to Spark. The HashBiMap is
 created in the client and broadcast to the workers. So it is needed in
 both. To achieve this there is a deps.jar with Guava (and Scopt but that is
 only for the client). Scopt is found so I know the jar is fine for the
 client.

 I pass in the deps.jar to the context creation code. I’ve checked the
 content of the jar and have verified that it is used at context creation
 time.

 I register the serializer as follows:

 class MyKryoRegistrator extends KryoRegistrator {

   override def registerClasses(kryo: Kryo) = {

 val h: HashBiMap[String, Int] = HashBiMap.create[String, Int]()
 //kryo.addDefaultSerializer(h.getClass, new JavaSerializer())
 log.info(\n\n\nRegister Serializer for  +
 h.getClass.getCanonicalName + \n\n\n) // just to be sure this does indeed
 get logged
 kryo.register(classOf[com.google.common.collect.HashBiMap[String,
 Int]], new JavaSerializer())
   }
 }

 The job proceeds up until the broadcast value, a HashBiMap, is
 deserialized, which is where I get the following error.

 Have I missed a step for deserialization of broadcast values? Odd that
 serialization happened but deserialization failed. I’m running on a
 standalone localhost-only cluster.


 15/02/27 11:40:34 WARN scheduler.TaskSetManager: Lost task 1.0 in stage
 4.0 (TID 9, 192.168.0.2): java.io.IOException:
 com.esotericsoftware.kryo.KryoException: Error during Java deserialization.
 at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1093)
 at
 org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164)
 at
 org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
 at
 org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
 at
 org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87)
 at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
 at
 my.TDIndexedDatasetReader$$anonfun$5.apply(TextDelimitedReaderWriter.scala:95)
 at
 my.TDIndexedDatasetReader$$anonfun$5.apply(TextDelimitedReaderWriter.scala:94)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at
 org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:366)
 at
 org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:211)
 at
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: com.esotericsoftware.kryo.KryoException: Error during Java
 deserialization.
 at
 com.esotericsoftware.kryo.serializers.JavaSerializer.read(JavaSerializer.java:42)
 at
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
 at
 org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:144)
 at
 org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:216)
 at
 org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:177)
 at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1090)
 ... 19 more

  root eror ==
 Caused by: java.lang.ClassNotFoundException:
 com.google.common.collect.HashBiMap
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366

Re: Upgrade to Spark 1.2.1 using Guava

2015-02-27 Thread Erlend Hamnaberg
Hi.

I have had a simliar issue. I had to pull the JavaSerializer source into my
own project, just so I got the classloading of this class under control.

This must be a class loader issue with spark.

-E

On Fri, Feb 27, 2015 at 8:52 PM, Pat Ferrel p...@occamsmachete.com wrote:

 I understand that I need to supply Guava to Spark. The HashBiMap is
 created in the client and broadcast to the workers. So it is needed in
 both. To achieve this there is a deps.jar with Guava (and Scopt but that is
 only for the client). Scopt is found so I know the jar is fine for the
 client.

 I pass in the deps.jar to the context creation code. I’ve checked the
 content of the jar and have verified that it is used at context creation
 time.

 I register the serializer as follows:

 class MyKryoRegistrator extends KryoRegistrator {

   override def registerClasses(kryo: Kryo) = {

 val h: HashBiMap[String, Int] = HashBiMap.create[String, Int]()
 //kryo.addDefaultSerializer(h.getClass, new JavaSerializer())
 log.info(\n\n\nRegister Serializer for  +
 h.getClass.getCanonicalName + \n\n\n) // just to be sure this does indeed
 get logged
 kryo.register(classOf[com.google.common.collect.HashBiMap[String,
 Int]], new JavaSerializer())
   }
 }

 The job proceeds up until the broadcast value, a HashBiMap, is
 deserialized, which is where I get the following error.

 Have I missed a step for deserialization of broadcast values? Odd that
 serialization happened but deserialization failed. I’m running on a
 standalone localhost-only cluster.


 15/02/27 11:40:34 WARN scheduler.TaskSetManager: Lost task 1.0 in stage
 4.0 (TID 9, 192.168.0.2): java.io.IOException:
 com.esotericsoftware.kryo.KryoException: Error during Java deserialization.
 at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1093)
 at
 org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164)
 at
 org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
 at
 org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
 at
 org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87)
 at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
 at
 my.TDIndexedDatasetReader$$anonfun$5.apply(TextDelimitedReaderWriter.scala:95)
 at
 my.TDIndexedDatasetReader$$anonfun$5.apply(TextDelimitedReaderWriter.scala:94)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at
 org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:366)
 at
 org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:211)
 at
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: com.esotericsoftware.kryo.KryoException: Error during Java
 deserialization.
 at
 com.esotericsoftware.kryo.serializers.JavaSerializer.read(JavaSerializer.java:42)
 at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
 at
 org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:144)
 at
 org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:216)
 at
 org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:177)
 at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1090)
 ... 19 more

  root eror ==
 Caused by: java.lang.ClassNotFoundException:
 com.google.common.collect.HashBiMap
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 ...








 On Feb 25, 2015, at 5:24 PM, Marcelo Vanzin van...@cloudera.com wrote:

 Guava is not in Spark. (Well, long version: it's in Spark but it's
 relocated to a different package except for some special classes
 leaked through the public API.)

 If your app needs Guava, it needs to package Guava with it (e.g. by
 using maven-shade-plugin, or using --jars if only executors use
 Guava).

 On Wed, Feb 25, 2015 at 5:17 PM, Pat Ferrel p...@occamsmachete.com wrote:
  The root Spark pom has guava set at a certain version number. It’s very
 hard
  to read the