OutOfMemorryError after chaning to .persist(StorageLevel.MEMORY_ONLY_SER)

2016-03-03 Thread Jake Yoon
Hi, Spark users.

I am getting the following OutOfMemoryError: Java heap space after changing
to StorageLevel.MEMORY_ONLY_SER.
MEMORY_AND_DISK_SER also throws the same error.
I thought DISK option should put unfitting blocks to the disk.
What could cause the OOM in such situation?

Is there any good approach to solve this issue?

FYI, I am using 1.6.0 with stand-alone mode.
Single node with master and slave in the same node.
16 Cores and 64GB mem.


16/03/03 11:09:27 WARN BlockManager: Putting block rdd_39_0 failed
16/03/03 11:09:27 ERROR Executor: Exception in task 0.0 in stage 17.0 (TID 37)
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3236)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
at 
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at 
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at 
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
at 
java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
at 
org.apache.spark.serializer.SerializationStream.writeAll(Serializer.scala:153)
at 
org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:1196)
at 
org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:1202)
at org.apache.spark.storage.MemoryStore.putArray(MemoryStore.scala:136)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:800)
at 
org.apache.spark.storage.BlockManager.putArray(BlockManager.scala:676)
at 
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:175)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/03/03 11:09:27 ERROR SparkUncaughtExceptionHandler: Uncaught
exception in thread Thread[Executor task launch worker-0,5,main]
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3236)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
at 
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at 
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at 
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
at 
java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
at 
org.apache.spark.serializer.SerializationStream.writeAll(Serializer.scala:153)
at 
org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:1196)
at 
org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:1202)
at org.apache.spark.storage.MemoryStore.putArray(MemoryStore.scala:136)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:800)
at 
org.apache.spark.storage.BlockManager.putArray(BlockManager.scala:676)
at 
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:175)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
   

Unshaded google guava classes in spark-network-common jar

2016-01-11 Thread Jake Yoon
I found an unshaded google guava classes used internally in
spark-network-common while working with ElasticSearch.

Following link discusses about duplicate dependencies conflict cause by
guava classes and how I solved the build conflict issue.

https://discuss.elastic.co/t/exception-when-using-elasticsearch-spark-and-elasticsearch-core-together/38471/4

Is this worth raising an issue?

-- 
Dynamicscope


Occasionally getting RpcTimeoutException

2015-11-01 Thread Jake Yoon
Hi Sparkers.

I am very new to Spark, and I am occasionally getting RpCTimeoutException
with the following error.

15/11/01 22:19:46 WARN HeartbeatReceiver: Removing executor 0 with no
> recent heartbeats: 321792 ms exceeds timeout 30 ms
> 15/11/01 22:19:46 ERROR TaskSchedulerImpl: Lost executor 0 on 172.31.11.1:
> Executor heartbeat timed out after 321792 ms
> 15/11/01 22:19:46 WARN TaskSetManager: Lost task 0.0 in stage 755.0 (TID
> 755, 172.31.11.1): ExecutorLostFailure (executor 0 lost)
> 15/11/01 22:20:18 ERROR ContextCleaner: Error cleaning RDD 1775
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [300
> seconds]. This timeout is controlled by spark.network.timeout
> ...
> ...
> ...
> 15/11/01 22:20:18 WARN BlockManagerMaster: Failed to remove RDD 1775 - Ask
> timed out on [Actor[akka.tcp://
> sparkExecutor@172.31.11.1:34987/user/BlockManagerEndpoint1#-787212020]]
> after [30 ms]. This timeout is controlled by spark.network.timeout
> org.apache.spark.rpc.RpcTimeoutException: Ask timed out on
> [Actor[akka.tcp://
> sparkExecutor@172.31.11.1:34987/user/BlockManagerEndpoint1#-787212020]]
> after [30 ms]. This timeout is controlled by spark.network.timeout


And the following is my code:

val sessionsRDD = sessions.mapPartitions { valueIterator =>
> val conf = new SparkConf()
>   .set("com.couchbase.nodes",
> confBd.value.get("com.couchbase.nodes").get)
>   .set("com.couchbase.bucket.default",
> confBd.value.get("com.couchbase.bucket.default").get)
> val cbConf = CouchbaseConfig(conf)
> val bucket = CouchbaseConnection().bucket(cbConf,
> "default").async()
> if (valueIterator.isEmpty) {
>   Iterator[JsonDocument]()
> } else LazyIterator {
>   Observable
> .from(OnceIterable(valueIterator).toSeq)
> .flatMap(id => {
>
> Observable.defer[JsonDocument](toScalaObservable(bucket.get(id,
> classOf[JsonDocument])))
>
> .retryWhen(RetryBuilder.anyOf(classOf[BackpressureException])
>   .max(5)
>   .delay(Delay.exponential(TimeUnit.MILLISECONDS, 500,
> 1)).build())
> })
> .toBlocking
> .toIterable
> .iterator
> }
>   }
>   sessionsRDD.cache()
>   val sessionInfo = sessionsRDD
>   .map(doc => {
>   (0, 0, 0)
> })
>   .count()
>   println(sessionInfo)
>   sessionsRDD.unpersist()



And this part gives me the error:

  val sessionInfo = sessionsRDD
>   .map(doc => {
>   (0, 0, 0)
> })
>   .count()
>   println(sessionInfo)


I tried increasing timeout, but it does not quite helping me.
How could I solve such issue?
Any hint would be helpful.

Jake