Help: Driver OOM when shuffle large amount of data

2015-12-27 Thread kendal
My driver is running OOM with my 4T data set... I don't collect any data to
driver. All what the program done is map - reduce - saveAsTextFile. But the
partitions to be shuffled is quite large - 20K+.

The symptom what I'm seeing the timeout when GetMapOutputStatuses from
Driver.
15/12/24 02:04:21 INFO spark.MapOutputTrackerWorker: Don't have map outputs
for shuffle 0, fetching them
15/12/24 02:04:21 INFO spark.MapOutputTrackerWorker: Doing the fetch;
tracker endpoint =
AkkaRpcEndpointRef(Actor[akka.tcp://sparkDriver@10.115.58.55:52077/user/MapOutputTracker#-1937024516])
15/12/24 02:06:21 WARN akka.AkkaRpcEndpointRef: Error sending message
[message = GetMapOutputStatuses(0)] in 1 attempts
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120
seconds]. This timeout is controlled by spark.rpc.askTimeout
at
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcEnv.scala:214)
at
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcEnv.scala:229)
at
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcEnv.scala:225)
at
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcEnv.scala:242)

But the root cause is OOM:
15/12/24 02:05:36 ERROR actor.ActorSystemImpl: Uncaught fatal error from
thread [sparkDriver-akka.remote.default-remote-dispatcher-24] shutting down
ActorSystem [sparkDriver]
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2271)
at
java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:178)
at akka.serialization.JavaSerializer.toBinary(Serializer.scala:131)
at
akka.remote.MessageSerializer$.serialize(MessageSerializer.scala:36)
at
akka.remote.EndpointWriter$$anonfun$serializeMessage$1.apply(Endpoint.scala:843)
at
akka.remote.EndpointWriter$$anonfun$serializeMessage$1.apply(Endpoint.scala:843)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at akka.remote.EndpointWriter.serializeMessage(Endpoint.scala:842)
at akka.remote.EndpointWriter.writeSend(Endpoint.scala:743)
at
akka.remote.EndpointWriter$$anonfun$4.applyOrElse(Endpoint.scala:718)
at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:411)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)

I've already allocated 16G memory for my driver - which is the hard limit
MAX of my Yarn cluster. And I also applied Kryo serialization... Any idea to
reduce memory foot point? 
And what confuses me is that, even I have 20K+ partition to shuffle, why I
need so much memory?!

Thank you so much for any help!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Help-Driver-OOM-when-shuffle-large-amount-of-data-tp25818.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: FileNotFoundException in appcache shuffle files

2015-12-10 Thread kendal
I have similar issues... Exception only with very large data. 
And I tried to double the memory or partition as suggested by some google
search, but in vain..
any idea?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/FileNotFoundException-in-appcache-shuffle-files-tp17605p25663.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Help: Get Timeout error and FileNotFoundException when shuffling large files

2015-12-10 Thread kendal
Hi there, 
My application is simply easy, just read huge files from HDFS with 
textFile()
Then I will map to to tuples, after than a reduceByKey(), finally
saveToTextFile().

The problem is when I am dealing with large inputs (2.5T), when the
application enter to the 2nd stage -- reduce by key. It fail with the
exception of FileNotFoundException when trying to fetch the temp files. I
also see Timeout (120s) error before that exception. No other exception or
error. (OOM, to many files, etc..)

I had done a lot of google searches, and tried to increase executor memory,
repartition the RDD to more splits, etc but in vain. 
I also find another post here:
http://permalink.gmane.org/gmane.comp.lang.scala.spark.user/5449
which has exactly the same problem with mine. 

Any idea? thanks so much for the help



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Help-Get-Timeout-error-and-FileNotFoundException-when-shuffling-large-files-tp25662.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org