Re: Why Spark having OutOfMemory Exception?

Zhan Zhang Thu, 21 Apr 2016 15:58:58 -0700

The data may be not large, but the driver need to do a lot of bookkeeping. In 
your case,  it is possible the driver control plane takes too much memory.


I think you can find a java developer to look at the coredump. Otherwise, it is 
hard to tell exactly which part are using all the memory.

Thanks.

Zhan Zhang


On Apr 20, 2016, at 1:38 AM, 李明伟 
<kramer2...@126.com<mailto:kramer2...@126.com>> wrote:

Hi

the input data size is less than 10M. The task result size should be less I 
think. Because I am doing aggregation&reduce on the data





At 2016-04-20 16:18:31, "Jeff Zhang" 
<zjf...@gmail.com<mailto:zjf...@gmail.com>> wrote:
Do you mean the input data size as 10M or the task result size ?

>>> But my way is to setup a forever loop to handle continued income data. Not 
>>> sure if it is the right way to use spark
Not sure what this mean, do you use spark-streaming, for doing batch job in the 
forever loop ?



On Wed, Apr 20, 2016 at 3:55 PM, 李明伟 
<kramer2...@126.com<mailto:kramer2...@126.com>> wrote:
Hi Jeff

The total size of my data is less than 10M. I already set the driver memory to 
4GB.







在 2016-04-20 13:42:25，"Jeff Zhang" <zjf...@gmail.com<mailto:zjf...@gmail.com>> 
写道：
Seems it is OOM in driver side when fetching task result.

You can try to increase spark.driver.memory and spark.driver.maxResultSize

On Tue, Apr 19, 2016 at 4:06 PM, 李明伟 
<kramer2...@126.com<mailto:kramer2...@126.com>> wrote:
Hi Zhan Zhang


Please see the exception trace below. It is saying some GC overhead limit error
I am not a java or scala developer so it is hard for me to understand these 
infor.
Also reading coredump is too difficult to me..

I am not sure if the way I am using spark is correct. I understand that spark 
can do batch or stream calculation. But my way is to setup a forever loop to 
handle continued income data.
Not sure if it is the right way to use spark


16/04/19 15:54:55 ERROR Utils: Uncaught exception in thread task-result-getter-2
java.lang.OutOfMemoryError: GC overhead limit exceeded
at scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:328)
at scala.collection.immutable.HashMap.updated(HashMap.scala:54)
at 
scala.collection.immutable.HashMap$SerializationProxy.readObject(HashMap.scala:516)
at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500)
at 
org.apache.spark.executor.TaskMetrics$$anonfun$readObject$1.apply$mcV$sp(TaskMetrics.scala:220)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
at org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219)
at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at 
org.apache.spark.scheduler.DirectTaskResult$$anonfun$readExternal$1.apply$mcV$sp(TaskResult.scala:79)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
at org.apache.spark.scheduler.DirectTaskResult.readExternal(TaskResult.scala:62)
at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)
Exception in thread "task-result-getter-2" java.lang.OutOfMemoryError: GC 
overhead limit exceeded
at scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:328)
at scala.collection.immutable.HashMap.updated(HashMap.scala:54)
at 
scala.collection.immutable.HashMap$SerializationProxy.readObject(HashMap.scala:516)
at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500)
at 
org.apache.spark.executor.TaskMetrics$$anonfun$readObject$1.apply$mcV$sp(TaskMetrics.scala:220)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
at org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219)
at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at 
org.apache.spark.scheduler.DirectTaskResult$$anonfun$readExternal$1.apply$mcV$sp(TaskResult.scala:79)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
at org.apache.spark.scheduler.DirectTaskResult.readExternal(TaskResult.scala:62)
at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)





At 2016-04-19 13:10:20, "Zhan Zhang" 
<zzh...@hortonworks.com<mailto:zzh...@hortonworks.com>> wrote:
What kind of OOM? Driver or executor side? You can use coredump to find what 
cause the OOM.

Thanks.

Zhan Zhang

On Apr 18, 2016, at 9:44 PM, 李明伟 
<kramer2...@126.com<mailto:kramer2...@126.com>> wrote:

Hi Samaga

Thanks very much for your reply and sorry for the delay reply.

Cassandra or Hive is a good suggestion.
However in my situation I am not sure if it will make sense.

My requirements is that to get the recent 24 hour data to generate report. The 
frequency is 5 minute.
So if use cassandra or hive, it means spark will have to read 24 hour data 
every 5 mintues. And among those data, a big part (like 23 hours or more ) will 
be repeatedly read.

The window in spark is for stream computing. I did not use it but I will 
consider it


Thanks again

Regards
Mingwei





At 2016-04-11 19:09:48, "Lohith Samaga M" 
<lohith.sam...@mphasis.com<mailto:lohith.sam...@mphasis.com>> wrote:
>Hi Kramer,
>       Some options:
>       1. Store in Cassandra with TTL = 24 hours. When you read the full 
> table, you get the latest 24 hours data.
>       2. Store in Hive as ORC file and use timestamp field to filter out the 
> old data.
>       3. Try windowing in spark or flink (have not used either).
>
>
>Best regards / Mit freundlichen Grüßen / Sincères salutations
>M. Lohith Samaga
>
>
>-----Original Message-----
>From: kramer2...@126.com<mailto:kramer2...@126.com> [mailto:kramer2...@126.com]
>Sent: Monday, April 11, 2016 16.18
>To: user@spark.apache.org<mailto:user@spark.apache.org>
>Subject: Why Spark having OutOfMemory Exception?
>
>I use spark to do some very simple calculation. The description is like below 
>(pseudo code):
>
>
>While timestamp == 5 minutes
>
>    df = read_hdf() # Read hdfs to get a dataframe every 5 minutes
>
>    my_dict[timestamp] = df # Put the data frame into a dict
>
>    delete_old_dataframe( my_dict ) # Delete old dataframe (timestamp is one
>24 hour before)
>
>    big_df = merge(my_dict) # Merge the recent 24 hours data frame
>
>To explain..
>
>I have new files comes in every 5 minutes. But I need to generate report on 
>recent 24 hours data.
>The concept of 24 hours means I need to delete the oldest data frame every 
>time I put a new one into it.
>So I maintain a dict (my_dict in above code), the dict contains map like
>timestamp: dataframe. Everytime I put dataframe into the dict, I will go 
>through the dict to delete those old data frame whose timestamp is 24 hour ago.
>After delete and input. I merge the data frames in the dict to a big one and 
>run SQL on it to get my report.
>
>*
>I want to know if any thing wrong about this model? Because it is very slow 
>after started for a while and hit OutOfMemory. I know that my memory is 
>enough. Also size of file is very small for test purpose. So should not have 
>memory problem.
>
>I am wondering if there is lineage issue, but I am not sure.
>
>*
>
>
>
>--
>View this message in context: 
>http://apache-spark-user-list.1001560.n3.nabble.com/Why-Spark-having-OutOfMemory-Exception-tp26743.html
>Sent from the Apache Spark User List mailing list archive at 
>Nabble.com<http://nabble.com/>.
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: 
>user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org> 
>For additional commands, e-mail: 
>user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>
>
>Information transmitted by this e-mail is proprietary to Mphasis, its 
>associated companies and/ or its customers and is intended
>for use only by the individual or entity to which it is addressed, and may 
>contain information that is privileged, confidential or
>exempt from disclosure under applicable law. If you are not the intended 
>recipient or it appears that this mail has been forwarded
>to you without proper authority, you are notified that any use or 
>dissemination of this information in any manner is strictly
>prohibited. In such cases, please notify us immediately at 
>mailmas...@mphasis.com<mailto:mailmas...@mphasis.com> and delete this mail 
>from your records.
>











--
Best Regards

Jeff Zhang






--
Best Regards

Jeff Zhang

Re: Why Spark having OutOfMemory Exception?

Reply via email to