Seems it is OOM in driver side when fetching task result. You can try to increase spark.driver.memory and spark.driver.maxResultSize
On Tue, Apr 19, 2016 at 4:06 PM, 李明伟 <kramer2...@126.com> wrote: > Hi Zhan Zhang > > > Please see the exception trace below. It is saying some GC overhead limit > error > I am not a java or scala developer so it is hard for me to understand > these infor. > Also reading coredump is too difficult to me.. > > I am not sure if the way I am using spark is correct. I understand that > spark can do batch or stream calculation. But my way is to setup a forever > loop to handle continued income data. > Not sure if it is the right way to use spark > > > 16/04/19 15:54:55 ERROR Utils: Uncaught exception in thread > task-result-getter-2 > java.lang.OutOfMemoryError: GC overhead limit exceeded > at > scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:328) > at scala.collection.immutable.HashMap.updated(HashMap.scala:54) > at > scala.collection.immutable.HashMap$SerializationProxy.readObject(HashMap.scala:516) > at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500) > at > org.apache.spark.executor.TaskMetrics$$anonfun$readObject$1.apply$mcV$sp(TaskMetrics.scala:220) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204) > at org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219) > at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.scheduler.DirectTaskResult$$anonfun$readExternal$1.apply$mcV$sp(TaskResult.scala:79) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204) > at > org.apache.spark.scheduler.DirectTaskResult.readExternal(TaskResult.scala:62) > at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76) > at > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109) > Exception in thread "task-result-getter-2" java.lang.OutOfMemoryError: GC > overhead limit exceeded > at > scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:328) > at scala.collection.immutable.HashMap.updated(HashMap.scala:54) > at > scala.collection.immutable.HashMap$SerializationProxy.readObject(HashMap.scala:516) > at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500) > at > org.apache.spark.executor.TaskMetrics$$anonfun$readObject$1.apply$mcV$sp(TaskMetrics.scala:220) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204) > at org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219) > at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.scheduler.DirectTaskResult$$anonfun$readExternal$1.apply$mcV$sp(TaskResult.scala:79) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204) > at > org.apache.spark.scheduler.DirectTaskResult.readExternal(TaskResult.scala:62) > at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76) > at > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109) > > > > > > At 2016-04-19 13:10:20, "Zhan Zhang" <zzh...@hortonworks.com> wrote: > > What kind of OOM? Driver or executor side? You can use coredump to find > what cause the OOM. > > Thanks. > > Zhan Zhang > > On Apr 18, 2016, at 9:44 PM, 李明伟 <kramer2...@126.com> wrote: > > Hi Samaga > > Thanks very much for your reply and sorry for the delay reply. > > Cassandra or Hive is a good suggestion. > However in my situation I am not sure if it will make sense. > > My requirements is that to get the recent 24 hour data to generate report. > The frequency is 5 minute. > So if use cassandra or hive, it means spark will have to read 24 hour data > every 5 mintues. And among those data, a big part (like 23 hours or more ) > will be repeatedly read. > > The window in spark is for stream computing. I did not use it but I will > consider it > > > Thanks again > > Regards > Mingwei > > > > > > At 2016-04-11 19:09:48, "Lohith Samaga M" <lohith.sam...@mphasis.com> wrote: > >Hi Kramer, > > Some options: > > 1. Store in Cassandra with TTL = 24 hours. When you read the full > > table, you get the latest 24 hours data. > > 2. Store in Hive as ORC file and use timestamp field to filter out the > > old data. > > 3. Try windowing in spark or flink (have not used either). > > > > > >Best regards / Mit freundlichen Grüßen / Sincères salutations > >M. Lohith Samaga > > > > > >-----Original Message----- > >From: kramer2...@126.com [mailto:kramer2...@126.com <kramer2...@126.com>] > >Sent: Monday, April 11, 2016 16.18 > >To: user@spark.apache.org > >Subject: Why Spark having OutOfMemory Exception? > > > >I use spark to do some very simple calculation. The description is like > >below (pseudo code): > > > > > >While timestamp == 5 minutes > > > > df = read_hdf() # Read hdfs to get a dataframe every 5 minutes > > > > my_dict[timestamp] = df # Put the data frame into a dict > > > > delete_old_dataframe( my_dict ) # Delete old dataframe (timestamp is one > >24 hour before) > > > > big_df = merge(my_dict) # Merge the recent 24 hours data frame > > > >To explain.. > > > >I have new files comes in every 5 minutes. But I need to generate report on > >recent 24 hours data. > >The concept of 24 hours means I need to delete the oldest data frame every > >time I put a new one into it. > >So I maintain a dict (my_dict in above code), the dict contains map like > >timestamp: dataframe. Everytime I put dataframe into the dict, I will go > >through the dict to delete those old data frame whose timestamp is 24 hour > >ago. > >After delete and input. I merge the data frames in the dict to a big one and > >run SQL on it to get my report. > > > >* > >I want to know if any thing wrong about this model? Because it is very slow > >after started for a while and hit OutOfMemory. I know that my memory is > >enough. Also size of file is very small for test purpose. So should not have > >memory problem. > > > >I am wondering if there is lineage issue, but I am not sure. > > > >* > > > > > > > >-- > >View this message in context: > >http://apache-spark-user-list.1001560.n3.nabble.com/Why-Spark-having-OutOfMemory-Exception-tp26743.html > >Sent from the Apache Spark User List mailing list archive at Nabble.com. > > > >--------------------------------------------------------------------- > >To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional > >commands, e-mail: user-h...@spark.apache.org > > > >Information transmitted by this e-mail is proprietary to Mphasis, its > >associated companies and/ or its customers and is intended > >for use only by the individual or entity to which it is addressed, and may > >contain information that is privileged, confidential or > >exempt from disclosure under applicable law. If you are not the intended > >recipient or it appears that this mail has been forwarded > >to you without proper authority, you are notified that any use or > >dissemination of this information in any manner is strictly > >prohibited. In such cases, please notify us immediately at > >mailmas...@mphasis.com and delete this mail from your records. > > > > > > > > > > > > -- Best Regards Jeff Zhang