Re: java.lang.OutOfMemoryError: GC overhead limit exceeded when using UDFs in SparkSQL (Spark 2.0.0)
Thanks, that makes sense. So it must be that this queue - which is kept because of the UDF - is the one running out of memory, because without the UDF field there is no out of memory error and the UDF fields is pretty small, unlikely that it would take us above the memory limit. In either case, thanks for your help, I think I understand it now how the UDFs and the fields together with the number of rows can result our out of memory scenario. On Tue, Aug 9, 2016 at 5:06 PM, Davies Liu <dav...@databricks.com> wrote: > When you have a Python UDF, only the input to UDF are passed into > Python process, > but all other fields that are used together with the result of UDF are > kept in a queue > then join with the result from Python. The length of this queue is depend > on the > number of rows is under processing by Python (or in the buffer of > Python process). > The amount of memory required also depend on how many fields are used in > the > results. > > On Tue, Aug 9, 2016 at 11:09 AM, Zoltan Fedor <zoltan.1.fe...@gmail.com> > wrote: > >> Does this mean you only have 1.6G memory for executor (others left for > >> Python) ? > >> The cached table could take 1.5G, it means almost nothing left for other > >> things. > > True. I have also tried with memoryOverhead being set to 800 (10% of the > 8Gb > > memory), but no difference. The "GC overhead limit exceeded" is still the > > same. > > > >> Python UDF do requires some buffering in JVM, the size of buffering > >> depends on how much rows are under processing by Python process. > > I did some more testing in the meantime. > > Leaving the UDFs as-is, but removing some other, static columns from the > > above SELECT FROM command has stopped the memoryOverhead error from > > occurring. I have plenty enough memory to store the results with all > static > > columns, plus when the UDFs are not there only the rest of the static > > columns are, then it runs fine. This makes me believe that having UDFs > and > > many columns causes the issue together. Maybe when you have UDFs then > > somehow the memory usage depends on the amount of data in that record > (the > > whole row), which includes other fields too, which are actually not used > by > > the UDF. Maybe the UDF serialization to Python serializes the whole row > > instead of just the attributes of the UDF? > > > > On Mon, Aug 8, 2016 at 5:59 PM, Davies Liu <dav...@databricks.com> > wrote: > >> > >> On Mon, Aug 8, 2016 at 2:24 PM, Zoltan Fedor <zoltan.1.fe...@gmail.com> > >> wrote: > >> > Hi all, > >> > > >> > I have an interesting issue trying to use UDFs from SparkSQL in Spark > >> > 2.0.0 > >> > using pyspark. > >> > > >> > There is a big table (5.6 Billion rows, 450Gb in memory) loaded into > 300 > >> > executors's memory in SparkSQL, on which we would do some calculation > >> > using > >> > UDFs in pyspark. > >> > If I run my SQL on only a portion of the data (filtering by one of the > >> > attributes), let's say 800 million records, then all works well. But > >> > when I > >> > run the same SQL on all the data, then I receive > >> > "java.lang.OutOfMemoryError: GC overhead limit exceeded" from > basically > >> > all > >> > of the executors. > >> > > >> > It seems to me that pyspark UDFs in SparkSQL might have a memory leak, > >> > causing this "GC overhead limit being exceeded". > >> > > >> > Details: > >> > > >> > - using Spark 2.0.0 on a Hadoop YARN cluster > >> > > >> > - 300 executors, each with 2 CPU cores and 8Gb memory ( > >> > spark.yarn.executor.memoryOverhead=6400 ) > >> > >> Does this mean you only have 1.6G memory for executor (others left for > >> Python) ? > >> The cached table could take 1.5G, it means almost nothing left for other > >> things. > >> > >> Python UDF do requires some buffering in JVM, the size of buffering > >> depends on > >> how much rows are under processing by Python process. > >> > >> > - a table of 5.6 Billions rows loaded into the memory of the executors > >> > (taking up 450Gb of memory), partitioned evenly across the executors > >> > > >> > - creating even the simplest UDF in SparkSQL causes 'GC overhead limit > >> > exceeded' error if running on all records. Running the same on a > smaller > >&
Re: java.lang.OutOfMemoryError: GC overhead limit exceeded when using UDFs in SparkSQL (Spark 2.0.0)
When you have a Python UDF, only the input to UDF are passed into Python process, but all other fields that are used together with the result of UDF are kept in a queue then join with the result from Python. The length of this queue is depend on the number of rows is under processing by Python (or in the buffer of Python process). The amount of memory required also depend on how many fields are used in the results. On Tue, Aug 9, 2016 at 11:09 AM, Zoltan Fedor <zoltan.1.fe...@gmail.com> wrote: >> Does this mean you only have 1.6G memory for executor (others left for >> Python) ? >> The cached table could take 1.5G, it means almost nothing left for other >> things. > True. I have also tried with memoryOverhead being set to 800 (10% of the 8Gb > memory), but no difference. The "GC overhead limit exceeded" is still the > same. > >> Python UDF do requires some buffering in JVM, the size of buffering >> depends on how much rows are under processing by Python process. > I did some more testing in the meantime. > Leaving the UDFs as-is, but removing some other, static columns from the > above SELECT FROM command has stopped the memoryOverhead error from > occurring. I have plenty enough memory to store the results with all static > columns, plus when the UDFs are not there only the rest of the static > columns are, then it runs fine. This makes me believe that having UDFs and > many columns causes the issue together. Maybe when you have UDFs then > somehow the memory usage depends on the amount of data in that record (the > whole row), which includes other fields too, which are actually not used by > the UDF. Maybe the UDF serialization to Python serializes the whole row > instead of just the attributes of the UDF? > > On Mon, Aug 8, 2016 at 5:59 PM, Davies Liu <dav...@databricks.com> wrote: >> >> On Mon, Aug 8, 2016 at 2:24 PM, Zoltan Fedor <zoltan.1.fe...@gmail.com> >> wrote: >> > Hi all, >> > >> > I have an interesting issue trying to use UDFs from SparkSQL in Spark >> > 2.0.0 >> > using pyspark. >> > >> > There is a big table (5.6 Billion rows, 450Gb in memory) loaded into 300 >> > executors's memory in SparkSQL, on which we would do some calculation >> > using >> > UDFs in pyspark. >> > If I run my SQL on only a portion of the data (filtering by one of the >> > attributes), let's say 800 million records, then all works well. But >> > when I >> > run the same SQL on all the data, then I receive >> > "java.lang.OutOfMemoryError: GC overhead limit exceeded" from basically >> > all >> > of the executors. >> > >> > It seems to me that pyspark UDFs in SparkSQL might have a memory leak, >> > causing this "GC overhead limit being exceeded". >> > >> > Details: >> > >> > - using Spark 2.0.0 on a Hadoop YARN cluster >> > >> > - 300 executors, each with 2 CPU cores and 8Gb memory ( >> > spark.yarn.executor.memoryOverhead=6400 ) >> >> Does this mean you only have 1.6G memory for executor (others left for >> Python) ? >> The cached table could take 1.5G, it means almost nothing left for other >> things. >> >> Python UDF do requires some buffering in JVM, the size of buffering >> depends on >> how much rows are under processing by Python process. >> >> > - a table of 5.6 Billions rows loaded into the memory of the executors >> > (taking up 450Gb of memory), partitioned evenly across the executors >> > >> > - creating even the simplest UDF in SparkSQL causes 'GC overhead limit >> > exceeded' error if running on all records. Running the same on a smaller >> > dataset (~800 million rows) does succeed. If no UDF, the query succeed >> > on >> > the whole dataset. >> > >> > - simplified pyspark code: >> > >> > from pyspark.sql.types import StringType >> > >> > def test_udf(var): >> > """test udf that will always return a""" >> > return "a" >> > sqlContext.registerFunction("test_udf", test_udf, StringType()) >> > >> > sqlContext.sql("""CACHE TABLE ma""") >> > >> > results_df = sqlContext.sql("""SELECT SOURCE, SOURCE_SYSTEM, >> > test_udf(STANDARD_ACCOUNT_STREET_SRC) AS TEST_UDF_OP, >> > ROUND(1.0 - (levenshtein(STANDARD_ACCOUNT_CITY_SRC, >> > STANDARD_ACCOUNT_CITY_SRC) >> > / >> > CASE WHEN LENGTH (STANDARD_ACCOUNT_CITY
Re: java.lang.OutOfMemoryError: GC overhead limit exceeded when using UDFs in SparkSQL (Spark 2.0.0)
> Does this mean you only have 1.6G memory for executor (others left for Python) ? > The cached table could take 1.5G, it means almost nothing left for other things. True. I have also tried with memoryOverhead being set to 800 (10% of the 8Gb memory), but no difference. The "GC overhead limit exceeded" is still the same. > Python UDF do requires some buffering in JVM, the size of buffering depends on how much rows are under processing by Python process. I did some more testing in the meantime. Leaving the UDFs as-is, but removing some other, static columns from the above SELECT FROM command has stopped the memoryOverhead error from occurring. I have plenty enough memory to store the results with all static columns, plus when the UDFs are not there only the rest of the static columns are, then it runs fine. This makes me believe that having UDFs and many columns causes the issue together. Maybe when you have UDFs then somehow the memory usage depends on the amount of data in that record (the whole row), which includes other fields too, which are actually not used by the UDF. Maybe the UDF serialization to Python serializes the whole row instead of just the attributes of the UDF? On Mon, Aug 8, 2016 at 5:59 PM, Davies Liu <dav...@databricks.com> wrote: > On Mon, Aug 8, 2016 at 2:24 PM, Zoltan Fedor <zoltan.1.fe...@gmail.com> > wrote: > > Hi all, > > > > I have an interesting issue trying to use UDFs from SparkSQL in Spark > 2.0.0 > > using pyspark. > > > > There is a big table (5.6 Billion rows, 450Gb in memory) loaded into 300 > > executors's memory in SparkSQL, on which we would do some calculation > using > > UDFs in pyspark. > > If I run my SQL on only a portion of the data (filtering by one of the > > attributes), let's say 800 million records, then all works well. But > when I > > run the same SQL on all the data, then I receive > > "java.lang.OutOfMemoryError: GC overhead limit exceeded" from basically > all > > of the executors. > > > > It seems to me that pyspark UDFs in SparkSQL might have a memory leak, > > causing this "GC overhead limit being exceeded". > > > > Details: > > > > - using Spark 2.0.0 on a Hadoop YARN cluster > > > > - 300 executors, each with 2 CPU cores and 8Gb memory ( > > spark.yarn.executor.memoryOverhead=6400 ) > > Does this mean you only have 1.6G memory for executor (others left for > Python) ? > The cached table could take 1.5G, it means almost nothing left for other > things. > > Python UDF do requires some buffering in JVM, the size of buffering > depends on > how much rows are under processing by Python process. > > > - a table of 5.6 Billions rows loaded into the memory of the executors > > (taking up 450Gb of memory), partitioned evenly across the executors > > > > - creating even the simplest UDF in SparkSQL causes 'GC overhead limit > > exceeded' error if running on all records. Running the same on a smaller > > dataset (~800 million rows) does succeed. If no UDF, the query succeed on > > the whole dataset. > > > > - simplified pyspark code: > > > > from pyspark.sql.types import StringType > > > > def test_udf(var): > > """test udf that will always return a""" > > return "a" > > sqlContext.registerFunction("test_udf", test_udf, StringType()) > > > > sqlContext.sql("""CACHE TABLE ma""") > > > > results_df = sqlContext.sql("""SELECT SOURCE, SOURCE_SYSTEM, > > test_udf(STANDARD_ACCOUNT_STREET_SRC) AS TEST_UDF_OP, > > ROUND(1.0 - (levenshtein(STANDARD_ACCOUNT_CITY_SRC, > > STANDARD_ACCOUNT_CITY_SRC) > > / > > CASE WHEN LENGTH (STANDARD_ACCOUNT_CITY_SRC)>LENGTH > > (STANDARD_ACCOUNT_CITY_SRC) > > THEN LENGTH (STANDARD_ACCOUNT_CITY_SRC) > > ELSE LENGTH (STANDARD_ACCOUNT_CITY_SRC) > > END),2) AS SCORE_ED_STANDARD_ACCOUNT_CITY, > > STANDARD_ACCOUNT_STATE_SRC, STANDARD_ACCOUNT_STATE_UNIV > > FROM ma""") > > > > results_df.registerTempTable("m") > > sqlContext.cacheTable("m") > > > > results_df = sqlContext.sql("""SELECT COUNT(*) FROM m""") > > print(results_df.take(1)) > > > > > > - the error thrown on the executors: > > > > 16/08/08 15:38:17 ERROR util.Utils: Uncaught exception in thread stdout > > writer for /hadoop/cloudera/parcels/Anaconda/bin/python > > java.lang.OutOfMemoryError: GC overhead limit exceeded > > at > > org.apache.spark.sql.catalyst.expre
Re: java.lang.OutOfMemoryError: GC overhead limit exceeded when using UDFs in SparkSQL (Spark 2.0.0)
On Mon, Aug 8, 2016 at 2:24 PM, Zoltan Fedor <zoltan.1.fe...@gmail.com> wrote: > Hi all, > > I have an interesting issue trying to use UDFs from SparkSQL in Spark 2.0.0 > using pyspark. > > There is a big table (5.6 Billion rows, 450Gb in memory) loaded into 300 > executors's memory in SparkSQL, on which we would do some calculation using > UDFs in pyspark. > If I run my SQL on only a portion of the data (filtering by one of the > attributes), let's say 800 million records, then all works well. But when I > run the same SQL on all the data, then I receive > "java.lang.OutOfMemoryError: GC overhead limit exceeded" from basically all > of the executors. > > It seems to me that pyspark UDFs in SparkSQL might have a memory leak, > causing this "GC overhead limit being exceeded". > > Details: > > - using Spark 2.0.0 on a Hadoop YARN cluster > > - 300 executors, each with 2 CPU cores and 8Gb memory ( > spark.yarn.executor.memoryOverhead=6400 ) Does this mean you only have 1.6G memory for executor (others left for Python) ? The cached table could take 1.5G, it means almost nothing left for other things. Python UDF do requires some buffering in JVM, the size of buffering depends on how much rows are under processing by Python process. > - a table of 5.6 Billions rows loaded into the memory of the executors > (taking up 450Gb of memory), partitioned evenly across the executors > > - creating even the simplest UDF in SparkSQL causes 'GC overhead limit > exceeded' error if running on all records. Running the same on a smaller > dataset (~800 million rows) does succeed. If no UDF, the query succeed on > the whole dataset. > > - simplified pyspark code: > > from pyspark.sql.types import StringType > > def test_udf(var): > """test udf that will always return a""" > return "a" > sqlContext.registerFunction("test_udf", test_udf, StringType()) > > sqlContext.sql("""CACHE TABLE ma""") > > results_df = sqlContext.sql("""SELECT SOURCE, SOURCE_SYSTEM, > test_udf(STANDARD_ACCOUNT_STREET_SRC) AS TEST_UDF_OP, > ROUND(1.0 - (levenshtein(STANDARD_ACCOUNT_CITY_SRC, > STANDARD_ACCOUNT_CITY_SRC) > / > CASE WHEN LENGTH (STANDARD_ACCOUNT_CITY_SRC)>LENGTH > (STANDARD_ACCOUNT_CITY_SRC) > THEN LENGTH (STANDARD_ACCOUNT_CITY_SRC) > ELSE LENGTH (STANDARD_ACCOUNT_CITY_SRC) >END),2) AS SCORE_ED_STANDARD_ACCOUNT_CITY, > STANDARD_ACCOUNT_STATE_SRC, STANDARD_ACCOUNT_STATE_UNIV > FROM ma""") > > results_df.registerTempTable("m") > sqlContext.cacheTable("m") > > results_df = sqlContext.sql("""SELECT COUNT(*) FROM m""") > print(results_df.take(1)) > > > - the error thrown on the executors: > > 16/08/08 15:38:17 ERROR util.Utils: Uncaught exception in thread stdout > writer for /hadoop/cloudera/parcels/Anaconda/bin/python > java.lang.OutOfMemoryError: GC overhead limit exceeded > at > org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:503) > at > org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:61) > at > org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$1.apply(BatchEvalPythonExec.scala:64) > at > org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$1.apply(BatchEvalPythonExec.scala:64) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at > scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:1076) > at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:1091) > at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1129) > at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1132) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504) > at > org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1857) > at > org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269) > 16/08/08 15:38:17 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED > SIGNAL TERM > > > Has anybody experienced these "GC overhead limit exceeded" errors with > pyspark UDFs before? > > Thanks, > Zoltan > - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
java.lang.OutOfMemoryError: GC overhead limit exceeded when using UDFs in SparkSQL (Spark 2.0.0)
Hi all, I have an interesting issue trying to use UDFs from SparkSQL in Spark 2.0.0 using pyspark. There is a big table (5.6 Billion rows, 450Gb in memory) loaded into 300 executors's memory in SparkSQL, on which we would do some calculation using UDFs in pyspark. If I run my SQL on only a portion of the data (filtering by one of the attributes), let's say 800 million records, then all works well. But when I run the same SQL on all the data, then I receive "*java.lang.OutOfMemoryError: GC overhead limit exceeded"* from basically all of the executors. It seems to me that pyspark UDFs in SparkSQL might have a memory leak, causing this "GC overhead limit being exceeded". Details: - using Spark 2.0.0 on a Hadoop YARN cluster - 300 executors, each with 2 CPU cores and 8Gb memory ( spark.yarn.executor.memoryOverhead=6400 ) - a table of 5.6 Billions rows loaded into the memory of the executors (taking up 450Gb of memory), partitioned evenly across the executors - creating even the simplest UDF in SparkSQL causes 'GC overhead limit exceeded' error if running on all records. Running the same on a smaller dataset (~800 million rows) does succeed. If no UDF, the query succeed on the whole dataset. - simplified pyspark code: *from pyspark.sql.types import StringType* *def test_udf(var):* *"""test udf that will always return a"""* *return "a"* *sqlContext.registerFunction("test_udf", test_udf, StringType())* *sqlContext.sql("""CACHE TABLE ma""")* *results_df = sqlContext.sql("""SELECT SOURCE, SOURCE_SYSTEM,* *test_udf(STANDARD_ACCOUNT_STREET_SRC) AS TEST_UDF_OP,* * ROUND(1.0 - (levenshtein(STANDARD_ACCOUNT_CITY_SRC, STANDARD_ACCOUNT_CITY_SRC) * * / * * CASE WHEN LENGTH (STANDARD_ACCOUNT_CITY_SRC)>LENGTH (STANDARD_ACCOUNT_CITY_SRC)* * THEN LENGTH (STANDARD_ACCOUNT_CITY_SRC)* * ELSE LENGTH (STANDARD_ACCOUNT_CITY_SRC)* *END),2) AS SCORE_ED_STANDARD_ACCOUNT_CITY,* * STANDARD_ACCOUNT_STATE_SRC, STANDARD_ACCOUNT_STATE_UNIV* * FROM ma""")* *results_df.registerTempTable("m")* *sqlContext.cacheTable("m")* *results_df = sqlContext.sql("""SELECT COUNT(*) FROM m""")* *print(results_df.take(1))* - the error thrown on the executors: *16/08/08 15:38:17 ERROR util.Utils: Uncaught exception in thread stdout writer for /hadoop/cloudera/parcels/Anaconda/bin/python* *java.lang.OutOfMemoryError: GC overhead limit exceeded* * at org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:503)* * at org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:61)* * at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$1.apply(BatchEvalPythonExec.scala:64)* * at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$1.apply(BatchEvalPythonExec.scala:64)* * at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)* * at scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:1076)* * at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:1091)* * at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1129)* * at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1132)* * at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)* * at scala.collection.Iterator$class.foreach(Iterator.scala:893)* * at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)* * at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504)* * at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328)* * at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1857)* * at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)* *16/08/08 15:38:17 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM* Has anybody experienced these "*GC overhead limit exceeded*" errors with pyspark UDFs before? Thanks, Zoltan
Re: df.partitionBy().parquet() java.lang.OutOfMemoryError: GC overhead limit exceeded
You may try to set Hadoop conf "parquet.enable.summary-metadata" to false to disable writing Parquet summary files (_metadata and _common_metadata). By default Parquet writes the summary files by collecting footers of all part-files in the dataset while committing the job. Spark also follows this convention. However, it turned out that the summary files aren't very useful in practice now, unless you have other downstream tools that strictly depend on the summary files. For example, if you don't need schema merging, Spark simply picks a random part-file to discovery the dataset schema. If you need schema merging, Spark has to read footers of all part-files anyway (but in a distributed, parallel way). Cheng On 12/3/15 6:11 AM, Don Drake wrote: Does anyone have any suggestions on creating a large amount of parquet files? Especially in regards to the last phase where it creates the _metadata. Thanks. -Don On Sat, Nov 28, 2015 at 9:02 AM, Don Drake <dondr...@gmail.com <mailto:dondr...@gmail.com>> wrote: I have a 2TB dataset that I have in a DataFrame that I am attempting to partition by 2 fields and my YARN job seems to write the partitioned dataset successfully. I can see the output in HDFS once all Spark tasks are done. After the spark tasks are done, the job appears to be running for over an hour, until I get the following (full stack trace below): java.lang.OutOfMemoryError: GC overhead limit exceeded at org.apache.parquet.format.converter.ParquetMetadataConverter.toParquetStatistics(ParquetMetadataConverter.java:238) I had set the driver memory to be 20GB. I attempted to read in the partitioned dataset and got another error saying the /_metadata directory was not a parquet file. I removed the _metadata directory and was able to query the data, but it appeared to not use the partitioned directory when I attempted to filter the data (it read every directory). This is Spark 1.5.2 and I saw the same problem when running the code in both Scala and Python. Any suggestions are appreciated. -Don 15/11/25 00:00:19 ERROR datasources.InsertIntoHadoopFsRelation: Aborting job. java.lang.OutOfMemoryError: GC overhead limit exceeded at org.apache.parquet.format.converter.ParquetMetadataConverter.toParquetStatistics(ParquetMetadataConverter.java:238) at org.apache.parquet.format.converter.ParquetMetadataConverter.addRowGroup(ParquetMetadataConverter.java:167) at org.apache.parquet.format.converter.ParquetMetadataConverter.toParquetMetadata(ParquetMetadataConverter.java:79) at org.apache.parquet.hadoop.ParquetFileWriter.serializeFooter(ParquetFileWriter.java:405) at org.apache.parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:433) at org.apache.parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:423) at org.apache.parquet.hadoop.ParquetOutputCommitter.writeMetaDataFile(ParquetOutputCommitter.java:58) at org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48) at org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:208) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:151) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57) at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933) at org.apache.spark.sql.SQLContext$QueryExecution.toRd
Re: df.partitionBy().parquet() java.lang.OutOfMemoryError: GC overhead limit exceeded
Very interested in that topic too, thanks Cheng for the direction! We'll give it a try as well. On 3 December 2015 at 01:40, Cheng Lian <lian.cs@gmail.com> wrote: > You may try to set Hadoop conf "parquet.enable.summary-metadata" to false > to disable writing Parquet summary files (_metadata and _common_metadata). > > By default Parquet writes the summary files by collecting footers of all > part-files in the dataset while committing the job. Spark also follows this > convention. However, it turned out that the summary files aren't very > useful in practice now, unless you have other downstream tools that > strictly depend on the summary files. For example, if you don't need schema > merging, Spark simply picks a random part-file to discovery the dataset > schema. If you need schema merging, Spark has to read footers of all > part-files anyway (but in a distributed, parallel way). > > Cheng > > On 12/3/15 6:11 AM, Don Drake wrote: > > Does anyone have any suggestions on creating a large amount of parquet > files? Especially in regards to the last phase where it creates the > _metadata. > > Thanks. > > -Don > > On Sat, Nov 28, 2015 at 9:02 AM, Don Drake <dondr...@gmail.com> wrote: > >> I have a 2TB dataset that I have in a DataFrame that I am attempting to >> partition by 2 fields and my YARN job seems to write the partitioned >> dataset successfully. I can see the output in HDFS once all Spark tasks >> are done. >> >> After the spark tasks are done, the job appears to be running for over an >> hour, until I get the following (full stack trace below): >> >> java.lang.OutOfMemoryError: GC overhead limit exceeded >> at >> org.apache.parquet.format.converter.ParquetMetadataConverter.toParquetStatistics(ParquetMetadataConverter.java:238) >> >> I had set the driver memory to be 20GB. >> >> I attempted to read in the partitioned dataset and got another error >> saying the /_metadata directory was not a parquet file. I removed the >> _metadata directory and was able to query the data, but it appeared to not >> use the partitioned directory when I attempted to filter the data (it read >> every directory). >> >> This is Spark 1.5.2 and I saw the same problem when running the code in >> both Scala and Python. >> >> Any suggestions are appreciated. >> >> -Don >> >> 15/11/25 00:00:19 ERROR datasources.InsertIntoHadoopFsRelation: Aborting >> job. >> java.lang.OutOfMemoryError: GC overhead limit exceeded >> at >> org.apache.parquet.format.converter.ParquetMetadataConverter.toParquetStatistics(ParquetMetadataConverter.java:238) >> at >> org.apache.parquet.format.converter.ParquetMetadataConverter.addRowGroup(ParquetMetadataConverter.java:167) >> at >> org.apache.parquet.format.converter.ParquetMetadataConverter.toParquetMetadata(ParquetMetadataConverter.java:79) >> at >> org.apache.parquet.hadoop.ParquetFileWriter.serializeFooter(ParquetFileWriter.java:405) >> at >> org.apache.parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:433) >> at >> org.apache.parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:423) >> at >> org.apache.parquet.hadoop.ParquetOutputCommitter.writeMetaDataFile(ParquetOutputCommitter.java:58) >> at >> org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48) >> at >> org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:208) >> at >> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:151) >> at >> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108) >> at >> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108) >> at >> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) >> at >> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108) >> at >> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57) >> at >> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57) >> at >> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69) >>
df.partitionBy().parquet() java.lang.OutOfMemoryError: GC overhead limit exceeded
I have a 2TB dataset that I have in a DataFrame that I am attempting to partition by 2 fields and my YARN job seems to write the partitioned dataset successfully. I can see the output in HDFS once all Spark tasks are done. After the spark tasks are done, the job appears to be running for over an hour, until I get the following (full stack trace below): java.lang.OutOfMemoryError: GC overhead limit exceeded at org.apache.parquet.format.converter.ParquetMetadataConverter.toParquetStatistics(ParquetMetadataConverter.java:238) I had set the driver memory to be 20GB. I attempted to read in the partitioned dataset and got another error saying the /_metadata directory was not a parquet file. I removed the _metadata directory and was able to query the data, but it appeared to not use the partitioned directory when I attempted to filter the data (it read every directory). This is Spark 1.5.2 and I saw the same problem when running the code in both Scala and Python. Any suggestions are appreciated. -Don 15/11/25 00:00:19 ERROR datasources.InsertIntoHadoopFsRelation: Aborting job. java.lang.OutOfMemoryError: GC overhead limit exceeded at org.apache.parquet.format.converter.ParquetMetadataConverter.toParquetStatistics(ParquetMetadataConverter.java:238) at org.apache.parquet.format.converter.ParquetMetadataConverter.addRowGroup(ParquetMetadataConverter.java:167) at org.apache.parquet.format.converter.ParquetMetadataConverter.toParquetMetadata(ParquetMetadataConverter.java:79) at org.apache.parquet.hadoop.ParquetFileWriter.serializeFooter(ParquetFileWriter.java:405) at org.apache.parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:433) at org.apache.parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:423) at org.apache.parquet.hadoop.ParquetOutputCommitter.writeMetaDataFile(ParquetOutputCommitter.java:58) at org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48) at org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:208) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:151) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57) at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137) at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304) at com.dondrake.qra.ScalaApp$.main(ScalaApp.scala:53) at com.dondrake.qra.ScalaApp.main(ScalaApp.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 15/11/25 00:00:20 ERROR actor.ActorSystemImpl: exception on LARS? timer thread java.lang.OutOfMemoryError: GC overhead limit exceeded at akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:409) at akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375) at java.lang.Thread.run(Thread.java:745) 15/11/25 00:00:20 ERROR akka.ErrorMonitor: exception on LARS? timer thread java.lang.OutOfMemoryError: GC overhead limit exceeded at akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:409) at akka.actor.LightArrayRevolverScheduler$$anon$8
newbie simple app, small data set: Py4JJavaError java.lang.OutOfMemoryError: GC overhead limit exceeded
Hi I am working on a spark POC. I created a ec2 cluster on AWS using spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 Bellow is a simple python program. I am running using IPython notebook. The notebook server is running on my spark master. If I run my program more than 1 once using my large data set, I get the GC outOfMemory error. I run it each time by ³re running the notebook cell². I can run my smaller set a couple of times with out problems. I launch using pyspark --master $MASTER_URL --total-executor-cores 2 Any idea how I can debug this? Kind regards Andy Using the master console I see * there is only one app run (this is what I expect) * There are 2 works each on a different slave (this is what I expect) * Each worker is using 1 core (this is what I expect) * Each worker memory usage is using 6154 (seems resonable) * Alive Workers: 3 * Cores in use: 6 Total, 2 Used * Memory in use: 18.8 GB Total, 12.0 GB Used * Applications: 1 Running, 5 Completed * Drivers: 0 Running, 0 Completed * Status: ALIVE The data file I am working with is small I collected this data using spark streaming twitter utilities. All I do is capture tweets, convert them to JSON and store as strings to hdfs $ hadoop fs -count hdfs:///largeSample hdfs:///smallSample 13226 240098 3839228100 1 228156 39689877 My python code. I am using python 3.4. import json import datetime startTime = datetime.datetime.now() #dataURL = "hdfs:///largeSample" dataURL = "hdfs:///smallSample" tweetStrings = sc.textFile(dataURL) t2 = tweetStrings.take(2) print (t2[1]) print("\n\nelapsed time:%s" % (datetime.datetime.now() - startTime)) --- Py4JJavaError Traceback (most recent call last) in () 8 tweetStrings = sc.textFile(dataURL) 9 ---> 10 t2 = tweetStrings.take(2) 11 print (t2[1]) 12 print("\n\nelapsed time:%s" % (datetime.datetime.now() - startTime)) /root/spark/python/pyspark/rdd.py in take(self, num) 1267 """ 1268 items = [] -> 1269 totalParts = self.getNumPartitions() 1270 partsScanned = 0 1271 /root/spark/python/pyspark/rdd.py in getNumPartitions(self) 354 2 355 """ --> 356 return self._jrdd.partitions().size() 357 358 def filter(self, f): /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __call__(self, *args) 536 answer = self.gateway_client.send_command(command) 537 return_value = get_return_value(answer, self.gateway_client, --> 538 self.target_id, self.name) 539 540 for temp_arg in temp_args: /root/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 34 def deco(*a, **kw): 35 try: ---> 36 return f(*a, **kw) 37 except py4j.protocol.Py4JJavaError as e: 38 s = e.java_exception.toString() /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 298 raise Py4JJavaError( 299 'An error occurred while calling {0}{1}{2}.\n'. --> 300 format(target_id, '.', name), value) 301 else: 302 raise Py4JError( Py4JJavaError: An error occurred while calling o65.partitions. : java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.Arrays.copyOfRange(Arrays.java:3664) at java.lang.String.(String.java:207) at java.lang.String.substring(String.java:1969) at java.net.URI$Parser.substring(URI.java:2869) at java.net.URI$Parser.parseHierarchical(URI.java:3106) at java.net.URI$Parser.parse(URI.java:3053) at java.net.URI.(URI.java:746) at org.apache.hadoop.fs.Path.initialize(Path.java:145) at org.apache.hadoop.fs.Path.(Path.java:71) at org.apache.hadoop.fs.Path.(Path.java:50) at org.apache.hadoop.hdfs.protocol.HdfsFileStatus.getFullPath(HdfsFileStatus.ja va:215) at org.apache.hadoop.hdfs.DistributedFileSystem.makeQualified(DistributedFileSy stem.java:293) at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSyste m.java:352) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:862) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:887) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:185 ) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anon
RE: [sparkR] Any insight on java.lang.OutOfMemoryError: GC overhead limit exceeded
This is probably because your config option actually do not take effect. Please refer to the email thread titled “How to set memory for SparkR with master="local[*]"”, which may answer you. I recommend you to try to use SparkR built from the master branch, which contains two fixes that may help you in your use case: https://issues.apache.org/jira/browse/SPARK-11340 https://issues.apache.org/jira/browse/SPARK-11258 BTW, it seems that there is a config conflict in your settings? spark.driver.memory="30g", spark.driver.extraJavaOptions="-Xms5g -Xmx5g From: Dhaval Patel [mailto:dhaval1...@gmail.com] Sent: Saturday, November 7, 2015 12:26 AM To: Spark User Group Subject: [sparkR] Any insight on java.lang.OutOfMemoryError: GC overhead limit exceeded I have been struggling through this error since past 3 days and have tried all possible ways/suggestions people have provided on stackoverflow and here in this group. I am trying to read a parquet file using sparkR and convert it into an R dataframe for further usage. The file size is not that big, ~4G and 250 mil records. My standalone cluster has more than enough memory and processing power : 24 core, 128 GB RAM. Here is configuration to give an idea: Tried this on both spark 1.4.1 and 1.5.1. I have attached both stack traces/logs. Parquet file has 24 partitions. spark.default.confs=list(spark.cores.max="24", spark.executor.memory="50g", spark.driver.memory="30g", spark.driver.extraJavaOptions="-Xms5g -Xmx5g -XX:MaxPermSize=1024M") sc <- sparkR.init(master="local[24]",sparkEnvir = spark.default.confs) ... reading parquet file and storing in R dataframe med.Rdf <- collect(mednew.DF) --- Begin Message --- Hi, Matej, For the convenience of SparkR users, when they start SparkR without using bin/sparkR, (for example, in RStudio), https://issues.apache.org/jira/browse/SPARK-11340 enables setting of “spark.driver.memory”, (also other similar options, like: spark.driver.extraClassPath, spark.driver.extraJavaOptions, spark.driver.extraLibraryPath) in the sparkEnvir parameter for sparkR.init() to take effect. Would you like to give it a try? Note the change is on the master branch, you have to build Spark from source before using it. From: Sun, Rui [mailto:rui@intel.com] Sent: Monday, October 26, 2015 10:24 AM To: Dirceu Semighini Filho Cc: user Subject: RE: How to set memory for SparkR with master="local[*]" As documented in http://spark.apache.org/docs/latest/configuration.html#available-properties, Note for “spark.driver.memory”: Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-memory command line option or in your default properties file. If you are to start a SparkR shell using bin/sparkR, then you can use bin/sparkR –driver-memory. You have no chance to set the driver memory size after the R shell has been launched via bin/sparkR. Buf if you are to start a SparkR shell manually without using bin/sparkR (for example, in Rstudio), you can: library(SparkR) Sys.setenv("SPARKR_SUBMIT_ARGS" = "--conf spark.driver.memory=2g sparkr-shell") sc <- sparkR.init() From: Dirceu Semighini Filho [mailto:dirceu.semigh...@gmail.com] Sent: Friday, October 23, 2015 7:53 PM Cc: user Subject: Re: How to set memory for SparkR with master="local[*]" Hi Matej, I'm also using this and I'm having the same behavior here, my driver has only 530mb which is the default value. Maybe this is a bug. 2015-10-23 9:43 GMT-02:00 Matej Holec <hol...@gmail.com<mailto:hol...@gmail.com>>: Hello! How to adjust the memory settings properly for SparkR with master="local[*]" in R? *When running from R -- SparkR doesn't accept memory settings :(* I use the following commands: R> library(SparkR) R> sc <- sparkR.init(master = "local[*]", sparkEnvir = list(spark.driver.memory = "5g")) Despite the variable spark.driver.memory is correctly set (checked in http://node:4040/environment/), the driver has only the default amount of memory allocated (Storage Memory 530.3 MB). *But when running from spark-1.5.1-bin-hadoop2.6/bin/sparkR -- OK* The following command: ]$ spark-1.5.1-bin-hadoop2.6/bin/sparkR --driver-memory 5g creates SparkR session with properly adjustest driver memory (Storage Memory 2.6 GB). Any suggestion? Thanks Matej -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-set-memory-for-SparkR-with-master-local-tp25178.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -
[sparkR] Any insight on java.lang.OutOfMemoryError: GC overhead limit exceeded
6 10:45:18 INFO CodecPool: Got brand-new decompressor [.gz] 15/11/06 10:45:18 INFO InternalParquetRecordReader: block read in memory in 7 ms. row count = 121970 15/11/06 10:45:18 INFO CodecPool: Got brand-new decompressor [.gz] 15/11/06 10:45:18 INFO CodecPool: Got brand-new decompressor [.gz] 15/11/06 10:45:18 INFO CodecPool: Got brand-new decompressor [.gz] 15/11/06 10:45:18 INFO CodecPool: Got brand-new decompressor [.gz] 15/11/06 10:45:18 INFO InternalParquetRecordReader: block read in memory in 7 ms. row count = 121969 15/11/06 10:45:18 INFO CodecPool: Got brand-new decompressor [.gz] 15/11/06 10:45:18 INFO CodecPool: Got brand-new decompressor [.gz] 15/11/06 10:45:18 INFO CodecPool: Got brand-new decompressor [.gz] 15/11/06 10:45:18 INFO InternalParquetRecordReader: block read in memory in 7 ms. row count = 121969 15/11/06 10:45:18 INFO InternalParquetRecordReader: block read in memory in 8 ms. row count = 121962 15/11/06 10:45:18 INFO InternalParquetRecordReader: block read in memory in 10 ms. row count = 121966 15/11/06 10:45:18 INFO InternalParquetRecordReader: block read in memory in 10 ms. row count = 121972 15/11/06 10:45:18 INFO InternalParquetRecordReader: block read in memory in 9 ms. row count = 121963 15/11/06 10:45:18 INFO InternalParquetRecordReader: block read in memory in 10 ms. row count = 121969 15/11/06 10:45:20 INFO BlockManagerInfo: Removed broadcast_2_piece0 on localhost:39562 in memory (size: 2.4 KB, free: 530.0 MB) 15/11/06 10:45:20 INFO ContextCleaner: Cleaned accumulator 2 15/11/06 10:45:53 WARN ServletHandler: Error for /static/timeline-view.css java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.zip.ZipCoder.toString(ZipCoder.java:49) at java.util.zip.ZipFile.getZipEntry(ZipFile.java:567) at java.util.zip.ZipFile.access$900(ZipFile.java:61) at java.util.zip.ZipFile$ZipEntryIterator.next(ZipFile.java:525) at java.util.zip.ZipFile$ZipEntryIterator.nextElement(ZipFile.java:500) at java.util.zip.ZipFile$ZipEntryIterator.nextElement(ZipFile.java:481) at java.util.jar.JarFile$JarEntryIterator.next(JarFile.java:257) at java.util.jar.JarFile$JarEntryIterator.nextElement(JarFile.java:266) at java.util.jar.JarFile$JarEntryIterator.nextElement(JarFile.java:247) at org.spark-project.jetty.util.resource.JarFileResource.exists(JarFileResource.java:189) at org.spark-project.jetty.servlet.DefaultServlet.getResource(DefaultServlet.java:398) at org.spark-project.jetty.servlet.DefaultServlet.doGet(DefaultServlet.java:476) at javax.servlet.http.HttpServlet.service(HttpServlet.java:735) at javax.servlet.http.HttpServlet.service(HttpServlet.java:848) at org.spark-project.jetty.servlet.ServletHolder.handle(ServletHolder.java:684) at org.spark-project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501) at org.spark-project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086) at org.spark-project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428) at org.spark-project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020) at org.spark-project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.spark-project.jetty.server.handler.GzipHandler.handle(GzipHandler.java:264) at org.spark-project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.spark-project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.spark-project.jetty.server.Server.handle(Server.java:370) at org.spark-project.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494) at org.spark-project.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971) at org.spark-project.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033) at org.spark-project.jetty.http.HttpParser.parseNext(HttpParser.java:644) at org.spark-project.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.spark-project.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82) at org.spark-project.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667) at org.spark-project.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52) 15/11/06 10:45:57 WARN ServletHandler: Error for /static/bootstrap.min.css java.lang.OutOfMemoryError: GC overhead limit exceeded at sun.util.calendar.Gregorian.newCalendarDate(Gregorian.java:85) at sun.util.calendar.Gregorian.newCalendarDate(Gregorian.java:37) at java.util.Date.(Date.java:254) at java.util.zip.ZipUtils.dosToJavaTime(ZipUtils.java:74) at java.util.zip.ZipFile.getZipEntry(ZipFile.java:570) at java.
java.lang.OutOfMemoryError: GC overhead limit exceeded
I get java.lang.OutOfMemoryError: GC overhead limit exceeded when trying coutn action on a file. The file is a CSV file 217GB zise Im using a 10 r3.8xlarge(ubuntu) machines cdh 5.3.6 and spark 1.2.0 configutation: spark.app.id:local-1443956477103 spark.app.name:Spark shell spark.cores.max:100 spark.driver.cores:24 spark.driver.extraLibraryPath:/opt/cloudera/parcels/CDH-5.3.6-1.cdh5.3.6.p0.11/lib/hadoop/lib/native spark.driver.host:ip-172-31-34-242.us-west-2.compute.internal spark.driver.maxResultSize:300g spark.driver.port:55123 spark.eventLog.dir:hdfs://ip-172-31-34-242.us-west-2.compute.internal:8020/user/spark/applicationHistory spark.eventLog.enabled:true spark.executor.extraLibraryPath:/opt/cloudera/parcels/CDH-5.3.6-1.cdh5.3.6.p0.11/lib/hadoop/lib/native spark.executor.id:driver spark.executor.memory:200g spark.fileserver.uri:http://172.31.34.242:51424 spark.jars: spark.master:local[*] spark.repl.class.uri:http://172.31.34.242:58244 spark.scheduler.mode:FIFO spark.serializer:org.apache.spark.serializer.KryoSerializer spark.storage.memoryFraction:0.9 spark.tachyonStore.folderName:spark-88bd9c44-d626-4ad2-8df3-f89df4cb30de spark.yarn.historyServer.address:http://ip-172-31-34-242.us-west-2.compute.internal:18088 here is what I ran: val testrdd = sc.textFile("hdfs://ip-172-31-34-242.us-west-2.compute.internal:8020/user/jethro/tables/edw_fact_lsx_detail/edw_fact_lsx_detail.csv") testrdd.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY_SER) testrdd.count() If I dont force it in memeory it sorks fine, but considering the cluster Im running on it should fit in memory properly. Any ideas? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-OutOfMemoryError-GC-overhead-limit-exceeded-tp24918.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: java.lang.OutOfMemoryError: GC overhead limit exceeded
1.2.0 is quite old. You may want to try 1.5.1 which was released in the past week. Cheers > On Oct 4, 2015, at 4:26 AM, t_ras <marti...@netvision.net.il> wrote: > > I get java.lang.OutOfMemoryError: GC overhead limit exceeded when trying > coutn action on a file. > > The file is a CSV file 217GB zise > > Im using a 10 r3.8xlarge(ubuntu) machines cdh 5.3.6 and spark 1.2.0 > > configutation: > > spark.app.id:local-1443956477103 > > spark.app.name:Spark shell > > spark.cores.max:100 > > spark.driver.cores:24 > > spark.driver.extraLibraryPath:/opt/cloudera/parcels/CDH-5.3.6-1.cdh5.3.6.p0.11/lib/hadoop/lib/native > spark.driver.host:ip-172-31-34-242.us-west-2.compute.internal > > spark.driver.maxResultSize:300g > > spark.driver.port:55123 > > spark.eventLog.dir:hdfs://ip-172-31-34-242.us-west-2.compute.internal:8020/user/spark/applicationHistory > spark.eventLog.enabled:true > > spark.executor.extraLibraryPath:/opt/cloudera/parcels/CDH-5.3.6-1.cdh5.3.6.p0.11/lib/hadoop/lib/native > > spark.executor.id:driver spark.executor.memory:200g > > spark.fileserver.uri:http://172.31.34.242:51424 > > spark.jars: spark.master:local[*] > > spark.repl.class.uri:http://172.31.34.242:58244 > > spark.scheduler.mode:FIFO > > spark.serializer:org.apache.spark.serializer.KryoSerializer > > spark.storage.memoryFraction:0.9 > > spark.tachyonStore.folderName:spark-88bd9c44-d626-4ad2-8df3-f89df4cb30de > > spark.yarn.historyServer.address:http://ip-172-31-34-242.us-west-2.compute.internal:18088 > > here is what I ran: > > val testrdd = > sc.textFile("hdfs://ip-172-31-34-242.us-west-2.compute.internal:8020/user/jethro/tables/edw_fact_lsx_detail/edw_fact_lsx_detail.csv") > > testrdd.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY_SER) > > testrdd.count() > > If I dont force it in memeory it sorks fine, but considering the cluster Im > running on it should fit in memory properly. > > Any ideas? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-OutOfMemoryError-GC-overhead-limit-exceeded-tp24918.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
AW: Spark GraphX memory requirements + java.lang.OutOfMemoryError: GC overhead limit exceeded
Hi – I'd like to follow up on this, as I am running into very similar issues (with a much bigger data set, though – 10^5 nodes, 10^7 edges). So let me repost the question: Any ideas on how to estimate graphx memory requirements? Cheers! Von: Roman Sokolov [mailto:ole...@gmail.com] Gesendet: Samstag, 11. Juli 2015 03:58 An: Ted Yu; Robin East; user Betreff: Re: Spark GraphX memory requirements + java.lang.OutOfMemoryError: GC overhead limit exceeded Hello again. So I could compute triangle numbers when run the code from spark shell without workers (with --driver-memory 15g option), but with workers I have errors. So I run spark shell: ./bin/spark-shell --master spark://192.168.0.31:7077http://192.168.0.31:7077 --executor-memory 6900m --driver-memory 15g and workers (by hands): ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://192.168.0.31:7077http://192.168.0.31:7077 (2 workers, each has 8Gb RAM; master has 32 Gb RAM). The code now is: import org.apache.spark._ import org.apache.spark.graphx._ val graph = GraphLoader.edgeListFile(sc, /home/data/graph.txt).partitionBy(PartitionStrategy.RandomVertexCut) val newgraph = graph.convertToCanonicalEdges() val triangleNum = newgraph.triangleCount().vertices.map(x = x._2.toLong).reduce(_ + _)/3 So how to understand what amount of memory is needed? And why I need it so much? Dataset is only 1,1Gb small... Error: [Stage 7: (0 + 8) / 32]15/07/11 01:48:45 WARN TaskSetManager: Lost task 2.0 in stage 7.0 (TID 130, 192.168.0.28): io.netty.handler.codec.DecoderException: java.lang.OutOfMemoryError at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:153) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.OutOfMemoryError at sun.misc.Unsafe.allocateMemory(Native Method) at java.nio.DirectByteBuffer.init(DirectByteBuffer.java:127) at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) at io.netty.buffer.PoolArena$DirectArena.newUnpooledChunk(PoolArena.java:440) at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:187) at io.netty.buffer.PoolArena.allocate(PoolArena.java:165) at io.netty.buffer.PoolArena.reallocate(PoolArena.java:277) at io.netty.buffer.PooledByteBuf.capacity(PooledByteBuf.java:108) at io.netty.buffer.AbstractByteBuf.ensureWritable(AbstractByteBuf.java:251) at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:849) at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:841) at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:831) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:146) ... 10 more On 26 June 2015 at 14:06, Roman Sokolov ole...@gmail.commailto:ole...@gmail.com wrote: Yep, I already found it. So I added 1 line: val graph = GraphLoader.edgeListFile(sc, , ...) val newgraph = graph.convertToCanonicalEdges() and could successfully count triangles on newgraph. Next will test it on bigger (several Gb) networks. I am using Spark 1.3 and 1.4 but haven't seen this function in https://spark.apache.org/docs/latest/graphx-programming-guide.html Thanks a lot guys! Am 26.06.2015 13:50 schrieb Ted Yu yuzhih...@gmail.commailto:yuzhih...@gmail.com: See SPARK-4917 which went into Spark 1.3.0 On Fri, Jun 26, 2015 at 2:27 AM, Robin East robin.e...@xense.co.ukmailto:robin.e...@xense.co.uk wrote: You’ll get this issue if you just take the first 2000 lines of that file. The problem is triangleCount() expects srdId dstId which is not the case in the file (e.g. vertex 28). You can get round this by calling graph.convertToCanonical Edges() which removes bi-directional edges and ensures srcId dstId. Which version of Spark are you
Re: Strange Error: java.lang.OutOfMemoryError: GC overhead limit exceeded
Yes there is. But the RDD is more than 10 TB and compression does not help. On Wed, Jul 15, 2015 at 8:36 PM, Ted Yu yuzhih...@gmail.com wrote: bq. serializeUncompressed() Is there a method which enables compression ? Just wondering if that would reduce the memory footprint. Cheers On Wed, Jul 15, 2015 at 8:06 AM, Saeed Shahrivari saeed.shahriv...@gmail.com wrote: I use a simple map/reduce step in a Java/Spark program to remove duplicated documents from a large (10 TB compressed) sequence file containing some html pages. Here is the partial code: JavaPairRDDBytesWritable, NullWritable inputRecords = sc.sequenceFile(args[0], BytesWritable.class, NullWritable.class).coalesce(numMaps); JavaPairRDDString, AteshProtos.CacheDoc hashDocs = inputRecords.mapToPair(t - cacheDocs.add(new Tuple2(BaseEncoding.base64() .encode(Hashing.sha1().hashString(doc.getUrl(), Charset.defaultCharset()).asBytes()), doc)); }); JavaPairRDDBytesWritable, NullWritable byteArrays = hashDocs.reduceByKey((a, b) - a.getUrl() b.getUrl() ? a : b, numReds). mapToPair(t - new Tuple2(new BytesWritable(PentV3.buildFromMessage(t._2).serializeUncompressed()), NullWritable.get())); The logic is simple. The map generates a sha-1 signature from the html and in the reduce phase we keep the html that has the shortest URL. However, after running for 2-3 hours the application crashes due to memory issue. Here is the exception: 15/07/15 18:24:05 WARN scheduler.TaskSetManager: Lost task 267.0 in stage 0.0 (TID 267, psh-11.nse.ir): java.lang.OutOfMemoryError: GC overhead limit exceeded It seems that the map function keeps the hashDocs RDD in the memory and when the memory is filled in an executor, the application crashes. Persisting the map output to disk solves the problem. Adding the following line between map and reduce solve the issue: hashDocs.persist(StorageLevel.DISK_ONLY()); Is this a bug of Spark? How can I tell Spark not to keep even a bit of RDD in the memory? Thanks
Re: Strange Error: java.lang.OutOfMemoryError: GC overhead limit exceeded
bq. serializeUncompressed() Is there a method which enables compression ? Just wondering if that would reduce the memory footprint. Cheers On Wed, Jul 15, 2015 at 8:06 AM, Saeed Shahrivari saeed.shahriv...@gmail.com wrote: I use a simple map/reduce step in a Java/Spark program to remove duplicated documents from a large (10 TB compressed) sequence file containing some html pages. Here is the partial code: JavaPairRDDBytesWritable, NullWritable inputRecords = sc.sequenceFile(args[0], BytesWritable.class, NullWritable.class).coalesce(numMaps); JavaPairRDDString, AteshProtos.CacheDoc hashDocs = inputRecords.mapToPair(t - cacheDocs.add(new Tuple2(BaseEncoding.base64() .encode(Hashing.sha1().hashString(doc.getUrl(), Charset.defaultCharset()).asBytes()), doc)); }); JavaPairRDDBytesWritable, NullWritable byteArrays = hashDocs.reduceByKey((a, b) - a.getUrl() b.getUrl() ? a : b, numReds). mapToPair(t - new Tuple2(new BytesWritable(PentV3.buildFromMessage(t._2).serializeUncompressed()), NullWritable.get())); The logic is simple. The map generates a sha-1 signature from the html and in the reduce phase we keep the html that has the shortest URL. However, after running for 2-3 hours the application crashes due to memory issue. Here is the exception: 15/07/15 18:24:05 WARN scheduler.TaskSetManager: Lost task 267.0 in stage 0.0 (TID 267, psh-11.nse.ir): java.lang.OutOfMemoryError: GC overhead limit exceeded It seems that the map function keeps the hashDocs RDD in the memory and when the memory is filled in an executor, the application crashes. Persisting the map output to disk solves the problem. Adding the following line between map and reduce solve the issue: hashDocs.persist(StorageLevel.DISK_ONLY()); Is this a bug of Spark? How can I tell Spark not to keep even a bit of RDD in the memory? Thanks
Strange Error: java.lang.OutOfMemoryError: GC overhead limit exceeded
I use a simple map/reduce step in a Java/Spark program to remove duplicated documents from a large (10 TB compressed) sequence file containing some html pages. Here is the partial code: JavaPairRDDBytesWritable, NullWritable inputRecords = sc.sequenceFile(args[0], BytesWritable.class, NullWritable.class).coalesce(numMaps); JavaPairRDDString, AteshProtos.CacheDoc hashDocs = inputRecords.mapToPair(t - cacheDocs.add(new Tuple2(BaseEncoding.base64() .encode(Hashing.sha1().hashString(doc.getUrl(), Charset.defaultCharset()).asBytes()), doc)); }); JavaPairRDDBytesWritable, NullWritable byteArrays = hashDocs.reduceByKey((a, b) - a.getUrl() b.getUrl() ? a : b, numReds). mapToPair(t - new Tuple2(new BytesWritable(PentV3.buildFromMessage(t._2).serializeUncompressed()), NullWritable.get())); The logic is simple. The map generates a sha-1 signature from the html and in the reduce phase we keep the html that has the shortest URL. However, after running for 2-3 hours the application crashes due to memory issue. Here is the exception: 15/07/15 18:24:05 WARN scheduler.TaskSetManager: Lost task 267.0 in stage 0.0 (TID 267, psh-11.nse.ir): java.lang.OutOfMemoryError: GC overhead limit exceeded It seems that the map function keeps the hashDocs RDD in the memory and when the memory is filled in an executor, the application crashes. Persisting the map output to disk solves the problem. Adding the following line between map and reduce solve the issue: hashDocs.persist(StorageLevel.DISK_ONLY()); Is this a bug of Spark? How can I tell Spark not to keep even a bit of RDD in the memory? Thanks
Re: Spark GraphX memory requirements + java.lang.OutOfMemoryError: GC overhead limit exceeded
Hello again. So I could compute triangle numbers when run the code from spark shell without workers (with --driver-memory 15g option), but with workers I have errors. So I run spark shell: ./bin/spark-shell --master spark://192.168.0.31:7077 --executor-memory 6900m --driver-memory 15g and workers (by hands): ./bin/spark-class org.apache.spark.deploy.worker.Worker spark:// 192.168.0.31:7077 (2 workers, each has 8Gb RAM; master has 32 Gb RAM). The code now is: import org.apache.spark._ import org.apache.spark.graphx._ val graph = GraphLoader.edgeListFile(sc, /home/data/graph.txt).partitionBy(PartitionStrategy.RandomVertexCut) val newgraph = graph.convertToCanonicalEdges() val triangleNum = newgraph.triangleCount().vertices.map(x = x._2.toLong).reduce(_ + _)/3 So how to understand what amount of memory is needed? And why I need it so much? Dataset is only 1,1Gb small... Error: [Stage 7: (0 + 8) / 32]15/07/11 01:48:45 WARN TaskSetManager: Lost task 2.0 in stage 7.0 (TID 130, 192.168.0.28): io.netty.handler.codec.DecoderException: java.lang.OutOfMemoryError at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:153) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.OutOfMemoryError at sun.misc.Unsafe.allocateMemory(Native Method) at java.nio.DirectByteBuffer.init(DirectByteBuffer.java:127) at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) at io.netty.buffer.PoolArena$DirectArena.newUnpooledChunk(PoolArena.java:440) at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:187) at io.netty.buffer.PoolArena.allocate(PoolArena.java:165) at io.netty.buffer.PoolArena.reallocate(PoolArena.java:277) at io.netty.buffer.PooledByteBuf.capacity(PooledByteBuf.java:108) at io.netty.buffer.AbstractByteBuf.ensureWritable(AbstractByteBuf.java:251) at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:849) at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:841) at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:831) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:146) ... 10 more On 26 June 2015 at 14:06, Roman Sokolov ole...@gmail.com wrote: Yep, I already found it. So I added 1 line: val graph = GraphLoader.edgeListFile(sc, , ...) val newgraph = graph.convertToCanonicalEdges() and could successfully count triangles on newgraph. Next will test it on bigger (several Gb) networks. I am using Spark 1.3 and 1.4 but haven't seen this function in https://spark.apache.org/docs/latest/graphx-programming-guide.html Thanks a lot guys! Am 26.06.2015 13:50 schrieb Ted Yu yuzhih...@gmail.com: See SPARK-4917 which went into Spark 1.3.0 On Fri, Jun 26, 2015 at 2:27 AM, Robin East robin.e...@xense.co.uk wrote: You’ll get this issue if you just take the first 2000 lines of that file. The problem is triangleCount() expects srdId dstId which is not the case in the file (e.g. vertex 28). You can get round this by calling graph.convertToCanonical Edges() which removes bi-directional edges and ensures srcId dstId. Which version of Spark are you on? Can’t remember what version that method was introduced in. Robin On 26 Jun 2015, at 09:44, Roman Sokolov ole...@gmail.com wrote: Ok, but what does it means? I did not change the core files of spark, so is it a bug there? PS: on small datasets (500 Mb) I have no problem. Am 25.06.2015 18:02 schrieb Ted Yu yuzhih...@gmail.com: The assertion failure from TriangleCount.scala corresponds with the following lines: g.outerJoinVertices(counters) { (vid, _, optCounter: Option[Int]) = val dblCount = optCounter.getOrElse(0) // double count should be even (divisible by two) assert((dblCount 1) == 0) Cheers On Thu, Jun 25, 2015 at 6:20 AM, Roman Sokolov ole...@gmail.com wrote: Hello! I am trying to compute number of triangles with GraphX. But get memory error or heap size, even though the dataset is very small (1Gb). I run the code in spark-shell, having 16Gb RAM machine (also tried with 2 workers on separate machines 8Gb RAM each). So
Re: Spark GraphX memory requirements + java.lang.OutOfMemoryError: GC overhead limit exceeded
Ok, but what does it means? I did not change the core files of spark, so is it a bug there? PS: on small datasets (500 Mb) I have no problem. Am 25.06.2015 18:02 schrieb Ted Yu yuzhih...@gmail.com: The assertion failure from TriangleCount.scala corresponds with the following lines: g.outerJoinVertices(counters) { (vid, _, optCounter: Option[Int]) = val dblCount = optCounter.getOrElse(0) // double count should be even (divisible by two) assert((dblCount 1) == 0) Cheers On Thu, Jun 25, 2015 at 6:20 AM, Roman Sokolov ole...@gmail.com wrote: Hello! I am trying to compute number of triangles with GraphX. But get memory error or heap size, even though the dataset is very small (1Gb). I run the code in spark-shell, having 16Gb RAM machine (also tried with 2 workers on separate machines 8Gb RAM each). So I have 15x more memory than the dataset size is, but it is not enough. What should I do with terabytes sized datasets? How do people process it? Read a lot of documentation and 2 Spark books, and still have no clue :( Tried to run with the options, no effect: ./bin/spark-shell --executor-memory 6g --driver-memory 9g --total-executor-cores 100 The code is simple: val graph = GraphLoader.edgeListFile(sc, /home/ubuntu/data/soc-LiveJournal1/lj.stdout, edgeStorageLevel = StorageLevel.MEMORY_AND_DISK_SER, vertexStorageLevel = StorageLevel.MEMORY_AND_DISK_SER).partitionBy(PartitionStrategy.RandomVertexCut) println(graph.numEdges) println(graph.numVertices) val triangleNum = graph.triangleCount().vertices.map(x = x._2).reduce(_ + _)/3 (dataset is from here: http://konect.uni-koblenz.de/downloads/tsv/soc-LiveJournal1.tar.bz2 first two lines contain % characters, so have to be removed). UPD: today tried on 32Gb machine (from spark shell again), now got another error: [Stage 8: (0 + 4) / 32]15/06/25 13:03:05 WARN ShippableVertexPartitionOps: Joining two VertexPartitions with different indexes is slow. 15/06/25 13:03:05 ERROR Executor: Exception in task 3.0 in stage 8.0 (TID 227) java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:165) at org.apache.spark.graphx.lib.TriangleCount$$anonfun$7.apply(TriangleCount.scala:90) at org.apache.spark.graphx.lib.TriangleCount$$anonfun$7.apply(TriangleCount.scala:87) at org.apache.spark.graphx.impl.VertexPartitionBaseOps.leftJoin(VertexPartitionBaseOps.scala:140) at org.apache.spark.graphx.impl.VertexPartitionBaseOps.leftJoin(VertexPartitionBaseOps.scala:133) at org.apache.spark.graphx.impl.VertexRDDImpl$$anonfun$3.apply(VertexRDDImpl.scala:159) at org.apache.spark.graphx.impl.VertexRDDImpl$$anonfun$3.apply(VertexRDDImpl.scala:156) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.graphx.VertexRDD.compute(VertexRDD.scala:71) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) -- Best regards, Roman Sokolov
Re: Spark GraphX memory requirements + java.lang.OutOfMemoryError: GC overhead limit exceeded
You’ll get this issue if you just take the first 2000 lines of that file. The problem is triangleCount() expects srdId dstId which is not the case in the file (e.g. vertex 28). You can get round this by calling graph.convertToCanonical Edges() which removes bi-directional edges and ensures srcId dstId. Which version of Spark are you on? Can’t remember what version that method was introduced in. Robin On 26 Jun 2015, at 09:44, Roman Sokolov ole...@gmail.com wrote: Ok, but what does it means? I did not change the core files of spark, so is it a bug there? PS: on small datasets (500 Mb) I have no problem. Am 25.06.2015 18:02 schrieb Ted Yu yuzhih...@gmail.com mailto:yuzhih...@gmail.com: The assertion failure from TriangleCount.scala corresponds with the following lines: g.outerJoinVertices(counters) { (vid, _, optCounter: Option[Int]) = val dblCount = optCounter.getOrElse(0) // double count should be even (divisible by two) assert((dblCount 1) == 0) Cheers On Thu, Jun 25, 2015 at 6:20 AM, Roman Sokolov ole...@gmail.com mailto:ole...@gmail.com wrote: Hello! I am trying to compute number of triangles with GraphX. But get memory error or heap size, even though the dataset is very small (1Gb). I run the code in spark-shell, having 16Gb RAM machine (also tried with 2 workers on separate machines 8Gb RAM each). So I have 15x more memory than the dataset size is, but it is not enough. What should I do with terabytes sized datasets? How do people process it? Read a lot of documentation and 2 Spark books, and still have no clue :( Tried to run with the options, no effect: ./bin/spark-shell --executor-memory 6g --driver-memory 9g --total-executor-cores 100 The code is simple: val graph = GraphLoader.edgeListFile(sc, /home/ubuntu/data/soc-LiveJournal1/lj.stdout, edgeStorageLevel = StorageLevel.MEMORY_AND_DISK_SER, vertexStorageLevel = StorageLevel.MEMORY_AND_DISK_SER).partitionBy(PartitionStrategy.RandomVertexCut) println(graph.numEdges) println(graph.numVertices) val triangleNum = graph.triangleCount().vertices.map(x = x._2).reduce(_ + _)/3 (dataset is from here: http://konect.uni-koblenz.de/downloads/tsv/soc-LiveJournal1.tar.bz2 http://konect.uni-koblenz.de/downloads/tsv/soc-LiveJournal1.tar.bz2 first two lines contain % characters, so have to be removed). UPD: today tried on 32Gb machine (from spark shell again), now got another error: [Stage 8: (0 + 4) / 32]15/06/25 13:03:05 WARN ShippableVertexPartitionOps: Joining two VertexPartitions with different indexes is slow. 15/06/25 13:03:05 ERROR Executor: Exception in task 3.0 in stage 8.0 (TID 227) java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:165) at org.apache.spark.graphx.lib.TriangleCount$$anonfun$7.apply(TriangleCount.scala:90) at org.apache.spark.graphx.lib.TriangleCount$$anonfun$7.apply(TriangleCount.scala:87) at org.apache.spark.graphx.impl.VertexPartitionBaseOps.leftJoin(VertexPartitionBaseOps.scala:140) at org.apache.spark.graphx.impl.VertexPartitionBaseOps.leftJoin(VertexPartitionBaseOps.scala:133) at org.apache.spark.graphx.impl.VertexRDDImpl$$anonfun$3.apply(VertexRDDImpl.scala:159) at org.apache.spark.graphx.impl.VertexRDDImpl$$anonfun$3.apply(VertexRDDImpl.scala:156) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.graphx.VertexRDD.compute(VertexRDD.scala:71) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) -- Best regards, Roman Sokolov
Re: Spark GraphX memory requirements + java.lang.OutOfMemoryError: GC overhead limit exceeded
Yep, I already found it. So I added 1 line: val graph = GraphLoader.edgeListFile(sc, , ...) val newgraph = graph.convertToCanonicalEdges() and could successfully count triangles on newgraph. Next will test it on bigger (several Gb) networks. I am using Spark 1.3 and 1.4 but haven't seen this function in https://spark.apache.org/docs/latest/graphx-programming-guide.html Thanks a lot guys! Am 26.06.2015 13:50 schrieb Ted Yu yuzhih...@gmail.com: See SPARK-4917 which went into Spark 1.3.0 On Fri, Jun 26, 2015 at 2:27 AM, Robin East robin.e...@xense.co.uk wrote: You’ll get this issue if you just take the first 2000 lines of that file. The problem is triangleCount() expects srdId dstId which is not the case in the file (e.g. vertex 28). You can get round this by calling graph.convertToCanonical Edges() which removes bi-directional edges and ensures srcId dstId. Which version of Spark are you on? Can’t remember what version that method was introduced in. Robin On 26 Jun 2015, at 09:44, Roman Sokolov ole...@gmail.com wrote: Ok, but what does it means? I did not change the core files of spark, so is it a bug there? PS: on small datasets (500 Mb) I have no problem. Am 25.06.2015 18:02 schrieb Ted Yu yuzhih...@gmail.com: The assertion failure from TriangleCount.scala corresponds with the following lines: g.outerJoinVertices(counters) { (vid, _, optCounter: Option[Int]) = val dblCount = optCounter.getOrElse(0) // double count should be even (divisible by two) assert((dblCount 1) == 0) Cheers On Thu, Jun 25, 2015 at 6:20 AM, Roman Sokolov ole...@gmail.com wrote: Hello! I am trying to compute number of triangles with GraphX. But get memory error or heap size, even though the dataset is very small (1Gb). I run the code in spark-shell, having 16Gb RAM machine (also tried with 2 workers on separate machines 8Gb RAM each). So I have 15x more memory than the dataset size is, but it is not enough. What should I do with terabytes sized datasets? How do people process it? Read a lot of documentation and 2 Spark books, and still have no clue :( Tried to run with the options, no effect: ./bin/spark-shell --executor-memory 6g --driver-memory 9g --total-executor-cores 100 The code is simple: val graph = GraphLoader.edgeListFile(sc, /home/ubuntu/data/soc-LiveJournal1/lj.stdout, edgeStorageLevel = StorageLevel.MEMORY_AND_DISK_SER, vertexStorageLevel = StorageLevel.MEMORY_AND_DISK_SER).partitionBy(PartitionStrategy.RandomVertexCut) println(graph.numEdges) println(graph.numVertices) val triangleNum = graph.triangleCount().vertices.map(x = x._2).reduce(_ + _)/3 (dataset is from here: http://konect.uni-koblenz.de/downloads/tsv/soc-LiveJournal1.tar.bz2 first two lines contain % characters, so have to be removed). UPD: today tried on 32Gb machine (from spark shell again), now got another error: [Stage 8: (0 + 4) / 32]15/06/25 13:03:05 WARN ShippableVertexPartitionOps: Joining two VertexPartitions with different indexes is slow. 15/06/25 13:03:05 ERROR Executor: Exception in task 3.0 in stage 8.0 (TID 227) java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:165) at org.apache.spark.graphx.lib.TriangleCount$$anonfun$7.apply(TriangleCount.scala:90) at org.apache.spark.graphx.lib.TriangleCount$$anonfun$7.apply(TriangleCount.scala:87) at org.apache.spark.graphx.impl.VertexPartitionBaseOps.leftJoin(VertexPartitionBaseOps.scala:140) at org.apache.spark.graphx.impl.VertexPartitionBaseOps.leftJoin(VertexPartitionBaseOps.scala:133) at org.apache.spark.graphx.impl.VertexRDDImpl$$anonfun$3.apply(VertexRDDImpl.scala:159) at org.apache.spark.graphx.impl.VertexRDDImpl$$anonfun$3.apply(VertexRDDImpl.scala:156) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.graphx.VertexRDD.compute(VertexRDD.scala:71) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) -- Best regards, Roman Sokolov
Spark GraphX memory requirements + java.lang.OutOfMemoryError: GC overhead limit exceeded
Hello! I am trying to compute number of triangles with GraphX. But get memory error or heap size, even though the dataset is very small (1Gb). I run the code in spark-shell, having 16Gb RAM machine (also tried with 2 workers on separate machines 8Gb RAM each). So I have 15x more memory than the dataset size is, but it is not enough. What should I do with terabytes sized datasets? How do people process it? Read a lot of documentation and 2 Spark books, and still have no clue :( Tried to run with the options, no effect: ./bin/spark-shell --executor-memory 6g --driver-memory 9g --total-executor-cores 100 The code is simple: val graph = GraphLoader.edgeListFile(sc, /home/ubuntu/data/soc-LiveJournal1/lj.stdout, edgeStorageLevel = StorageLevel.MEMORY_AND_DISK_SER, vertexStorageLevel = StorageLevel.MEMORY_AND_DISK_SER).partitionBy(PartitionStrategy.RandomVertexCut) println(graph.numEdges) println(graph.numVertices) val triangleNum = graph.triangleCount().vertices.map(x = x._2).reduce(_ + _)/3 (dataset is from here: http://konect.uni-koblenz.de/downloads/tsv/soc-LiveJournal1.tar.bz2 first two lines contain % characters, so have to be removed). UPD: today tried on 32Gb machine (from spark shell again), now got another error: [Stage 8: (0 + 4) / 32]15/06/25 13:03:05 WARN ShippableVertexPartitionOps: Joining two VertexPartitions with different indexes is slow. 15/06/25 13:03:05 ERROR Executor: Exception in task 3.0 in stage 8.0 (TID 227) java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:165) at org.apache.spark.graphx.lib.TriangleCount$$anonfun$7.apply(TriangleCount.scala:90) at org.apache.spark.graphx.lib.TriangleCount$$anonfun$7.apply(TriangleCount.scala:87) at org.apache.spark.graphx.impl.VertexPartitionBaseOps.leftJoin(VertexPartitionBaseOps.scala:140) at org.apache.spark.graphx.impl.VertexPartitionBaseOps.leftJoin(VertexPartitionBaseOps.scala:133) at org.apache.spark.graphx.impl.VertexRDDImpl$$anonfun$3.apply(VertexRDDImpl.scala:159) at org.apache.spark.graphx.impl.VertexRDDImpl$$anonfun$3.apply(VertexRDDImpl.scala:156) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.graphx.VertexRDD.compute(VertexRDD.scala:71) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) -- Best regards, Roman Sokolov
Re: Spark GraphX memory requirements + java.lang.OutOfMemoryError: GC overhead limit exceeded
The assertion failure from TriangleCount.scala corresponds with the following lines: g.outerJoinVertices(counters) { (vid, _, optCounter: Option[Int]) = val dblCount = optCounter.getOrElse(0) // double count should be even (divisible by two) assert((dblCount 1) == 0) Cheers On Thu, Jun 25, 2015 at 6:20 AM, Roman Sokolov ole...@gmail.com wrote: Hello! I am trying to compute number of triangles with GraphX. But get memory error or heap size, even though the dataset is very small (1Gb). I run the code in spark-shell, having 16Gb RAM machine (also tried with 2 workers on separate machines 8Gb RAM each). So I have 15x more memory than the dataset size is, but it is not enough. What should I do with terabytes sized datasets? How do people process it? Read a lot of documentation and 2 Spark books, and still have no clue :( Tried to run with the options, no effect: ./bin/spark-shell --executor-memory 6g --driver-memory 9g --total-executor-cores 100 The code is simple: val graph = GraphLoader.edgeListFile(sc, /home/ubuntu/data/soc-LiveJournal1/lj.stdout, edgeStorageLevel = StorageLevel.MEMORY_AND_DISK_SER, vertexStorageLevel = StorageLevel.MEMORY_AND_DISK_SER).partitionBy(PartitionStrategy.RandomVertexCut) println(graph.numEdges) println(graph.numVertices) val triangleNum = graph.triangleCount().vertices.map(x = x._2).reduce(_ + _)/3 (dataset is from here: http://konect.uni-koblenz.de/downloads/tsv/soc-LiveJournal1.tar.bz2 first two lines contain % characters, so have to be removed). UPD: today tried on 32Gb machine (from spark shell again), now got another error: [Stage 8: (0 + 4) / 32]15/06/25 13:03:05 WARN ShippableVertexPartitionOps: Joining two VertexPartitions with different indexes is slow. 15/06/25 13:03:05 ERROR Executor: Exception in task 3.0 in stage 8.0 (TID 227) java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:165) at org.apache.spark.graphx.lib.TriangleCount$$anonfun$7.apply(TriangleCount.scala:90) at org.apache.spark.graphx.lib.TriangleCount$$anonfun$7.apply(TriangleCount.scala:87) at org.apache.spark.graphx.impl.VertexPartitionBaseOps.leftJoin(VertexPartitionBaseOps.scala:140) at org.apache.spark.graphx.impl.VertexPartitionBaseOps.leftJoin(VertexPartitionBaseOps.scala:133) at org.apache.spark.graphx.impl.VertexRDDImpl$$anonfun$3.apply(VertexRDDImpl.scala:159) at org.apache.spark.graphx.impl.VertexRDDImpl$$anonfun$3.apply(VertexRDDImpl.scala:156) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.graphx.VertexRDD.compute(VertexRDD.scala:71) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) -- Best regards, Roman Sokolov
Re: Spark job throwing “java.lang.OutOfMemoryError: GC overhead limit exceeded”
Hi Raj, Since the number of executor cores is equivalent to the number of tasks that can be executed in parallel in the executor, in effect, the 6G executor memory configured for an executor is being shared by 6 tasks plus factoring in the memory allocation for caching task execution. I would suggest increasing the executor-memory and also adjusting it if you're going to increase the number of executor cores. You might also want to adjust the memory allocation for caching and task execution, via the spark.storage.memoryFraction config. By default, it's configured to 0.6 (60% of the memory is allocated for the cache). Lowering it to a smaller fraction, say 0.4 or 0.3, would give you more available memory for task executions. Hope this helps! Thanks, Deng On Tue, Jun 16, 2015 at 3:09 AM, diplomatic Guru diplomaticg...@gmail.com wrote: Hello All, I have a Spark job that throws java.lang.OutOfMemoryError: GC overhead limit exceeded. The job is trying to process a filesize 4.5G. I've tried following spark configuration: --num-executors 6 --executor-memory 6G --executor-cores 6 --driver-memory 3G I tried increasing more cores and executors which sometime works, but takes over 20 minutes to process the file. Could I do something to improve the performance? or stop the Java Heap issue? Thank you. Best regards, Raj.
Spark job throwing “java.lang.OutOfMemoryError: GC overhead limit exceeded”
Hello All, I have a Spark job that throws java.lang.OutOfMemoryError: GC overhead limit exceeded. The job is trying to process a filesize 4.5G. I've tried following spark configuration: --num-executors 6 --executor-memory 6G --executor-cores 6 --driver-memory 3G I tried increasing more cores and executors which sometime works, but takes over 20 minutes to process the file. Could I do something to improve the performance? or stop the Java Heap issue? Thank you. Best regards, Raj.
Re: java.lang.OutOfMemoryError: GC overhead limit exceeded
Hi Antony, Did you get pass this error by repartitioning your job with smaller tasks as Sven Krasser pointed out? From: Antony Mayi antonym...@yahoo.com Reply-To: Antony Mayi antonym...@yahoo.com Date: Tuesday, January 27, 2015 at 5:24 PM To: Guru Medasani gdm...@outlook.com, Sven Krasser kras...@gmail.com Cc: Sandy Ryza sandy.r...@cloudera.com, user@spark.apache.org user@spark.apache.org Subject: Re: java.lang.OutOfMemoryError: GC overhead limit exceeded I have yarn configured with yarn.nodemanager.vmem-check-enabled=false and yarn.nodemanager.pmem-check-enabled=false to avoid yarn killing the containers. the stack trace is bellow. thanks, Antony. 15/01/27 17:02:53 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM 15/01/27 17:02:53 ERROR executor.Executor: Exception in task 21.0 in stage 12.0 (TID 1312) java.lang.OutOfMemoryError: GC overhead limit exceeded at java.lang.Integer.valueOf(Integer.java:642) at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:70) at scala.collection.mutable.ArrayOps$ofInt.apply(ArrayOps.scala:156) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scal a:33) at scala.collection.mutable.ArrayOps$ofInt.foreach(ArrayOps.scala:156) at scala.collection.SeqLike$class.distinct(SeqLike.scala:493) at scala.collection.mutable.ArrayOps$ofInt.distinct(ArrayOps.scala:156) at org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommenda tion$ALS$$makeOutLinkBlock(ALS.scala:404) at org.apache.spark.mllib.recommendation.ALS$$anonfun$15.apply(ALS.scala:459) at org.apache.spark.mllib.recommendation.ALS$$anonfun$15.apply(ALS.scala:456) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:614) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:614) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61) at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.sca la:130) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.sca la:127) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(Traver sableLike.scala:772) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scal a:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:7 71) at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:127) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala: 31) 15/01/27 17:02:53 ERROR util.SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-8,5,main] java.lang.OutOfMemoryError: GC overhead limit exceeded at java.lang.Integer.valueOf(Integer.java:642) at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:70) at scala.collection.mutable.ArrayOps$ofInt.apply(ArrayOps.scala:156) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scal a:33) at scala.collection.mutable.ArrayOps$ofInt.foreach(ArrayOps.scala:156) at scala.collection.SeqLike$class.distinct(SeqLike.scala:493) at scala.collection.mutable.ArrayOps$ofInt.distinct(ArrayOps.scala:156) at org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommenda tion$ALS$$makeOutLinkBlock(ALS.scala:404) at org.apache.spark.mllib.recommendation.ALS$$anonfun$15.apply(ALS.scala:459) at org.apache.spark.mllib.recommendation.ALS$$anonfun$15.apply(ALS.scala:456) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:614) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:614) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31
Re: java.lang.OutOfMemoryError: GC overhead limit exceeded
Hi Antony, If you look in the YARN NodeManager logs, do you see that it's killing the executors? Or are they crashing for a different reason? -Sandy On Tue, Jan 27, 2015 at 12:43 PM, Antony Mayi antonym...@yahoo.com.invalid wrote: Hi, I am using spark.yarn.executor.memoryOverhead=8192 yet getting executors crashed with this error. does that mean I have genuinely not enough RAM or is this matter of config tuning? other config options used: spark.storage.memoryFraction=0.3 SPARK_EXECUTOR_MEMORY=14G running spark 1.2.0 as yarn-client on cluster of 10 nodes (the workload is ALS trainImplicit on ~15GB dataset) thanks for any ideas, Antony.
java.lang.OutOfMemoryError: GC overhead limit exceeded
Hi, I am using spark.yarn.executor.memoryOverhead=8192 yet getting executors crashed with this error. does that mean I have genuinely not enough RAM or is this matter of config tuning? other config options used:spark.storage.memoryFraction=0.3 SPARK_EXECUTOR_MEMORY=14G running spark 1.2.0 as yarn-client on cluster of 10 nodes (the workload is ALS trainImplicit on ~15GB dataset) thanks for any ideas,Antony.
Re: java.lang.OutOfMemoryError: GC overhead limit exceeded
Hi Anthony, What is the setting of the total amount of memory in MB that can be allocated to containers on your NodeManagers? yarn.nodemanager.resource.memory-mb Can you check this above configuration in yarn-site.xml used by the node manager process? -Guru Medasani From: Sandy Ryza sandy.r...@cloudera.com Date: Tuesday, January 27, 2015 at 3:33 PM To: Antony Mayi antonym...@yahoo.com Cc: user@spark.apache.org user@spark.apache.org Subject: Re: java.lang.OutOfMemoryError: GC overhead limit exceeded Hi Antony, If you look in the YARN NodeManager logs, do you see that it's killing the executors? Or are they crashing for a different reason? -Sandy On Tue, Jan 27, 2015 at 12:43 PM, Antony Mayi antonym...@yahoo.com.invalid wrote: Hi, I am using spark.yarn.executor.memoryOverhead=8192 yet getting executors crashed with this error. does that mean I have genuinely not enough RAM or is this matter of config tuning? other config options used: spark.storage.memoryFraction=0.3 SPARK_EXECUTOR_MEMORY=14G running spark 1.2.0 as yarn-client on cluster of 10 nodes (the workload is ALS trainImplicit on ~15GB dataset) thanks for any ideas, Antony.
Re: java.lang.OutOfMemoryError: GC overhead limit exceeded
Since it's an executor running OOM it doesn't look like a container being killed by YARN to me. As a starting point, can you repartition your job into smaller tasks? -Sven On Tue, Jan 27, 2015 at 2:34 PM, Guru Medasani gdm...@outlook.com wrote: Hi Anthony, What is the setting of the total amount of memory in MB that can be allocated to containers on your NodeManagers? yarn.nodemanager.resource.memory-mb Can you check this above configuration in yarn-site.xml used by the node manager process? -Guru Medasani From: Sandy Ryza sandy.r...@cloudera.com Date: Tuesday, January 27, 2015 at 3:33 PM To: Antony Mayi antonym...@yahoo.com Cc: user@spark.apache.org user@spark.apache.org Subject: Re: java.lang.OutOfMemoryError: GC overhead limit exceeded Hi Antony, If you look in the YARN NodeManager logs, do you see that it's killing the executors? Or are they crashing for a different reason? -Sandy On Tue, Jan 27, 2015 at 12:43 PM, Antony Mayi antonym...@yahoo.com.invalid wrote: Hi, I am using spark.yarn.executor.memoryOverhead=8192 yet getting executors crashed with this error. does that mean I have genuinely not enough RAM or is this matter of config tuning? other config options used: spark.storage.memoryFraction=0.3 SPARK_EXECUTOR_MEMORY=14G running spark 1.2.0 as yarn-client on cluster of 10 nodes (the workload is ALS trainImplicit on ~15GB dataset) thanks for any ideas, Antony. -- http://sites.google.com/site/krasser/?utm_source=sig
Re: java.lang.OutOfMemoryError: GC overhead limit exceeded
Can you attach the logs where this is failing? From: Sven Krasser kras...@gmail.com Date: Tuesday, January 27, 2015 at 4:50 PM To: Guru Medasani gdm...@outlook.com Cc: Sandy Ryza sandy.r...@cloudera.com, Antony Mayi antonym...@yahoo.com, user@spark.apache.org user@spark.apache.org Subject: Re: java.lang.OutOfMemoryError: GC overhead limit exceeded Since it's an executor running OOM it doesn't look like a container being killed by YARN to me. As a starting point, can you repartition your job into smaller tasks? -Sven On Tue, Jan 27, 2015 at 2:34 PM, Guru Medasani gdm...@outlook.com wrote: Hi Anthony, What is the setting of the total amount of memory in MB that can be allocated to containers on your NodeManagers? yarn.nodemanager.resource.memory-mb Can you check this above configuration in yarn-site.xml used by the node manager process? -Guru Medasani From: Sandy Ryza sandy.r...@cloudera.com Date: Tuesday, January 27, 2015 at 3:33 PM To: Antony Mayi antonym...@yahoo.com Cc: user@spark.apache.org user@spark.apache.org Subject: Re: java.lang.OutOfMemoryError: GC overhead limit exceeded Hi Antony, If you look in the YARN NodeManager logs, do you see that it's killing the executors? Or are they crashing for a different reason? -Sandy On Tue, Jan 27, 2015 at 12:43 PM, Antony Mayi antonym...@yahoo.com.invalid wrote: Hi, I am using spark.yarn.executor.memoryOverhead=8192 yet getting executors crashed with this error. does that mean I have genuinely not enough RAM or is this matter of config tuning? other config options used: spark.storage.memoryFraction=0.3 SPARK_EXECUTOR_MEMORY=14G running spark 1.2.0 as yarn-client on cluster of 10 nodes (the workload is ALS trainImplicit on ~15GB dataset) thanks for any ideas, Antony. -- http://sites.google.com/site/krasser/?utm_source=sig
Re: java.lang.OutOfMemoryError: GC overhead limit exceeded
I have yarn configured with yarn.nodemanager.vmem-check-enabled=false and yarn.nodemanager.pmem-check-enabled=false to avoid yarn killing the containers. the stack trace is bellow. thanks,Antony. 15/01/27 17:02:53 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM15/01/27 17:02:53 ERROR executor.Executor: Exception in task 21.0 in stage 12.0 (TID 1312)java.lang.OutOfMemoryError: GC overhead limit exceeded at java.lang.Integer.valueOf(Integer.java:642) at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:70) at scala.collection.mutable.ArrayOps$ofInt.apply(ArrayOps.scala:156) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofInt.foreach(ArrayOps.scala:156) at scala.collection.SeqLike$class.distinct(SeqLike.scala:493) at scala.collection.mutable.ArrayOps$ofInt.distinct(ArrayOps.scala:156) at org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$makeOutLinkBlock(ALS.scala:404) at org.apache.spark.mllib.recommendation.ALS$$anonfun$15.apply(ALS.scala:459) at org.apache.spark.mllib.recommendation.ALS$$anonfun$15.apply(ALS.scala:456) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:614) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:614) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61) at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:130) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:127) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:127) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)15/01/27 17:02:53 ERROR util.SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-8,5,main]java.lang.OutOfMemoryError: GC overhead limit exceeded at java.lang.Integer.valueOf(Integer.java:642) at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:70) at scala.collection.mutable.ArrayOps$ofInt.apply(ArrayOps.scala:156) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofInt.foreach(ArrayOps.scala:156) at scala.collection.SeqLike$class.distinct(SeqLike.scala:493) at scala.collection.mutable.ArrayOps$ofInt.distinct(ArrayOps.scala:156) at org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$makeOutLinkBlock(ALS.scala:404) at org.apache.spark.mllib.recommendation.ALS$$anonfun$15.apply(ALS.scala:459) at org.apache.spark.mllib.recommendation.ALS$$anonfun$15.apply(ALS.scala:456) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:614) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:614) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61) at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:130) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:127) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772
[Spark Streaming] java.lang.OutOfMemoryError: GC overhead limit exceeded
Hi guys, My Spark Streaming application have this java.lang.OutOfMemoryError: GC overhead limit exceeded error in SparkStreaming driver program. I have done the following to debug with it: 1. improved the driver memory from 1GB to 2GB, this error came after 22 hrs. When the memory was 1GB, it came after 10 hrs. So I think it is the memory leak problem. 2. after starting the application a few hours, I killed all workers. The driver program kept running and also filling up the memory. I was thinking it was because too many batches in the queue, obviously it is not. Otherwise, after killing workers (of course, the receiver), the memory usage should have gone down. 3. run the heap dump and Leak Suspect of Memory Analysis in Eclipse, found that *One instance of org.apache.spark.storage.BlockManager loaded by sun.misc.Launcher$AppClassLoader @ 0x6c002fb90 occupies 1,477,177,296 (72.70%) bytes. The memory is accumulated in one instance of java.util.LinkedHashMap loaded by system class loader.* *Keywords* *sun.misc.Launcher$AppClassLoader @ 0x6c002fb90**java.util.LinkedHashMap* *org.apache.spark.storage.BlockManager * What my application mainly does is : 1. calculate the sum/count in a batch 2. get the average in the batch 3. store the result in DB 4. calculate the sum/count in a window 5. get the average/min/max in the window 6. store the result in DB 7. compare the current batch value with previous batch value using updateStateByKey. Any hint what causes this leakage? Thank you. Cheers, Fang, Yan yanfang...@gmail.com +1 (206) 849-4108
Spark app throwing java.lang.OutOfMemoryError: GC overhead limit exceeded
I got a 40 node cdh 5.1 cluster and attempting to run a simple spark app that processes about 10-15GB raw data but I keep running into this error: java.lang.OutOfMemoryError: GC overhead limit exceeded Each node has 8 cores and 2GB memory. I notice the heap size on the executors is set to 512MB with total heap size on each executor is set to 2GB. Wanted to know whats the heap size needs to be set to for such data sizes and if anyone had input on other config changes that will help as well. Thanks for the input! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-app-throwing-java-lang-OutOfMemoryError-GC-overhead-limit-exceeded-tp11350.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark app throwing java.lang.OutOfMemoryError: GC overhead limit exceeded
(- incubator list, + user list) (Answer copied from original posting at http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Spark-app-throwing-java-lang-OutOfMemoryError-GC-overhead-limit/m-p/16396#U16396 -- let's follow up one place. If it's not specific to CDH, this is a good place to record a solution.) Where does the out of memory exception occur? in your driver, or an executor? I assume it is an executor. Yes, you are using the default of 512MB per executor. You can raise that with properties like spark.executor.memory, or flags like --executor-memory if using spark-shell. It sounds like your workers are allocating 2GB for executors, so you could potentially use up to 2GB per executor and your 1 executor per machine would consume all of your Spark cluster memory. But more memory doesn't necessarily help if you're performing some operation that inherently allocates a great deal of memory. I'm not sure what your operations are. Keep in mind too that if you are caching RDDs in memory, this is taking memory away from what's available for computations. On Mon, Aug 4, 2014 at 7:03 PM, buntu buntu...@gmail.com wrote: I got a 40 node cdh 5.1 cluster and attempting to run a simple spark app that processes about 10-15GB raw data but I keep running into this error: java.lang.OutOfMemoryError: GC overhead limit exceeded Each node has 8 cores and 2GB memory. I notice the heap size on the executors is set to 512MB with total heap size on each executor is set to 2GB. Wanted to know whats the heap size needs to be set to for such data sizes and if anyone had input on other config changes that will help as well. Thanks for the input! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-app-throwing-java-lang-OutOfMemoryError-GC-overhead-limit-exceeded-tp11350.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: java.lang.OutOfMemoryError: GC overhead limit exceeded
Hi Yifan This works for me: export SPARK_JAVA_OPTS=-Xms10g -Xmx40g -XX:MaxPermSize=10g export ADD_JARS=/home/abel/spark/MLI/target/MLI-assembly-1.0.jar export SPARK_MEM=40g ./spark-shell Regards On Mon, Jul 21, 2014 at 7:48 AM, Yifan LI iamyifa...@gmail.com wrote: Hi, I am trying to load the Graphx example dataset(LiveJournal, 1.08GB) through *Scala Shell* on my standalone multicore machine(8 cpus, 16GB mem), but an OutOfMemory error was returned when below code was running, val graph = GraphLoader.edgeListFile(sc, path, minEdgePartitions = 16).partitionBy(PartitionStrategy.RandomVertexCut) I guess I should set some parameters to JVM? like -Xmx5120m But how to do this in Scala Shell? I directly used the bin/spark-shell to start spark and seems everything works correctly in WebUI. Or, I should do parameters setting at somewhere(spark-1.0.1)? Best, Yifan LI
Re: java.lang.OutOfMemoryError: GC overhead limit exceeded
Thanks, Abel. Best, Yifan LI On Jul 21, 2014, at 4:16 PM, Abel Coronado Iruegas acoronadoirue...@gmail.com wrote: Hi Yifan This works for me: export SPARK_JAVA_OPTS=-Xms10g -Xmx40g -XX:MaxPermSize=10g export ADD_JARS=/home/abel/spark/MLI/target/MLI-assembly-1.0.jar export SPARK_MEM=40g ./spark-shell Regards On Mon, Jul 21, 2014 at 7:48 AM, Yifan LI iamyifa...@gmail.com wrote: Hi, I am trying to load the Graphx example dataset(LiveJournal, 1.08GB) through Scala Shell on my standalone multicore machine(8 cpus, 16GB mem), but an OutOfMemory error was returned when below code was running, val graph = GraphLoader.edgeListFile(sc, path, minEdgePartitions = 16).partitionBy(PartitionStrategy.RandomVertexCut) I guess I should set some parameters to JVM? like -Xmx5120m But how to do this in Scala Shell? I directly used the bin/spark-shell to start spark and seems everything works correctly in WebUI. Or, I should do parameters setting at somewhere(spark-1.0.1)? Best, Yifan LI
java.lang.OutOfMemoryError (java.lang.OutOfMemoryError: GC overhead limit exceeded)
Hi all, I faced with the next exception during map step: java.lang.OutOfMemoryError (java.lang.OutOfMemoryError: GC overhead limit exceeded) java.lang.reflect.Array.newInstance(Array.java:70) com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:325) com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293) com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699) com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611) com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648) com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605) com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648) com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605) com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648) com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605) com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699) com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611) com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43) com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:115) org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:155) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) I'm using Spark 1.0In map I create new object each time, as I understand I can't reuse object similar to MapReduce development? I wondered, if you could point me how is it possible to avoid GC overhead...thank you in advance Thank you, Konstantin Kudryavtsev
Re: java.lang.OutOfMemoryError (java.lang.OutOfMemoryError: GC overhead limit exceeded)
There is a difference from actual GC overhead, which can be reduced by reusing objects, versus this error, which actually means you ran out of memory. This error can probably be relieved by increasing your executor heap size, unless your data is corrupt and it is allocating huge arrays, or you are otherwise keeping too much memory around. For your other question, you can reuse objects similar to MapReduce (HadoopRDD does this by actually using Hadoop's Writables, for instance), but the general Spark APIs don't support this because mutable objects are not friendly to caching or serializing. On Tue, Jul 8, 2014 at 9:27 AM, Konstantin Kudryavtsev kudryavtsev.konstan...@gmail.com wrote: Hi all, I faced with the next exception during map step: java.lang.OutOfMemoryError (java.lang.OutOfMemoryError: GC overhead limit exceeded) java.lang.reflect.Array.newInstance(Array.java:70) com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:325) com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293) com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699) com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611) com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648) com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605) com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648) com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605) com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648) com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605) com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699) com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611) com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43) com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:115) org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:155) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) I'm using Spark 1.0In map I create new object each time, as I understand I can't reuse object similar to MapReduce development? I wondered, if you could point me how is it possible to avoid GC overhead...thank you in advance Thank you, Konstantin Kudryavtsev
Re: java.lang.OutOfMemoryError (java.lang.OutOfMemoryError: GC overhead limit exceeded)
Hi Konstantin, I just ran into the same problem. I mitigated the issue by reducing the number of cores when I executed the job which otherwise it won't be able to finish. Unlike many people believes, it might not means that you were running out of memory. A better answer can be found here: http://stackoverflow.com/questions/4371505/gc-overhead-limit-exceeded and copied here as a reference: Excessive GC Time and OutOfMemoryError The concurrent collector will throw an OutOfMemoryError if too much time is being spent in garbage collection: if more than 98% of the total time is spent in garbage collection and less than 2% of the heap is recovered, an OutOfMemoryError will be thrown. This feature is designed to prevent applications from running for an extended period of time while making little or no progress because the heap is too small. If necessary, this feature can be disabled by adding the option -XX:-UseGCOverheadLimit to the command line. The policy is the same as that in the parallel collector, except that time spent performing concurrent collections is not counted toward the 98% time limit. In other words, only collections performed while the application is stopped count toward excessive GC time. Such collections are typically due to a concurrent mode failure or an explicit collection request (e.g., a call to System.gc()). It could be that there are many tasks running in the same node and they all compete for running GCs which slow things down and trigger the error you saw. By reducing the number of cores, there are more cpu resources available to a task so the GC could finish before the error gets throw. HTH, Jerry On Tue, Jul 8, 2014 at 1:35 PM, Aaron Davidson ilike...@gmail.com wrote: There is a difference from actual GC overhead, which can be reduced by reusing objects, versus this error, which actually means you ran out of memory. This error can probably be relieved by increasing your executor heap size, unless your data is corrupt and it is allocating huge arrays, or you are otherwise keeping too much memory around. For your other question, you can reuse objects similar to MapReduce (HadoopRDD does this by actually using Hadoop's Writables, for instance), but the general Spark APIs don't support this because mutable objects are not friendly to caching or serializing. On Tue, Jul 8, 2014 at 9:27 AM, Konstantin Kudryavtsev kudryavtsev.konstan...@gmail.com wrote: Hi all, I faced with the next exception during map step: java.lang.OutOfMemoryError (java.lang.OutOfMemoryError: GC overhead limit exceeded) java.lang.reflect.Array.newInstance(Array.java:70) com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:325) com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293) com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699) com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611) com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648) com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605) com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648) com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605) com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648) com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605) com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699) com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611) com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43) com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:115) org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:155) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply
Re: java.lang.OutOfMemoryError (java.lang.OutOfMemoryError: GC overhead limit exceeded)
This seems almost equivalent to a heap size error -- since GCs are stop-the-world events, the fact that we were unable to release more than 2% of the heap suggests that almost all the memory is *currently in use *(i.e., live). Decreasing the number of cores is another solution which decreases memory pressure, because each core requires its own set of buffers (for instance, each kryo serializer has a certain buffer allocated to it), and has its own working set of data (some subset of a partition). Thus, decreasing the number of used cores decreases memory contention. On Tue, Jul 8, 2014 at 10:44 AM, Jerry Lam chiling...@gmail.com wrote: Hi Konstantin, I just ran into the same problem. I mitigated the issue by reducing the number of cores when I executed the job which otherwise it won't be able to finish. Unlike many people believes, it might not means that you were running out of memory. A better answer can be found here: http://stackoverflow.com/questions/4371505/gc-overhead-limit-exceeded and copied here as a reference: Excessive GC Time and OutOfMemoryError The concurrent collector will throw an OutOfMemoryError if too much time is being spent in garbage collection: if more than 98% of the total time is spent in garbage collection and less than 2% of the heap is recovered, an OutOfMemoryError will be thrown. This feature is designed to prevent applications from running for an extended period of time while making little or no progress because the heap is too small. If necessary, this feature can be disabled by adding the option -XX:-UseGCOverheadLimit to the command line. The policy is the same as that in the parallel collector, except that time spent performing concurrent collections is not counted toward the 98% time limit. In other words, only collections performed while the application is stopped count toward excessive GC time. Such collections are typically due to a concurrent mode failure or an explicit collection request (e.g., a call to System.gc()). It could be that there are many tasks running in the same node and they all compete for running GCs which slow things down and trigger the error you saw. By reducing the number of cores, there are more cpu resources available to a task so the GC could finish before the error gets throw. HTH, Jerry On Tue, Jul 8, 2014 at 1:35 PM, Aaron Davidson ilike...@gmail.com wrote: There is a difference from actual GC overhead, which can be reduced by reusing objects, versus this error, which actually means you ran out of memory. This error can probably be relieved by increasing your executor heap size, unless your data is corrupt and it is allocating huge arrays, or you are otherwise keeping too much memory around. For your other question, you can reuse objects similar to MapReduce (HadoopRDD does this by actually using Hadoop's Writables, for instance), but the general Spark APIs don't support this because mutable objects are not friendly to caching or serializing. On Tue, Jul 8, 2014 at 9:27 AM, Konstantin Kudryavtsev kudryavtsev.konstan...@gmail.com wrote: Hi all, I faced with the next exception during map step: java.lang.OutOfMemoryError (java.lang.OutOfMemoryError: GC overhead limit exceeded) java.lang.reflect.Array.newInstance(Array.java:70) com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:325) com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293) com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699) com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611) com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648) com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605) com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648) com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605) com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648) com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605) com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699) com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611) com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43) com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34