Re: java.lang.OutOfMemoryError: GC overhead limit exceeded when using UDFs in SparkSQL (Spark 2.0.0)

2016-08-09 Thread Zoltan Fedor
Thanks, that makes sense.
So it must be that this queue - which is kept because of the UDF - is the
one running out of memory, because without the UDF field there is no out of
memory error and the UDF fields is pretty small, unlikely that it would
take us above the memory limit.

In either case, thanks for your help, I think I understand it now how the
UDFs and the fields together with the number of rows can result our out of
memory scenario.

On Tue, Aug 9, 2016 at 5:06 PM, Davies Liu <dav...@databricks.com> wrote:

> When you have a Python UDF, only the input to UDF are passed into
> Python process,
> but all other fields that are used together with the result of UDF are
> kept in a queue
> then join with the result from Python. The length of this queue is depend
> on the
> number of rows is under processing by Python (or in the buffer of
> Python process).
> The amount of memory required also depend on how many fields are used in
> the
> results.
>
> On Tue, Aug 9, 2016 at 11:09 AM, Zoltan Fedor <zoltan.1.fe...@gmail.com>
> wrote:
> >> Does this mean you only have 1.6G memory for executor (others left for
> >> Python) ?
> >> The cached table could take 1.5G, it means almost nothing left for other
> >> things.
> > True. I have also tried with memoryOverhead being set to 800 (10% of the
> 8Gb
> > memory), but no difference. The "GC overhead limit exceeded" is still the
> > same.
> >
> >> Python UDF do requires some buffering in JVM, the size of buffering
> >> depends on how much rows are under processing by Python process.
> > I did some more testing in the meantime.
> > Leaving the UDFs as-is, but removing some other, static columns from the
> > above SELECT FROM command has stopped the memoryOverhead error from
> > occurring. I have plenty enough memory to store the results with all
> static
> > columns, plus when the UDFs are not there only the rest of the static
> > columns are, then it runs fine. This makes me believe that having UDFs
> and
> > many columns causes the issue together. Maybe when you have UDFs then
> > somehow the memory usage depends on the amount of data in that record
> (the
> > whole row), which includes other fields too, which are actually not used
> by
> > the UDF. Maybe the UDF serialization to Python serializes the whole row
> > instead of just the attributes of the UDF?
> >
> > On Mon, Aug 8, 2016 at 5:59 PM, Davies Liu <dav...@databricks.com>
> wrote:
> >>
> >> On Mon, Aug 8, 2016 at 2:24 PM, Zoltan Fedor <zoltan.1.fe...@gmail.com>
> >> wrote:
> >> > Hi all,
> >> >
> >> > I have an interesting issue trying to use UDFs from SparkSQL in Spark
> >> > 2.0.0
> >> > using pyspark.
> >> >
> >> > There is a big table (5.6 Billion rows, 450Gb in memory) loaded into
> 300
> >> > executors's memory in SparkSQL, on which we would do some calculation
> >> > using
> >> > UDFs in pyspark.
> >> > If I run my SQL on only a portion of the data (filtering by one of the
> >> > attributes), let's say 800 million records, then all works well. But
> >> > when I
> >> > run the same SQL on all the data, then I receive
> >> > "java.lang.OutOfMemoryError: GC overhead limit exceeded" from
> basically
> >> > all
> >> > of the executors.
> >> >
> >> > It seems to me that pyspark UDFs in SparkSQL might have a memory leak,
> >> > causing this "GC overhead limit being exceeded".
> >> >
> >> > Details:
> >> >
> >> > - using Spark 2.0.0 on a Hadoop YARN cluster
> >> >
> >> > - 300 executors, each with 2 CPU cores and 8Gb memory (
> >> > spark.yarn.executor.memoryOverhead=6400 )
> >>
> >> Does this mean you only have 1.6G memory for executor (others left for
> >> Python) ?
> >> The cached table could take 1.5G, it means almost nothing left for other
> >> things.
> >>
> >> Python UDF do requires some buffering in JVM, the size of buffering
> >> depends on
> >> how much rows are under processing by Python process.
> >>
> >> > - a table of 5.6 Billions rows loaded into the memory of the executors
> >> > (taking up 450Gb of memory), partitioned evenly across the executors
> >> >
> >> > - creating even the simplest UDF in SparkSQL causes 'GC overhead limit
> >> > exceeded' error if running on all records. Running the same on a
> smaller
> >&

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded when using UDFs in SparkSQL (Spark 2.0.0)

2016-08-09 Thread Davies Liu
When you have a Python UDF, only the input to UDF are passed into
Python process,
but all other fields that are used together with the result of UDF are
kept in a queue
then join with the result from Python. The length of this queue is depend on the
number of rows is under processing by Python (or in the buffer of
Python process).
The amount of memory required also depend on how many fields are used in the
results.

On Tue, Aug 9, 2016 at 11:09 AM, Zoltan Fedor <zoltan.1.fe...@gmail.com> wrote:
>> Does this mean you only have 1.6G memory for executor (others left for
>> Python) ?
>> The cached table could take 1.5G, it means almost nothing left for other
>> things.
> True. I have also tried with memoryOverhead being set to 800 (10% of the 8Gb
> memory), but no difference. The "GC overhead limit exceeded" is still the
> same.
>
>> Python UDF do requires some buffering in JVM, the size of buffering
>> depends on how much rows are under processing by Python process.
> I did some more testing in the meantime.
> Leaving the UDFs as-is, but removing some other, static columns from the
> above SELECT FROM command has stopped the memoryOverhead error from
> occurring. I have plenty enough memory to store the results with all static
> columns, plus when the UDFs are not there only the rest of the static
> columns are, then it runs fine. This makes me believe that having UDFs and
> many columns causes the issue together. Maybe when you have UDFs then
> somehow the memory usage depends on the amount of data in that record (the
> whole row), which includes other fields too, which are actually not used by
> the UDF. Maybe the UDF serialization to Python serializes the whole row
> instead of just the attributes of the UDF?
>
> On Mon, Aug 8, 2016 at 5:59 PM, Davies Liu <dav...@databricks.com> wrote:
>>
>> On Mon, Aug 8, 2016 at 2:24 PM, Zoltan Fedor <zoltan.1.fe...@gmail.com>
>> wrote:
>> > Hi all,
>> >
>> > I have an interesting issue trying to use UDFs from SparkSQL in Spark
>> > 2.0.0
>> > using pyspark.
>> >
>> > There is a big table (5.6 Billion rows, 450Gb in memory) loaded into 300
>> > executors's memory in SparkSQL, on which we would do some calculation
>> > using
>> > UDFs in pyspark.
>> > If I run my SQL on only a portion of the data (filtering by one of the
>> > attributes), let's say 800 million records, then all works well. But
>> > when I
>> > run the same SQL on all the data, then I receive
>> > "java.lang.OutOfMemoryError: GC overhead limit exceeded" from basically
>> > all
>> > of the executors.
>> >
>> > It seems to me that pyspark UDFs in SparkSQL might have a memory leak,
>> > causing this "GC overhead limit being exceeded".
>> >
>> > Details:
>> >
>> > - using Spark 2.0.0 on a Hadoop YARN cluster
>> >
>> > - 300 executors, each with 2 CPU cores and 8Gb memory (
>> > spark.yarn.executor.memoryOverhead=6400 )
>>
>> Does this mean you only have 1.6G memory for executor (others left for
>> Python) ?
>> The cached table could take 1.5G, it means almost nothing left for other
>> things.
>>
>> Python UDF do requires some buffering in JVM, the size of buffering
>> depends on
>> how much rows are under processing by Python process.
>>
>> > - a table of 5.6 Billions rows loaded into the memory of the executors
>> > (taking up 450Gb of memory), partitioned evenly across the executors
>> >
>> > - creating even the simplest UDF in SparkSQL causes 'GC overhead limit
>> > exceeded' error if running on all records. Running the same on a smaller
>> > dataset (~800 million rows) does succeed. If no UDF, the query succeed
>> > on
>> > the whole dataset.
>> >
>> > - simplified pyspark code:
>> >
>> > from pyspark.sql.types import StringType
>> >
>> > def test_udf(var):
>> > """test udf that will always return a"""
>> > return "a"
>> > sqlContext.registerFunction("test_udf", test_udf, StringType())
>> >
>> > sqlContext.sql("""CACHE TABLE ma""")
>> >
>> > results_df = sqlContext.sql("""SELECT SOURCE, SOURCE_SYSTEM,
>> > test_udf(STANDARD_ACCOUNT_STREET_SRC) AS TEST_UDF_OP,
>> > ROUND(1.0 - (levenshtein(STANDARD_ACCOUNT_CITY_SRC,
>> > STANDARD_ACCOUNT_CITY_SRC)
>> >  /
>> > CASE WHEN LENGTH (STANDARD_ACCOUNT_CITY

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded when using UDFs in SparkSQL (Spark 2.0.0)

2016-08-09 Thread Zoltan Fedor
> Does this mean you only have 1.6G memory for executor (others left for
Python) ?
> The cached table could take 1.5G, it means almost nothing left for other
things.
True. I have also tried with memoryOverhead being set to 800 (10% of the
8Gb memory), but no difference. The "GC overhead limit exceeded" is still
the same.

> Python UDF do requires some buffering in JVM, the size of buffering
depends on how much rows are under processing by Python process.
I did some more testing in the meantime.
Leaving the UDFs as-is, but removing some other, static columns from the
above SELECT FROM command has stopped the memoryOverhead error
from occurring. I have plenty enough memory to store the results with all
static columns, plus when the UDFs are not there only the rest of the
static columns are, then it runs fine. This makes me believe that having
UDFs and many columns causes the issue together. Maybe when you have UDFs
then somehow the memory usage depends on the amount of data in that record
(the whole row), which includes other fields too, which are actually not
used by the UDF. Maybe the UDF serialization to Python serializes the whole
row instead of just the attributes of the UDF?

On Mon, Aug 8, 2016 at 5:59 PM, Davies Liu <dav...@databricks.com> wrote:

> On Mon, Aug 8, 2016 at 2:24 PM, Zoltan Fedor <zoltan.1.fe...@gmail.com>
> wrote:
> > Hi all,
> >
> > I have an interesting issue trying to use UDFs from SparkSQL in Spark
> 2.0.0
> > using pyspark.
> >
> > There is a big table (5.6 Billion rows, 450Gb in memory) loaded into 300
> > executors's memory in SparkSQL, on which we would do some calculation
> using
> > UDFs in pyspark.
> > If I run my SQL on only a portion of the data (filtering by one of the
> > attributes), let's say 800 million records, then all works well. But
> when I
> > run the same SQL on all the data, then I receive
> > "java.lang.OutOfMemoryError: GC overhead limit exceeded" from basically
> all
> > of the executors.
> >
> > It seems to me that pyspark UDFs in SparkSQL might have a memory leak,
> > causing this "GC overhead limit being exceeded".
> >
> > Details:
> >
> > - using Spark 2.0.0 on a Hadoop YARN cluster
> >
> > - 300 executors, each with 2 CPU cores and 8Gb memory (
> > spark.yarn.executor.memoryOverhead=6400 )
>
> Does this mean you only have 1.6G memory for executor (others left for
> Python) ?
> The cached table could take 1.5G, it means almost nothing left for other
> things.
>
> Python UDF do requires some buffering in JVM, the size of buffering
> depends on
> how much rows are under processing by Python process.
>
> > - a table of 5.6 Billions rows loaded into the memory of the executors
> > (taking up 450Gb of memory), partitioned evenly across the executors
> >
> > - creating even the simplest UDF in SparkSQL causes 'GC overhead limit
> > exceeded' error if running on all records. Running the same on a smaller
> > dataset (~800 million rows) does succeed. If no UDF, the query succeed on
> > the whole dataset.
> >
> > - simplified pyspark code:
> >
> > from pyspark.sql.types import StringType
> >
> > def test_udf(var):
> > """test udf that will always return a"""
> > return "a"
> > sqlContext.registerFunction("test_udf", test_udf, StringType())
> >
> > sqlContext.sql("""CACHE TABLE ma""")
> >
> > results_df = sqlContext.sql("""SELECT SOURCE, SOURCE_SYSTEM,
> > test_udf(STANDARD_ACCOUNT_STREET_SRC) AS TEST_UDF_OP,
> > ROUND(1.0 - (levenshtein(STANDARD_ACCOUNT_CITY_SRC,
> > STANDARD_ACCOUNT_CITY_SRC)
> >  /
> > CASE WHEN LENGTH (STANDARD_ACCOUNT_CITY_SRC)>LENGTH
> > (STANDARD_ACCOUNT_CITY_SRC)
> > THEN LENGTH (STANDARD_ACCOUNT_CITY_SRC)
> > ELSE LENGTH (STANDARD_ACCOUNT_CITY_SRC)
> >    END),2) AS SCORE_ED_STANDARD_ACCOUNT_CITY,
> > STANDARD_ACCOUNT_STATE_SRC, STANDARD_ACCOUNT_STATE_UNIV
> > FROM ma""")
> >
> > results_df.registerTempTable("m")
> > sqlContext.cacheTable("m")
> >
> > results_df = sqlContext.sql("""SELECT COUNT(*) FROM m""")
> > print(results_df.take(1))
> >
> >
> > - the error thrown on the executors:
> >
> > 16/08/08 15:38:17 ERROR util.Utils: Uncaught exception in thread stdout
> > writer for /hadoop/cloudera/parcels/Anaconda/bin/python
> > java.lang.OutOfMemoryError: GC overhead limit exceeded
> > at
> > org.apache.spark.sql.catalyst.expre

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded when using UDFs in SparkSQL (Spark 2.0.0)

2016-08-08 Thread Davies Liu
On Mon, Aug 8, 2016 at 2:24 PM, Zoltan Fedor <zoltan.1.fe...@gmail.com> wrote:
> Hi all,
>
> I have an interesting issue trying to use UDFs from SparkSQL in Spark 2.0.0
> using pyspark.
>
> There is a big table (5.6 Billion rows, 450Gb in memory) loaded into 300
> executors's memory in SparkSQL, on which we would do some calculation using
> UDFs in pyspark.
> If I run my SQL on only a portion of the data (filtering by one of the
> attributes), let's say 800 million records, then all works well. But when I
> run the same SQL on all the data, then I receive
> "java.lang.OutOfMemoryError: GC overhead limit exceeded" from basically all
> of the executors.
>
> It seems to me that pyspark UDFs in SparkSQL might have a memory leak,
> causing this "GC overhead limit being exceeded".
>
> Details:
>
> - using Spark 2.0.0 on a Hadoop YARN cluster
>
> - 300 executors, each with 2 CPU cores and 8Gb memory (
> spark.yarn.executor.memoryOverhead=6400 )

Does this mean you only have 1.6G memory for executor (others left for Python) ?
The cached table could take 1.5G, it means almost nothing left for other things.

Python UDF do requires some buffering in JVM, the size of buffering depends on
how much rows are under processing by Python process.

> - a table of 5.6 Billions rows loaded into the memory of the executors
> (taking up 450Gb of memory), partitioned evenly across the executors
>
> - creating even the simplest UDF in SparkSQL causes 'GC overhead limit
> exceeded' error if running on all records. Running the same on a smaller
> dataset (~800 million rows) does succeed. If no UDF, the query succeed on
> the whole dataset.
>
> - simplified pyspark code:
>
> from pyspark.sql.types import StringType
>
> def test_udf(var):
> """test udf that will always return a"""
> return "a"
> sqlContext.registerFunction("test_udf", test_udf, StringType())
>
> sqlContext.sql("""CACHE TABLE ma""")
>
> results_df = sqlContext.sql("""SELECT SOURCE, SOURCE_SYSTEM,
> test_udf(STANDARD_ACCOUNT_STREET_SRC) AS TEST_UDF_OP,
> ROUND(1.0 - (levenshtein(STANDARD_ACCOUNT_CITY_SRC,
> STANDARD_ACCOUNT_CITY_SRC)
>  /
> CASE WHEN LENGTH (STANDARD_ACCOUNT_CITY_SRC)>LENGTH
> (STANDARD_ACCOUNT_CITY_SRC)
> THEN LENGTH (STANDARD_ACCOUNT_CITY_SRC)
> ELSE LENGTH (STANDARD_ACCOUNT_CITY_SRC)
>END),2) AS SCORE_ED_STANDARD_ACCOUNT_CITY,
> STANDARD_ACCOUNT_STATE_SRC, STANDARD_ACCOUNT_STATE_UNIV
> FROM ma""")
>
> results_df.registerTempTable("m")
> sqlContext.cacheTable("m")
>
> results_df = sqlContext.sql("""SELECT COUNT(*) FROM m""")
> print(results_df.take(1))
>
>
> - the error thrown on the executors:
>
> 16/08/08 15:38:17 ERROR util.Utils: Uncaught exception in thread stdout
> writer for /hadoop/cloudera/parcels/Anaconda/bin/python
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> at
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:503)
> at
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:61)
> at
> org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$1.apply(BatchEvalPythonExec.scala:64)
> at
> org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$1.apply(BatchEvalPythonExec.scala:64)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at
> scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:1076)
> at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:1091)
> at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1129)
> at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1132)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504)
> at
> org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1857)
> at
> org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)
> 16/08/08 15:38:17 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED
> SIGNAL TERM
>
>
> Has anybody experienced these "GC overhead limit exceeded" errors with
> pyspark UDFs before?
>
> Thanks,
> Zoltan
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



java.lang.OutOfMemoryError: GC overhead limit exceeded when using UDFs in SparkSQL (Spark 2.0.0)

2016-08-08 Thread Zoltan Fedor
Hi all,

I have an interesting issue trying to use UDFs from SparkSQL in Spark 2.0.0
using pyspark.

There is a big table (5.6 Billion rows, 450Gb in memory) loaded into 300
executors's memory in SparkSQL, on which we would do some calculation using
UDFs in pyspark.
If I run my SQL on only a portion of the data (filtering by one of the
attributes), let's say 800 million records, then all works well. But when I
run the same SQL on all the data, then I receive "*java.lang.OutOfMemoryError:
GC overhead limit exceeded"* from basically all of the executors.

It seems to me that pyspark UDFs in SparkSQL might have a memory leak,
causing this "GC overhead limit being exceeded".

Details:

- using Spark 2.0.0 on a Hadoop YARN cluster

- 300 executors, each with 2 CPU cores and 8Gb memory (
spark.yarn.executor.memoryOverhead=6400 )

- a table of 5.6 Billions rows loaded into the memory of the executors
(taking up 450Gb of memory), partitioned evenly across the executors

- creating even the simplest UDF in SparkSQL causes 'GC overhead limit
exceeded' error if running on all records. Running the same on a smaller
dataset (~800 million rows) does succeed. If no UDF, the query succeed on
the whole dataset.

- simplified pyspark code:

*from pyspark.sql.types import StringType*

*def test_udf(var):*
*"""test udf that will always return a"""*
*return "a"*
*sqlContext.registerFunction("test_udf", test_udf, StringType())*

*sqlContext.sql("""CACHE TABLE ma""")*

*results_df = sqlContext.sql("""SELECT SOURCE, SOURCE_SYSTEM,*
*test_udf(STANDARD_ACCOUNT_STREET_SRC) AS TEST_UDF_OP,*
* ROUND(1.0 - (levenshtein(STANDARD_ACCOUNT_CITY_SRC,
STANDARD_ACCOUNT_CITY_SRC) *
*  / *
* CASE WHEN LENGTH (STANDARD_ACCOUNT_CITY_SRC)>LENGTH
(STANDARD_ACCOUNT_CITY_SRC)*
* THEN LENGTH (STANDARD_ACCOUNT_CITY_SRC)*
* ELSE LENGTH (STANDARD_ACCOUNT_CITY_SRC)*
*END),2) AS SCORE_ED_STANDARD_ACCOUNT_CITY,*
* STANDARD_ACCOUNT_STATE_SRC, STANDARD_ACCOUNT_STATE_UNIV*
* FROM ma""")*

*results_df.registerTempTable("m")*
*sqlContext.cacheTable("m")*

*results_df = sqlContext.sql("""SELECT COUNT(*) FROM m""")*
*print(results_df.take(1))*


- the error thrown on the executors:

*16/08/08 15:38:17 ERROR util.Utils: Uncaught exception in thread stdout
writer for /hadoop/cloudera/parcels/Anaconda/bin/python*
*java.lang.OutOfMemoryError: GC overhead limit exceeded*
* at
org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:503)*
* at
org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:61)*
* at
org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$1.apply(BatchEvalPythonExec.scala:64)*
* at
org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$1.apply(BatchEvalPythonExec.scala:64)*
* at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)*
* at
scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:1076)*
* at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:1091)*
* at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1129)*
* at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1132)*
* at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)*
* at scala.collection.Iterator$class.foreach(Iterator.scala:893)*
* at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)*
* at
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504)*
* at
org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328)*
* at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1857)*
* at
org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)*
*16/08/08 15:38:17 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED
SIGNAL TERM*


Has anybody experienced these "*GC overhead limit exceeded*" errors with
pyspark UDFs before?

Thanks,
Zoltan


Re: df.partitionBy().parquet() java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-12-02 Thread Cheng Lian
You may try to set Hadoop conf "parquet.enable.summary-metadata" to 
false to disable writing Parquet summary files (_metadata and 
_common_metadata).


By default Parquet writes the summary files by collecting footers of all 
part-files in the dataset while committing the job. Spark also follows 
this convention. However, it turned out that the summary files aren't 
very useful in practice now, unless you have other downstream tools that 
strictly depend on the summary files. For example, if you don't need 
schema merging, Spark simply picks a random part-file to discovery the 
dataset schema. If you need schema merging, Spark has to read footers of 
all part-files anyway (but in a distributed, parallel way).


Cheng

On 12/3/15 6:11 AM, Don Drake wrote:
Does anyone have any suggestions on creating a large amount of parquet 
files? Especially in regards to the last phase where it creates the 
_metadata.


Thanks.

-Don

On Sat, Nov 28, 2015 at 9:02 AM, Don Drake <dondr...@gmail.com 
<mailto:dondr...@gmail.com>> wrote:


I have a 2TB dataset that I have in a DataFrame that I am
attempting to partition by 2 fields and my YARN job seems to write
the partitioned dataset successfully.  I can see the output in
HDFS once all Spark tasks are done.

After the spark tasks are done, the job appears to be running for
over an hour, until I get the following (full stack trace below):

java.lang.OutOfMemoryError: GC overhead limit exceeded
at

org.apache.parquet.format.converter.ParquetMetadataConverter.toParquetStatistics(ParquetMetadataConverter.java:238)

I had set the driver memory to be 20GB.

I attempted to read in the partitioned dataset and got another
error saying the /_metadata directory was not a parquet file.  I
removed the _metadata directory and was able to query the data,
but it appeared to not use the partitioned directory when I
attempted to filter the data (it read every directory).

This is Spark 1.5.2 and I saw the same problem when running the
code in both Scala and Python.

Any suggestions are appreciated.

-Don

15/11/25 00:00:19 ERROR datasources.InsertIntoHadoopFsRelation:
Aborting job.
java.lang.OutOfMemoryError: GC overhead limit exceeded
at

org.apache.parquet.format.converter.ParquetMetadataConverter.toParquetStatistics(ParquetMetadataConverter.java:238)
at

org.apache.parquet.format.converter.ParquetMetadataConverter.addRowGroup(ParquetMetadataConverter.java:167)
at

org.apache.parquet.format.converter.ParquetMetadataConverter.toParquetMetadata(ParquetMetadataConverter.java:79)
at

org.apache.parquet.hadoop.ParquetFileWriter.serializeFooter(ParquetFileWriter.java:405)
at

org.apache.parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:433)
at

org.apache.parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:423)
at

org.apache.parquet.hadoop.ParquetOutputCommitter.writeMetaDataFile(ParquetOutputCommitter.java:58)
at

org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48)
at

org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:208)
at

org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:151)
at

org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
at

org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
at

org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
at

org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
at

org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
at

org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
at
org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
at

org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
at

org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
at

org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at
org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
at

org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
at
org.apache.spark.sql.SQLContext$QueryExecution.toRd

Re: df.partitionBy().parquet() java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-12-02 Thread Adrien Mogenet
Very interested in that topic too, thanks Cheng for the direction!

We'll give it a try as well.

On 3 December 2015 at 01:40, Cheng Lian <lian.cs@gmail.com> wrote:

> You may try to set Hadoop conf "parquet.enable.summary-metadata" to false
> to disable writing Parquet summary files (_metadata and _common_metadata).
>
> By default Parquet writes the summary files by collecting footers of all
> part-files in the dataset while committing the job. Spark also follows this
> convention. However, it turned out that the summary files aren't very
> useful in practice now, unless you have other downstream tools that
> strictly depend on the summary files. For example, if you don't need schema
> merging, Spark simply picks a random part-file to discovery the dataset
> schema. If you need schema merging, Spark has to read footers of all
> part-files anyway (but in a distributed, parallel way).
>
> Cheng
>
> On 12/3/15 6:11 AM, Don Drake wrote:
>
> Does anyone have any suggestions on creating a large amount of parquet
> files? Especially in regards to the last phase where it creates the
> _metadata.
>
> Thanks.
>
> -Don
>
> On Sat, Nov 28, 2015 at 9:02 AM, Don Drake <dondr...@gmail.com> wrote:
>
>> I have a 2TB dataset that I have in a DataFrame that I am attempting to
>> partition by 2 fields and my YARN job seems to write the partitioned
>> dataset successfully.  I can see the output in HDFS once all Spark tasks
>> are done.
>>
>> After the spark tasks are done, the job appears to be running for over an
>> hour, until I get the following (full stack trace below):
>>
>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>> at
>> org.apache.parquet.format.converter.ParquetMetadataConverter.toParquetStatistics(ParquetMetadataConverter.java:238)
>>
>> I had set the driver memory to be 20GB.
>>
>> I attempted to read in the partitioned dataset and got another error
>> saying the /_metadata directory was not a parquet file.  I removed the
>> _metadata directory and was able to query the data, but it appeared to not
>> use the partitioned directory when I attempted to filter the data (it read
>> every directory).
>>
>> This is Spark 1.5.2 and I saw the same problem when running the code in
>> both Scala and Python.
>>
>> Any suggestions are appreciated.
>>
>> -Don
>>
>> 15/11/25 00:00:19 ERROR datasources.InsertIntoHadoopFsRelation: Aborting
>> job.
>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>> at
>> org.apache.parquet.format.converter.ParquetMetadataConverter.toParquetStatistics(ParquetMetadataConverter.java:238)
>> at
>> org.apache.parquet.format.converter.ParquetMetadataConverter.addRowGroup(ParquetMetadataConverter.java:167)
>> at
>> org.apache.parquet.format.converter.ParquetMetadataConverter.toParquetMetadata(ParquetMetadataConverter.java:79)
>> at
>> org.apache.parquet.hadoop.ParquetFileWriter.serializeFooter(ParquetFileWriter.java:405)
>> at
>> org.apache.parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:433)
>> at
>> org.apache.parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:423)
>> at
>> org.apache.parquet.hadoop.ParquetOutputCommitter.writeMetaDataFile(ParquetOutputCommitter.java:58)
>> at
>> org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48)
>> at
>> org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:208)
>> at
>> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:151)
>> at
>> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
>> at
>> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
>> at
>> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>> at
>> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
>> at
>> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
>> at
>> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
>> at
>> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
>>

df.partitionBy().parquet() java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-11-28 Thread Don Drake
I have a 2TB dataset that I have in a DataFrame that I am attempting to
partition by 2 fields and my YARN job seems to write the partitioned
dataset successfully.  I can see the output in HDFS once all Spark tasks
are done.

After the spark tasks are done, the job appears to be running for over an
hour, until I get the following (full stack trace below):

java.lang.OutOfMemoryError: GC overhead limit exceeded
at
org.apache.parquet.format.converter.ParquetMetadataConverter.toParquetStatistics(ParquetMetadataConverter.java:238)

I had set the driver memory to be 20GB.

I attempted to read in the partitioned dataset and got another error saying
the /_metadata directory was not a parquet file.  I removed the _metadata
directory and was able to query the data, but it appeared to not use the
partitioned directory when I attempted to filter the data (it read every
directory).

This is Spark 1.5.2 and I saw the same problem when running the code in
both Scala and Python.

Any suggestions are appreciated.

-Don

15/11/25 00:00:19 ERROR datasources.InsertIntoHadoopFsRelation: Aborting
job.
java.lang.OutOfMemoryError: GC overhead limit exceeded
at
org.apache.parquet.format.converter.ParquetMetadataConverter.toParquetStatistics(ParquetMetadataConverter.java:238)
at
org.apache.parquet.format.converter.ParquetMetadataConverter.addRowGroup(ParquetMetadataConverter.java:167)
at
org.apache.parquet.format.converter.ParquetMetadataConverter.toParquetMetadata(ParquetMetadataConverter.java:79)
at
org.apache.parquet.hadoop.ParquetFileWriter.serializeFooter(ParquetFileWriter.java:405)
at
org.apache.parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:433)
at
org.apache.parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:423)
at
org.apache.parquet.hadoop.ParquetOutputCommitter.writeMetaDataFile(ParquetOutputCommitter.java:58)
at
org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48)
at
org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:208)
at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:151)
at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
at
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
at
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
at
org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at
org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
at
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
at
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
at
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197)
at
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
at
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137)
at
org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304)
at com.dondrake.qra.ScalaApp$.main(ScalaApp.scala:53)
at com.dondrake.qra.ScalaApp.main(ScalaApp.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
15/11/25 00:00:20 ERROR actor.ActorSystemImpl: exception on LARS? timer
thread
java.lang.OutOfMemoryError: GC overhead limit exceeded
at
akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:409)
at
akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375)
at java.lang.Thread.run(Thread.java:745)
15/11/25 00:00:20 ERROR akka.ErrorMonitor: exception on LARS? timer thread
java.lang.OutOfMemoryError: GC overhead limit exceeded
at
akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:409)
at
akka.actor.LightArrayRevolverScheduler$$anon$8

newbie simple app, small data set: Py4JJavaError java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-11-18 Thread Andy Davidson
Hi

I am working on a spark POC. I created a ec2 cluster on AWS using
spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2

Bellow is a simple python program. I am running using IPython notebook. The
notebook server is running on my spark master. If I run my program more than
1 once using my large data set, I get the GC outOfMemory error. I run it
each time by ³re running the notebook cell². I can run my smaller set a
couple of times with out problems.

I launch using

pyspark --master $MASTER_URL --total-executor-cores 2


Any idea how I can debug this?

Kind regards

Andy

Using the master console I see
* there is only one app run (this is what I expect)
* There are 2 works each on a different slave (this is what I expect)
* Each worker is using 1 core (this is what I expect)
* Each worker memory usage is using 6154 (seems resonable)

* Alive Workers: 3
* Cores in use: 6 Total, 2 Used
* Memory in use: 18.8 GB Total, 12.0 GB Used
* Applications: 1 Running, 5 Completed
* Drivers: 0 Running, 0 Completed
* Status: ALIVE

The data file I am working with is small

I collected this data using spark streaming twitter utilities. All I do is
capture tweets, convert them to JSON and store as strings to hdfs

$ hadoop fs -count  hdfs:///largeSample hdfs:///smallSample

   13226   240098 3839228100

   1   228156   39689877


My python code. I am using python 3.4.

import json
import datetime

startTime = datetime.datetime.now()

#dataURL = "hdfs:///largeSample"
dataURL = "hdfs:///smallSample"
tweetStrings = sc.textFile(dataURL)

t2 = tweetStrings.take(2)
print (t2[1])
print("\n\nelapsed time:%s" % (datetime.datetime.now() - startTime))

---
Py4JJavaError Traceback (most recent call last)
 in ()
  8 tweetStrings = sc.textFile(dataURL)
  9 
---> 10 t2 = tweetStrings.take(2)
 11 print (t2[1])
 12 print("\n\nelapsed time:%s" % (datetime.datetime.now() - startTime))

/root/spark/python/pyspark/rdd.py in take(self, num)
   1267 """
   1268 items = []
-> 1269 totalParts = self.getNumPartitions()
   1270 partsScanned = 0
   1271 

/root/spark/python/pyspark/rdd.py in getNumPartitions(self)
354 2
355 """
--> 356 return self._jrdd.partitions().size()
357 
358 def filter(self, f):

/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
__call__(self, *args)
536 answer = self.gateway_client.send_command(command)
537 return_value = get_return_value(answer, self.gateway_client,
--> 538 self.target_id, self.name)
539 
540 for temp_arg in temp_args:

/root/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
 34 def deco(*a, **kw):
 35 try:
---> 36 return f(*a, **kw)
 37 except py4j.protocol.Py4JJavaError as e:
 38 s = e.java_exception.toString()

/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in
get_return_value(answer, gateway_client, target_id, name)
298 raise Py4JJavaError(
299 'An error occurred while calling {0}{1}{2}.\n'.
--> 300 format(target_id, '.', name), value)
301 else:
302 raise Py4JError(

Py4JJavaError: An error occurred while calling o65.partitions.
: java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.Arrays.copyOfRange(Arrays.java:3664)
at java.lang.String.(String.java:207)
at java.lang.String.substring(String.java:1969)
at java.net.URI$Parser.substring(URI.java:2869)
at java.net.URI$Parser.parseHierarchical(URI.java:3106)
at java.net.URI$Parser.parse(URI.java:3053)
at java.net.URI.(URI.java:746)
at org.apache.hadoop.fs.Path.initialize(Path.java:145)
at org.apache.hadoop.fs.Path.(Path.java:71)
at org.apache.hadoop.fs.Path.(Path.java:50)
at 
org.apache.hadoop.hdfs.protocol.HdfsFileStatus.getFullPath(HdfsFileStatus.ja
va:215)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.makeQualified(DistributedFileSy
stem.java:293)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSyste
m.java:352)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:862)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:887)
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:185
)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anon

RE: [sparkR] Any insight on java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-11-07 Thread Sun, Rui
This is probably because your config option actually do not take effect. Please 
refer to the email thread titled “How to set memory for SparkR with 
master="local[*]"”, which may answer you.

I recommend you to try to use SparkR built from the master branch, which 
contains two fixes that may help you in your use case:
https://issues.apache.org/jira/browse/SPARK-11340
https://issues.apache.org/jira/browse/SPARK-11258

BTW, it seems that there is a config conflict in your settings?
spark.driver.memory="30g",
spark.driver.extraJavaOptions="-Xms5g -Xmx5g


From: Dhaval Patel [mailto:dhaval1...@gmail.com]
Sent: Saturday, November 7, 2015 12:26 AM
To: Spark User Group
Subject: [sparkR] Any insight on java.lang.OutOfMemoryError: GC overhead limit 
exceeded

I have been struggling through this error since past 3 days and have tried all 
possible ways/suggestions people have provided on stackoverflow and here in 
this group.

I am trying to read a parquet file using sparkR and convert it into an R 
dataframe for further usage. The file size is not that big, ~4G and 250 mil 
records.

My standalone cluster has more than enough memory and processing power : 24 
core, 128 GB RAM. Here is configuration to give an idea:

Tried this on both spark 1.4.1 and 1.5.1.  I have attached both stack 
traces/logs. Parquet file has 24 partitions.

spark.default.confs=list(spark.cores.max="24",
 spark.executor.memory="50g",
 spark.driver.memory="30g",
 spark.driver.extraJavaOptions="-Xms5g -Xmx5g 
-XX:MaxPermSize=1024M")
sc <- sparkR.init(master="local[24]",sparkEnvir = spark.default.confs)
...
 reading parquet file and storing in R dataframe
med.Rdf <- collect(mednew.DF)




--- Begin Message ---
Hi, Matej,

For the convenience of SparkR users, when they start SparkR without using 
bin/sparkR, (for example, in RStudio), 
https://issues.apache.org/jira/browse/SPARK-11340 enables setting of 
“spark.driver.memory”, (also other similar options, like: 
spark.driver.extraClassPath, spark.driver.extraJavaOptions, 
spark.driver.extraLibraryPath) in the sparkEnvir parameter for sparkR.init() to 
take effect.

Would you like to give it a try? Note the change is on the master branch, you 
have to build Spark from source before using it.


From: Sun, Rui [mailto:rui@intel.com]
Sent: Monday, October 26, 2015 10:24 AM
To: Dirceu Semighini Filho
Cc: user
Subject: RE: How to set memory for SparkR with master="local[*]"

As documented in 
http://spark.apache.org/docs/latest/configuration.html#available-properties,
Note for “spark.driver.memory”:
Note: In client mode, this config must not be set through the SparkConf 
directly in your application, because the driver JVM has already started at 
that point. Instead, please set this through the --driver-memory command line 
option or in your default properties file.

If you are to start a SparkR shell using bin/sparkR, then you can use 
bin/sparkR –driver-memory. You have no chance to set the driver memory size 
after the R shell has been launched via bin/sparkR.

Buf if you are to start a SparkR shell manually without using bin/sparkR (for 
example, in Rstudio), you can:
library(SparkR)
Sys.setenv("SPARKR_SUBMIT_ARGS" = "--conf spark.driver.memory=2g sparkr-shell")
sc <- sparkR.init()

From: Dirceu Semighini Filho [mailto:dirceu.semigh...@gmail.com]
Sent: Friday, October 23, 2015 7:53 PM
Cc: user
Subject: Re: How to set memory for SparkR with master="local[*]"

Hi Matej,
I'm also using this and I'm having the same behavior here, my driver has only 
530mb which is the default value.

Maybe this is a bug.

2015-10-23 9:43 GMT-02:00 Matej Holec 
<hol...@gmail.com<mailto:hol...@gmail.com>>:
Hello!

How to adjust the memory settings properly for SparkR with master="local[*]"
in R?


*When running from  R -- SparkR doesn't accept memory settings :(*

I use the following commands:

R>  library(SparkR)
R>  sc <- sparkR.init(master = "local[*]", sparkEnvir =
list(spark.driver.memory = "5g"))

Despite the variable spark.driver.memory is correctly set (checked in
http://node:4040/environment/), the driver has only the default amount of
memory allocated (Storage Memory 530.3 MB).

*But when running from  spark-1.5.1-bin-hadoop2.6/bin/sparkR -- OK*

The following command:

]$ spark-1.5.1-bin-hadoop2.6/bin/sparkR --driver-memory 5g

creates SparkR session with properly adjustest driver memory (Storage Memory
2.6 GB).


Any suggestion?

Thanks
Matej



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-set-memory-for-SparkR-with-master-local-tp25178.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-

[sparkR] Any insight on java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-11-06 Thread Dhaval Patel
6 10:45:18 INFO CodecPool: Got brand-new decompressor [.gz]
15/11/06 10:45:18 INFO InternalParquetRecordReader: block read in memory in 7 
ms. row count = 121970
15/11/06 10:45:18 INFO CodecPool: Got brand-new decompressor [.gz]
15/11/06 10:45:18 INFO CodecPool: Got brand-new decompressor [.gz]
15/11/06 10:45:18 INFO CodecPool: Got brand-new decompressor [.gz]
15/11/06 10:45:18 INFO CodecPool: Got brand-new decompressor [.gz]
15/11/06 10:45:18 INFO InternalParquetRecordReader: block read in memory in 7 
ms. row count = 121969
15/11/06 10:45:18 INFO CodecPool: Got brand-new decompressor [.gz]
15/11/06 10:45:18 INFO CodecPool: Got brand-new decompressor [.gz]
15/11/06 10:45:18 INFO CodecPool: Got brand-new decompressor [.gz]
15/11/06 10:45:18 INFO InternalParquetRecordReader: block read in memory in 7 
ms. row count = 121969
15/11/06 10:45:18 INFO InternalParquetRecordReader: block read in memory in 8 
ms. row count = 121962
15/11/06 10:45:18 INFO InternalParquetRecordReader: block read in memory in 10 
ms. row count = 121966
15/11/06 10:45:18 INFO InternalParquetRecordReader: block read in memory in 10 
ms. row count = 121972
15/11/06 10:45:18 INFO InternalParquetRecordReader: block read in memory in 9 
ms. row count = 121963
15/11/06 10:45:18 INFO InternalParquetRecordReader: block read in memory in 10 
ms. row count = 121969
15/11/06 10:45:20 INFO BlockManagerInfo: Removed broadcast_2_piece0 on 
localhost:39562 in memory (size: 2.4 KB, free: 530.0 MB)
15/11/06 10:45:20 INFO ContextCleaner: Cleaned accumulator 2
15/11/06 10:45:53 WARN ServletHandler: Error for /static/timeline-view.css
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.zip.ZipCoder.toString(ZipCoder.java:49)
at java.util.zip.ZipFile.getZipEntry(ZipFile.java:567)
at java.util.zip.ZipFile.access$900(ZipFile.java:61)
at java.util.zip.ZipFile$ZipEntryIterator.next(ZipFile.java:525)
at java.util.zip.ZipFile$ZipEntryIterator.nextElement(ZipFile.java:500)
at java.util.zip.ZipFile$ZipEntryIterator.nextElement(ZipFile.java:481)
at java.util.jar.JarFile$JarEntryIterator.next(JarFile.java:257)
at java.util.jar.JarFile$JarEntryIterator.nextElement(JarFile.java:266)
at java.util.jar.JarFile$JarEntryIterator.nextElement(JarFile.java:247)
at 
org.spark-project.jetty.util.resource.JarFileResource.exists(JarFileResource.java:189)
at 
org.spark-project.jetty.servlet.DefaultServlet.getResource(DefaultServlet.java:398)
at 
org.spark-project.jetty.servlet.DefaultServlet.doGet(DefaultServlet.java:476)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:735)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:848)
at 
org.spark-project.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)
at 
org.spark-project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501)
at 
org.spark-project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)
at 
org.spark-project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428)
at 
org.spark-project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)
at 
org.spark-project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at 
org.spark-project.jetty.server.handler.GzipHandler.handle(GzipHandler.java:264)
at 
org.spark-project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at 
org.spark-project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.spark-project.jetty.server.Server.handle(Server.java:370)
at 
org.spark-project.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
at 
org.spark-project.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971)
at 
org.spark-project.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033)
at 
org.spark-project.jetty.http.HttpParser.parseNext(HttpParser.java:644)
at 
org.spark-project.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at 
org.spark-project.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
at 
org.spark-project.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667)
at 
org.spark-project.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52)
15/11/06 10:45:57 WARN ServletHandler: Error for /static/bootstrap.min.css
java.lang.OutOfMemoryError: GC overhead limit exceeded
at sun.util.calendar.Gregorian.newCalendarDate(Gregorian.java:85)
at sun.util.calendar.Gregorian.newCalendarDate(Gregorian.java:37)
at java.util.Date.(Date.java:254)
at java.util.zip.ZipUtils.dosToJavaTime(ZipUtils.java:74)
at java.util.zip.ZipFile.getZipEntry(ZipFile.java:570)
at java.

java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-10-04 Thread t_ras
I get java.lang.OutOfMemoryError: GC overhead limit exceeded when trying
coutn action on a file.

The file is a CSV file 217GB zise

Im using a 10 r3.8xlarge(ubuntu) machines cdh 5.3.6 and spark 1.2.0

configutation:

spark.app.id:local-1443956477103

spark.app.name:Spark shell

spark.cores.max:100

spark.driver.cores:24

spark.driver.extraLibraryPath:/opt/cloudera/parcels/CDH-5.3.6-1.cdh5.3.6.p0.11/lib/hadoop/lib/native
spark.driver.host:ip-172-31-34-242.us-west-2.compute.internal

spark.driver.maxResultSize:300g

spark.driver.port:55123

spark.eventLog.dir:hdfs://ip-172-31-34-242.us-west-2.compute.internal:8020/user/spark/applicationHistory
spark.eventLog.enabled:true

spark.executor.extraLibraryPath:/opt/cloudera/parcels/CDH-5.3.6-1.cdh5.3.6.p0.11/lib/hadoop/lib/native

spark.executor.id:driver spark.executor.memory:200g

spark.fileserver.uri:http://172.31.34.242:51424

spark.jars: spark.master:local[*]

spark.repl.class.uri:http://172.31.34.242:58244

spark.scheduler.mode:FIFO

spark.serializer:org.apache.spark.serializer.KryoSerializer

spark.storage.memoryFraction:0.9

spark.tachyonStore.folderName:spark-88bd9c44-d626-4ad2-8df3-f89df4cb30de

spark.yarn.historyServer.address:http://ip-172-31-34-242.us-west-2.compute.internal:18088

here is what I ran:

val testrdd =
sc.textFile("hdfs://ip-172-31-34-242.us-west-2.compute.internal:8020/user/jethro/tables/edw_fact_lsx_detail/edw_fact_lsx_detail.csv")

testrdd.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY_SER)

testrdd.count()

If I dont force it in memeory it sorks fine, but considering the cluster Im
running on it should fit in memory properly.

Any ideas?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-OutOfMemoryError-GC-overhead-limit-exceeded-tp24918.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-10-04 Thread Ted Yu
1.2.0 is quite old. 

You may want to try 1.5.1 which was released in the past week. 

Cheers

> On Oct 4, 2015, at 4:26 AM, t_ras <marti...@netvision.net.il> wrote:
> 
> I get java.lang.OutOfMemoryError: GC overhead limit exceeded when trying
> coutn action on a file.
> 
> The file is a CSV file 217GB zise
> 
> Im using a 10 r3.8xlarge(ubuntu) machines cdh 5.3.6 and spark 1.2.0
> 
> configutation:
> 
> spark.app.id:local-1443956477103
> 
> spark.app.name:Spark shell
> 
> spark.cores.max:100
> 
> spark.driver.cores:24
> 
> spark.driver.extraLibraryPath:/opt/cloudera/parcels/CDH-5.3.6-1.cdh5.3.6.p0.11/lib/hadoop/lib/native
> spark.driver.host:ip-172-31-34-242.us-west-2.compute.internal
> 
> spark.driver.maxResultSize:300g
> 
> spark.driver.port:55123
> 
> spark.eventLog.dir:hdfs://ip-172-31-34-242.us-west-2.compute.internal:8020/user/spark/applicationHistory
> spark.eventLog.enabled:true
> 
> spark.executor.extraLibraryPath:/opt/cloudera/parcels/CDH-5.3.6-1.cdh5.3.6.p0.11/lib/hadoop/lib/native
> 
> spark.executor.id:driver spark.executor.memory:200g
> 
> spark.fileserver.uri:http://172.31.34.242:51424
> 
> spark.jars: spark.master:local[*]
> 
> spark.repl.class.uri:http://172.31.34.242:58244
> 
> spark.scheduler.mode:FIFO
> 
> spark.serializer:org.apache.spark.serializer.KryoSerializer
> 
> spark.storage.memoryFraction:0.9
> 
> spark.tachyonStore.folderName:spark-88bd9c44-d626-4ad2-8df3-f89df4cb30de
> 
> spark.yarn.historyServer.address:http://ip-172-31-34-242.us-west-2.compute.internal:18088
> 
> here is what I ran:
> 
> val testrdd =
> sc.textFile("hdfs://ip-172-31-34-242.us-west-2.compute.internal:8020/user/jethro/tables/edw_fact_lsx_detail/edw_fact_lsx_detail.csv")
> 
> testrdd.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY_SER)
> 
> testrdd.count()
> 
> If I dont force it in memeory it sorks fine, but considering the cluster Im
> running on it should fit in memory properly.
> 
> Any ideas?
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-OutOfMemoryError-GC-overhead-limit-exceeded-tp24918.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



AW: Spark GraphX memory requirements + java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-08-11 Thread rene.pfitzner
Hi –

I'd like to follow up on this, as I am running into very similar issues (with a 
much bigger data set, though – 10^5 nodes, 10^7 edges).

So let me repost the question: Any ideas on how to estimate graphx memory 
requirements?

Cheers!

Von: Roman Sokolov [mailto:ole...@gmail.com]
Gesendet: Samstag, 11. Juli 2015 03:58
An: Ted Yu; Robin East; user
Betreff: Re: Spark GraphX memory requirements + java.lang.OutOfMemoryError: GC 
overhead limit exceeded

Hello again.
So I could compute triangle numbers when run the code from spark shell without 
workers (with --driver-memory 15g option), but with workers I have errors. So I 
run spark shell:
./bin/spark-shell --master spark://192.168.0.31:7077http://192.168.0.31:7077 
--executor-memory 6900m --driver-memory 15g
and workers (by hands):
./bin/spark-class org.apache.spark.deploy.worker.Worker 
spark://192.168.0.31:7077http://192.168.0.31:7077
(2 workers, each has 8Gb RAM; master has 32 Gb RAM).

The code now is:
import org.apache.spark._
import org.apache.spark.graphx._
val graph = GraphLoader.edgeListFile(sc, 
/home/data/graph.txt).partitionBy(PartitionStrategy.RandomVertexCut)
val newgraph = graph.convertToCanonicalEdges()
val triangleNum = newgraph.triangleCount().vertices.map(x = 
x._2.toLong).reduce(_ + _)/3

So how to understand what amount of memory is needed? And why I need it so 
much? Dataset is only 1,1Gb small...

Error:
[Stage 7: (0 + 8) / 
32]15/07/11 01:48:45 WARN TaskSetManager: Lost task 2.0 in stage 7.0 (TID 130, 
192.168.0.28): io.netty.handler.codec.DecoderException: 
java.lang.OutOfMemoryError
at 
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:153)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError
at sun.misc.Unsafe.allocateMemory(Native Method)
at java.nio.DirectByteBuffer.init(DirectByteBuffer.java:127)
at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306)
at 
io.netty.buffer.PoolArena$DirectArena.newUnpooledChunk(PoolArena.java:440)
at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:187)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:165)
at io.netty.buffer.PoolArena.reallocate(PoolArena.java:277)
at 
io.netty.buffer.PooledByteBuf.capacity(PooledByteBuf.java:108)
at 
io.netty.buffer.AbstractByteBuf.ensureWritable(AbstractByteBuf.java:251)
at 
io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:849)
at 
io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:841)
at 
io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:831)
at 
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:146)
... 10 more


On 26 June 2015 at 14:06, Roman Sokolov 
ole...@gmail.commailto:ole...@gmail.com wrote:

Yep, I already found it. So I added 1 line:

val graph = GraphLoader.edgeListFile(sc, , ...)
val newgraph = graph.convertToCanonicalEdges()

and could successfully count triangles on newgraph. Next will test it on 
bigger (several Gb) networks.

I am using Spark 1.3 and 1.4 but haven't seen this function in 
https://spark.apache.org/docs/latest/graphx-programming-guide.html

Thanks a lot guys!
Am 26.06.2015 13:50 schrieb Ted Yu 
yuzhih...@gmail.commailto:yuzhih...@gmail.com:
See SPARK-4917 which went into Spark 1.3.0

On Fri, Jun 26, 2015 at 2:27 AM, Robin East 
robin.e...@xense.co.ukmailto:robin.e...@xense.co.uk wrote:
You’ll get this issue if you just take the first 2000 lines of that file. The 
problem is triangleCount() expects srdId  dstId which is not the case in the 
file (e.g. vertex 28). You can get round this by calling 
graph.convertToCanonical Edges() which removes bi-directional edges and ensures 
srcId  dstId. Which version of Spark are you

Re: Strange Error: java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-07-15 Thread Saeed Shahrivari
Yes there is.
But the RDD is more than 10 TB and compression does not help.

On Wed, Jul 15, 2015 at 8:36 PM, Ted Yu yuzhih...@gmail.com wrote:

 bq. serializeUncompressed()

 Is there a method which enables compression ?

 Just wondering if that would reduce the memory footprint.

 Cheers

 On Wed, Jul 15, 2015 at 8:06 AM, Saeed Shahrivari 
 saeed.shahriv...@gmail.com wrote:

 I use a simple map/reduce step in a Java/Spark program to remove
 duplicated documents from a large (10 TB compressed) sequence file
 containing some html pages. Here is the partial code:

 JavaPairRDDBytesWritable, NullWritable inputRecords =
 sc.sequenceFile(args[0], BytesWritable.class, 
 NullWritable.class).coalesce(numMaps);


 JavaPairRDDString, AteshProtos.CacheDoc hashDocs = 
 inputRecords.mapToPair(t -
 cacheDocs.add(new Tuple2(BaseEncoding.base64()
 .encode(Hashing.sha1().hashString(doc.getUrl(), 
 Charset.defaultCharset()).asBytes()), doc));
 });


 JavaPairRDDBytesWritable, NullWritable byteArrays =
 hashDocs.reduceByKey((a, b) - a.getUrl()  b.getUrl() ? a : b, numReds).
 mapToPair(t - new Tuple2(new 
 BytesWritable(PentV3.buildFromMessage(t._2).serializeUncompressed()),
 NullWritable.get()));


 The logic is simple. The map generates a sha-1 signature from the html and 
 in the reduce phase we keep the html that has the shortest URL. However, 
 after running for 2-3 hours the application crashes due to memory issue. 
 Here is the exception:

 15/07/15 18:24:05 WARN scheduler.TaskSetManager: Lost task 267.0 in stage 
 0.0 (TID 267, psh-11.nse.ir): java.lang.OutOfMemoryError: GC overhead limit 
 exceeded


 It seems that the map function keeps the hashDocs RDD in the memory and when 
 the memory is filled in an executor, the application crashes. Persisting the 
 map output to disk solves the problem. Adding the following line between map 
 and reduce solve the issue:

 hashDocs.persist(StorageLevel.DISK_ONLY());


 Is this a bug of Spark?

 How can I tell Spark not to keep even a bit of RDD in the memory?


 Thanks






Re: Strange Error: java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-07-15 Thread Ted Yu
bq. serializeUncompressed()

Is there a method which enables compression ?

Just wondering if that would reduce the memory footprint.

Cheers

On Wed, Jul 15, 2015 at 8:06 AM, Saeed Shahrivari 
saeed.shahriv...@gmail.com wrote:

 I use a simple map/reduce step in a Java/Spark program to remove
 duplicated documents from a large (10 TB compressed) sequence file
 containing some html pages. Here is the partial code:

 JavaPairRDDBytesWritable, NullWritable inputRecords =
 sc.sequenceFile(args[0], BytesWritable.class, 
 NullWritable.class).coalesce(numMaps);


 JavaPairRDDString, AteshProtos.CacheDoc hashDocs = inputRecords.mapToPair(t 
 -
 cacheDocs.add(new Tuple2(BaseEncoding.base64()
 .encode(Hashing.sha1().hashString(doc.getUrl(), 
 Charset.defaultCharset()).asBytes()), doc));
 });


 JavaPairRDDBytesWritable, NullWritable byteArrays =
 hashDocs.reduceByKey((a, b) - a.getUrl()  b.getUrl() ? a : b, numReds).
 mapToPair(t - new Tuple2(new 
 BytesWritable(PentV3.buildFromMessage(t._2).serializeUncompressed()),
 NullWritable.get()));


 The logic is simple. The map generates a sha-1 signature from the html and in 
 the reduce phase we keep the html that has the shortest URL. However, after 
 running for 2-3 hours the application crashes due to memory issue. Here is 
 the exception:

 15/07/15 18:24:05 WARN scheduler.TaskSetManager: Lost task 267.0 in stage 0.0 
 (TID 267, psh-11.nse.ir): java.lang.OutOfMemoryError: GC overhead limit 
 exceeded


 It seems that the map function keeps the hashDocs RDD in the memory and when 
 the memory is filled in an executor, the application crashes. Persisting the 
 map output to disk solves the problem. Adding the following line between map 
 and reduce solve the issue:

 hashDocs.persist(StorageLevel.DISK_ONLY());


 Is this a bug of Spark?

 How can I tell Spark not to keep even a bit of RDD in the memory?


 Thanks





Strange Error: java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-07-15 Thread Saeed Shahrivari
I use a simple map/reduce step in a Java/Spark program to remove duplicated
documents from a large (10 TB compressed) sequence file containing some
html pages. Here is the partial code:

JavaPairRDDBytesWritable, NullWritable inputRecords =
sc.sequenceFile(args[0], BytesWritable.class,
NullWritable.class).coalesce(numMaps);


JavaPairRDDString, AteshProtos.CacheDoc hashDocs =
inputRecords.mapToPair(t -
cacheDocs.add(new Tuple2(BaseEncoding.base64()
.encode(Hashing.sha1().hashString(doc.getUrl(),
Charset.defaultCharset()).asBytes()), doc));
});


JavaPairRDDBytesWritable, NullWritable byteArrays =
hashDocs.reduceByKey((a, b) - a.getUrl()  b.getUrl() ? a : b, numReds).
mapToPair(t - new Tuple2(new
BytesWritable(PentV3.buildFromMessage(t._2).serializeUncompressed()),
NullWritable.get()));


The logic is simple. The map generates a sha-1 signature from the html
and in the reduce phase we keep the html that has the shortest URL.
However, after running for 2-3 hours the application crashes due to
memory issue. Here is the exception:

15/07/15 18:24:05 WARN scheduler.TaskSetManager: Lost task 267.0 in
stage 0.0 (TID 267, psh-11.nse.ir): java.lang.OutOfMemoryError: GC
overhead limit exceeded


It seems that the map function keeps the hashDocs RDD in the memory
and when the memory is filled in an executor, the application crashes.
Persisting the map output to disk solves the problem. Adding the
following line between map and reduce solve the issue:

hashDocs.persist(StorageLevel.DISK_ONLY());


Is this a bug of Spark?

How can I tell Spark not to keep even a bit of RDD in the memory?


Thanks


Re: Spark GraphX memory requirements + java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-07-10 Thread Roman Sokolov
Hello again.
So I could compute triangle numbers when run the code from spark shell
without workers (with --driver-memory 15g option), but with workers I have
errors. So I run spark shell:
./bin/spark-shell --master spark://192.168.0.31:7077 --executor-memory
6900m --driver-memory 15g
and workers (by hands):
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://
192.168.0.31:7077
(2 workers, each has 8Gb RAM; master has 32 Gb RAM).

The code now is:
import org.apache.spark._
import org.apache.spark.graphx._
val graph = GraphLoader.edgeListFile(sc,
/home/data/graph.txt).partitionBy(PartitionStrategy.RandomVertexCut)
val newgraph = graph.convertToCanonicalEdges()
val triangleNum = newgraph.triangleCount().vertices.map(x =
x._2.toLong).reduce(_ + _)/3

So how to understand what amount of memory is needed? And why I need it so
much? Dataset is only 1,1Gb small...

Error:
[Stage 7: (0 + 8)
/ 32]15/07/11 01:48:45 WARN TaskSetManager: Lost task 2.0 in stage 7.0 (TID
130, 192.168.0.28): io.netty.handler.codec.DecoderException:
java.lang.OutOfMemoryError
at
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:153)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
at
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
at
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
at
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError
at sun.misc.Unsafe.allocateMemory(Native Method)
at java.nio.DirectByteBuffer.init(DirectByteBuffer.java:127)
at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306)
at
io.netty.buffer.PoolArena$DirectArena.newUnpooledChunk(PoolArena.java:440)
at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:187)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:165)
at io.netty.buffer.PoolArena.reallocate(PoolArena.java:277)
at io.netty.buffer.PooledByteBuf.capacity(PooledByteBuf.java:108)
at io.netty.buffer.AbstractByteBuf.ensureWritable(AbstractByteBuf.java:251)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:849)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:841)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:831)
at
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:146)
... 10 more


On 26 June 2015 at 14:06, Roman Sokolov ole...@gmail.com wrote:

 Yep, I already found it. So I added 1 line:

 val graph = GraphLoader.edgeListFile(sc, , ...)
 val newgraph = graph.convertToCanonicalEdges()

 and could successfully count triangles on newgraph. Next will test it on
 bigger (several Gb) networks.

 I am using Spark 1.3 and 1.4 but haven't seen this function in
 https://spark.apache.org/docs/latest/graphx-programming-guide.html

 Thanks a lot guys!
 Am 26.06.2015 13:50 schrieb Ted Yu yuzhih...@gmail.com:

 See SPARK-4917 which went into Spark 1.3.0

 On Fri, Jun 26, 2015 at 2:27 AM, Robin East robin.e...@xense.co.uk
 wrote:

 You’ll get this issue if you just take the first 2000 lines of that
 file. The problem is triangleCount() expects srdId  dstId which is not the
 case in the file (e.g. vertex 28). You can get round this by calling
 graph.convertToCanonical Edges() which removes bi-directional edges and
 ensures srcId  dstId. Which version of Spark are you on? Can’t remember
 what version that method was introduced in.

 Robin

 On 26 Jun 2015, at 09:44, Roman Sokolov ole...@gmail.com wrote:

 Ok, but what does it means? I did not change the core files of spark, so
 is it a bug there?
 PS: on small datasets (500 Mb) I have no problem.
 Am 25.06.2015 18:02 schrieb Ted Yu yuzhih...@gmail.com:

 The assertion failure from TriangleCount.scala corresponds with the
 following lines:

 g.outerJoinVertices(counters) {
   (vid, _, optCounter: Option[Int]) =
 val dblCount = optCounter.getOrElse(0)
 // double count should be even (divisible by two)
 assert((dblCount  1) == 0)

 Cheers

 On Thu, Jun 25, 2015 at 6:20 AM, Roman Sokolov ole...@gmail.com
 wrote:

 Hello!
 I am trying to compute number of triangles with GraphX. But get memory
 error or heap size, even though the dataset is very small (1Gb). I run the
 code in spark-shell, having 16Gb RAM machine (also tried with 2 workers on
 separate machines 8Gb RAM each). So 

Re: Spark GraphX memory requirements + java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-06-26 Thread Roman Sokolov
Ok, but what does it means? I did not change the core files of spark, so is
it a bug there?
PS: on small datasets (500 Mb) I have no problem.
Am 25.06.2015 18:02 schrieb Ted Yu yuzhih...@gmail.com:

 The assertion failure from TriangleCount.scala corresponds with the
 following lines:

 g.outerJoinVertices(counters) {
   (vid, _, optCounter: Option[Int]) =
 val dblCount = optCounter.getOrElse(0)
 // double count should be even (divisible by two)
 assert((dblCount  1) == 0)

 Cheers

 On Thu, Jun 25, 2015 at 6:20 AM, Roman Sokolov ole...@gmail.com wrote:

 Hello!
 I am trying to compute number of triangles with GraphX. But get memory
 error or heap size, even though the dataset is very small (1Gb). I run the
 code in spark-shell, having 16Gb RAM machine (also tried with 2 workers on
 separate machines 8Gb RAM each). So I have 15x more memory than the dataset
 size is, but it is not enough. What should I do with terabytes sized
 datasets? How do people process it? Read a lot of documentation and 2 Spark
 books, and still have no clue :(

 Tried to run with the options, no effect:
 ./bin/spark-shell --executor-memory 6g --driver-memory 9g
 --total-executor-cores 100

 The code is simple:

 val graph = GraphLoader.edgeListFile(sc,
 /home/ubuntu/data/soc-LiveJournal1/lj.stdout,
 edgeStorageLevel = StorageLevel.MEMORY_AND_DISK_SER,
 vertexStorageLevel =
 StorageLevel.MEMORY_AND_DISK_SER).partitionBy(PartitionStrategy.RandomVertexCut)

 println(graph.numEdges)
 println(graph.numVertices)

 val triangleNum = graph.triangleCount().vertices.map(x = x._2).reduce(_
 + _)/3

 (dataset is from here:
 http://konect.uni-koblenz.de/downloads/tsv/soc-LiveJournal1.tar.bz2 first
 two lines contain % characters, so have to be removed).


 UPD: today tried on 32Gb machine (from spark shell again), now got
 another error:

 [Stage 8: (0 +
 4) / 32]15/06/25 13:03:05 WARN ShippableVertexPartitionOps: Joining two
 VertexPartitions with different indexes is slow.
 15/06/25 13:03:05 ERROR Executor: Exception in task 3.0 in stage 8.0 (TID
 227)
 java.lang.AssertionError: assertion failed
 at scala.Predef$.assert(Predef.scala:165)
 at
 org.apache.spark.graphx.lib.TriangleCount$$anonfun$7.apply(TriangleCount.scala:90)
 at
 org.apache.spark.graphx.lib.TriangleCount$$anonfun$7.apply(TriangleCount.scala:87)
 at
 org.apache.spark.graphx.impl.VertexPartitionBaseOps.leftJoin(VertexPartitionBaseOps.scala:140)
 at
 org.apache.spark.graphx.impl.VertexPartitionBaseOps.leftJoin(VertexPartitionBaseOps.scala:133)
 at
 org.apache.spark.graphx.impl.VertexRDDImpl$$anonfun$3.apply(VertexRDDImpl.scala:159)
 at
 org.apache.spark.graphx.impl.VertexRDDImpl$$anonfun$3.apply(VertexRDDImpl.scala:156)
 at
 org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at org.apache.spark.graphx.VertexRDD.compute(VertexRDD.scala:71)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
 at org.apache.spark.scheduler.Task.run(Task.scala:70)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)






 --
 Best regards, Roman Sokolov






Re: Spark GraphX memory requirements + java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-06-26 Thread Robin East
You’ll get this issue if you just take the first 2000 lines of that file. The 
problem is triangleCount() expects srdId  dstId which is not the case in the 
file (e.g. vertex 28). You can get round this by calling 
graph.convertToCanonical Edges() which removes bi-directional edges and ensures 
srcId  dstId. Which version of Spark are you on? Can’t remember what version 
that method was introduced in.

Robin
 On 26 Jun 2015, at 09:44, Roman Sokolov ole...@gmail.com wrote:
 
 Ok, but what does it means? I did not change the core files of spark, so is 
 it a bug there?
 PS: on small datasets (500 Mb) I have no problem.
 
 Am 25.06.2015 18:02 schrieb Ted Yu yuzhih...@gmail.com 
 mailto:yuzhih...@gmail.com:
 The assertion failure from TriangleCount.scala corresponds with the following 
 lines:
 
 g.outerJoinVertices(counters) {
   (vid, _, optCounter: Option[Int]) =
 val dblCount = optCounter.getOrElse(0)
 // double count should be even (divisible by two)
 assert((dblCount  1) == 0)
 
 Cheers
 
 On Thu, Jun 25, 2015 at 6:20 AM, Roman Sokolov ole...@gmail.com 
 mailto:ole...@gmail.com wrote:
 Hello!
 I am trying to compute number of triangles with GraphX. But get memory error 
 or heap size, even though the dataset is very small (1Gb). I run the code in 
 spark-shell, having 16Gb RAM machine (also tried with 2 workers on separate 
 machines 8Gb RAM each). So I have 15x more memory than the dataset size is, 
 but it is not enough. What should I do with terabytes sized datasets? How do 
 people process it? Read a lot of documentation and 2 Spark books, and still 
 have no clue :(
 
 Tried to run with the options, no effect:
 ./bin/spark-shell --executor-memory 6g --driver-memory 9g 
 --total-executor-cores 100
 
 The code is simple:
 
 val graph = GraphLoader.edgeListFile(sc, 
 /home/ubuntu/data/soc-LiveJournal1/lj.stdout,  
   edgeStorageLevel = StorageLevel.MEMORY_AND_DISK_SER, 
   vertexStorageLevel = 
 StorageLevel.MEMORY_AND_DISK_SER).partitionBy(PartitionStrategy.RandomVertexCut)
 
 println(graph.numEdges)
 println(graph.numVertices)
 
 val triangleNum = graph.triangleCount().vertices.map(x = x._2).reduce(_ + 
 _)/3
 
 (dataset is from here: 
 http://konect.uni-koblenz.de/downloads/tsv/soc-LiveJournal1.tar.bz2 
 http://konect.uni-koblenz.de/downloads/tsv/soc-LiveJournal1.tar.bz2 first 
 two lines contain % characters, so have to be removed).
 
 
 UPD: today tried on 32Gb machine (from spark shell again), now got another 
 error:
 
 [Stage 8: (0 + 4) / 
 32]15/06/25 13:03:05 WARN ShippableVertexPartitionOps: Joining two 
 VertexPartitions with different indexes is slow.
 15/06/25 13:03:05 ERROR Executor: Exception in task 3.0 in stage 8.0 (TID 227)
 java.lang.AssertionError: assertion failed
   at scala.Predef$.assert(Predef.scala:165)
   at 
 org.apache.spark.graphx.lib.TriangleCount$$anonfun$7.apply(TriangleCount.scala:90)
   at 
 org.apache.spark.graphx.lib.TriangleCount$$anonfun$7.apply(TriangleCount.scala:87)
   at 
 org.apache.spark.graphx.impl.VertexPartitionBaseOps.leftJoin(VertexPartitionBaseOps.scala:140)
   at 
 org.apache.spark.graphx.impl.VertexPartitionBaseOps.leftJoin(VertexPartitionBaseOps.scala:133)
   at 
 org.apache.spark.graphx.impl.VertexRDDImpl$$anonfun$3.apply(VertexRDDImpl.scala:159)
   at 
 org.apache.spark.graphx.impl.VertexRDDImpl$$anonfun$3.apply(VertexRDDImpl.scala:156)
   at 
 org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
   at org.apache.spark.graphx.VertexRDD.compute(VertexRDD.scala:71)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
   at org.apache.spark.scheduler.Task.run(Task.scala:70)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 
 
 
 
 
 
 -- 
 Best regards, Roman Sokolov
 
 
 



Re: Spark GraphX memory requirements + java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-06-26 Thread Roman Sokolov
Yep, I already found it. So I added 1 line:

val graph = GraphLoader.edgeListFile(sc, , ...)
val newgraph = graph.convertToCanonicalEdges()

and could successfully count triangles on newgraph. Next will test it on
bigger (several Gb) networks.

I am using Spark 1.3 and 1.4 but haven't seen this function in
https://spark.apache.org/docs/latest/graphx-programming-guide.html

Thanks a lot guys!
Am 26.06.2015 13:50 schrieb Ted Yu yuzhih...@gmail.com:

 See SPARK-4917 which went into Spark 1.3.0

 On Fri, Jun 26, 2015 at 2:27 AM, Robin East robin.e...@xense.co.uk
 wrote:

 You’ll get this issue if you just take the first 2000 lines of that file.
 The problem is triangleCount() expects srdId  dstId which is not the case
 in the file (e.g. vertex 28). You can get round this by calling
 graph.convertToCanonical Edges() which removes bi-directional edges and
 ensures srcId  dstId. Which version of Spark are you on? Can’t remember
 what version that method was introduced in.

 Robin

 On 26 Jun 2015, at 09:44, Roman Sokolov ole...@gmail.com wrote:

 Ok, but what does it means? I did not change the core files of spark, so
 is it a bug there?
 PS: on small datasets (500 Mb) I have no problem.
 Am 25.06.2015 18:02 schrieb Ted Yu yuzhih...@gmail.com:

 The assertion failure from TriangleCount.scala corresponds with the
 following lines:

 g.outerJoinVertices(counters) {
   (vid, _, optCounter: Option[Int]) =
 val dblCount = optCounter.getOrElse(0)
 // double count should be even (divisible by two)
 assert((dblCount  1) == 0)

 Cheers

 On Thu, Jun 25, 2015 at 6:20 AM, Roman Sokolov ole...@gmail.com wrote:

 Hello!
 I am trying to compute number of triangles with GraphX. But get memory
 error or heap size, even though the dataset is very small (1Gb). I run the
 code in spark-shell, having 16Gb RAM machine (also tried with 2 workers on
 separate machines 8Gb RAM each). So I have 15x more memory than the dataset
 size is, but it is not enough. What should I do with terabytes sized
 datasets? How do people process it? Read a lot of documentation and 2 Spark
 books, and still have no clue :(

 Tried to run with the options, no effect:
 ./bin/spark-shell --executor-memory 6g --driver-memory 9g
 --total-executor-cores 100

 The code is simple:

 val graph = GraphLoader.edgeListFile(sc,
 /home/ubuntu/data/soc-LiveJournal1/lj.stdout,
 edgeStorageLevel = StorageLevel.MEMORY_AND_DISK_SER,
 vertexStorageLevel =
 StorageLevel.MEMORY_AND_DISK_SER).partitionBy(PartitionStrategy.RandomVertexCut)

 println(graph.numEdges)
 println(graph.numVertices)

 val triangleNum = graph.triangleCount().vertices.map(x =
 x._2).reduce(_ + _)/3

 (dataset is from here:
 http://konect.uni-koblenz.de/downloads/tsv/soc-LiveJournal1.tar.bz2 first
 two lines contain % characters, so have to be removed).


 UPD: today tried on 32Gb machine (from spark shell again), now got
 another error:

 [Stage 8: (0 +
 4) / 32]15/06/25 13:03:05 WARN ShippableVertexPartitionOps: Joining two
 VertexPartitions with different indexes is slow.
 15/06/25 13:03:05 ERROR Executor: Exception in task 3.0 in stage 8.0
 (TID 227)
 java.lang.AssertionError: assertion failed
 at scala.Predef$.assert(Predef.scala:165)
 at
 org.apache.spark.graphx.lib.TriangleCount$$anonfun$7.apply(TriangleCount.scala:90)
 at
 org.apache.spark.graphx.lib.TriangleCount$$anonfun$7.apply(TriangleCount.scala:87)
 at
 org.apache.spark.graphx.impl.VertexPartitionBaseOps.leftJoin(VertexPartitionBaseOps.scala:140)
 at
 org.apache.spark.graphx.impl.VertexPartitionBaseOps.leftJoin(VertexPartitionBaseOps.scala:133)
 at
 org.apache.spark.graphx.impl.VertexRDDImpl$$anonfun$3.apply(VertexRDDImpl.scala:159)
 at
 org.apache.spark.graphx.impl.VertexRDDImpl$$anonfun$3.apply(VertexRDDImpl.scala:156)
 at
 org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at org.apache.spark.graphx.VertexRDD.compute(VertexRDD.scala:71)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
 at org.apache.spark.scheduler.Task.run(Task.scala:70)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)






 --
 Best regards, Roman Sokolov








Spark GraphX memory requirements + java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-06-25 Thread Roman Sokolov
Hello!
I am trying to compute number of triangles with GraphX. But get memory
error or heap size, even though the dataset is very small (1Gb). I run the
code in spark-shell, having 16Gb RAM machine (also tried with 2 workers on
separate machines 8Gb RAM each). So I have 15x more memory than the dataset
size is, but it is not enough. What should I do with terabytes sized
datasets? How do people process it? Read a lot of documentation and 2 Spark
books, and still have no clue :(

Tried to run with the options, no effect:
./bin/spark-shell --executor-memory 6g --driver-memory 9g
--total-executor-cores 100

The code is simple:

val graph = GraphLoader.edgeListFile(sc,
/home/ubuntu/data/soc-LiveJournal1/lj.stdout,
edgeStorageLevel = StorageLevel.MEMORY_AND_DISK_SER,
vertexStorageLevel =
StorageLevel.MEMORY_AND_DISK_SER).partitionBy(PartitionStrategy.RandomVertexCut)

println(graph.numEdges)
println(graph.numVertices)

val triangleNum = graph.triangleCount().vertices.map(x = x._2).reduce(_ +
_)/3

(dataset is from here:
http://konect.uni-koblenz.de/downloads/tsv/soc-LiveJournal1.tar.bz2 first
two lines contain % characters, so have to be removed).


UPD: today tried on 32Gb machine (from spark shell again), now got another
error:

[Stage 8: (0 + 4)
/ 32]15/06/25 13:03:05 WARN ShippableVertexPartitionOps: Joining two
VertexPartitions with different indexes is slow.
15/06/25 13:03:05 ERROR Executor: Exception in task 3.0 in stage 8.0 (TID
227)
java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:165)
at
org.apache.spark.graphx.lib.TriangleCount$$anonfun$7.apply(TriangleCount.scala:90)
at
org.apache.spark.graphx.lib.TriangleCount$$anonfun$7.apply(TriangleCount.scala:87)
at
org.apache.spark.graphx.impl.VertexPartitionBaseOps.leftJoin(VertexPartitionBaseOps.scala:140)
at
org.apache.spark.graphx.impl.VertexPartitionBaseOps.leftJoin(VertexPartitionBaseOps.scala:133)
at
org.apache.spark.graphx.impl.VertexRDDImpl$$anonfun$3.apply(VertexRDDImpl.scala:159)
at
org.apache.spark.graphx.impl.VertexRDDImpl$$anonfun$3.apply(VertexRDDImpl.scala:156)
at
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.graphx.VertexRDD.compute(VertexRDD.scala:71)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)






-- 
Best regards, Roman Sokolov


Re: Spark GraphX memory requirements + java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-06-25 Thread Ted Yu
The assertion failure from TriangleCount.scala corresponds with the
following lines:

g.outerJoinVertices(counters) {
  (vid, _, optCounter: Option[Int]) =
val dblCount = optCounter.getOrElse(0)
// double count should be even (divisible by two)
assert((dblCount  1) == 0)

Cheers

On Thu, Jun 25, 2015 at 6:20 AM, Roman Sokolov ole...@gmail.com wrote:

 Hello!
 I am trying to compute number of triangles with GraphX. But get memory
 error or heap size, even though the dataset is very small (1Gb). I run the
 code in spark-shell, having 16Gb RAM machine (also tried with 2 workers on
 separate machines 8Gb RAM each). So I have 15x more memory than the dataset
 size is, but it is not enough. What should I do with terabytes sized
 datasets? How do people process it? Read a lot of documentation and 2 Spark
 books, and still have no clue :(

 Tried to run with the options, no effect:
 ./bin/spark-shell --executor-memory 6g --driver-memory 9g
 --total-executor-cores 100

 The code is simple:

 val graph = GraphLoader.edgeListFile(sc,
 /home/ubuntu/data/soc-LiveJournal1/lj.stdout,
 edgeStorageLevel = StorageLevel.MEMORY_AND_DISK_SER,
 vertexStorageLevel =
 StorageLevel.MEMORY_AND_DISK_SER).partitionBy(PartitionStrategy.RandomVertexCut)

 println(graph.numEdges)
 println(graph.numVertices)

 val triangleNum = graph.triangleCount().vertices.map(x = x._2).reduce(_ +
 _)/3

 (dataset is from here:
 http://konect.uni-koblenz.de/downloads/tsv/soc-LiveJournal1.tar.bz2 first
 two lines contain % characters, so have to be removed).


 UPD: today tried on 32Gb machine (from spark shell again), now got another
 error:

 [Stage 8: (0 + 4)
 / 32]15/06/25 13:03:05 WARN ShippableVertexPartitionOps: Joining two
 VertexPartitions with different indexes is slow.
 15/06/25 13:03:05 ERROR Executor: Exception in task 3.0 in stage 8.0 (TID
 227)
 java.lang.AssertionError: assertion failed
 at scala.Predef$.assert(Predef.scala:165)
 at
 org.apache.spark.graphx.lib.TriangleCount$$anonfun$7.apply(TriangleCount.scala:90)
 at
 org.apache.spark.graphx.lib.TriangleCount$$anonfun$7.apply(TriangleCount.scala:87)
 at
 org.apache.spark.graphx.impl.VertexPartitionBaseOps.leftJoin(VertexPartitionBaseOps.scala:140)
 at
 org.apache.spark.graphx.impl.VertexPartitionBaseOps.leftJoin(VertexPartitionBaseOps.scala:133)
 at
 org.apache.spark.graphx.impl.VertexRDDImpl$$anonfun$3.apply(VertexRDDImpl.scala:159)
 at
 org.apache.spark.graphx.impl.VertexRDDImpl$$anonfun$3.apply(VertexRDDImpl.scala:156)
 at
 org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at org.apache.spark.graphx.VertexRDD.compute(VertexRDD.scala:71)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
 at org.apache.spark.scheduler.Task.run(Task.scala:70)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)






 --
 Best regards, Roman Sokolov





Re: Spark job throwing “java.lang.OutOfMemoryError: GC overhead limit exceeded”

2015-06-15 Thread Deng Ching-Mallete
Hi Raj,

Since the number of executor cores is equivalent to the number of tasks
that can be executed in parallel in the executor, in effect, the 6G
executor memory configured for an executor is being shared by 6 tasks plus
factoring in the memory allocation for caching  task execution. I would
suggest increasing the executor-memory and also adjusting it if you're
going to increase the number of executor cores.

You might also want to adjust the memory allocation for caching and task
execution, via the spark.storage.memoryFraction config. By default, it's
configured to 0.6 (60% of the memory is allocated for the cache). Lowering
it to a smaller fraction, say 0.4 or 0.3, would give you more available
memory for task executions.

Hope this helps!

Thanks,
Deng

On Tue, Jun 16, 2015 at 3:09 AM, diplomatic Guru diplomaticg...@gmail.com
wrote:

 Hello All,


 I have a Spark job that throws java.lang.OutOfMemoryError: GC overhead
 limit exceeded.

 The job is trying to process a filesize 4.5G.

 I've tried following spark configuration:

 --num-executors 6  --executor-memory 6G --executor-cores 6 --driver-memory 3G

 I tried increasing more cores and executors which sometime works, but
 takes over 20 minutes to process the file.

 Could I do something to improve the performance? or stop the Java Heap
 issue?


 Thank you.


 Best regards,


 Raj.



Spark job throwing “java.lang.OutOfMemoryError: GC overhead limit exceeded”

2015-06-15 Thread diplomatic Guru
Hello All,


I have a Spark job that throws java.lang.OutOfMemoryError: GC overhead
limit exceeded.

The job is trying to process a filesize 4.5G.

I've tried following spark configuration:

--num-executors 6  --executor-memory 6G --executor-cores 6 --driver-memory 3G

I tried increasing more cores and executors which sometime works, but takes
over 20 minutes to process the file.

Could I do something to improve the performance? or stop the Java Heap
issue?


Thank you.


Best regards,


Raj.


Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-01-28 Thread Guru Medasani
Hi Antony, Did you get pass this error by repartitioning your job with 
smaller tasks as Sven Krasser pointed out?

From:  Antony Mayi antonym...@yahoo.com
Reply-To:  Antony Mayi antonym...@yahoo.com
Date:  Tuesday, January 27, 2015 at 5:24 PM
To:  Guru Medasani gdm...@outlook.com, Sven Krasser kras...@gmail.com
Cc:  Sandy Ryza sandy.r...@cloudera.com, user@spark.apache.org 
user@spark.apache.org
Subject:  Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

I have yarn configured with yarn.nodemanager.vmem-check-enabled=false and 
yarn.nodemanager.pmem-check-enabled=false to avoid yarn killing the 
containers.

the stack trace is bellow.

thanks,
Antony.

15/01/27 17:02:53 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED 
SIGNAL 15: SIGTERM
15/01/27 17:02:53 ERROR executor.Executor: Exception in task 21.0 in stage 
12.0 (TID 1312)
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.Integer.valueOf(Integer.java:642)
at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:70)
at 
scala.collection.mutable.ArrayOps$ofInt.apply(ArrayOps.scala:156)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scal
a:33)
at 
scala.collection.mutable.ArrayOps$ofInt.foreach(ArrayOps.scala:156)
at scala.collection.SeqLike$class.distinct(SeqLike.scala:493)
at 
scala.collection.mutable.ArrayOps$ofInt.distinct(ArrayOps.scala:156)
at 
org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommenda
tion$ALS$$makeOutLinkBlock(ALS.scala:404)
at 
org.apache.spark.mllib.recommendation.ALS$$anonfun$15.apply(ALS.scala:459)
at 
org.apache.spark.mllib.recommendation.ALS$$anonfun$15.apply(ALS.scala:456)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:614)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:614)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at 
org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at 
org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
at 
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.sca
la:130)
at 
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.sca
la:127)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(Traver
sableLike.scala:772)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scal
a:33)
at 
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:7
71)
at 
org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:127)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at 
org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at 
org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:
31)
15/01/27 17:02:53 ERROR util.SparkUncaughtExceptionHandler: Uncaught 
exception in thread Thread[Executor task launch worker-8,5,main]
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.Integer.valueOf(Integer.java:642)
at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:70)
at 
scala.collection.mutable.ArrayOps$ofInt.apply(ArrayOps.scala:156)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scal
a:33)
at 
scala.collection.mutable.ArrayOps$ofInt.foreach(ArrayOps.scala:156)
at scala.collection.SeqLike$class.distinct(SeqLike.scala:493)
at 
scala.collection.mutable.ArrayOps$ofInt.distinct(ArrayOps.scala:156)
at 
org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommenda
tion$ALS$$makeOutLinkBlock(ALS.scala:404)
at 
org.apache.spark.mllib.recommendation.ALS$$anonfun$15.apply(ALS.scala:459)
at 
org.apache.spark.mllib.recommendation.ALS$$anonfun$15.apply(ALS.scala:456)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:614)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:614)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at 
org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-01-27 Thread Sandy Ryza
Hi Antony,

If you look in the YARN NodeManager logs, do you see that it's killing the
executors?  Or are they crashing for a different reason?

-Sandy

On Tue, Jan 27, 2015 at 12:43 PM, Antony Mayi antonym...@yahoo.com.invalid
wrote:

 Hi,

 I am using spark.yarn.executor.memoryOverhead=8192 yet getting executors
 crashed with this error.

 does that mean I have genuinely not enough RAM or is this matter of config
 tuning?

 other config options used:
 spark.storage.memoryFraction=0.3
 SPARK_EXECUTOR_MEMORY=14G

 running spark 1.2.0 as yarn-client on cluster of 10 nodes (the workload is
 ALS trainImplicit on ~15GB dataset)

 thanks for any ideas,
 Antony.



java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-01-27 Thread Antony Mayi
Hi,
I am using spark.yarn.executor.memoryOverhead=8192 yet getting executors 
crashed with this error.
does that mean I have genuinely not enough RAM or is this matter of config 
tuning?
other config options used:spark.storage.memoryFraction=0.3
SPARK_EXECUTOR_MEMORY=14G
running spark 1.2.0 as yarn-client on cluster of 10 nodes (the workload is ALS 
trainImplicit on ~15GB dataset)
thanks for any ideas,Antony.

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-01-27 Thread Guru Medasani
Hi Anthony,

What is the setting of the total amount of memory in MB that can be 
allocated to containers on your NodeManagers?

yarn.nodemanager.resource.memory-mb

Can you check this above configuration in yarn-site.xml used by the node 
manager process?

-Guru Medasani

From:  Sandy Ryza sandy.r...@cloudera.com
Date:  Tuesday, January 27, 2015 at 3:33 PM
To:  Antony Mayi antonym...@yahoo.com
Cc:  user@spark.apache.org user@spark.apache.org
Subject:  Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

Hi Antony,

If you look in the YARN NodeManager logs, do you see that it's killing the 
executors?  Or are they crashing for a different reason?

-Sandy

On Tue, Jan 27, 2015 at 12:43 PM, Antony Mayi 
antonym...@yahoo.com.invalid wrote:
Hi,

I am using spark.yarn.executor.memoryOverhead=8192 yet getting executors 
crashed with this error.

does that mean I have genuinely not enough RAM or is this matter of config 
tuning?

other config options used:
spark.storage.memoryFraction=0.3
SPARK_EXECUTOR_MEMORY=14G

running spark 1.2.0 as yarn-client on cluster of 10 nodes (the workload is 
ALS trainImplicit on ~15GB dataset)

thanks for any ideas,
Antony.




Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-01-27 Thread Sven Krasser
Since it's an executor running OOM it doesn't look like a container being
killed by YARN to me. As a starting point, can you repartition your job
into smaller tasks?
-Sven

On Tue, Jan 27, 2015 at 2:34 PM, Guru Medasani gdm...@outlook.com wrote:

 Hi Anthony,

 What is the setting of the total amount of memory in MB that can be
 allocated to containers on your NodeManagers?

 yarn.nodemanager.resource.memory-mb

 Can you check this above configuration in yarn-site.xml used by the node
 manager process?

 -Guru Medasani

 From: Sandy Ryza sandy.r...@cloudera.com
 Date: Tuesday, January 27, 2015 at 3:33 PM
 To: Antony Mayi antonym...@yahoo.com
 Cc: user@spark.apache.org user@spark.apache.org
 Subject: Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

 Hi Antony,

 If you look in the YARN NodeManager logs, do you see that it's killing the
 executors?  Or are they crashing for a different reason?

 -Sandy

 On Tue, Jan 27, 2015 at 12:43 PM, Antony Mayi 
 antonym...@yahoo.com.invalid wrote:

 Hi,

 I am using spark.yarn.executor.memoryOverhead=8192 yet getting executors
 crashed with this error.

 does that mean I have genuinely not enough RAM or is this matter of
 config tuning?

 other config options used:
 spark.storage.memoryFraction=0.3
 SPARK_EXECUTOR_MEMORY=14G

 running spark 1.2.0 as yarn-client on cluster of 10 nodes (the workload
 is ALS trainImplicit on ~15GB dataset)

 thanks for any ideas,
 Antony.





-- 
http://sites.google.com/site/krasser/?utm_source=sig


Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-01-27 Thread Guru Medasani
Can you attach the logs where this is failing?

From:  Sven Krasser kras...@gmail.com
Date:  Tuesday, January 27, 2015 at 4:50 PM
To:  Guru Medasani gdm...@outlook.com
Cc:  Sandy Ryza sandy.r...@cloudera.com, Antony Mayi 
antonym...@yahoo.com, user@spark.apache.org user@spark.apache.org
Subject:  Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

Since it's an executor running OOM it doesn't look like a container being 
killed by YARN to me. As a starting point, can you repartition your job 
into smaller tasks?
-Sven

On Tue, Jan 27, 2015 at 2:34 PM, Guru Medasani gdm...@outlook.com wrote:
Hi Anthony,

What is the setting of the total amount of memory in MB that can be 
allocated to containers on your NodeManagers?

yarn.nodemanager.resource.memory-mb

Can you check this above configuration in yarn-site.xml used by the node 
manager process?

-Guru Medasani

From:  Sandy Ryza sandy.r...@cloudera.com
Date:  Tuesday, January 27, 2015 at 3:33 PM
To:  Antony Mayi antonym...@yahoo.com
Cc:  user@spark.apache.org user@spark.apache.org
Subject:  Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

Hi Antony,

If you look in the YARN NodeManager logs, do you see that it's killing the 
executors?  Or are they crashing for a different reason?

-Sandy

On Tue, Jan 27, 2015 at 12:43 PM, Antony Mayi 
antonym...@yahoo.com.invalid wrote:
Hi,

I am using spark.yarn.executor.memoryOverhead=8192 yet getting executors 
crashed with this error.

does that mean I have genuinely not enough RAM or is this matter of config 
tuning?

other config options used:
spark.storage.memoryFraction=0.3
SPARK_EXECUTOR_MEMORY=14G

running spark 1.2.0 as yarn-client on cluster of 10 nodes (the workload is 
ALS trainImplicit on ~15GB dataset)

thanks for any ideas,
Antony.




-- 
http://sites.google.com/site/krasser/?utm_source=sig



Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-01-27 Thread Antony Mayi
I have yarn configured with yarn.nodemanager.vmem-check-enabled=false and 
yarn.nodemanager.pmem-check-enabled=false to avoid yarn killing the containers.
the stack trace is bellow.
thanks,Antony.
15/01/27 17:02:53 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 
15: SIGTERM15/01/27 17:02:53 ERROR executor.Executor: Exception in task 21.0 in 
stage 12.0 (TID 1312)java.lang.OutOfMemoryError: GC overhead limit exceeded     
   at java.lang.Integer.valueOf(Integer.java:642)        at 
scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:70)        at 
scala.collection.mutable.ArrayOps$ofInt.apply(ArrayOps.scala:156)        at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
        at scala.collection.mutable.ArrayOps$ofInt.foreach(ArrayOps.scala:156)  
      at scala.collection.SeqLike$class.distinct(SeqLike.scala:493)        at 
scala.collection.mutable.ArrayOps$ofInt.distinct(ArrayOps.scala:156)        at 
org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$makeOutLinkBlock(ALS.scala:404)
        at 
org.apache.spark.mllib.recommendation.ALS$$anonfun$15.apply(ALS.scala:459)      
  at org.apache.spark.mllib.recommendation.ALS$$anonfun$15.apply(ALS.scala:456) 
       at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:614)        at 
org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:614)        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)        
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)        at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:230)        at 
org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)        
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)        at 
org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)        at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:228)        at 
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:130)
        at 
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:127)
        at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
        at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
        at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)  
      at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)  
      at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:127)      
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)        at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:230)        at 
org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)        
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)        at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:230)        at 
org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)15/01/27
 17:02:53 ERROR util.SparkUncaughtExceptionHandler: Uncaught exception in 
thread Thread[Executor task launch worker-8,5,main]java.lang.OutOfMemoryError: 
GC overhead limit exceeded        at 
java.lang.Integer.valueOf(Integer.java:642)        at 
scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:70)        at 
scala.collection.mutable.ArrayOps$ofInt.apply(ArrayOps.scala:156)        at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
        at scala.collection.mutable.ArrayOps$ofInt.foreach(ArrayOps.scala:156)  
      at scala.collection.SeqLike$class.distinct(SeqLike.scala:493)        at 
scala.collection.mutable.ArrayOps$ofInt.distinct(ArrayOps.scala:156)        at 
org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$makeOutLinkBlock(ALS.scala:404)
        at 
org.apache.spark.mllib.recommendation.ALS$$anonfun$15.apply(ALS.scala:459)      
  at org.apache.spark.mllib.recommendation.ALS$$anonfun$15.apply(ALS.scala:456) 
       at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:614)        at 
org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:614)        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)        
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)        at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:230)        at 
org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)        
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)        at 
org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)        at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:228)        at 
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:130)
        at 
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:127)
        at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772

[Spark Streaming] java.lang.OutOfMemoryError: GC overhead limit exceeded

2014-09-08 Thread Yan Fang
Hi guys,

My Spark Streaming application have this java.lang.OutOfMemoryError: GC
overhead limit exceeded error in SparkStreaming driver program. I have
done the following to debug with it:

1. improved the driver memory from 1GB to 2GB, this error came after 22
hrs. When the memory was 1GB, it came after 10 hrs. So I think it is the
memory leak problem.

2. after starting the application a few hours, I killed all workers. The
driver program kept running and also filling up the memory. I was thinking
it was because too many batches in the queue, obviously it is not.
Otherwise, after killing workers (of course, the receiver), the memory
usage should have gone down.

3. run the heap dump and Leak Suspect of Memory Analysis in Eclipse, found
that

*One instance of org.apache.spark.storage.BlockManager loaded by
sun.misc.Launcher$AppClassLoader @ 0x6c002fb90 occupies 1,477,177,296
(72.70%) bytes. The memory is accumulated in one instance of
java.util.LinkedHashMap loaded by system class loader.*

*Keywords*
*sun.misc.Launcher$AppClassLoader @ 0x6c002fb90**java.util.LinkedHashMap*
*org.apache.spark.storage.BlockManager *



What my application mainly does is :

1. calculate the sum/count in a batch
2. get the average in the batch
3. store the result in DB

4. calculate the sum/count in a window
5. get the average/min/max in the window
6. store the result in DB

7. compare the current batch value with previous batch value using
updateStateByKey.


Any hint what causes this leakage? Thank you.

Cheers,

Fang, Yan
yanfang...@gmail.com
+1 (206) 849-4108


Spark app throwing java.lang.OutOfMemoryError: GC overhead limit exceeded

2014-08-04 Thread buntu
I got a 40 node cdh 5.1 cluster and attempting to run a simple spark app that
processes about 10-15GB raw data but I keep running into this error:
 
  java.lang.OutOfMemoryError: GC overhead limit exceeded
 
Each node has 8 cores and 2GB memory. I notice the heap size on the
executors is set to 512MB with total heap size on each executor is set to
2GB. Wanted to know whats the heap size needs to be set to for such data
sizes and if anyone had input on other config changes that will help as
well.
 
Thanks for the input!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-app-throwing-java-lang-OutOfMemoryError-GC-overhead-limit-exceeded-tp11350.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark app throwing java.lang.OutOfMemoryError: GC overhead limit exceeded

2014-08-04 Thread Sean Owen
(- incubator list, + user list)
(Answer copied from original posting at
http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Spark-app-throwing-java-lang-OutOfMemoryError-GC-overhead-limit/m-p/16396#U16396
-- let's follow up one place. If it's not specific to CDH, this is a
good place to record a solution.)

Where does the out of memory exception occur? in your driver, or an
executor? I assume it is an executor. Yes, you are using the default
of 512MB per executor. You can raise that with properties like
spark.executor.memory, or flags like --executor-memory if using
spark-shell.

It sounds like your workers are allocating 2GB for executors, so you
could potentially use up to 2GB per executor and your 1 executor per
machine would consume all of your Spark cluster memory.

But more memory doesn't necessarily help if you're performing some
operation that inherently allocates a great deal of memory. I'm not
sure what your operations are. Keep in mind too that if you are
caching RDDs in memory, this is taking memory away from what's
available for computations.

On Mon, Aug 4, 2014 at 7:03 PM, buntu buntu...@gmail.com wrote:
 I got a 40 node cdh 5.1 cluster and attempting to run a simple spark app that
 processes about 10-15GB raw data but I keep running into this error:

   java.lang.OutOfMemoryError: GC overhead limit exceeded

 Each node has 8 cores and 2GB memory. I notice the heap size on the
 executors is set to 512MB with total heap size on each executor is set to
 2GB. Wanted to know whats the heap size needs to be set to for such data
 sizes and if anyone had input on other config changes that will help as
 well.

 Thanks for the input!



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-app-throwing-java-lang-OutOfMemoryError-GC-overhead-limit-exceeded-tp11350.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2014-07-21 Thread Abel Coronado Iruegas
Hi Yifan

This works for me:

export SPARK_JAVA_OPTS=-Xms10g -Xmx40g -XX:MaxPermSize=10g
export ADD_JARS=/home/abel/spark/MLI/target/MLI-assembly-1.0.jar
export SPARK_MEM=40g
./spark-shell

Regards


On Mon, Jul 21, 2014 at 7:48 AM, Yifan LI iamyifa...@gmail.com wrote:

 Hi,

 I am trying to load the Graphx example dataset(LiveJournal, 1.08GB)
 through *Scala Shell* on my standalone multicore machine(8 cpus, 16GB
 mem), but an OutOfMemory error was returned when below code was running,

 val graph = GraphLoader.edgeListFile(sc, path, minEdgePartitions =
 16).partitionBy(PartitionStrategy.RandomVertexCut)

 I guess I should set some parameters to JVM? like -Xmx5120m
 But how to do this in Scala Shell?
 I directly used the bin/spark-shell to start spark and seems everything
 works correctly in WebUI.

 Or, I should do parameters setting at somewhere(spark-1.0.1)?



 Best,
 Yifan LI



Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2014-07-21 Thread Yifan LI
Thanks, Abel.


Best,
Yifan LI
On Jul 21, 2014, at 4:16 PM, Abel Coronado Iruegas acoronadoirue...@gmail.com 
wrote:

 Hi Yifan
 
 This works for me:
 
 export SPARK_JAVA_OPTS=-Xms10g -Xmx40g -XX:MaxPermSize=10g
 export ADD_JARS=/home/abel/spark/MLI/target/MLI-assembly-1.0.jar
 export SPARK_MEM=40g
 ./spark-shell 
 
 
 Regards
 
 
 On Mon, Jul 21, 2014 at 7:48 AM, Yifan LI iamyifa...@gmail.com wrote:
 Hi,
 
 I am trying to load the Graphx example dataset(LiveJournal, 1.08GB) through 
 Scala Shell on my standalone multicore machine(8 cpus, 16GB mem), but an 
 OutOfMemory error was returned when below code was running,
 
 val graph = GraphLoader.edgeListFile(sc, path, minEdgePartitions = 
 16).partitionBy(PartitionStrategy.RandomVertexCut)
 
 I guess I should set some parameters to JVM? like -Xmx5120m
 But how to do this in Scala Shell? 
 I directly used the bin/spark-shell to start spark and seems everything 
 works correctly in WebUI.
 
 Or, I should do parameters setting at somewhere(spark-1.0.1)?
 
 
 
 Best,
 Yifan LI
 



java.lang.OutOfMemoryError (java.lang.OutOfMemoryError: GC overhead limit exceeded)

2014-07-08 Thread Konstantin Kudryavtsev
Hi all,

I faced with the next exception during map step:
java.lang.OutOfMemoryError (java.lang.OutOfMemoryError: GC overhead limit
exceeded)
java.lang.reflect.Array.newInstance(Array.java:70)
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:325)
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293)
com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699)
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611)
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699)
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611)
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43)
com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:115)
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125)
org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:155)
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154)
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
I'm using Spark 1.0In map I create new object each time, as I understand I
can't reuse object similar to MapReduce development? I wondered, if you
could point me how is it possible to avoid GC overhead...thank you in
advance

Thank you,
Konstantin Kudryavtsev


Re: java.lang.OutOfMemoryError (java.lang.OutOfMemoryError: GC overhead limit exceeded)

2014-07-08 Thread Aaron Davidson
There is a difference from actual GC overhead, which can be reduced by
reusing objects, versus this error, which actually means you ran out of
memory. This error can probably be relieved by increasing your executor
heap size, unless your data is corrupt and it is allocating huge arrays, or
you are otherwise keeping too much memory around.

For your other question, you can reuse objects similar to MapReduce
(HadoopRDD does this by actually using Hadoop's Writables, for instance),
but the general Spark APIs don't support this because mutable objects are
not friendly to caching or serializing.


On Tue, Jul 8, 2014 at 9:27 AM, Konstantin Kudryavtsev 
kudryavtsev.konstan...@gmail.com wrote:

 Hi all,

 I faced with the next exception during map step:
 java.lang.OutOfMemoryError (java.lang.OutOfMemoryError: GC overhead limit
 exceeded)
 java.lang.reflect.Array.newInstance(Array.java:70)
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:325)
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293)
 com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699)
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611)
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699)
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611)
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43)
 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:115)
 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125)
 org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
 scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:155)
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154)
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 I'm using Spark 1.0In map I create new object each time, as I understand
 I can't reuse object similar to MapReduce development? I wondered, if you
 could point me how is it possible to avoid GC overhead...thank you in
 advance

 Thank you,
 Konstantin Kudryavtsev



Re: java.lang.OutOfMemoryError (java.lang.OutOfMemoryError: GC overhead limit exceeded)

2014-07-08 Thread Jerry Lam
Hi Konstantin,

I just ran into the same problem. I mitigated the issue by reducing the
number of cores when I executed the job which otherwise it won't be able to
finish.

Unlike many people believes, it might not means that you were running out
of memory. A better answer can be found here:
http://stackoverflow.com/questions/4371505/gc-overhead-limit-exceeded and
copied here as a reference:

Excessive GC Time and OutOfMemoryError

The concurrent collector will throw an OutOfMemoryError if too much time is
being spent in garbage collection: if more than 98% of the total time is
spent in garbage collection and less than 2% of the heap is recovered, an
OutOfMemoryError will be thrown. This feature is designed to prevent
applications from running for an extended period of time while making
little or no progress because the heap is too small. If necessary, this
feature can be disabled by adding the option -XX:-UseGCOverheadLimit to the
command line.

The policy is the same as that in the parallel collector, except that time
spent performing concurrent collections is not counted toward the 98% time
limit. In other words, only collections performed while the application is
stopped count toward excessive GC time. Such collections are typically due
to a concurrent mode failure or an explicit collection request (e.g., a
call to System.gc()).

It could be that there are many tasks running in the same node and they all
compete for running GCs which slow things down and trigger the error you
saw. By reducing the number of cores, there are more cpu resources
available to a task so the GC could finish before the error gets throw.

HTH,

Jerry


On Tue, Jul 8, 2014 at 1:35 PM, Aaron Davidson ilike...@gmail.com wrote:

 There is a difference from actual GC overhead, which can be reduced by
 reusing objects, versus this error, which actually means you ran out of
 memory. This error can probably be relieved by increasing your executor
 heap size, unless your data is corrupt and it is allocating huge arrays, or
 you are otherwise keeping too much memory around.

 For your other question, you can reuse objects similar to MapReduce
 (HadoopRDD does this by actually using Hadoop's Writables, for instance),
 but the general Spark APIs don't support this because mutable objects are
 not friendly to caching or serializing.


 On Tue, Jul 8, 2014 at 9:27 AM, Konstantin Kudryavtsev 
 kudryavtsev.konstan...@gmail.com wrote:

 Hi all,

 I faced with the next exception during map step:
 java.lang.OutOfMemoryError (java.lang.OutOfMemoryError: GC overhead limit
 exceeded)
 java.lang.reflect.Array.newInstance(Array.java:70)
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:325)
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293)
 com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699)
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611)
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699)
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611)
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43)
 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:115)
 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125)
 org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
 scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:155)
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply

Re: java.lang.OutOfMemoryError (java.lang.OutOfMemoryError: GC overhead limit exceeded)

2014-07-08 Thread Aaron Davidson
This seems almost equivalent to a heap size error -- since GCs are
stop-the-world events, the fact that we were unable to release more than 2%
of the heap suggests that almost all the memory is *currently in use *(i.e.,
live).

Decreasing the number of cores is another solution which decreases memory
pressure, because each core requires its own set of buffers (for instance,
each kryo serializer has a certain buffer allocated to it), and has its own
working set of data (some subset of a partition). Thus, decreasing the
number of used cores decreases memory contention.


On Tue, Jul 8, 2014 at 10:44 AM, Jerry Lam chiling...@gmail.com wrote:

 Hi Konstantin,

 I just ran into the same problem. I mitigated the issue by reducing the
 number of cores when I executed the job which otherwise it won't be able to
 finish.

 Unlike many people believes, it might not means that you were running out
 of memory. A better answer can be found here:
 http://stackoverflow.com/questions/4371505/gc-overhead-limit-exceeded and
 copied here as a reference:

 Excessive GC Time and OutOfMemoryError

 The concurrent collector will throw an OutOfMemoryError if too much time
 is being spent in garbage collection: if more than 98% of the total time is
 spent in garbage collection and less than 2% of the heap is recovered, an
 OutOfMemoryError will be thrown. This feature is designed to prevent
 applications from running for an extended period of time while making
 little or no progress because the heap is too small. If necessary, this
 feature can be disabled by adding the option -XX:-UseGCOverheadLimit to the
 command line.

 The policy is the same as that in the parallel collector, except that time
 spent performing concurrent collections is not counted toward the 98% time
 limit. In other words, only collections performed while the application is
 stopped count toward excessive GC time. Such collections are typically due
 to a concurrent mode failure or an explicit collection request (e.g., a
 call to System.gc()).

 It could be that there are many tasks running in the same node and they
 all compete for running GCs which slow things down and trigger the error
 you saw. By reducing the number of cores, there are more cpu resources
 available to a task so the GC could finish before the error gets throw.

 HTH,

 Jerry


 On Tue, Jul 8, 2014 at 1:35 PM, Aaron Davidson ilike...@gmail.com wrote:

 There is a difference from actual GC overhead, which can be reduced by
 reusing objects, versus this error, which actually means you ran out of
 memory. This error can probably be relieved by increasing your executor
 heap size, unless your data is corrupt and it is allocating huge arrays, or
 you are otherwise keeping too much memory around.

 For your other question, you can reuse objects similar to MapReduce
 (HadoopRDD does this by actually using Hadoop's Writables, for instance),
 but the general Spark APIs don't support this because mutable objects are
 not friendly to caching or serializing.


 On Tue, Jul 8, 2014 at 9:27 AM, Konstantin Kudryavtsev 
 kudryavtsev.konstan...@gmail.com wrote:

 Hi all,

 I faced with the next exception during map step:
 java.lang.OutOfMemoryError (java.lang.OutOfMemoryError: GC overhead
 limit exceeded)
 java.lang.reflect.Array.newInstance(Array.java:70)
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:325)
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293)
 com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699)
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611)
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699)
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611)
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43)
 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34