OutOfMemoryError when using DataFrame created by Spark SQL

2015-03-25 Thread SLiZn Liu
Hi,

I am using *Spark SQL* to query on my *Hive cluster*, following Spark SQL
and DataFrame Guide
https://spark.apache.org/docs/latest/sql-programming-guide.html step by
step. However, my HiveQL via sqlContext.sql() fails and
java.lang.OutOfMemoryError was raised. The expected result of such query is
considered to be small (by adding limit 1000 clause). My code is shown
below:

scala import sqlContext.implicits._
scala val df = sqlContext.sql(select * from some_table where
logdate=2015-03-24 limit 1000)

and the error msg:

[ERROR] [03/25/2015 16:08:22.379] [sparkDriver-scheduler-27]
[ActorSystem(sparkDriver)] Uncaught fatal error from thread
[sparkDriver-scheduler-27] shutting down ActorSystem [sparkDriver]
java.lang.OutOfMemoryError: GC overhead limit exceeded

the master heap memory is set by -Xms512m -Xmx512m, while workers set
by -Xms4096M
-Xmx4096M, which I presume sufficient for this trivial query.

Additionally, after restarted the spark-shell and re-run the limit 5 query
, the df object is returned and can be printed by df.show(), but other APIs
fails on OutOfMemoryError, namely, df.count(),
df.select(some_field).show() and so forth.

I understand that the RDD can be collected to master hence further
transmutations can be applied, as DataFrame has “richer optimizations under
the hood” and the convention from an R/julia user, I really hope this error
is able to be tackled, and DataFrame is robust enough to depend.

Thanks in advance!

REGARDS,
Todd
​


OutOfMemoryError when using DataFrame created by Spark SQL

2015-03-25 Thread Todd Leo
Hi,

I am using *Spark SQL* to query on my *Hive cluster*, following Spark SQL
and DataFrame Guide
https://spark.apache.org/docs/latest/sql-programming-guide.html step by
step. However, my HiveQL via sqlContext.sql() fails and
java.lang.OutOfMemoryError was raised. The expected result of such query is
considered to be small (by adding limit 1000 clause). My code is shown
below:

scala import sqlContext.implicits._
scala val df = sqlContext.sql(select * from some_table where
logdate=2015-03-24 limit 1000)

and the error msg:

[ERROR] [03/25/2015 16:08:22.379] [sparkDriver-scheduler-27]
[ActorSystem(sparkDriver)] Uncaught fatal error from thread
[sparkDriver-scheduler-27] shutting down ActorSystem [sparkDriver]
java.lang.OutOfMemoryError: GC overhead limit exceeded

the master heap memory is set by -Xms512m -Xmx512m, while workers set
by -Xms4096M
-Xmx4096M, which I presume sufficient for this trivial query.

Additionally, after restarted the spark-shell and re-run the limit 5 query
, the df object is returned and can be printed by df.show(), but other APIs
fails on OutOfMemoryError, namely, df.count(),
df.select(some_field).show() and so forth.

I understand that the RDD can be collected to master hence further
transmutations can be applied, as DataFrame has “richer optimizations under
the hood” and the convention from an R/julia user, I really hope this error
is able to be tackled, and DataFrame is robust enough to depend.

Thanks in advance!

REGARDS,
Todd
​




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/OutOfMemoryError-when-using-DataFrame-created-by-Spark-SQL-tp22219.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: OutOfMemoryError when using DataFrame created by Spark SQL

2015-03-25 Thread Ted Yu
Can you try giving Spark driver more heap ?

Cheers



 On Mar 25, 2015, at 2:14 AM, Todd Leo sliznmail...@gmail.com wrote:
 
 Hi,
 
 I am using Spark SQL to query on my Hive cluster, following Spark SQL and 
 DataFrame Guide step by step. However, my HiveQL via sqlContext.sql() fails 
 and java.lang.OutOfMemoryError was raised. The expected result of such query 
 is considered to be small (by adding limit 1000 clause). My code is shown 
 below:
 
 scala import sqlContext.implicits._  

 scala val df = sqlContext.sql(select * from some_table where 
 logdate=2015-03-24 limit 1000)
 and the error msg:
 
 [ERROR] [03/25/2015 16:08:22.379] [sparkDriver-scheduler-27] 
 [ActorSystem(sparkDriver)] Uncaught fatal error from thread 
 [sparkDriver-scheduler-27] shutting down ActorSystem [sparkDriver]
 java.lang.OutOfMemoryError: GC overhead limit exceeded
 the master heap memory is set by -Xms512m -Xmx512m, while workers set by 
 -Xms4096M -Xmx4096M, which I presume sufficient for this trivial query.
 
 Additionally, after restarted the spark-shell and re-run the limit 5 query , 
 the df object is returned and can be printed by df.show(), but other APIs 
 fails on OutOfMemoryError, namely, df.count(), df.select(some_field).show() 
 and so forth.
 
 I understand that the RDD can be collected to master hence further 
 transmutations can be applied, as DataFrame has “richer optimizations under 
 the hood” and the convention from an R/julia user, I really hope this error 
 is able to be tackled, and DataFrame is robust enough to depend.
 
 Thanks in advance!
 
 REGARDS,
 Todd
 
 
 View this message in context: OutOfMemoryError when using DataFrame created 
 by Spark SQL
 Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: OutOfMemoryError when using DataFrame created by Spark SQL

2015-03-25 Thread Michael Armbrust
You should also try increasing the perm gen size: -XX:MaxPermSize=512m

On Wed, Mar 25, 2015 at 2:37 AM, Ted Yu yuzhih...@gmail.com wrote:

 Can you try giving Spark driver more heap ?

 Cheers



 On Mar 25, 2015, at 2:14 AM, Todd Leo sliznmail...@gmail.com wrote:

 Hi,

 I am using *Spark SQL* to query on my *Hive cluster*, following Spark SQL
 and DataFrame Guide
 https://spark.apache.org/docs/latest/sql-programming-guide.html step by
 step. However, my HiveQL via sqlContext.sql() fails and
 java.lang.OutOfMemoryError was raised. The expected result of such query is
 considered to be small (by adding limit 1000 clause). My code is shown
 below:

 scala import sqlContext.implicits._
 scala val df = sqlContext.sql(select * from some_table where 
 logdate=2015-03-24 limit 1000)

 and the error msg:

 [ERROR] [03/25/2015 16:08:22.379] [sparkDriver-scheduler-27] 
 [ActorSystem(sparkDriver)] Uncaught fatal error from thread 
 [sparkDriver-scheduler-27] shutting down ActorSystem [sparkDriver]
 java.lang.OutOfMemoryError: GC overhead limit exceeded

 the master heap memory is set by -Xms512m -Xmx512m, while workers set by 
 -Xms4096M
 -Xmx4096M, which I presume sufficient for this trivial query.

 Additionally, after restarted the spark-shell and re-run the limit 5 query
 , the df object is returned and can be printed by df.show(), but other
 APIs fails on OutOfMemoryError, namely, df.count(),
 df.select(some_field).show() and so forth.

 I understand that the RDD can be collected to master hence further
 transmutations can be applied, as DataFrame has “richer optimizations under
 the hood” and the convention from an R/julia user, I really hope this error
 is able to be tackled, and DataFrame is robust enough to depend.

 Thanks in advance!

 REGARDS,
 Todd
 ​

 --
 View this message in context: OutOfMemoryError when using DataFrame
 created by Spark SQL
 http://apache-spark-user-list.1001560.n3.nabble.com/OutOfMemoryError-when-using-DataFrame-created-by-Spark-SQL-tp22219.html
 Sent from the Apache Spark User List mailing list archive
 http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.