Can you try giving Spark driver more heap ? Cheers
> On Mar 25, 2015, at 2:14 AM, Todd Leo <sliznmail...@gmail.com> wrote: > > Hi, > > I am using Spark SQL to query on my Hive cluster, following Spark SQL and > DataFrame Guide step by step. However, my HiveQL via sqlContext.sql() fails > and java.lang.OutOfMemoryError was raised. The expected result of such query > is considered to be small (by adding limit 1000 clause). My code is shown > below: > > scala> import sqlContext.implicits._ > > scala> val df = sqlContext.sql("""select * from some_table where > logdate="2015-03-24" limit 1000""") > and the error msg: > > [ERROR] [03/25/2015 16:08:22.379] [sparkDriver-scheduler-27] > [ActorSystem(sparkDriver)] Uncaught fatal error from thread > [sparkDriver-scheduler-27] shutting down ActorSystem [sparkDriver] > java.lang.OutOfMemoryError: GC overhead limit exceeded > the master heap memory is set by -Xms512m -Xmx512m, while workers set by > -Xms4096M -Xmx4096M, which I presume sufficient for this trivial query. > > Additionally, after restarted the spark-shell and re-run the limit 5 query , > the df object is returned and can be printed by df.show(), but other APIs > fails on OutOfMemoryError, namely, df.count(), df.select("some_field").show() > and so forth. > > I understand that the RDD can be collected to master hence further > transmutations can be applied, as DataFrame has “richer optimizations under > the hood” and the convention from an R/julia user, I really hope this error > is able to be tackled, and DataFrame is robust enough to depend. > > Thanks in advance! > > REGARDS, > Todd > > > View this message in context: OutOfMemoryError when using DataFrame created > by Spark SQL > Sent from the Apache Spark User List mailing list archive at Nabble.com.