Hi, I am using *Spark SQL* to query on my *Hive cluster*, following Spark SQL and DataFrame Guide <https://spark.apache.org/docs/latest/sql-programming-guide.html> step by step. However, my HiveQL via sqlContext.sql() fails and java.lang.OutOfMemoryError was raised. The expected result of such query is considered to be small (by adding limit 1000 clause). My code is shown below:
scala> import sqlContext.implicits._ scala> val df = sqlContext.sql("""select * from some_table where logdate="2015-03-24" limit 1000""") and the error msg: [ERROR] [03/25/2015 16:08:22.379] [sparkDriver-scheduler-27] [ActorSystem(sparkDriver)] Uncaught fatal error from thread [sparkDriver-scheduler-27] shutting down ActorSystem [sparkDriver] java.lang.OutOfMemoryError: GC overhead limit exceeded the master heap memory is set by -Xms512m -Xmx512m, while workers set by -Xms4096M -Xmx4096M, which I presume sufficient for this trivial query. Additionally, after restarted the spark-shell and re-run the limit 5 query , the df object is returned and can be printed by df.show(), but other APIs fails on OutOfMemoryError, namely, df.count(), df.select("some_field").show() and so forth. I understand that the RDD can be collected to master hence further transmutations can be applied, as DataFrame has “richer optimizations under the hood” and the convention from an R/julia user, I really hope this error is able to be tackled, and DataFrame is robust enough to depend. Thanks in advance! REGARDS, Todd