Same error with the new code: import org.apache.spark.sql.hive.HiveContext
val ctx = sqlContext.asInstanceOf[HiveContext] import ctx.implicits._ val df = ctx.jsonFile("file:///Users/sim/dev/spx/data/view-clicks-training/2015/06/18/part-00000.gz") df.registerTempTable("training") val dfCount = ctx.sql("select count(*) as cnt from training") println(dfCount.first.getLong(0)) /Sim Simeon Simeonov, Founder & CTO, Swoop<http://swoop.com/> @simeons<http://twitter.com/simeons> | blog.simeonov.com<http://blog.simeonov.com/> | 617.299.6746 From: Yin Huai <yh...@databricks.com<mailto:yh...@databricks.com>> Date: Thursday, July 2, 2015 at 4:34 PM To: Simeon Simeonov <s...@swoop.com<mailto:s...@swoop.com>> Cc: user <user@spark.apache.org<mailto:user@spark.apache.org>> Subject: Re: 1.4.0 regression: out-of-memory errors on small data Hi Sim, Seems you already set the PermGen size to 256m, right? I notice that in your the shell, you created a HiveContext (it further increased the memory consumption on PermGen). But, spark shell has already created a HiveContext for you (sqlContext. You can use asInstanceOf to access HiveContext's methods). Can you just use the sqlContext created by the shell and try again? Thanks, Yin On Thu, Jul 2, 2015 at 12:50 PM, Yin Huai <yh...@databricks.com<mailto:yh...@databricks.com>> wrote: Hi Sim, Spark 1.4.0's memory consumption on PermGen is higher then Spark 1.3 (explained in https://issues.apache.org/jira/browse/SPARK-8776). Can you add --conf "spark.driver.extraJavaOptions=-XX:MaxPermSize=256m" in the command you used to launch Spark shell? This will increase the PermGen size from 128m (our default) to 256m. Thanks, Yin On Thu, Jul 2, 2015 at 12:40 PM, sim <s...@swoop.com<mailto:s...@swoop.com>> wrote: A very simple Spark SQL COUNT operation succeeds in spark-shell for 1.3.1 and fails with a series of out-of-memory errors in 1.4.0. This gist <https://gist.github.com/ssimeonov/a49b75dc086c3ac6f3c4> includes the code and the full output from the 1.3.1 and 1.4.0 runs, including the command line showing how spark-shell is started. Should the 1.4.0 spark-shell be started with different options to avoid this problem? Thanks, Sim -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/1-4-0-regression-out-of-memory-errors-on-small-data-tp23595.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org> For additional commands, e-mail: user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>