Re: 1.4.0 regression: out-of-memory errors on small data

Simeon Simeonov Thu, 02 Jul 2015 15:50:33 -0700

Same error with the new code:

import org.apache.spark.sql.hive.HiveContext


val ctx = sqlContext.asInstanceOf[HiveContext]
import ctx.implicits._

val df = 
ctx.jsonFile("file:///Users/sim/dev/spx/data/view-clicks-training/2015/06/18/part-00000.gz")
df.registerTempTable("training")

val dfCount = ctx.sql("select count(*) as cnt from training")
println(dfCount.first.getLong(0))

/Sim

Simeon Simeonov, Founder & CTO, Swoop<http://swoop.com/>
@simeons<http://twitter.com/simeons> | 
blog.simeonov.com<http://blog.simeonov.com/> | 617.299.6746


From: Yin Huai <yh...@databricks.com<mailto:yh...@databricks.com>>
Date: Thursday, July 2, 2015 at 4:34 PM
To: Simeon Simeonov <s...@swoop.com<mailto:s...@swoop.com>>
Cc: user <user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: 1.4.0 regression: out-of-memory errors on small data

Hi Sim,

Seems you already set the PermGen size to 256m, right? I notice that in your 
the shell, you created a HiveContext (it further increased the memory 
consumption on PermGen). But, spark shell has already created a HiveContext for 
you (sqlContext. You can use asInstanceOf to access HiveContext's methods). Can 
you just use the sqlContext created by the shell and try again?

Thanks,

Yin

On Thu, Jul 2, 2015 at 12:50 PM, Yin Huai 
<yh...@databricks.com<mailto:yh...@databricks.com>> wrote:
Hi Sim,

Spark 1.4.0's memory consumption on PermGen is higher then Spark 1.3 (explained 
in https://issues.apache.org/jira/browse/SPARK-8776). Can you add --conf 
"spark.driver.extraJavaOptions=-XX:MaxPermSize=256m" in the command you used to 
launch Spark shell? This will increase the PermGen size from 128m (our default) 
to 256m.

Thanks,

Yin

On Thu, Jul 2, 2015 at 12:40 PM, sim <s...@swoop.com<mailto:s...@swoop.com>> 
wrote:
A very simple Spark SQL COUNT operation succeeds in spark-shell for 1.3.1 and
fails with a series of out-of-memory errors in 1.4.0.

This gist <https://gist.github.com/ssimeonov/a49b75dc086c3ac6f3c4>
includes the code and the full output from the 1.3.1 and 1.4.0 runs,
including the command line showing how spark-shell is started.

Should the 1.4.0 spark-shell be started with different options to avoid this
problem?

Thanks,
Sim




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/1-4-0-regression-out-of-memory-errors-on-small-data-tp23595.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: 
user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>

Re: 1.4.0 regression: out-of-memory errors on small data

Reply via email to