testdf.persist(pyspark.storagelevel.StorageLevel.MEMORY_ONLY_SER) maybe StorageLevel should change.And check you config " spark.memory.storageFraction" which default value is 0.5
2017-07-28 3:04 GMT+08:00 Gourav Sengupta <gourav.sengu...@gmail.com>: > Hi, > > I cached in a table in a large EMR cluster and it has a size of 62 MB. > Therefore I know the size of the table while cached. > > But when I am trying to cache in the table in smaller cluster which still > has a total of 3 GB Driver memory and two executors with close to 2.5 GB > memory the job still keeps on failing giving JVM out of memory errors. > > Is there something that I am missing? > > CODE: > ================================================================= > sparkSession = spark.builder \ > .config("spark.rdd.compress", "true") \ > .config("spark.serializer", > "org.apache.spark.serializer.KryoSerializer") > \ > > .config("spark.executor.extraJavaOptions","-XX:+UseCompressedOops > -XX:+PrintGCDetails -XX:+PrintGCTimeStamps") \ > .appName("test").enableHiveSupport().getOrCreate() > > testdf = sparkSession.sql("select * from tablename") > testdf.persist(pyspark.storagelevel.StorageLevel.MEMORY_ONLY_SER) > ================================================================= > > This causes JVM out of memory error. > > > Regards, > Gourav Sengupta >