Hi,
I cached in a table in a large EMR cluster and it has a size of 62 MB.
Therefore I know the size of the table while cached.
But when I am trying to cache in the table in smaller cluster which still
has a total of 3 GB Driver memory and two executors with close to 2.5 GB
memory the job still keeps on failing giving JVM out of memory errors.
Is there something that I am missing?
CODE:
=================================================================
sparkSession = spark.builder \
.config("spark.rdd.compress", "true") \
.config("spark.serializer",
"org.apache.spark.serializer.KryoSerializer") \
.config("spark.executor.extraJavaOptions","-XX:+UseCompressedOops
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps") \
.appName("test").enableHiveSupport().getOrCreate()
testdf = sparkSession.sql("select * from tablename")
testdf.persist(pyspark.storagelevel.StorageLevel.MEMORY_ONLY_SER)
=================================================================
This causes JVM out of memory error.
Regards,
Gourav Sengupta