All right, i did not catch the point ,sorry for that. But you can take a snapshot of the heap, and then analysis heap dump by mat or other tools. >From the code i can not find any clue.
2017-07-28 17:09 GMT+08:00 Gourav Sengupta <[email protected]>: > Hi, > > I have done all of that, but my question is "why should a 62 MB data give > memory error when we have over 2 GB of memory available". > > Therefore all that is mentioned by Zhoukang is not pertinent at all. > > > Regards, > Gourav Sengupta > > On Fri, Jul 28, 2017 at 4:43 AM, 周康 <[email protected]> wrote: > >> testdf.persist(pyspark.storagelevel.StorageLevel.MEMORY_ONLY_SER) maybe >> StorageLevel should change.And check you config " >> spark.memory.storageFraction" which default value is 0.5 >> >> 2017-07-28 3:04 GMT+08:00 Gourav Sengupta <[email protected]>: >> >>> Hi, >>> >>> I cached in a table in a large EMR cluster and it has a size of 62 MB. >>> Therefore I know the size of the table while cached. >>> >>> But when I am trying to cache in the table in smaller cluster which >>> still has a total of 3 GB Driver memory and two executors with close to 2.5 >>> GB memory the job still keeps on failing giving JVM out of memory errors. >>> >>> Is there something that I am missing? >>> >>> CODE: >>> ================================================================= >>> sparkSession = spark.builder \ >>> .config("spark.rdd.compress", "true") \ >>> .config("spark.serializer", >>> "org.apache.spark.serializer.KryoSerializer") \ >>> .config("spark.executor.extraJ >>> avaOptions","-XX:+UseCompressedOops -XX:+PrintGCDetails >>> -XX:+PrintGCTimeStamps") \ >>> .appName("test").enableHiveSupport().getOrCreate() >>> >>> testdf = sparkSession.sql("select * from tablename") >>> testdf.persist(pyspark.storagelevel.StorageLevel.MEMORY_ONLY_SER) >>> ================================================================= >>> >>> This causes JVM out of memory error. >>> >>> >>> Regards, >>> Gourav Sengupta >>> >> >> >
