Hi all, I'm running the teraSort benchmark with a relative small input set: 5GB. During profiling, I can see I am using a total of 68GB. I've got a terabyte of memory in my system, and set spark.executor.memory 900g spark.driver.memory 900g I use the default for spark.shuffle.memoryFraction spark.storage.memoryFraction I believe that I now have 0.2*900=180GB for shuffle and 0.6*900=540GB for storage.
I noticed a lot of variation in runtime (under the same load), and tracked this down to this function in core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala private def spillToPartitionFiles(collection: SizeTrackingPairCollection[(Int, K), C]): Unit = { spillToPartitionFiles(collection.iterator) } In a slow run, it would loop through this function 12000 times, in a fast run only 700 times, even though the settings in both runs are the same and there are no other users on the system. When I look at the function calling this (insertAll, also in ExternalSorter), I see that spillToPartitionFiles is only called 700 times in both fast and slow runs, meaning that the function recursively calls itself very often. Because of the function name, I assume the system is spilling to disk. As I have sufficient memory, I assume that I forgot to set a certain memory setting. Anybody any idea which other setting I have to set, in order to not spill data in this scenario? Thanks, Tom -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Spilling-when-not-expected-tp11017.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org