Hi all,
I'm running the teraSort benchmark with a relative small input set: 5GB.
During profiling, I can see I am using a total of 68GB. I've got a terabyte
of memory in my system, and set
spark.executor.memory 900g
spark.driver.memory 900g
I use the default for
spark.shuffle.memoryFraction
spark.storage.memoryFraction
I believe that I now have 0.2*900=180GB for shuffle and 0.6*900=540GB for
storage.
I noticed a lot of variation in runtime (under the same load), and tracked
this down to this function in
core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala
private def spillToPartitionFiles(collection:
SizeTrackingPairCollection[(Int, K), C]): Unit = {
spillToPartitionFiles(collection.iterator)
}
In a slow run, it would loop through this function 12000 times, in a fast
run only 700 times, even though the settings in both runs are the same and
there are no other users on the system. When I look at the function calling
this (insertAll, also in ExternalSorter), I see that spillToPartitionFiles
is only called 700 times in both fast and slow runs, meaning that the
function recursively calls itself very often. Because of the function name,
I assume the system is spilling to disk. As I have sufficient memory, I
assume that I forgot to set a certain memory setting. Anybody any idea which
other setting I have to set, in order to not spill data in this scenario?
Thanks,
Tom
--
View this message in context:
http://apache-spark-developers-list.1001551.n3.nabble.com/Spilling-when-not-expected-tp11017.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]