I'm trying to understand the disk I/O patterns for Spark -- specifically, I'd like to reduce the number of files that are being written during shuffle operations. A couple questions:
* is the amount of file I/O performed independent of the memory I allocate for the shuffles? * if this is the case, what is the purpose of this memory and is there any way to see how much of it is actually being used? * how can I minimize the number of files being written? With 24 cores per node, the filesystem can't handle the large amount of simultaneous I/O very well so it limits the number of cores I can use... Thanks for any insight you might have! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/minimizing-disk-I-O-tp18845.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org