Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/14726#discussion_r75573049 --- Diff: core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java --- @@ -22,15 +22,21 @@ import com.google.common.io.ByteStreams; import com.google.common.io.Closeables; +import org.apache.spark.SparkEnv; import org.apache.spark.serializer.SerializerManager; import org.apache.spark.storage.BlockId; import org.apache.spark.unsafe.Platform; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; /** * Reads spill files written by {@link UnsafeSorterSpillWriter} (see that class for a description * of the file format). */ public final class UnsafeSorterSpillReader extends UnsafeSorterIterator implements Closeable { + private static final Logger logger = LoggerFactory.getLogger(UnsafeSorterSpillReader.class); + private static final int DEFAULT_BUFFER_SIZE_BYTES = 1024 * 1024; // 1 MB --- End diff -- @rxin : In response to [0], I have changed to 1 MB. As per my experiments, 1 MB gave good perf and we are using it as default for all prod jobs. One concern / proposal: With the change, UnsafeSorterSpillReader would consume more memory than before as the buffer would increase from 8k to 1 MB. Overall per UnsafeSorterSpillReader object footprint would grow from 2.5 MB to 3.6 MB (I have profiled to the number. See [1]). In case of job(s) which spill a lot, there would be lot of these spill readers created (in the screenshot, there were 400+ readers). Current merging approach is to open all the spill files at the same time and merge them all at once using a priority queue. Having lots of these objects in memory can lead to OOMs as there is no accounting for buffers allocated inside UnsafeSorterSpillReader (even without this change, snappy already had its own buffers for compressed and uncompressed data). Also, from disk point of view, having lots of file open at the same time would lead to random seeks and won't play well with OS's cache for disk reads. One might say that users should tune the job so that the spills are lesser but it might not be o bvious for people who do not understand the system internals. Also, for pipelines the data changes everyday and one setting may not work everytime. Should we add some kinda hierarchical merging wherein spill files are iteratively merged in batches ? It could be turned on when there are say more than 100 spill files to be merged. AFAIK, Hadoop has this. [0] : https://github.com/apache/spark/pull/14475#discussion_r75440822 [1] : https://postimg.org/image/cs5zr6lyx/
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org