[GitHub] spark pull request #14726: [SPARK-16862] Configurable buffer size in `Unsafe...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/14726 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14726: [SPARK-16862] Configurable buffer size in `Unsafe...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/14726#discussion_r75960177 --- Diff: core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java --- @@ -22,15 +22,21 @@ import com.google.common.io.ByteStreams; import com.google.common.io.Closeables; +import org.apache.spark.SparkEnv; import org.apache.spark.serializer.SerializerManager; import org.apache.spark.storage.BlockId; import org.apache.spark.unsafe.Platform; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; /** * Reads spill files written by {@link UnsafeSorterSpillWriter} (see that class for a description * of the file format). */ public final class UnsafeSorterSpillReader extends UnsafeSorterIterator implements Closeable { + private static final Logger logger = LoggerFactory.getLogger(UnsafeSorterSpillReader.class); + private static final int DEFAULT_BUFFER_SIZE_BYTES = 1024 * 1024; // 1 MB --- End diff -- @rxin : Ok. I will do that change separately. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14726: [SPARK-16862] Configurable buffer size in `Unsafe...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/14726#discussion_r75574332 --- Diff: core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java --- @@ -22,15 +22,21 @@ import com.google.common.io.ByteStreams; import com.google.common.io.Closeables; +import org.apache.spark.SparkEnv; import org.apache.spark.serializer.SerializerManager; import org.apache.spark.storage.BlockId; import org.apache.spark.unsafe.Platform; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; /** * Reads spill files written by {@link UnsafeSorterSpillWriter} (see that class for a description * of the file format). */ public final class UnsafeSorterSpillReader extends UnsafeSorterIterator implements Closeable { + private static final Logger logger = LoggerFactory.getLogger(UnsafeSorterSpillReader.class); + private static final int DEFAULT_BUFFER_SIZE_BYTES = 1024 * 1024; // 1 MB --- End diff -- Yes, hierarchical merging sounds good. Actually we might even want to memory management the buffers more explicitly (just ask the memory manager for spark.unsafe.sorter.spill.reader.buffer.size for every spill file we open. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14726: [SPARK-16862] Configurable buffer size in `Unsafe...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/14726#discussion_r75573049 --- Diff: core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java --- @@ -22,15 +22,21 @@ import com.google.common.io.ByteStreams; import com.google.common.io.Closeables; +import org.apache.spark.SparkEnv; import org.apache.spark.serializer.SerializerManager; import org.apache.spark.storage.BlockId; import org.apache.spark.unsafe.Platform; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; /** * Reads spill files written by {@link UnsafeSorterSpillWriter} (see that class for a description * of the file format). */ public final class UnsafeSorterSpillReader extends UnsafeSorterIterator implements Closeable { + private static final Logger logger = LoggerFactory.getLogger(UnsafeSorterSpillReader.class); + private static final int DEFAULT_BUFFER_SIZE_BYTES = 1024 * 1024; // 1 MB --- End diff -- @rxin : In response to [0], I have changed to 1 MB. As per my experiments, 1 MB gave good perf and we are using it as default for all prod jobs. One concern / proposal: With the change, UnsafeSorterSpillReader would consume more memory than before as the buffer would increase from 8k to 1 MB. Overall per UnsafeSorterSpillReader object footprint would grow from 2.5 MB to 3.6 MB (I have profiled to the number. See [1]). In case of job(s) which spill a lot, there would be lot of these spill readers created (in the screenshot, there were 400+ readers). Current merging approach is to open all the spill files at the same time and merge them all at once using a priority queue. Having lots of these objects in memory can lead to OOMs as there is no accounting for buffers allocated inside UnsafeSorterSpillReader (even without this change, snappy already had its own buffers for compressed and uncompressed data). Also, from disk point of view, having lots of file open at the same time would lead to random seeks and won't play well with OS's cache for disk reads. One might say that users should tune the job so that the spills are lesser but it might not be o bvious for people who do not understand the system internals. Also, for pipelines the data changes everyday and one setting may not work everytime. Should we add some kinda hierarchical merging wherein spill files are iteratively merged in batches ? It could be turned on when there are say more than 100 spill files to be merged. AFAIK, Hadoop has this. [0] : https://github.com/apache/spark/pull/14475#discussion_r75440822 [1] : https://postimg.org/image/cs5zr6lyx/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14726: [SPARK-16862] Configurable buffer size in `Unsafe...
GitHub user tejasapatil opened a pull request: https://github.com/apache/spark/pull/14726 [SPARK-16862] Configurable buffer size in `UnsafeSorterSpillReader` ## What changes were proposed in this pull request? Jira: https://issues.apache.org/jira/browse/SPARK-16862 `BufferedInputStream` used in `UnsafeSorterSpillReader` uses the default 8k buffer to read data off disk. This PR makes it configurable to improve on disk reads. I have made the default value to be 1 MB as with that value I observed improved performance. ## How was this patch tested? I am relying on the existing unit tests. ## Performance After deploying this change to prod and setting the config to 1 mb, there was a 12% reduction in the CPU time and 19.5% reduction in CPU reservation time. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tejasapatil/spark spill_buffer_2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14726.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14726 commit c4f37b6c8d3f1a8a565b1f215f55a501edece778 Author: Tejas PatilDate: 2016-08-20T05:06:03Z [SPARK-16862] Configurable buffer size in `UnsafeSorterSpillReader` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org