[GitHub] spark pull request #14475: [SPARK-16862] Configurable buffer size in `Unsafe...
Github user tejasapatil closed the pull request at: https://github.com/apache/spark/pull/14475 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14475: [SPARK-16862] Configurable buffer size in `Unsafe...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/14475#discussion_r75440822 --- Diff: core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java --- @@ -31,6 +34,9 @@ * of the file format). */ public final class UnsafeSorterSpillReader extends UnsafeSorterIterator implements Closeable { + private static final Logger logger = LoggerFactory.getLogger(UnsafeSorterSpillReader.class); + private static final int DEFAULT_BUFFER_SIZE_BYTES = 8192; // 8 kb --- End diff -- should we bump the default to something higher too? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14475: [SPARK-16862] Configurable buffer size in `Unsafe...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/14475#discussion_r75440799 --- Diff: core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java --- @@ -50,7 +56,21 @@ public UnsafeSorterSpillReader( File file, BlockId blockId) throws IOException { assert (file.length() > 0); -final BufferedInputStream bs = new BufferedInputStream(new FileInputStream(file)); +long bufferSizeBytes = +SparkEnv.get() == null ? +DEFAULT_BUFFER_SIZE_BYTES : + SparkEnv.get().conf().getSizeAsBytes("spark.unsafe.sorter.spill.reader.buffer.size", + DEFAULT_BUFFER_SIZE_BYTES); +if (bufferSizeBytes > MAX_BUFFER_SIZE_BYTES || bufferSizeBytes < DEFAULT_BUFFER_SIZE_BYTES) { + // fall back to a sane default value + logger.warn("Value of config \"spark.unsafe.sorter.spill.reader.buffer.size\" = {} not in " + + "allowed range [{}, {}). Falling to default value : {} bytes", bufferSizeBytes, --- End diff -- "Falling back to" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14475: [SPARK-16862] Configurable buffer size in `Unsafe...
GitHub user tejasapatil reopened a pull request: https://github.com/apache/spark/pull/14475 [SPARK-16862] Configurable buffer size in `UnsafeSorterSpillReader` ## What changes were proposed in this pull request? Jira: https://issues.apache.org/jira/browse/SPARK-16862 `BufferedInputStream` used in `UnsafeSorterSpillReader` uses the default 8k buffer to read data off disk. This PR makes it configurable to improve on disk reads. I have kept the default value to be same as it was before (8k) so there would not be any change in current behavior. ## How was this patch tested? I am relying on the existing unit tests. ## Performance After deploying this change to prod and setting the config to 1 mb, there was a 12% reduction in the CPU time and 19.5% reduction in CPU reservation time. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tejasapatil/spark spill_buffer Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14475.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14475 commit 6d6feef965d07cb18b6806b588c649637f52eb56 Author: Tejas Patil Date: 2016-08-03T01:00:46Z [SPARK-16862] Configurable buffer size in `UnsafeSorterSpillReader` commit cac4ebe1fa31f79e8c460bc9d86995a686659b65 Author: Tejas Patil Date: 2016-08-03T05:35:16Z review comments #1 commit 6ce70a9995287aa19a042a80c77ecdbb1f56fe4f Author: Tejas Patil Date: 2016-08-03T07:35:55Z Handle test case failure due to Sparkenv being null ``` [error] Test org.apache.spark.unsafe.map.BytesToBytesMapOffHeapSuite.spillInIterator failed: java.lang.NullPointerException: null, took 0.008 sec [error] at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.(UnsafeSorterSpillReader.java:60) [error] at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.getReader(UnsafeSorterSpillWriter.java:150) [error] at org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator.advanceToNextPage(BytesToBytesMap.java:291) [error] at org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator.next(BytesToBytesMap.java:320) [error] at org.apache.spark.unsafe.map.AbstractBytesToBytesMapSuite.spillInIterator(AbstractBytesToBytesMapSuite.java:577) [error] ... ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14475: [SPARK-16862] Configurable buffer size in `Unsafe...
Github user tejasapatil closed the pull request at: https://github.com/apache/spark/pull/14475 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14475: [SPARK-16862] Configurable buffer size in `Unsafe...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/14475#discussion_r73283038 --- Diff: core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java --- @@ -50,7 +55,19 @@ public UnsafeSorterSpillReader( File file, BlockId blockId) throws IOException { assert (file.length() > 0); -final BufferedInputStream bs = new BufferedInputStream(new FileInputStream(file)); +long bufferSizeBytes = + SparkEnv.get().conf().getSizeAsBytes("spark.unsafe.sorter.spill.reader.buffer.size", + DEFAULT_BUFFER_SIZE_BYTES); +if (bufferSizeBytes > Integer.MAX_VALUE) { + // fall back to a sane default value + logger.error("Value of config \"spark.unsafe.sorter.spill.reader.buffer.size\" exceeds " + --- End diff -- 1. changed to warn 2. Fall back wont kill the job and move on offering more resilience. 3. Yeah. Large buffer (2 gb) is a bad idea. On some level this is user shooting themselves on the foot. In that case, the job would theoretically hit OOM and fail. I can define a conservative upper bound but not sure what that number should be as it would depend on the system. I am picking 16 MB to be the max allowed value (no science here) but I cant see any case why would anyone want to exceed that. Thoughts ? 4. 0 and negative values will cause the `BufferedInputStream`'s constructor to fail: http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/io/BufferedInputStream.java#198 . I have changed that to handle that within Spark code itself so that there would be better resilience + warn message. Meta question: Do you think that it would be valuable to have some form of config validation (eg. range check) at the start of the job ? I think it would be better to do such checks right at the start of the job (at driver side) rather than doing somewhere in the middle of execution when Spark actually reads the config. I think this would be helpful for Spark in general. Jobs with bad config values should not end up reserving resources on a cluster and wasting CPU. Plus, its immediate feedback for user rather than waiting for some time till job hits the issue and then going through logs to figure out the problem. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14475: [SPARK-16862] Configurable buffer size in `Unsafe...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/14475#discussion_r73283047 --- Diff: core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java --- @@ -31,6 +34,8 @@ * of the file format). */ public final class UnsafeSorterSpillReader extends UnsafeSorterIterator implements Closeable { + private final static Logger logger = LoggerFactory.getLogger(UnsafeSorterSpillReader.class); + private final static int DEFAULT_BUFFER_SIZE_BYTES = 8192; --- End diff -- changed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14475: [SPARK-16862] Configurable buffer size in `Unsafe...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/14475#discussion_r73278229 --- Diff: core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java --- @@ -50,7 +55,19 @@ public UnsafeSorterSpillReader( File file, BlockId blockId) throws IOException { assert (file.length() > 0); -final BufferedInputStream bs = new BufferedInputStream(new FileInputStream(file)); +long bufferSizeBytes = + SparkEnv.get().conf().getSizeAsBytes("spark.unsafe.sorter.spill.reader.buffer.size", + DEFAULT_BUFFER_SIZE_BYTES); +if (bufferSizeBytes > Integer.MAX_VALUE) { + // fall back to a sane default value + logger.error("Value of config \"spark.unsafe.sorter.spill.reader.buffer.size\" exceeds " + --- End diff -- warning? because this isn't a fatal error. Is it reasonable to fall back? I'd imagine that specifying an incredibly large buffer is just an error. What about 0 or negative values? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14475: [SPARK-16862] Configurable buffer size in `Unsafe...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/14475#discussion_r73278175 --- Diff: core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java --- @@ -31,6 +34,8 @@ * of the file format). */ public final class UnsafeSorterSpillReader extends UnsafeSorterIterator implements Closeable { + private final static Logger logger = LoggerFactory.getLogger(UnsafeSorterSpillReader.class); + private final static int DEFAULT_BUFFER_SIZE_BYTES = 8192; --- End diff -- Tiny nit: `private static final` is canonical ordering in Java --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14475: [SPARK-16862] Configurable buffer size in `Unsafe...
GitHub user tejasapatil opened a pull request: https://github.com/apache/spark/pull/14475 [SPARK-16862] Configurable buffer size in `UnsafeSorterSpillReader` ## What changes were proposed in this pull request? Jira: https://issues.apache.org/jira/browse/SPARK-16862 `BufferedInputStream` used in `UnsafeSorterSpillReader` uses the default 8k buffer to read data off disk. This PR makes it configurable to improve on disk reads. I have kept the default value to be same as it was before (8k) so there would not be any change in current behavior. ## How was this patch tested? I am relying on the existing unit tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tejasapatil/spark spill_buffer Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14475.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14475 commit 6d6feef965d07cb18b6806b588c649637f52eb56 Author: Tejas Patil Date: 2016-08-03T01:00:46Z [SPARK-16862] Configurable buffer size in `UnsafeSorterSpillReader` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org