Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/14475#discussion_r73283038 --- Diff: core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java --- @@ -50,7 +55,19 @@ public UnsafeSorterSpillReader( File file, BlockId blockId) throws IOException { assert (file.length() > 0); - final BufferedInputStream bs = new BufferedInputStream(new FileInputStream(file)); + long bufferSizeBytes = + SparkEnv.get().conf().getSizeAsBytes("spark.unsafe.sorter.spill.reader.buffer.size", + DEFAULT_BUFFER_SIZE_BYTES); + if (bufferSizeBytes > Integer.MAX_VALUE) { + // fall back to a sane default value + logger.error("Value of config \"spark.unsafe.sorter.spill.reader.buffer.size\" exceeds " + --- End diff -- 1. changed to warn 2. Fall back wont kill the job and move on offering more resilience. 3. Yeah. Large buffer (2 gb) is a bad idea. On some level this is user shooting themselves on the foot. In that case, the job would theoretically hit OOM and fail. I can define a conservative upper bound but not sure what that number should be as it would depend on the system. I am picking 16 MB to be the max allowed value (no science here) but I cant see any case why would anyone want to exceed that. Thoughts ? 4. 0 and negative values will cause the `BufferedInputStream`'s constructor to fail: http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/io/BufferedInputStream.java#198 . I have changed that to handle that within Spark code itself so that there would be better resilience + warn message. Meta question: Do you think that it would be valuable to have some form of config validation (eg. range check) at the start of the job ? I think it would be better to do such checks right at the start of the job (at driver side) rather than doing somewhere in the middle of execution when Spark actually reads the config. I think this would be helpful for Spark in general. Jobs with bad config values should not end up reserving resources on a cluster and wasting CPU. Plus, its immediate feedback for user rather than waiting for some time till job hits the issue and then going through logs to figure out the problem.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org