[GitHub] spark pull request #14475: [SPARK-16862] Configurable buffer size in `Unsafe...

tejasapatil Tue, 02 Aug 2016 22:37:23 -0700

Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/14475#discussion_r73283038

--- Diff:
core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java
---
@@ -50,7 +55,19 @@ public UnsafeSorterSpillReader(
File file,
BlockId blockId) throws IOException {
assert (file.length() > 0);
- final BufferedInputStream bs = new BufferedInputStream(new
FileInputStream(file));
+ long bufferSizeBytes =
+
SparkEnv.get().conf().getSizeAsBytes("spark.unsafe.sorter.spill.reader.buffer.size",
+ DEFAULT_BUFFER_SIZE_BYTES);
+ if (bufferSizeBytes > Integer.MAX_VALUE) {
+ // fall back to a sane default value
+ logger.error("Value of config
\"spark.unsafe.sorter.spill.reader.buffer.size\" exceeds " +
--- End diff --

1. changed to warn
2. Fall back wont kill the job and move on offering more resilience.
3. Yeah. Large buffer (2 gb) is a bad idea. On some level this is user
shooting themselves on the foot. In that case, the job would theoretically hit
OOM and fail. I can define a conservative upper bound but not sure what that
number should be as it would depend on the system. I am picking 16 MB to be the
max allowed value (no science here) but I cant see any case why would anyone
want to exceed that. Thoughts ?
4. 0 and negative values will cause the `BufferedInputStream`'s constructor
to fail:
http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/io/BufferedInputStream.java#198
. I have changed that to handle that within Spark code itself so that there
would be better resilience + warn message.

Meta question: Do you think that it would be valuable to have some form of
config validation (eg. range check) at the start of the job ? I think it would
be better to do such checks right at the start of the job (at driver side)
rather than doing somewhere in the middle of execution when Spark actually
reads the config. I think this would be helpful for Spark in general. Jobs with
bad config values should not end up reserving resources on a cluster and
wasting CPU. Plus, its immediate feedback for user rather than waiting for some
time till job hits the issue and then going through logs to figure out the
problem.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14475: [SPARK-16862] Configurable buffer size in `Unsafe...

Reply via email to