[GitHub] spark pull request #14475: [SPARK-16862] Configurable buffer size in `Unsafe...

2016-08-19 Thread tejasapatil
Github user tejasapatil closed the pull request at:

https://github.com/apache/spark/pull/14475


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14475: [SPARK-16862] Configurable buffer size in `Unsafe...

2016-08-19 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14475#discussion_r75440822
  
--- Diff: 
core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java
 ---
@@ -31,6 +34,9 @@
  * of the file format).
  */
 public final class UnsafeSorterSpillReader extends UnsafeSorterIterator 
implements Closeable {
+  private static final Logger logger = 
LoggerFactory.getLogger(UnsafeSorterSpillReader.class);
+  private static final int DEFAULT_BUFFER_SIZE_BYTES = 8192; // 8 kb
--- End diff --

should we bump the default to something higher too? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14475: [SPARK-16862] Configurable buffer size in `Unsafe...

2016-08-19 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14475#discussion_r75440799
  
--- Diff: 
core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java
 ---
@@ -50,7 +56,21 @@ public UnsafeSorterSpillReader(
   File file,
   BlockId blockId) throws IOException {
 assert (file.length() > 0);
-final BufferedInputStream bs = new BufferedInputStream(new 
FileInputStream(file));
+long bufferSizeBytes =
+SparkEnv.get() == null ?
+DEFAULT_BUFFER_SIZE_BYTES :
+
SparkEnv.get().conf().getSizeAsBytes("spark.unsafe.sorter.spill.reader.buffer.size",
+ 
DEFAULT_BUFFER_SIZE_BYTES);
+if (bufferSizeBytes > MAX_BUFFER_SIZE_BYTES || bufferSizeBytes < 
DEFAULT_BUFFER_SIZE_BYTES) {
+  // fall back to a sane default value
+  logger.warn("Value of config 
\"spark.unsafe.sorter.spill.reader.buffer.size\" = {} not in " +
+  "allowed range [{}, {}). Falling to default value : 
{} bytes", bufferSizeBytes,
--- End diff --

"Falling back to"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14475: [SPARK-16862] Configurable buffer size in `Unsafe...

2016-08-09 Thread tejasapatil
GitHub user tejasapatil reopened a pull request:

https://github.com/apache/spark/pull/14475

[SPARK-16862] Configurable buffer size in `UnsafeSorterSpillReader`

## What changes were proposed in this pull request?

Jira: https://issues.apache.org/jira/browse/SPARK-16862

`BufferedInputStream` used in `UnsafeSorterSpillReader` uses the default 8k 
buffer to read data off disk. This PR makes it configurable to improve on disk 
reads. I have kept the default value to be same as it was before (8k) so there 
would not be any change in current behavior.

## How was this patch tested?

I am relying on the existing unit tests.

## Performance

After deploying this change to prod and setting the config to 1 mb, there 
was a 12% reduction in the CPU time and 19.5% reduction in CPU reservation time.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tejasapatil/spark spill_buffer

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14475.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14475


commit 6d6feef965d07cb18b6806b588c649637f52eb56
Author: Tejas Patil 
Date:   2016-08-03T01:00:46Z

[SPARK-16862] Configurable buffer size in `UnsafeSorterSpillReader`

commit cac4ebe1fa31f79e8c460bc9d86995a686659b65
Author: Tejas Patil 
Date:   2016-08-03T05:35:16Z

review comments #1

commit 6ce70a9995287aa19a042a80c77ecdbb1f56fe4f
Author: Tejas Patil 
Date:   2016-08-03T07:35:55Z

Handle test case failure due to Sparkenv being null

```
[error] Test 
org.apache.spark.unsafe.map.BytesToBytesMapOffHeapSuite.spillInIterator failed: 
java.lang.NullPointerException: null, took 0.008 sec
[error] at 
org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.(UnsafeSorterSpillReader.java:60)
[error] at 
org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.getReader(UnsafeSorterSpillWriter.java:150)
[error] at 
org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator.advanceToNextPage(BytesToBytesMap.java:291)
[error] at 
org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator.next(BytesToBytesMap.java:320)
[error] at 
org.apache.spark.unsafe.map.AbstractBytesToBytesMapSuite.spillInIterator(AbstractBytesToBytesMapSuite.java:577)
[error] ...
```




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14475: [SPARK-16862] Configurable buffer size in `Unsafe...

2016-08-09 Thread tejasapatil
Github user tejasapatil closed the pull request at:

https://github.com/apache/spark/pull/14475


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14475: [SPARK-16862] Configurable buffer size in `Unsafe...

2016-08-02 Thread tejasapatil
Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/14475#discussion_r73283038
  
--- Diff: 
core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java
 ---
@@ -50,7 +55,19 @@ public UnsafeSorterSpillReader(
   File file,
   BlockId blockId) throws IOException {
 assert (file.length() > 0);
-final BufferedInputStream bs = new BufferedInputStream(new 
FileInputStream(file));
+long bufferSizeBytes =
+
SparkEnv.get().conf().getSizeAsBytes("spark.unsafe.sorter.spill.reader.buffer.size",
+ DEFAULT_BUFFER_SIZE_BYTES);
+if (bufferSizeBytes > Integer.MAX_VALUE) {
+  // fall back to a sane default value
+  logger.error("Value of config 
\"spark.unsafe.sorter.spill.reader.buffer.size\" exceeds " +
--- End diff --

1. changed to warn
2. Fall back wont kill the job and move on offering more resilience.
3. Yeah. Large buffer (2 gb) is a bad idea. On some level this is user 
shooting themselves on the foot. In that case, the job would theoretically hit 
OOM and fail. I can define a conservative upper bound but not sure what that 
number should be as it would depend on the system. I am picking 16 MB to be the 
max allowed value (no science here) but I cant see any case why would anyone 
want to exceed that. Thoughts ?
4. 0 and negative values will cause the `BufferedInputStream`'s constructor 
to fail: 
http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/io/BufferedInputStream.java#198
 . I have changed that to handle that within Spark code itself so that there 
would be better resilience + warn message.

Meta question: Do you think that it would be valuable to have some form of 
config validation (eg. range check) at the start of the job ? I think it would 
be better to do such checks right at the start of the job (at driver side) 
rather than doing somewhere in the middle of execution when Spark actually 
reads the config. I think this would be helpful for Spark in general. Jobs with 
bad config values should not end up reserving resources on a cluster and 
wasting CPU. Plus, its immediate feedback for user rather than waiting for some 
time till job hits the issue and then going through logs to figure out the 
problem.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14475: [SPARK-16862] Configurable buffer size in `Unsafe...

2016-08-02 Thread tejasapatil
Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/14475#discussion_r73283047
  
--- Diff: 
core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java
 ---
@@ -31,6 +34,8 @@
  * of the file format).
  */
 public final class UnsafeSorterSpillReader extends UnsafeSorterIterator 
implements Closeable {
+  private final static Logger logger = 
LoggerFactory.getLogger(UnsafeSorterSpillReader.class);
+  private final static int DEFAULT_BUFFER_SIZE_BYTES = 8192;
--- End diff --

changed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14475: [SPARK-16862] Configurable buffer size in `Unsafe...

2016-08-02 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14475#discussion_r73278229
  
--- Diff: 
core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java
 ---
@@ -50,7 +55,19 @@ public UnsafeSorterSpillReader(
   File file,
   BlockId blockId) throws IOException {
 assert (file.length() > 0);
-final BufferedInputStream bs = new BufferedInputStream(new 
FileInputStream(file));
+long bufferSizeBytes =
+
SparkEnv.get().conf().getSizeAsBytes("spark.unsafe.sorter.spill.reader.buffer.size",
+ DEFAULT_BUFFER_SIZE_BYTES);
+if (bufferSizeBytes > Integer.MAX_VALUE) {
+  // fall back to a sane default value
+  logger.error("Value of config 
\"spark.unsafe.sorter.spill.reader.buffer.size\" exceeds " +
--- End diff --

warning? because this isn't a fatal error.
Is it reasonable to fall back? I'd imagine that specifying an incredibly 
large buffer is just an error. What about 0 or negative values?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14475: [SPARK-16862] Configurable buffer size in `Unsafe...

2016-08-02 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14475#discussion_r73278175
  
--- Diff: 
core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java
 ---
@@ -31,6 +34,8 @@
  * of the file format).
  */
 public final class UnsafeSorterSpillReader extends UnsafeSorterIterator 
implements Closeable {
+  private final static Logger logger = 
LoggerFactory.getLogger(UnsafeSorterSpillReader.class);
+  private final static int DEFAULT_BUFFER_SIZE_BYTES = 8192;
--- End diff --

Tiny nit: `private static final` is canonical ordering in Java


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14475: [SPARK-16862] Configurable buffer size in `Unsafe...

2016-08-02 Thread tejasapatil
GitHub user tejasapatil opened a pull request:

https://github.com/apache/spark/pull/14475

[SPARK-16862] Configurable buffer size in `UnsafeSorterSpillReader`

## What changes were proposed in this pull request?

Jira: https://issues.apache.org/jira/browse/SPARK-16862

`BufferedInputStream` used in `UnsafeSorterSpillReader` uses the default 8k 
buffer to read data off disk. This PR makes it configurable to improve on disk 
reads. I have kept the default value to be same as it was before (8k) so there 
would not be any change in current behavior.

## How was this patch tested?

I am relying on the existing unit tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tejasapatil/spark spill_buffer

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14475.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14475


commit 6d6feef965d07cb18b6806b588c649637f52eb56
Author: Tejas Patil 
Date:   2016-08-03T01:00:46Z

[SPARK-16862] Configurable buffer size in `UnsafeSorterSpillReader`




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org