[GitHub] spark pull request #15056: [SPARK-17503][Core] Fix memory leak in Memory sto...

2016-09-12 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/15056


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15056: [SPARK-17503][Core] Fix memory leak in Memory sto...

2016-09-12 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/15056#discussion_r78426904
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala ---
@@ -663,31 +663,43 @@ private[spark] class MemoryStore(
 private[storage] class PartiallyUnrolledIterator[T](
 memoryStore: MemoryStore,
 unrollMemory: Long,
-unrolled: Iterator[T],
+private[this] var unrolled: Iterator[T],
 rest: Iterator[T])
   extends Iterator[T] {
 
-  private[this] var unrolledIteratorIsConsumed: Boolean = false
-  private[this] var iter: Iterator[T] = {
-val completionIterator = CompletionIterator[T, Iterator[T]](unrolled, {
-  unrolledIteratorIsConsumed = true
-  memoryStore.releaseUnrollMemoryForThisTask(MemoryMode.ON_HEAP, 
unrollMemory)
-})
-completionIterator ++ rest
--- End diff --

oh, yea, you are right.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15056: [SPARK-17503][Core] Fix memory leak in Memory sto...

2016-09-12 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15056#discussion_r78425128
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala ---
@@ -663,31 +663,43 @@ private[spark] class MemoryStore(
 private[storage] class PartiallyUnrolledIterator[T](
 memoryStore: MemoryStore,
 unrollMemory: Long,
-unrolled: Iterator[T],
+private[this] var unrolled: Iterator[T],
 rest: Iterator[T])
   extends Iterator[T] {
 
-  private[this] var unrolledIteratorIsConsumed: Boolean = false
-  private[this] var iter: Iterator[T] = {
-val completionIterator = CompletionIterator[T, Iterator[T]](unrolled, {
-  unrolledIteratorIsConsumed = true
-  memoryStore.releaseUnrollMemoryForThisTask(MemoryMode.ON_HEAP, 
unrollMemory)
-})
-completionIterator ++ rest
--- End diff --

I think the problem here is that the completion iterator is releasing the 
bookkeeping memory for the iterator as soon as the iterator is fully iterated, 
but the on-heap objects are being retained by the reference in the `unrolled` 
field. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15056: [SPARK-17503][Core] Fix memory leak in Memory sto...

2016-09-12 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/15056#discussion_r78424222
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala ---
@@ -663,31 +663,43 @@ private[spark] class MemoryStore(
 private[storage] class PartiallyUnrolledIterator[T](
 memoryStore: MemoryStore,
 unrollMemory: Long,
-unrolled: Iterator[T],
+private[this] var unrolled: Iterator[T],
 rest: Iterator[T])
   extends Iterator[T] {
 
-  private[this] var unrolledIteratorIsConsumed: Boolean = false
-  private[this] var iter: Iterator[T] = {
-val completionIterator = CompletionIterator[T, Iterator[T]](unrolled, {
-  unrolledIteratorIsConsumed = true
-  memoryStore.releaseUnrollMemoryForThisTask(MemoryMode.ON_HEAP, 
unrollMemory)
-})
-completionIterator ++ rest
--- End diff --

Let's see if I understand the problem. 

Because BlockManager may call `close` early, so we cannot rely on 
`CompletionIterator` to free the memory because we will never actually consume 
all of elements of `unrolled`, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15056: [SPARK-17503][Core] Fix memory leak in Memory sto...

2016-09-12 Thread clockfly
GitHub user clockfly opened a pull request:

https://github.com/apache/spark/pull/15056

[SPARK-17503][Core] Fix memory leak in Memory store when unable to cache 
the whole RDD

## What changes were proposed in this pull request?

   Memory store may throws OutOfMemoryError when trying to cache a super 
big RDD that cannot fit in memory. 
   ```
   scala> sc.parallelize(1 to 1000, 5).map(new 
Array[Long](1000)).cache().count

   java.lang.OutOfMemoryError: Java heap space
at 
$line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:24)
at 
$line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:23)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$JoinIterator.next(Iterator.scala:232)
at 
org.apache.spark.storage.memory.PartiallyUnrolledIterator.next(MemoryStore.scala:683)
at 
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1684)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1915)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1915)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
   ```

Spark MemoryStore uses SizeTrackingVector as a temporary unrolling buffer 
to store all input values it has read so far before transferring the values to 
cache. The problem is that when the input RDD is too big for caching, the 
temporary unrolling memory SizeTrackingVector is not garbage collected in time. 
As SizeTrackingVector can occupy all available storage memory, it may cause the 
executor JVM to run out of memory quickly.

## How was this patch tested?

Unit test.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/clockfly/spark memory_store_leak

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15056.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15056


commit a9a4a8b23afc64d7e2d7426b92013442308a8ea3
Author: Sean Zhong 
Date:   2016-09-12T07:12:48Z

SPARK-17503: Fix memory leak in Memory store when unable to cache the whole 
RDD




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org