JoshRosen commented on a change in pull request #25953: [SPARK-29244][Core] 
Prevent freed page in BytesToBytesMap free again
URL: https://github.com/apache/spark/pull/25953#discussion_r329323330
 
 

 ##########
 File path: core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java
 ##########
 @@ -787,8 +783,16 @@ private void allocate(int capacity) {
     assert (capacity >= 0);
     capacity = Math.max((int) Math.min(MAX_CAPACITY, 
ByteArrayMethods.nextPowerOf2(capacity)), 64);
     assert (capacity <= MAX_CAPACITY);
-    longArray = allocateArray(capacity * 2L);
-    longArray.zeroOut();
+    try {
+      longArray = allocateArray(capacity * 2L);
+      longArray.zeroOut();
+    } catch (SparkOutOfMemoryError e) {
+      // When OOM, allocated page was already freed by `TaskMemoryManager`.
+      // We should not keep it in `longArray`. Otherwise it might be freed 
again in task
+      // complete listeners and cause unnecessary error.
+      longArray = null;
 
 Review comment:
   If this call throws `SparkOutOfMemoryError` then I don't think the 
assignment here will change `longArray`; instead, I think that `longArray` will 
continue to point to the old array. If we simply `null` out `longArray` here 
then I think we may lose our reference to the old array, but the Javadoc of 
this `allocate()` method says:
   
   ```java
     /**
      * Allocate new data structures for this map. When calling this outside of 
the constructor,
      * make sure to keep references to the old data structures so that you can 
free them.
      *
      * @param capacity the new map capacity
      */
     private void allocate(int capacity) {
   ```
   
   and it looks like the callers are responsible for keeping an old reference.
   
   There are three calls to `allocate()` in this file:
   
   - In the constructor
   - In `growAndRehash()`
   - In `reset()`
   
   Let's look at each in turn and see whether this patch's changes modify those 
call sites' behavior in the case where a `SparkOutOfMemoryError` exception is 
thrown:
   
   The constructor call is unaffected because `longArray` is initially `null`.
   
   In `growAndRehash()` call , a `SparkOutOfMemoryError` thrown by 
`growAndRehash()`'s `allocate()` call will result in us setting `longArray = 
null` here then doing a `longArray = oldLongArray` 
[assignment](https://github.com/apache/spark/pull/25953/files#diff-976d2d63175b5830e120d3f3b873bc76R919)
 to restore the old value. Given this, I think this patch's changes are a no-op 
w.r.t. the end state of `longArray` after an exception is thrown here. In this 
patch, we're setting `canGrowArray = false` to prevent continuing through to 
the "re-mask" step of `growAndRehash()`, but the old code never would have 
reached there in case of OOM because the `SparkOutOfMemoryError` would have 
been allowed to bubble up.
   
   Finally, the `reset()` method has a `freeArray(longArray)` call, followed by 
a call to `allocate(initialCapacity)`. **I think this is the source of the 
original bug**:
   
   In the `free()` method we have
   ```java
   public void free() {
       updatePeakMemoryUsed();
       if (longArray != null) {
         freeArray(longArray);
         longArray = null;
       }
       [...]
   ```
   
   where the caller is responsible for setting `longArray = null` after freeing 
it. However, we appear to be missing this `longArray = null` in `reset()`: 
there, we have
   
   ```java
     /**
      * Reset this map to initialized state.
      */
     public void reset() {
       updatePeakMemoryUsed();
       numKeys = 0;
       numValues = 0;
       freeArray(longArray);
       while (dataPages.size() > 0) {
         MemoryBlock dataPage = dataPages.removeLast();
         freePage(dataPage);
       }
       allocate(initialCapacity);
       canGrowArray = true;
       currentPage = null;
       pageCursor = 0;
     }
   ```
   
   Here, if we throw an exception in `allocate(initialCapacity)` then 
`longArray` will continue to point to an already-freed array. With this patch's 
changes to `allocate()`'s behavior, `longArray` will be set to `null` and that 
will prevent the dangling pointer to freed memory.
   
   That seems like a somewhat indirect fix, though.
   
   Given this (if I've understood this code correctly), what do you think about 
simplifying this patch to instead update `reset()` to do
   
   ```java
       freeArray(longArray);
       longArray = null;  // <--- added line
   ```
   
   ? I tried this one-line fix and the test case you added seems to pass.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to