JoshRosen commented on a change in pull request #25953: [SPARK-29244][Core] Prevent freed page in BytesToBytesMap free again URL: https://github.com/apache/spark/pull/25953#discussion_r329323330
########## File path: core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java ########## @@ -787,8 +783,16 @@ private void allocate(int capacity) { assert (capacity >= 0); capacity = Math.max((int) Math.min(MAX_CAPACITY, ByteArrayMethods.nextPowerOf2(capacity)), 64); assert (capacity <= MAX_CAPACITY); - longArray = allocateArray(capacity * 2L); - longArray.zeroOut(); + try { + longArray = allocateArray(capacity * 2L); + longArray.zeroOut(); + } catch (SparkOutOfMemoryError e) { + // When OOM, allocated page was already freed by `TaskMemoryManager`. + // We should not keep it in `longArray`. Otherwise it might be freed again in task + // complete listeners and cause unnecessary error. + longArray = null; Review comment: If the `allocate()` call call throws `SparkOutOfMemoryError` then I don't think the original `longArray = <new array>` here would change `longArray`; instead, I think that `longArray` would continue to point to the old array. If we simply `null` out `longArray` here then I think we may lose our reference to the old array, but the Javadoc of this `allocate()` method says: ```java /** * Allocate new data structures for this map. When calling this outside of the constructor, * make sure to keep references to the old data structures so that you can free them. * * @param capacity the new map capacity */ private void allocate(int capacity) { ``` and it looks like the callers are responsible for keeping an old reference and managing cleanup. With this in mind, I'd like to dig into how the old code worked to see if we can gain a clearer understanding of the root-cause of the bug: There are three calls to `allocate()` in this file: - In the constructor - In `growAndRehash()` - In `reset()` Let's look at each in turn and see whether this patch's changes modify those call sites' behavior in the case where a `SparkOutOfMemoryError` exception is thrown: The constructor call is unaffected because `longArray` is initially `null`. In `growAndRehash()` call , a `SparkOutOfMemoryError` thrown by `growAndRehash()`'s `allocate()` call will result in us setting `longArray = null` here then doing a `longArray = oldLongArray` [assignment](https://github.com/apache/spark/pull/25953/files#diff-976d2d63175b5830e120d3f3b873bc76R919) to restore the old value. Given this, I think this patch's changes are a no-op w.r.t. the end state of `longArray` after an exception is thrown here. In this patch, we're setting `canGrowArray = false` to prevent continuing through to the "re-mask" step of `growAndRehash()`, but the old code never would have reached there in case of OOM because the `SparkOutOfMemoryError` would have been allowed to bubble up. Finally, the `reset()` method has a `freeArray(longArray)` call, followed by a call to `allocate(initialCapacity)`. **I think this is the source of the original bug**: In the `free()` method we have ```java public void free() { updatePeakMemoryUsed(); if (longArray != null) { freeArray(longArray); longArray = null; } [...] ``` where the caller is responsible for setting `longArray = null` after freeing it. However, we appear to be missing this `longArray = null` in `reset()`: there, we have ```java /** * Reset this map to initialized state. */ public void reset() { updatePeakMemoryUsed(); numKeys = 0; numValues = 0; freeArray(longArray); while (dataPages.size() > 0) { MemoryBlock dataPage = dataPages.removeLast(); freePage(dataPage); } allocate(initialCapacity); canGrowArray = true; currentPage = null; pageCursor = 0; } ``` Here, if we throw an exception in `allocate(initialCapacity)` then `longArray` will continue to point to an already-freed array. With this patch's changes to `allocate()`'s behavior, `longArray` will be set to `null` and that will prevent the dangling pointer to freed memory. That seems like a somewhat indirect fix, though. Given this (if I've understood this code correctly), what do you think about simplifying this patch to instead update `reset()` to do ```java freeArray(longArray); longArray = null; // <--- added line ``` ? I tried this one-line fix and the test case you added seems to pass. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org