Github user squito commented on the pull request: https://github.com/apache/spark/pull/10024#issuecomment-211424963 Hi @lianhuiwang thanks for submitting this patch. I just have a really high-level question first. If I understand how this works correctly, the idea is that: 1) as data is added to `Spillable`s, they work exactly the same as before at first, so after all records have been inserted, they've still got a bunch of data in memory. 2) When an iterator is requested, as before its still an iterator that merges data has been spilled to disk, and data that is still-in-memory 3) If the `TaskMemoryManager` requests more memory while that iterator is in flight, then the `Spillable`s look at the position of the current iterator over the in-memory data, and spill only the _remaining_ data to disk 4) The `Spillable` then free the memory to the `TaskMemoryManager`, and have the in-flight iterator switch to using the new spill file. Is this a correct understanding? If so, this seems to hinge on one key assumption: that the `Spillable`s are never iterated over multiple times. If they were iterated multiple times, then the second iteration would be incorrect -- some of the initial data in the in-memory structure would be lost on the second iteration. I think this assumption is sound -- it is implied by "destructive"SortedIterator in the internals, though I think the actual wrapping `Spillable`s might allow multiple iteration now. I've been trying to think of a case where this assumption would be violated but cant' come up with anything. (If the same shuffle data is read multiple times in different tasks, the work on the shuffle-read side is simply repeated each time, I'm pretty sure there isn't any sharing). But nonetheless if that is the case, I think this deserves both a lot of comments explaining how this works, assertions which make sure this assumption is not violated, and a number of tests. FWIW, I started down the path of writing something similar with*out* that assumption -- when a spill was requested on an in-flight iterator, then the *entire* in-memory structure would get spilled to disk, and the in-flight iterator would switch to the spilled data, and advance to the same location in the spilled data that it was on the in-memory data. This was pretty convoluted, and as I started writing tests I realized there were corner cases that needed work. So I decided to submit the simpler change instead. It seems much easier to it your way. I do have some test which I think I can add as well -- lemme dig those up and send them later today.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org