Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21102#discussion_r203322056
  
    --- Diff: 
core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala ---
    @@ -163,7 +187,7 @@ class OpenHashSet[@specialized(Long, Int) T: ClassTag](
        *                 to a new position (in the new data array).
        */
       def rehashIfNeeded(k: T, allocateFunc: (Int) => Unit, moveFunc: (Int, 
Int) => Unit) {
    -    if (_size > _growThreshold) {
    +    if (_occupied > _growThreshold) {
    --- End diff --
    
    There is no explicitly entry here - it is simply unoccupied slots in an 
array.
    The slot is free, it can be used by some other (new) entry when insert is 
called.
    
    It must be trivial to see how very bad behavior can happen with actual size 
of set being very small - with a series of add/remove's : resulting in unending 
growth of the set.
    
    something like this, for example, is enough to cause set to blow to 2B 
entries:
    ```
    var i = 0
    while (i < Int.MaxValue) {
      set.add(1)
      set.remove(1)
      assert (0 == set.size)
      i += 1
    }
    ```



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to