Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/21102#discussion_r203322056 --- Diff: core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala --- @@ -163,7 +187,7 @@ class OpenHashSet[@specialized(Long, Int) T: ClassTag]( * to a new position (in the new data array). */ def rehashIfNeeded(k: T, allocateFunc: (Int) => Unit, moveFunc: (Int, Int) => Unit) { - if (_size > _growThreshold) { + if (_occupied > _growThreshold) { --- End diff -- There is no explicitly entry here - it is simply unoccupied slots in an array. The slot is free, it can be used by some other (new) entry when insert is called. It must be trivial to see how very bad behavior can happen with actual size of set being very small - with a series of add/remove's : resulting in unending growth of the set. something like this, for example, is enough to cause set to blow to 2B entries: ``` var i = 0 while (i < Int.MaxValue) { set.add(1) set.remove(1) assert (0 == set.size) i += 1 } ```
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org