Github user superbobry commented on the issue: https://github.com/apache/spark/pull/19369 I've looked at the allocation profile of the sample, it does indeed contain strings from other sources, so the heap histogram is not representative. However, the allocations from `BlockManagerMasterEndpoint` are gone. The test failures are due to an invalid test "deterministic node selection" in the `BlockManagerReplicationBehavior`. The test checks that any `BlockReplicationPolicy` is deterministic and strictly monotonic, meaning that the locations for 4x replication of some block would be a strict superset of 3x and 2x locations. This is not the case for neither `BasicBlockReplicationPolicy` nor `RandomBlockReplicationPolicy`. Both call `getSampleIds` and, for them to be deterministic given a fixed random seed `r`, the result of `getSampleIds(n, m, r)` should be a subset of `getSampleIds(n, m + 1, r)`. This would be true if the method implemented a partial Knuth shuffle, which does the same sequence of `m` `nextInt` calls in both cases. For Floyd algorithm, however, this is false, because the sequence of calls to `nextInt` depends on `m`. Specifically for `getSampleIds(n, m, r)` it would start from ``` r.nextInt(n - m + 1) ... ``` while for `getSampleIds(n, m + 1, r)` the first call would be ``` r.nextInt(n - (m + 1) + 1) ... ``` Magically hashing `"test_" + id` resulted in the case where the monotonicity property is satisfied. The switch to the auto-generated `hashCode` has broken the magic. I replaced the implementation of `getSampleIds` with ```scala private def getSampleIds(n: Int, m: Int, r: Random): List[Int] = { r.shuffle(List.range(0, n)).take(m) } ```
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org