Github user superbobry commented on the issue:

    https://github.com/apache/spark/pull/19369
  
    I've looked at the allocation profile of the sample, it does indeed contain 
strings from other sources, so the heap histogram is not representative. 
However, the allocations from `BlockManagerMasterEndpoint` are gone.
    
    The test failures are due to an invalid test "deterministic node selection" 
in the `BlockManagerReplicationBehavior`. The test checks that any 
`BlockReplicationPolicy` is deterministic and strictly monotonic, meaning that 
the locations for 4x replication of some block would be a strict superset of 3x 
and 2x locations. This is not the case for neither 
`BasicBlockReplicationPolicy` nor 
     `RandomBlockReplicationPolicy`. Both call `getSampleIds` and, for them to 
be deterministic given a fixed random seed `r`, the result of `getSampleIds(n, 
m, r)` should be a subset of `getSampleIds(n, m + 1, r)`. This would be true if 
the method implemented a partial Knuth shuffle, which does the same sequence of 
`m` `nextInt` calls in both cases. For Floyd algorithm, however, this is false, 
because the sequence of calls to `nextInt` depends on `m`. Specifically for 
`getSampleIds(n, m, r)` it would start from
    
    ```
    r.nextInt(n - m + 1)
    ...
    ```
    
    while for `getSampleIds(n, m + 1, r)` the first call would be
    
    ```
    r.nextInt(n - (m + 1) + 1)
    ...
    ```
    
    Magically hashing `"test_" + id` resulted in the case where the 
monotonicity property is satisfied. The switch to the auto-generated `hashCode` 
has broken the magic. 
    
    I replaced the implementation of `getSampleIds` with 
    
    ```scala
    private def getSampleIds(n: Int, m: Int, r: Random): List[Int] = {
      r.shuffle(List.range(0, n)).take(m)
    }
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to