Github user akopich commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    Consider the following scenario. Let `docs` be an RDD containing 1000 empty 
documents and 1000 non-empty documents and let `miniBatchFraction = 0.05`.
    
    Assume, we use `filter(...).sample(...)`. Then the resulted RDD will have 
around `50` elements. 
    
    If we use `sample(...).filter(...)` instead, the `sample` returns around 
`100` elements. Now the number of elements in the RDD returned by `filter` is 
normally distributed. The expectation is `50` again though. 
    Do I miss smth?
    
    However, for larger samples this shouldn't make any difference. 
    
    On the purpose of the issue: there were two different variables 
`batchSize`, and `nonEmptyDocsN` which could not be used interchangeably. The 
purpose is to submit a batch containing no empty docs which makes the two  
variables refer the same value. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to