Github user akopich commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    @hhbyyh 
    
    Yes, in this way we don't change semantics of `miniBatchFraction`. But is 
the way it is defined now actually correct? As I mentioned above, in the 
`upstram/master` the number of non-empty documents in the mini-batch is 
asymptotically normally distributed. Hence, the size of RDD fed to 
`treeAggregate` differs from one batch to another. While in this PR (filtering 
before sampling) all the batches have the same length.
    
    But then again, for large datasets this should make no difference. If 
nobody thinks this disparity of batch sizes is an issue, I won't object against 
sampling before filtering.
    
    @WeichenXu123, I believe, you were advocating for filter before sample. Do 
you still have the preference?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to