Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 @hhbyyh Yes, in this way we don't change semantics of `miniBatchFraction`. But is the way it is defined now actually correct? As I mentioned above, in the `upstram/master` the number of non-empty documents in the mini-batch is asymptotically normally distributed. Hence, the size of RDD fed to `treeAggregate` differs from one batch to another. While in this PR (filtering before sampling) all the batches have the same length. But then again, for large datasets this should make no difference. If nobody thinks this disparity of batch sizes is an issue, I won't object against sampling before filtering. @WeichenXu123, I believe, you were advocating for filter before sample. Do you still have the preference?
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org