Github user akopich commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    @WeichenXu123, yes there indeed is a difference in logic. Eventually it 
boils down to semantics of `miniBatchFraction`. If it is a fraction of 
non-empty documents being sampled, the version with `filter` going first is 
correct. If it's a fraction of documents (empty and non-empty) being sampled, 
then the version with `sample` going first is correct. To me the first version 
seems more reasonable (who cares about empty docs anyway). @srowen, if I get it 
right, you would prefer the second option. Why? 
    
    @WeichenXu123, I agree with you: filtering introduces a minimal overhead. 
    
    @srowen, regarding performance... I don't actually think it makes any 
difference unless complexity of `sample` depends on the length of the parent 
RDD. In all the subsequent computations empty documents can be handled 
effectively. 
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to