Github user erikerlandson commented on the pull request:

    https://github.com/apache/spark/pull/2313#issuecomment-55529273
  
    @mattf, one useful question would be:  do the results generate equivalent 
output distributions.   The basic methodology would be to collect output in 
both scenarios, and run Kolmogorov-Smirnov tests to assess whether the sampling 
is statistically equivalent.
    
    I did this recently for testing my upcoming proposal for gap sampling:
    https://gist.github.com/erikerlandson/05db1f15c8d623448ff6
    
    That doesn't cover the question of *exactly* reproducible results.  I'm not 
sure if that would be feasible or not.  In general, I only consider *exactly* 
reproducible results as being relevant for things like unit testing 
applications, so if that's important my answer would be "make sure your 
environment is set up to either use numpy or not, consistently"



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to