Github user erikerlandson commented on the pull request: https://github.com/apache/spark/pull/2313#issuecomment-55529273 @mattf, one useful question would be: do the results generate equivalent output distributions. The basic methodology would be to collect output in both scenarios, and run Kolmogorov-Smirnov tests to assess whether the sampling is statistically equivalent. I did this recently for testing my upcoming proposal for gap sampling: https://gist.github.com/erikerlandson/05db1f15c8d623448ff6 That doesn't cover the question of *exactly* reproducible results. I'm not sure if that would be feasible or not. In general, I only consider *exactly* reproducible results as being relevant for things like unit testing applications, so if that's important my answer would be "make sure your environment is set up to either use numpy or not, consistently"
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org