Github user josepablocam commented on a diff in the pull request: https://github.com/apache/spark/pull/7430#discussion_r34832557 --- Diff: python/pyspark/mllib/stat/_statistics.py --- @@ -238,6 +242,54 @@ def chiSqTest(observed, expected=None): jmodel = callMLlibFunc("chiSqTest", _convert_to_vector(observed), expected) return ChiSqTestResult(jmodel) + @staticmethod + @ignore_unicode_prefix + def kolmogorovSmirnovTest(data, distName="norm", *params): + """ + .. note:: Experimental + + Performs the Kolmogorov Smirnov (KS) test for data sampled from a continuous + distribution. It tests the null hypothesis that the data is generated from a + particular distribution. + + The given data is sorted, the Empirical Cumulative Distribution Function (ECDF) + is calculated which is the number of points having a CDF value lesser than a given point + divided by the total number of points. Since the data is sorted, this is a step function + that rises by (1 / length of data) for every ordered point. + + The KS statistic gives us the maximum distance between the ECDF and the CDF. Intuitively + if this value is large, the probabilty that the null hypothesis is true becomes small. + For specific details of the implementation, please have a look at the Scala documentation. + + :param data: RDD, samples from the data + :param distName: string, currently only "norm" is suuported. (Normal distribution) --- End diff -- suuported -> supported
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org