Re: Is this a bug in MLlib.stat.test ? About the mapPartitions API used in Chi-Squared test

2015-03-12 Thread Joseph Bradley
The checks against maxCategories are not for statistical purposes; they are to make sure communication does not blow up. There currently are not checks to make sure that there are enough entries for statistically significant results. That is up to the user. I do like the idea of adding a warning

Is this a bug in MLlib.stat.test ? About the mapPartitions API used in Chi-Squared test

2015-03-12 Thread Chunnan Yao
Hi everyone! I am digging into MLlib of Spark 1.2.1 currently. When reading codes of MLlib.stat.test, in the file ChiSqTest.scala under /spark/mllib/src/main/scala/org/apache/spark/mllib/stat/test, I am confused by the usage of mapPartitions API in the function def chiSquaredFeatures(data: RDD[La