The checks against maxCategories are not for statistical purposes; they are
to make sure communication does not blow up. There currently are not
checks to make sure that there are enough entries for statistically
significant results. That is up to the user.
I do like the idea of adding a warning
Hi everyone!
I am digging into MLlib of Spark 1.2.1 currently. When reading codes of
MLlib.stat.test, in the file ChiSqTest.scala under
/spark/mllib/src/main/scala/org/apache/spark/mllib/stat/test, I am confused
by the usage of mapPartitions API in the function
def chiSquaredFeatures(data: RDD[La