[ https://issues.apache.org/jira/browse/SPARK-31301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17080175#comment-17080175 ]
zhengruifeng commented on SPARK-31301: -------------------------------------- (One small question: the new method takes a Dataset[_] rather than DataFrame; should it be DataFrame too?) I think so, we may need to change it to DataFrame; what is different about what they do then? I think existing two methods provide similar function, since for the first method, what we can do is just collecting it back to the driver, so I think it is similar to the second one. If we add another method (or change the return type of second method), then we can do more operations on it. Another point is that: If we return dataframe containing one row per feature, then we can avoid the bottleneck on the driver, because we no longer need to collect test results of all features back to the driver. I will test it and maybe send a PR to show it. > flatten the result dataframe of tests in stat > --------------------------------------------- > > Key: SPARK-31301 > URL: https://issues.apache.org/jira/browse/SPARK-31301 > Project: Spark > Issue Type: Improvement > Components: ML > Affects Versions: 3.1.0 > Reporter: zhengruifeng > Priority: Major > > {code:java} > scala> import org.apache.spark.ml.linalg.{Vector, Vectors} > import org.apache.spark.ml.linalg.{Vector, Vectors}scala> import > org.apache.spark.ml.stat.ChiSquareTest > import org.apache.spark.ml.stat.ChiSquareTestscala> val data = Seq( > | (0.0, Vectors.dense(0.5, 10.0)), > | (0.0, Vectors.dense(1.5, 20.0)), > | (1.0, Vectors.dense(1.5, 30.0)), > | (0.0, Vectors.dense(3.5, 30.0)), > | (0.0, Vectors.dense(3.5, 40.0)), > | (1.0, Vectors.dense(3.5, 40.0)) > | ) > data: Seq[(Double, org.apache.spark.ml.linalg.Vector)] = > List((0.0,[0.5,10.0]), (0.0,[1.5,20.0]), (1.0,[1.5,30.0]), (0.0,[3.5,30.0]), > (0.0,[3.5,40.0]), (1.0,[3.5,40.0]))scala> scala> scala> val df = > data.toDF("label", "features") > df: org.apache.spark.sql.DataFrame = [label: double, features: vector]scala> > val chi = ChiSquareTest.test(df, "features", "label") > chi: org.apache.spark.sql.DataFrame = [pValues: vector, degreesOfFreedom: > array<int> ... 1 more field]scala> chi.show > +--------------------+----------------+----------+ > | pValues|degreesOfFreedom|statistics| > +--------------------+----------------+----------+ > |[0.68728927879097...| [2, 3]|[0.75,1.5]| > +--------------------+----------------+----------+{code} > > Current impls of {{ChiSquareTest}}, {{ANOVATest}}, {{FValueTest}}, > {{Correlation}} all return a df only containing one row. > I think this is quite hard to use, suppose we have a dataset with dim=1000, > the only operation we can deal with the test result is to collect it by > {{head()}} or {{first(), and then use it in the driver.}} > {{While what I really want to do is filtering the df like pValue>0.1}} or > {{corr<0.5}}, *So I suggest to flatten the output df in those tests.* > > {{note: {{ANOVATest}}{{}}}} and\{{FValueTest}} are newly added in 3.1.0, but > {{{{ChiSquareTest}}}} and {{{{Correlation}}}} were here for a long time. > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org