[ https://issues.apache.org/jira/browse/SPARK-12325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen resolved SPARK-12325. ------------------------------- Resolution: Invalid [~Narine] I'm going to push back on this, since it's inappropriate to open a "Critical Bug" on JIRA just to get attention for your question. This is, at best, suggesting a very minor change to an error message. Please first read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark Then consider whether you clearly explained your problem and proposed change -- I think it's a lot simpler than this. When you're ready to open a pull request with a change to the message, *then* make a "Minor Improvement" / "Documentation" JIRA which your PR references. > Inappropriate error messages in DataFrame StatFunctions > -------------------------------------------------------- > > Key: SPARK-12325 > URL: https://issues.apache.org/jira/browse/SPARK-12325 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.5.2 > Reporter: Narine Kokhlikyan > Priority: Critical > > Hi there, > I have mentioned this issue earlier in one of my pull requests for SQL > component, but I've never received a feedback in any of them. > https://github.com/apache/spark/pull/9366#issuecomment-155171975 > Although this has been very frustrating, I'll try to list certain facts again: > 1. I call dataframe correlation method and it says that covariance is wrong. > I do not think that this is an appropriate message to show here. > scala> df.stat.corr("rating", "income") > java.lang.IllegalArgumentException: requirement failed: Covariance > calculation for columns with dataType StringType not supported. > at scala.Predef$.require(Predef.scala:233) > at > org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$collectStatisticalData$3.apply(StatFunctions.scala:81) > 2. The biggest issue here is not the message shown, but the design. > A class called CovarianceCounter does the computations both for correlation > and covariance. This might be a convenient way > from certain perspective, however something like this is harder to understand > and extend, especially if you want to add another algorithm > e.g. Spearman correlation, or something else. > There are many possible solutions here: > starting from > 1. just fixing the message > 2. fixing the message and renaming CovarianceCounter and corresponding > methods > 3. create CorrelationCounter and splitting the computations for correlation > and covariance > and many more .... > Since I'm not getting any response and according to github all five of you > have been working on this, I'll try again: > [~brkyvz], [~rxin], [~davies], [~viirya], [~cloud_fan] > Can any of you ,please, explain me such a behavior with the stat functions or > communicate more about this ? > In case you are planning to remove it or something else, we'd truly > appreciate if you communicate. > In fact, I would like to do a pull request on this, but since my pull > requests in SQL/ML components are just staying there without any response, > I'll wait for your response first. > cc: [~shivaram], [~mengxr] > Thank you, > Narine -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org