[ 
https://issues.apache.org/jira/browse/SPARK-12325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12325.
-------------------------------
    Resolution: Invalid

[~Narine] I'm going to push back on this, since it's inappropriate to open a 
"Critical Bug" on JIRA just to get attention for your question. This is, at 
best, suggesting a very minor change to an error message.

Please first read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark  Then 
consider whether you clearly explained your problem and proposed change -- I 
think it's a lot simpler than this.

When you're ready to open a pull request with a change to the message, *then* 
make a "Minor Improvement" / "Documentation" JIRA which your PR references.

> Inappropriate error messages in DataFrame StatFunctions 
> --------------------------------------------------------
>
>                 Key: SPARK-12325
>                 URL: https://issues.apache.org/jira/browse/SPARK-12325
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.2
>            Reporter: Narine Kokhlikyan
>            Priority: Critical
>
> Hi there,
> I have mentioned this issue earlier in one of my pull requests for SQL 
> component, but I've never received a feedback in any of them.
> https://github.com/apache/spark/pull/9366#issuecomment-155171975
> Although this has been very frustrating, I'll try to list certain facts again:
> 1. I call dataframe correlation method and it says that covariance is wrong.
> I do not think that this is an appropriate message to show here.
> scala> df.stat.corr("rating", "income")
> java.lang.IllegalArgumentException: requirement failed: Covariance 
> calculation for columns with dataType StringType not supported.
>     at scala.Predef$.require(Predef.scala:233)
>     at 
> org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$collectStatisticalData$3.apply(StatFunctions.scala:81)
> 2. The biggest issue here is not the message shown, but the design.
> A class called CovarianceCounter does the computations both for correlation 
> and covariance. This might be a convenient way
> from certain perspective, however something like this is harder to understand 
> and extend, especially if you want to add another algorithm
> e.g. Spearman correlation, or something else.
> There are many possible solutions here:
> starting from
> 1. just fixing the message 
> 2. fixing the message and renaming  CovarianceCounter and corresponding 
> methods
> 3. create CorrelationCounter and splitting the computations for correlation 
> and covariance
> and many more .... 
> Since I'm not getting any response and according to github all five of you 
> have been working on this, I'll try again:
> [~brkyvz], [~rxin], [~davies], [~viirya], [~cloud_fan]
> Can any of you ,please, explain me such a behavior with the stat functions or 
> communicate more about this ?
> In case you are planning to remove it or something else, we'd truly 
> appreciate if you communicate.
> In fact, I would like to do a pull request on this, but since my pull 
> requests in SQL/ML components are just staying there without any response, 
> I'll wait for your response first.
> cc: [~shivaram], [~mengxr]
> Thank you,
> Narine



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to