Narine Kokhlikyan created SPARK-12325:
-----------------------------------------

             Summary: Inappropriate error messages in DataFrame StatFunctions 
                 Key: SPARK-12325
                 URL: https://issues.apache.org/jira/browse/SPARK-12325
             Project: Spark
          Issue Type: Bug
          Components: SQL
            Reporter: Narine Kokhlikyan
            Priority: Critical


Hi there,

I have mentioned this issue earlier in one of my pull requests for SQL 
component, but I've never received a feedback in any of them.
https://github.com/apache/spark/pull/9366#issuecomment-155171975

Although this has been very frustrating, I'll try to list certain facts again:

1. I call dataframe correlation method and it says that covariance is wrong.
I do not think that this is an appropriate message to show here.

scala> df.stat.corr("rating", "income")
java.lang.IllegalArgumentException: requirement failed: Covariance calculation 
for columns with dataType StringType not supported.
    at scala.Predef$.require(Predef.scala:233)
    at 
org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$collectStatisticalData$3.apply(StatFunctions.scala:81)


2. The biggest issue here is not the message shown, but the design.
A class called CovarianceCounter does the computations both for correlation and 
covariance. This might be a convenient way
from certain perspective, however something like this is harder to understand 
and extend, especially if you want to add another algorithm
e.g. Spearman correlation, or something else.

There are many possible solutions here:
starting from
1. just fixing the message 
2. fixing the message and renaming  CovarianceCounter and corresponding methods
3. create CorrelationCounter and splitting the computations for correlation and 
covariance

and many more .... 

Since I'm not getting any response and according to github all five of you have 
been working on this, I'll try again:
[~brkyvz], [~rxin], [~davies], [~viirya], [~cloud_fan]

Can any of you ,please, explain me such a behavior or communicate more about 
this.
In case you are planning to remove it or something else, we'd truly appreciate 
if you communicate.

In fact, I would like to do a pull request on this, but since my pull requests 
in SQL/ML components are just staying there without any response, I'll wait for 
your response first.

cc: [~shivaram], [~mengxr]

Thank you,
Narine




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to