[ 
https://issues.apache.org/jira/browse/SPARK-14831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15254510#comment-15254510
 ] 

Xiangrui Meng commented on SPARK-14831:
---------------------------------------

We have been trying to mimic existing R APIs in SparkR. That gave users some 
impressions that existing R code should work magically after they convert the 
input data.frame to SparkR's DataFrame. However, this is not true for 
DataFrames APIs, nor for the ML APIs in SparkR. For example, we have 
`algorithm` defined in `kmeans` because R's kmeans has this argument. But 
actually they mean different things, one for initialization algorithms and one 
for training algorithms. This is quite annoying to users when the methods are 
similar but with subtle differences. If we don't use the same method name, user 
would probably look at the help first before trying it.

> Make ML APIs in SparkR consistent
> ---------------------------------
>
>                 Key: SPARK-14831
>                 URL: https://issues.apache.org/jira/browse/SPARK-14831
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML, SparkR
>    Affects Versions: 2.0.0
>            Reporter: Xiangrui Meng
>            Assignee: Xiangrui Meng
>            Priority: Critical
>
> In current master, we have 4 ML methods in SparkR:
> {code:none}
> glm(formula, family, data, ...)
> kmeans(data, centers, ...)
> naiveBayes(formula, data, ...)
> survreg(formula, data, ...)
> {code}
> We tried to keep the signatures similar to existing ones in R. However, if we 
> put them together, they are not consistent. One example is k-means, which 
> doesn't accept a formula. Instead of looking at each method independently, we 
> might want to update the signature of kmeans to
> {code:none}
> kmeans(formula, data, centers, ...)
> {code}
> We can also discuss possible global changes here. For example, `glm` puts 
> `family` before `data` while `kmeans` puts `centers` after `data`. This is 
> not consistent. And logically, the formula doesn't mean anything without 
> associating with a DataFrame. So it makes more sense to me to have the 
> following signature:
> {code:none}
> algorithm(df, formula, [required params], [optional params])
> {code}
> If we make this change, we might want to avoid name collisions because they 
> have different signature. We can use `ml.kmeans`, 'ml.glm`, etc.
> Sorry for discussing API changes in the last minute. But I think it would be 
> better to have consistent signatures in SparkR.
> cc: [~shivaram] [~josephkb] [~yanboliang]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to