[jira] [Commented] (SPARK-14831) Make ML APIs in SparkR consistent
[ https://issues.apache.org/jira/browse/SPARK-14831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265190#comment-15265190 ] Apache Spark commented on SPARK-14831: -- User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/12807 > Make ML APIs in SparkR consistent > - > > Key: SPARK-14831 > URL: https://issues.apache.org/jira/browse/SPARK-14831 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Timothy Hunter >Priority: Critical > Fix For: 2.0.0 > > > In current master, we have 4 ML methods in SparkR: > {code:none} > glm(formula, family, data, ...) > kmeans(data, centers, ...) > naiveBayes(formula, data, ...) > survreg(formula, data, ...) > {code} > We tried to keep the signatures similar to existing ones in R. However, if we > put them together, they are not consistent. One example is k-means, which > doesn't accept a formula. Instead of looking at each method independently, we > might want to update the signature of kmeans to > {code:none} > kmeans(formula, data, centers, ...) > {code} > We can also discuss possible global changes here. For example, `glm` puts > `family` before `data` while `kmeans` puts `centers` after `data`. This is > not consistent. And logically, the formula doesn't mean anything without > associating with a DataFrame. So it makes more sense to me to have the > following signature: > {code:none} > algorithm(df, formula, [required params], [optional params]) > {code} > If we make this change, we might want to avoid name collisions because they > have different signature. We can use `ml.kmeans`, 'ml.glm`, etc. > Sorry for discussing API changes in the last minute. But I think it would be > better to have consistent signatures in SparkR. > cc: [~shivaram] [~josephkb] [~yanboliang] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14831) Make ML APIs in SparkR consistent
[ https://issues.apache.org/jira/browse/SPARK-14831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15264865#comment-15264865 ] Apache Spark commented on SPARK-14831: -- User 'thunterdb' has created a pull request for this issue: https://github.com/apache/spark/pull/12789 > Make ML APIs in SparkR consistent > - > > Key: SPARK-14831 > URL: https://issues.apache.org/jira/browse/SPARK-14831 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Timothy Hunter >Priority: Critical > > In current master, we have 4 ML methods in SparkR: > {code:none} > glm(formula, family, data, ...) > kmeans(data, centers, ...) > naiveBayes(formula, data, ...) > survreg(formula, data, ...) > {code} > We tried to keep the signatures similar to existing ones in R. However, if we > put them together, they are not consistent. One example is k-means, which > doesn't accept a formula. Instead of looking at each method independently, we > might want to update the signature of kmeans to > {code:none} > kmeans(formula, data, centers, ...) > {code} > We can also discuss possible global changes here. For example, `glm` puts > `family` before `data` while `kmeans` puts `centers` after `data`. This is > not consistent. And logically, the formula doesn't mean anything without > associating with a DataFrame. So it makes more sense to me to have the > following signature: > {code:none} > algorithm(df, formula, [required params], [optional params]) > {code} > If we make this change, we might want to avoid name collisions because they > have different signature. We can use `ml.kmeans`, 'ml.glm`, etc. > Sorry for discussing API changes in the last minute. But I think it would be > better to have consistent signatures in SparkR. > cc: [~shivaram] [~josephkb] [~yanboliang] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14831) Make ML APIs in SparkR consistent
[ https://issues.apache.org/jira/browse/SPARK-14831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15264328#comment-15264328 ] Xiangrui Meng commented on SPARK-14831: --- Talked to [~timhunter] offline and he will submit a PR soon. > Make ML APIs in SparkR consistent > - > > Key: SPARK-14831 > URL: https://issues.apache.org/jira/browse/SPARK-14831 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Timothy Hunter >Priority: Critical > > In current master, we have 4 ML methods in SparkR: > {code:none} > glm(formula, family, data, ...) > kmeans(data, centers, ...) > naiveBayes(formula, data, ...) > survreg(formula, data, ...) > {code} > We tried to keep the signatures similar to existing ones in R. However, if we > put them together, they are not consistent. One example is k-means, which > doesn't accept a formula. Instead of looking at each method independently, we > might want to update the signature of kmeans to > {code:none} > kmeans(formula, data, centers, ...) > {code} > We can also discuss possible global changes here. For example, `glm` puts > `family` before `data` while `kmeans` puts `centers` after `data`. This is > not consistent. And logically, the formula doesn't mean anything without > associating with a DataFrame. So it makes more sense to me to have the > following signature: > {code:none} > algorithm(df, formula, [required params], [optional params]) > {code} > If we make this change, we might want to avoid name collisions because they > have different signature. We can use `ml.kmeans`, 'ml.glm`, etc. > Sorry for discussing API changes in the last minute. But I think it would be > better to have consistent signatures in SparkR. > cc: [~shivaram] [~josephkb] [~yanboliang] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14831) Make ML APIs in SparkR consistent
[ https://issues.apache.org/jira/browse/SPARK-14831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261489#comment-15261489 ] Yanbo Liang commented on SPARK-14831: - +1 > Make ML APIs in SparkR consistent > - > > Key: SPARK-14831 > URL: https://issues.apache.org/jira/browse/SPARK-14831 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > > In current master, we have 4 ML methods in SparkR: > {code:none} > glm(formula, family, data, ...) > kmeans(data, centers, ...) > naiveBayes(formula, data, ...) > survreg(formula, data, ...) > {code} > We tried to keep the signatures similar to existing ones in R. However, if we > put them together, they are not consistent. One example is k-means, which > doesn't accept a formula. Instead of looking at each method independently, we > might want to update the signature of kmeans to > {code:none} > kmeans(formula, data, centers, ...) > {code} > We can also discuss possible global changes here. For example, `glm` puts > `family` before `data` while `kmeans` puts `centers` after `data`. This is > not consistent. And logically, the formula doesn't mean anything without > associating with a DataFrame. So it makes more sense to me to have the > following signature: > {code:none} > algorithm(df, formula, [required params], [optional params]) > {code} > If we make this change, we might want to avoid name collisions because they > have different signature. We can use `ml.kmeans`, 'ml.glm`, etc. > Sorry for discussing API changes in the last minute. But I think it would be > better to have consistent signatures in SparkR. > cc: [~shivaram] [~josephkb] [~yanboliang] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14831) Make ML APIs in SparkR consistent
[ https://issues.apache.org/jira/browse/SPARK-14831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261005#comment-15261005 ] Shivaram Venkataraman commented on SPARK-14831: --- +1 > Make ML APIs in SparkR consistent > - > > Key: SPARK-14831 > URL: https://issues.apache.org/jira/browse/SPARK-14831 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > > In current master, we have 4 ML methods in SparkR: > {code:none} > glm(formula, family, data, ...) > kmeans(data, centers, ...) > naiveBayes(formula, data, ...) > survreg(formula, data, ...) > {code} > We tried to keep the signatures similar to existing ones in R. However, if we > put them together, they are not consistent. One example is k-means, which > doesn't accept a formula. Instead of looking at each method independently, we > might want to update the signature of kmeans to > {code:none} > kmeans(formula, data, centers, ...) > {code} > We can also discuss possible global changes here. For example, `glm` puts > `family` before `data` while `kmeans` puts `centers` after `data`. This is > not consistent. And logically, the formula doesn't mean anything without > associating with a DataFrame. So it makes more sense to me to have the > following signature: > {code:none} > algorithm(df, formula, [required params], [optional params]) > {code} > If we make this change, we might want to avoid name collisions because they > have different signature. We can use `ml.kmeans`, 'ml.glm`, etc. > Sorry for discussing API changes in the last minute. But I think it would be > better to have consistent signatures in SparkR. > cc: [~shivaram] [~josephkb] [~yanboliang] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14831) Make ML APIs in SparkR consistent
[ https://issues.apache.org/jira/browse/SPARK-14831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15260981#comment-15260981 ] Xiangrui Meng commented on SPARK-14831: --- +1 on `read.ml` and `write.ml`, which are consistent with `read.df` and `write.df` and leave space for future features. Putting the discussions together, we have: * read.ml and write.ml for saving/loading ML models * "spark." prefix to ML algorithms, especially if we cannot closely match existing R methods or have to shadow them. This includes: ** spark.glm and glm (which doesn't shadow stats::glm) ** spark.kmeans ** spark.naiveBayes ** spark.survreg For methods with `spark.` prefix, I suggest the following signature: {code:none} spark.kmeans(df, formula, [required params], [optional params], ...) {code} Sounds good? > Make ML APIs in SparkR consistent > - > > Key: SPARK-14831 > URL: https://issues.apache.org/jira/browse/SPARK-14831 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > > In current master, we have 4 ML methods in SparkR: > {code:none} > glm(formula, family, data, ...) > kmeans(data, centers, ...) > naiveBayes(formula, data, ...) > survreg(formula, data, ...) > {code} > We tried to keep the signatures similar to existing ones in R. However, if we > put them together, they are not consistent. One example is k-means, which > doesn't accept a formula. Instead of looking at each method independently, we > might want to update the signature of kmeans to > {code:none} > kmeans(formula, data, centers, ...) > {code} > We can also discuss possible global changes here. For example, `glm` puts > `family` before `data` while `kmeans` puts `centers` after `data`. This is > not consistent. And logically, the formula doesn't mean anything without > associating with a DataFrame. So it makes more sense to me to have the > following signature: > {code:none} > algorithm(df, formula, [required params], [optional params]) > {code} > If we make this change, we might want to avoid name collisions because they > have different signature. We can use `ml.kmeans`, 'ml.glm`, etc. > Sorry for discussing API changes in the last minute. But I think it would be > better to have consistent signatures in SparkR. > cc: [~shivaram] [~josephkb] [~yanboliang] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14831) Make ML APIs in SparkR consistent
[ https://issues.apache.org/jira/browse/SPARK-14831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15256983#comment-15256983 ] Joseph K. Bradley commented on SPARK-14831: --- 2. {{spark.glm}}, etc. SGTM. For save/load, I'd prefer either {{spark.save/load}} (if that works for DataFrames too), or {{read.ml}} (rather than {{read.model}} since that leaves open the possibility of supporting Estimators and Pipelines in R someday). > Make ML APIs in SparkR consistent > - > > Key: SPARK-14831 > URL: https://issues.apache.org/jira/browse/SPARK-14831 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > > In current master, we have 4 ML methods in SparkR: > {code:none} > glm(formula, family, data, ...) > kmeans(data, centers, ...) > naiveBayes(formula, data, ...) > survreg(formula, data, ...) > {code} > We tried to keep the signatures similar to existing ones in R. However, if we > put them together, they are not consistent. One example is k-means, which > doesn't accept a formula. Instead of looking at each method independently, we > might want to update the signature of kmeans to > {code:none} > kmeans(formula, data, centers, ...) > {code} > We can also discuss possible global changes here. For example, `glm` puts > `family` before `data` while `kmeans` puts `centers` after `data`. This is > not consistent. And logically, the formula doesn't mean anything without > associating with a DataFrame. So it makes more sense to me to have the > following signature: > {code:none} > algorithm(df, formula, [required params], [optional params]) > {code} > If we make this change, we might want to avoid name collisions because they > have different signature. We can use `ml.kmeans`, 'ml.glm`, etc. > Sorry for discussing API changes in the last minute. But I think it would be > better to have consistent signatures in SparkR. > cc: [~shivaram] [~josephkb] [~yanboliang] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14831) Make ML APIs in SparkR consistent
[ https://issues.apache.org/jira/browse/SPARK-14831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15254641#comment-15254641 ] Felix Cheung commented on SPARK-14831: -- 2. +1 read.spark.model and write.spark.model might be more consistent with the existing R convention. > Make ML APIs in SparkR consistent > - > > Key: SPARK-14831 > URL: https://issues.apache.org/jira/browse/SPARK-14831 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > > In current master, we have 4 ML methods in SparkR: > {code:none} > glm(formula, family, data, ...) > kmeans(data, centers, ...) > naiveBayes(formula, data, ...) > survreg(formula, data, ...) > {code} > We tried to keep the signatures similar to existing ones in R. However, if we > put them together, they are not consistent. One example is k-means, which > doesn't accept a formula. Instead of looking at each method independently, we > might want to update the signature of kmeans to > {code:none} > kmeans(formula, data, centers, ...) > {code} > We can also discuss possible global changes here. For example, `glm` puts > `family` before `data` while `kmeans` puts `centers` after `data`. This is > not consistent. And logically, the formula doesn't mean anything without > associating with a DataFrame. So it makes more sense to me to have the > following signature: > {code:none} > algorithm(df, formula, [required params], [optional params]) > {code} > If we make this change, we might want to avoid name collisions because they > have different signature. We can use `ml.kmeans`, 'ml.glm`, etc. > Sorry for discussing API changes in the last minute. But I think it would be > better to have consistent signatures in SparkR. > cc: [~shivaram] [~josephkb] [~yanboliang] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14831) Make ML APIs in SparkR consistent
[ https://issues.apache.org/jira/browse/SPARK-14831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15254551#comment-15254551 ] Shivaram Venkataraman commented on SPARK-14831: --- 1. Agree. I think a valid policy could be that if we are able to support say most of the functionality in the base R function then we add the overload method. All methods though will have the spark. variant that. We can do one pass right now to add spark. and remove the overloads that don't match the base R functionality well enough. 2. We have so far used `read.df` and `write.df` to save and load data frames. I think read.model and write.model might work (I can't find a overloaded method in R for that) but I'm also fine if we just want to have a separate set of commands for models. > Make ML APIs in SparkR consistent > - > > Key: SPARK-14831 > URL: https://issues.apache.org/jira/browse/SPARK-14831 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > > In current master, we have 4 ML methods in SparkR: > {code:none} > glm(formula, family, data, ...) > kmeans(data, centers, ...) > naiveBayes(formula, data, ...) > survreg(formula, data, ...) > {code} > We tried to keep the signatures similar to existing ones in R. However, if we > put them together, they are not consistent. One example is k-means, which > doesn't accept a formula. Instead of looking at each method independently, we > might want to update the signature of kmeans to > {code:none} > kmeans(formula, data, centers, ...) > {code} > We can also discuss possible global changes here. For example, `glm` puts > `family` before `data` while `kmeans` puts `centers` after `data`. This is > not consistent. And logically, the formula doesn't mean anything without > associating with a DataFrame. So it makes more sense to me to have the > following signature: > {code:none} > algorithm(df, formula, [required params], [optional params]) > {code} > If we make this change, we might want to avoid name collisions because they > have different signature. We can use `ml.kmeans`, 'ml.glm`, etc. > Sorry for discussing API changes in the last minute. But I think it would be > better to have consistent signatures in SparkR. > cc: [~shivaram] [~josephkb] [~yanboliang] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14831) Make ML APIs in SparkR consistent
[ https://issues.apache.org/jira/browse/SPARK-14831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15254516#comment-15254516 ] Xiangrui Meng commented on SPARK-14831: --- 1. Please see my reply to Felix above for the issue with similar but slightly different. I totally agree that we should use the same method name if we can safely override the base R functions and match the features 100%. However, it depends on how existing R methods are defined and their signature. For some ML methods in SparkR, they actually shadow the existing ones. User need to specify namespace after SparkR is loaded. 2. +1 on `spark.` prefix. A related task is save/load for MLlib models. If we want to call them`spark.save` and `spark.load`, we need to discuss how to implement it. If would be nice if save/load work for both DataFrames and ML models. > Make ML APIs in SparkR consistent > - > > Key: SPARK-14831 > URL: https://issues.apache.org/jira/browse/SPARK-14831 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > > In current master, we have 4 ML methods in SparkR: > {code:none} > glm(formula, family, data, ...) > kmeans(data, centers, ...) > naiveBayes(formula, data, ...) > survreg(formula, data, ...) > {code} > We tried to keep the signatures similar to existing ones in R. However, if we > put them together, they are not consistent. One example is k-means, which > doesn't accept a formula. Instead of looking at each method independently, we > might want to update the signature of kmeans to > {code:none} > kmeans(formula, data, centers, ...) > {code} > We can also discuss possible global changes here. For example, `glm` puts > `family` before `data` while `kmeans` puts `centers` after `data`. This is > not consistent. And logically, the formula doesn't mean anything without > associating with a DataFrame. So it makes more sense to me to have the > following signature: > {code:none} > algorithm(df, formula, [required params], [optional params]) > {code} > If we make this change, we might want to avoid name collisions because they > have different signature. We can use `ml.kmeans`, 'ml.glm`, etc. > Sorry for discussing API changes in the last minute. But I think it would be > better to have consistent signatures in SparkR. > cc: [~shivaram] [~josephkb] [~yanboliang] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14831) Make ML APIs in SparkR consistent
[ https://issues.apache.org/jira/browse/SPARK-14831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15254510#comment-15254510 ] Xiangrui Meng commented on SPARK-14831: --- We have been trying to mimic existing R APIs in SparkR. That gave users some impressions that existing R code should work magically after they convert the input data.frame to SparkR's DataFrame. However, this is not true for DataFrames APIs, nor for the ML APIs in SparkR. For example, we have `algorithm` defined in `kmeans` because R's kmeans has this argument. But actually they mean different things, one for initialization algorithms and one for training algorithms. This is quite annoying to users when the methods are similar but with subtle differences. If we don't use the same method name, user would probably look at the help first before trying it. > Make ML APIs in SparkR consistent > - > > Key: SPARK-14831 > URL: https://issues.apache.org/jira/browse/SPARK-14831 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > > In current master, we have 4 ML methods in SparkR: > {code:none} > glm(formula, family, data, ...) > kmeans(data, centers, ...) > naiveBayes(formula, data, ...) > survreg(formula, data, ...) > {code} > We tried to keep the signatures similar to existing ones in R. However, if we > put them together, they are not consistent. One example is k-means, which > doesn't accept a formula. Instead of looking at each method independently, we > might want to update the signature of kmeans to > {code:none} > kmeans(formula, data, centers, ...) > {code} > We can also discuss possible global changes here. For example, `glm` puts > `family` before `data` while `kmeans` puts `centers` after `data`. This is > not consistent. And logically, the formula doesn't mean anything without > associating with a DataFrame. So it makes more sense to me to have the > following signature: > {code:none} > algorithm(df, formula, [required params], [optional params]) > {code} > If we make this change, we might want to avoid name collisions because they > have different signature. We can use `ml.kmeans`, 'ml.glm`, etc. > Sorry for discussing API changes in the last minute. But I think it would be > better to have consistent signatures in SparkR. > cc: [~shivaram] [~josephkb] [~yanboliang] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14831) Make ML APIs in SparkR consistent
[ https://issues.apache.org/jira/browse/SPARK-14831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15254447#comment-15254447 ] Shivaram Venkataraman commented on SPARK-14831: --- Yeah I think there are a couple of factors to consider here 1. Existing R users who want to use SparkR: For this case I think its valuable to have the methods mimic the ordering that is used by the corresponding R function. So we will then have kmeans(data, centers, ...) and glm(formula, family, data, ...) . I think its useful to mimic the ordering for two reasons (a) its helps with familiarity (b) it also ensures we can safely override the base R functions as they are now 2. New users for SparkR / Spark-ML: I think having internal consistency is useful for these users. My take on SparkR API has always been that it doesn't hurt to support multiple ways to do things as long they don't collide etc. In this scenario if we want to define a new set of consistent APIs we should adopt a new namespace as [~mengxr] indicated. I would suggest `spark.kmeans` and `spark.glm` as opposed to `ml.glm` to make it more clear these are SparkR functions (we are also using spark.lapply for example) > Make ML APIs in SparkR consistent > - > > Key: SPARK-14831 > URL: https://issues.apache.org/jira/browse/SPARK-14831 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > > In current master, we have 4 ML methods in SparkR: > {code:none} > glm(formula, family, data, ...) > kmeans(data, centers, ...) > naiveBayes(formula, data, ...) > survreg(formula, data, ...) > {code} > We tried to keep the signatures similar to existing ones in R. However, if we > put them together, they are not consistent. One example is k-means, which > doesn't accept a formula. Instead of looking at each method independently, we > might want to update the signature of kmeans to > {code:none} > kmeans(formula, data, centers, ...) > {code} > We can also discuss possible global changes here. For example, `glm` puts > `family` before `data` while `kmeans` puts `centers` after `data`. This is > not consistent. And logically, the formula doesn't mean anything without > associating with a DataFrame. So it makes more sense to me to have the > following signature: > {code:none} > algorithm(df, formula, [required params], [optional params]) > {code} > If we make this change, we might want to avoid name collisions because they > have different signature. We can use `ml.kmeans`, 'ml.glm`, etc. > Sorry for discussing API changes in the last minute. But I think it would be > better to have consistent signatures in SparkR. > cc: [~shivaram] [~josephkb] [~yanboliang] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14831) Make ML APIs in SparkR consistent
[ https://issues.apache.org/jira/browse/SPARK-14831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15254032#comment-15254032 ] Yanbo Liang commented on SPARK-14831: - This change looks good to me. Thanks! BTW, I think we should also structure functions of mllib.R. We should make functions related to a model in one code block, like following: * glm, summary, predict * kmeans, summary, fitted, predit * naiveBayes, summary, predict * survreg, summary, predict It will make developers and contributors understand code clearly. > Make ML APIs in SparkR consistent > - > > Key: SPARK-14831 > URL: https://issues.apache.org/jira/browse/SPARK-14831 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > > In current master, we have 4 ML methods in SparkR: > {code:none} > glm(formula, family, data, ...) > kmeans(data, centers, ...) > naiveBayes(formula, data, ...) > survreg(formula, data, ...) > {code} > We tried to keep the signatures similar to existing ones in R. However, if we > put them together, they are not consistent. One example is k-means, which > doesn't accept a formula. Instead of looking at each method independently, we > might want to update the signature of kmeans to > {code:none} > kmeans(formula, data, centers, ...) > {code} > We can also discuss possible global changes here. For example, `glm` puts > `family` before `data` while `kmeans` puts `centers` after `data`. This is > not consistent. And logically, the formula doesn't mean anything without > associating with a DataFrame. So it makes more sense to me to have the > following signature: > {code:none} > algorithm(df, formula, [required params], [optional params]) > {code} > If we make this change, we might want to avoid name collisions because they > have different signature. We can use `ml.kmeans`, 'ml.glm`, etc. > Sorry for discussing API changes in the last minute. But I think it would be > better to have consistent signatures in SparkR. > cc: [~shivaram] [~josephkb] [~yanboliang] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14831) Make ML APIs in SparkR consistent
[ https://issues.apache.org/jira/browse/SPARK-14831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253285#comment-15253285 ] Felix Cheung commented on SPARK-14831: -- I'd argue it is more important that they are like the existing R functions? Granted they are not consistent and they don't always match what Spark support, but I think we are expecting a large number of long time R users who are very familiar with how to call kmeans, to try to use Spark. However, take kmeans as an example, these are S4 methods, it should be possible to define them in such a way that they would look like R's kmeans by default, for example {code} setMethod("kmeans", signature(x = "DataFrame"), function(x, centers, iter.max = 10, algorithm = c("random", "k-means||")) {code} could be changed to as you later suggested (DataFrame to follow by Formula) {code} setMethod("kmeans", signature(data = "DataFrame"), function(data, formula = NULL, centers, iter.max = 10, algorithm = c("random", "k-means||")) {code} > Make ML APIs in SparkR consistent > - > > Key: SPARK-14831 > URL: https://issues.apache.org/jira/browse/SPARK-14831 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > > In current master, we have 4 ML methods in SparkR: > {code:none} > glm(formula, family, data, ...) > kmeans(data, centers, ...) > naiveBayes(formula, data, ...) > survreg(formula, data, ...) > {code} > We tried to keep the signatures similar to existing ones in R. However, if we > put them together, they are not consistent. One example is k-means, which > doesn't accept a formula. Instead of looking at each method independently, we > might want to update the signature of kmeans to > {code:none} > kmeans(formula, data, centers, ...) > {code} > We can also discuss possible global changes here. For example, `glm` puts > `family` before `data` while `kmeans` puts `centers` after `data`. This is > not consistent. And logically, the formula doesn't mean anything without > associating with a DataFrame. So it makes more sense to me to have the > following signature: > {code:none} > algorithm(df, formula, [required params], [optional params]) > {code} > If we make this change, we might want to avoid name collisions because they > have different signature. We can use `ml.kmeans`, 'ml.glm`, etc. > Sorry for discussing API changes in the last minute. But I think it would be > better to have consistent signatures in SparkR. > cc: [~shivaram] [~josephkb] [~yanboliang] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org