[jira] [Updated] (SPARK-14657) RFormula output wrong features when formula w/o intercept
[ https://issues.apache.org/jira/browse/SPARK-14657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-14657: Affects Version/s: 2.2.0 Target Version/s: 2.3.0 > RFormula output wrong features when formula w/o intercept > - > > Key: SPARK-14657 > URL: https://issues.apache.org/jira/browse/SPARK-14657 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Yanbo Liang >Assignee: Yanbo Liang > > SparkR::glm output different features compared with R glm when fit w/o > intercept and having string/category features. Take the following example, > SparkR output three features compared with four features for native R. > SparkR::glm > {quote} > training <- suppressWarnings(createDataFrame(sqlContext, iris)) > model <- glm(Sepal_Width ~ Sepal_Length + Species - 1, data = training) > summary(model) > Coefficients: > Estimate Std. Error t value Pr(>|t|) > Sepal_Length0.67468 0.0093013 72.536 0 > Species_versicolor -1.2349 0.07269 -16.989 0 > Species_virginica -1.4708 0.077397-19.003 0 > {quote} > stats::glm > {quote} > summary(glm(Sepal.Width ~ Sepal.Length + Species - 1, data = iris)) > Coefficients: > Estimate Std. Error t value Pr(>|t|) > Sepal.Length0.3499 0.0463 7.557 4.19e-12 *** > Speciessetosa 1.6765 0.2354 7.123 4.46e-11 *** > Speciesversicolor 0.6931 0.2779 2.494 0.0137 * > Speciesvirginica0.6690 0.3078 2.174 0.0313 * > {quote} > The encoder for string/category feature is different. R did not drop any > category but SparkR drop the last one. > I searched online and test some other cases, found when we fit R glm model(or > other models powered by R formula) w/o intercept on a dataset including > string/category features, one of the categories in the first category feature > is being used as reference category, we will not drop any category for that > feature. > I think we should keep consistent semantics between Spark RFormula and R > formula. > cc [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14657) RFormula output wrong features when formula w/o intercept
[ https://issues.apache.org/jira/browse/SPARK-14657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-14657: -- Target Version/s: (was: 2.2.0) > RFormula output wrong features when formula w/o intercept > - > > Key: SPARK-14657 > URL: https://issues.apache.org/jira/browse/SPARK-14657 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang >Assignee: Yanbo Liang > > SparkR::glm output different features compared with R glm when fit w/o > intercept and having string/category features. Take the following example, > SparkR output three features compared with four features for native R. > SparkR::glm > {quote} > training <- suppressWarnings(createDataFrame(sqlContext, iris)) > model <- glm(Sepal_Width ~ Sepal_Length + Species - 1, data = training) > summary(model) > Coefficients: > Estimate Std. Error t value Pr(>|t|) > Sepal_Length0.67468 0.0093013 72.536 0 > Species_versicolor -1.2349 0.07269 -16.989 0 > Species_virginica -1.4708 0.077397-19.003 0 > {quote} > stats::glm > {quote} > summary(glm(Sepal.Width ~ Sepal.Length + Species - 1, data = iris)) > Coefficients: > Estimate Std. Error t value Pr(>|t|) > Sepal.Length0.3499 0.0463 7.557 4.19e-12 *** > Speciessetosa 1.6765 0.2354 7.123 4.46e-11 *** > Speciesversicolor 0.6931 0.2779 2.494 0.0137 * > Speciesvirginica0.6690 0.3078 2.174 0.0313 * > {quote} > The encoder for string/category feature is different. R did not drop any > category but SparkR drop the last one. > I searched online and test some other cases, found when we fit R glm model(or > other models powered by R formula) w/o intercept on a dataset including > string/category features, one of the categories in the first category feature > is being used as reference category, we will not drop any category for that > feature. > I think we should keep consistent semantics between Spark RFormula and R > formula. > cc [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14657) RFormula output wrong features when formula w/o intercept
[ https://issues.apache.org/jira/browse/SPARK-14657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-14657: -- Shepherd: (was: Xiangrui Meng) > RFormula output wrong features when formula w/o intercept > - > > Key: SPARK-14657 > URL: https://issues.apache.org/jira/browse/SPARK-14657 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang >Assignee: Yanbo Liang > > SparkR::glm output different features compared with R glm when fit w/o > intercept and having string/category features. Take the following example, > SparkR output three features compared with four features for native R. > SparkR::glm > {quote} > training <- suppressWarnings(createDataFrame(sqlContext, iris)) > model <- glm(Sepal_Width ~ Sepal_Length + Species - 1, data = training) > summary(model) > Coefficients: > Estimate Std. Error t value Pr(>|t|) > Sepal_Length0.67468 0.0093013 72.536 0 > Species_versicolor -1.2349 0.07269 -16.989 0 > Species_virginica -1.4708 0.077397-19.003 0 > {quote} > stats::glm > {quote} > summary(glm(Sepal.Width ~ Sepal.Length + Species - 1, data = iris)) > Coefficients: > Estimate Std. Error t value Pr(>|t|) > Sepal.Length0.3499 0.0463 7.557 4.19e-12 *** > Speciessetosa 1.6765 0.2354 7.123 4.46e-11 *** > Speciesversicolor 0.6931 0.2779 2.494 0.0137 * > Speciesvirginica0.6690 0.3078 2.174 0.0313 * > {quote} > The encoder for string/category feature is different. R did not drop any > category but SparkR drop the last one. > I searched online and test some other cases, found when we fit R glm model(or > other models powered by R formula) w/o intercept on a dataset including > string/category features, one of the categories in the first category feature > is being used as reference category, we will not drop any category for that > feature. > I think we should keep consistent semantics between Spark RFormula and R > formula. > cc [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14657) RFormula output wrong features when formula w/o intercept
[ https://issues.apache.org/jira/browse/SPARK-14657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-14657: -- Target Version/s: 2.2.0 (was: 2.1.0) > RFormula output wrong features when formula w/o intercept > - > > Key: SPARK-14657 > URL: https://issues.apache.org/jira/browse/SPARK-14657 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang >Assignee: Yanbo Liang > > SparkR::glm output different features compared with R glm when fit w/o > intercept and having string/category features. Take the following example, > SparkR output three features compared with four features for native R. > SparkR::glm > {quote} > training <- suppressWarnings(createDataFrame(sqlContext, iris)) > model <- glm(Sepal_Width ~ Sepal_Length + Species - 1, data = training) > summary(model) > Coefficients: > Estimate Std. Error t value Pr(>|t|) > Sepal_Length0.67468 0.0093013 72.536 0 > Species_versicolor -1.2349 0.07269 -16.989 0 > Species_virginica -1.4708 0.077397-19.003 0 > {quote} > stats::glm > {quote} > summary(glm(Sepal.Width ~ Sepal.Length + Species - 1, data = iris)) > Coefficients: > Estimate Std. Error t value Pr(>|t|) > Sepal.Length0.3499 0.0463 7.557 4.19e-12 *** > Speciessetosa 1.6765 0.2354 7.123 4.46e-11 *** > Speciesversicolor 0.6931 0.2779 2.494 0.0137 * > Speciesvirginica0.6690 0.3078 2.174 0.0313 * > {quote} > The encoder for string/category feature is different. R did not drop any > category but SparkR drop the last one. > I searched online and test some other cases, found when we fit R glm model(or > other models powered by R formula) w/o intercept on a dataset including > string/category features, one of the categories in the first category feature > is being used as reference category, we will not drop any category for that > feature. > I think we should keep consistent semantics between Spark RFormula and R > formula. > cc [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14657) RFormula output wrong features when formula w/o intercept
[ https://issues.apache.org/jira/browse/SPARK-14657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-14657: -- Target Version/s: 2.1.0 (was: 2.0.0) > RFormula output wrong features when formula w/o intercept > - > > Key: SPARK-14657 > URL: https://issues.apache.org/jira/browse/SPARK-14657 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang >Assignee: Yanbo Liang > > SparkR::glm output different features compared with R glm when fit w/o > intercept and having string/category features. Take the following example, > SparkR output three features compared with four features for native R. > SparkR::glm > {quote} > training <- suppressWarnings(createDataFrame(sqlContext, iris)) > model <- glm(Sepal_Width ~ Sepal_Length + Species - 1, data = training) > summary(model) > Coefficients: > Estimate Std. Error t value Pr(>|t|) > Sepal_Length0.67468 0.0093013 72.536 0 > Species_versicolor -1.2349 0.07269 -16.989 0 > Species_virginica -1.4708 0.077397-19.003 0 > {quote} > stats::glm > {quote} > summary(glm(Sepal.Width ~ Sepal.Length + Species - 1, data = iris)) > Coefficients: > Estimate Std. Error t value Pr(>|t|) > Sepal.Length0.3499 0.0463 7.557 4.19e-12 *** > Speciessetosa 1.6765 0.2354 7.123 4.46e-11 *** > Speciesversicolor 0.6931 0.2779 2.494 0.0137 * > Speciesvirginica0.6690 0.3078 2.174 0.0313 * > {quote} > The encoder for string/category feature is different. R did not drop any > category but SparkR drop the last one. > I searched online and test some other cases, found when we fit R glm model(or > other models powered by R formula) w/o intercept on a dataset including > string/category features, one of the categories in the first category feature > is being used as reference category, we will not drop any category for that > feature. > I think we should keep consistent semantics between Spark RFormula and R > formula. > cc [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14657) RFormula output wrong features when formula w/o intercept
[ https://issues.apache.org/jira/browse/SPARK-14657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-14657: -- Target Version/s: 2.0.0 > RFormula output wrong features when formula w/o intercept > - > > Key: SPARK-14657 > URL: https://issues.apache.org/jira/browse/SPARK-14657 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang >Assignee: Yanbo Liang > > SparkR::glm output different features compared with R glm when fit w/o > intercept and having string/category features. Take the following example, > SparkR output three features compared with four features for native R. > SparkR::glm > {quote} > training <- suppressWarnings(createDataFrame(sqlContext, iris)) > model <- glm(Sepal_Width ~ Sepal_Length + Species - 1, data = training) > summary(model) > Coefficients: > Estimate Std. Error t value Pr(>|t|) > Sepal_Length0.67468 0.0093013 72.536 0 > Species_versicolor -1.2349 0.07269 -16.989 0 > Species_virginica -1.4708 0.077397-19.003 0 > {quote} > stats::glm > {quote} > summary(glm(Sepal.Width ~ Sepal.Length + Species - 1, data = iris)) > Coefficients: > Estimate Std. Error t value Pr(>|t|) > Sepal.Length0.3499 0.0463 7.557 4.19e-12 *** > Speciessetosa 1.6765 0.2354 7.123 4.46e-11 *** > Speciesversicolor 0.6931 0.2779 2.494 0.0137 * > Speciesvirginica0.6690 0.3078 2.174 0.0313 * > {quote} > The encoder for string/category feature is different. R did not drop any > category but SparkR drop the last one. > I searched online and test some other cases, found when we fit R glm model(or > other models powered by R formula) w/o intercept on a dataset including > string/category features, one of the categories in the first category feature > is being used as reference category, we will not drop any category for that > feature. > I think we should keep consistent semantics between Spark RFormula and R > formula. > cc [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14657) RFormula output wrong features when formula w/o intercept
[ https://issues.apache.org/jira/browse/SPARK-14657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-14657: -- Shepherd: Xiangrui Meng > RFormula output wrong features when formula w/o intercept > - > > Key: SPARK-14657 > URL: https://issues.apache.org/jira/browse/SPARK-14657 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang >Assignee: Yanbo Liang > > SparkR::glm output different features compared with R glm when fit w/o > intercept and having string/category features. Take the following example, > SparkR output three features compared with four features for native R. > SparkR::glm > {quote} > training <- suppressWarnings(createDataFrame(sqlContext, iris)) > model <- glm(Sepal_Width ~ Sepal_Length + Species - 1, data = training) > summary(model) > Coefficients: > Estimate Std. Error t value Pr(>|t|) > Sepal_Length0.67468 0.0093013 72.536 0 > Species_versicolor -1.2349 0.07269 -16.989 0 > Species_virginica -1.4708 0.077397-19.003 0 > {quote} > stats::glm > {quote} > summary(glm(Sepal.Width ~ Sepal.Length + Species - 1, data = iris)) > Coefficients: > Estimate Std. Error t value Pr(>|t|) > Sepal.Length0.3499 0.0463 7.557 4.19e-12 *** > Speciessetosa 1.6765 0.2354 7.123 4.46e-11 *** > Speciesversicolor 0.6931 0.2779 2.494 0.0137 * > Speciesvirginica0.6690 0.3078 2.174 0.0313 * > {quote} > The encoder for string/category feature is different. R did not drop any > category but SparkR drop the last one. > I searched online and test some other cases, found when we fit R glm model(or > other models powered by R formula) w/o intercept on a dataset including > string/category features, one of the categories in the first category feature > is being used as reference category, we will not drop any category for that > feature. > I think we should keep consistent semantics between Spark RFormula and R > formula. > cc [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14657) RFormula output wrong features when formula w/o intercept
[ https://issues.apache.org/jira/browse/SPARK-14657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-14657: -- Assignee: Yanbo Liang > RFormula output wrong features when formula w/o intercept > - > > Key: SPARK-14657 > URL: https://issues.apache.org/jira/browse/SPARK-14657 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang >Assignee: Yanbo Liang > > SparkR::glm output different features compared with R glm when fit w/o > intercept and having string/category features. Take the following example, > SparkR output three features compared with four features for native R. > SparkR::glm > {quote} > training <- suppressWarnings(createDataFrame(sqlContext, iris)) > model <- glm(Sepal_Width ~ Sepal_Length + Species - 1, data = training) > summary(model) > Coefficients: > Estimate Std. Error t value Pr(>|t|) > Sepal_Length0.67468 0.0093013 72.536 0 > Species_versicolor -1.2349 0.07269 -16.989 0 > Species_virginica -1.4708 0.077397-19.003 0 > {quote} > stats::glm > {quote} > summary(glm(Sepal.Width ~ Sepal.Length + Species - 1, data = iris)) > Coefficients: > Estimate Std. Error t value Pr(>|t|) > Sepal.Length0.3499 0.0463 7.557 4.19e-12 *** > Speciessetosa 1.6765 0.2354 7.123 4.46e-11 *** > Speciesversicolor 0.6931 0.2779 2.494 0.0137 * > Speciesvirginica0.6690 0.3078 2.174 0.0313 * > {quote} > The encoder for string/category feature is different. R did not drop any > category but SparkR drop the last one. > I searched online and test some other cases, found when we fit R glm model(or > other models powered by R formula) w/o intercept on a dataset including > string/category features, one of the categories in the first category feature > is being used as reference category, we will not drop any category for that > feature. > I think we should keep consistent semantics between Spark RFormula and R > formula. > cc [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14657) RFormula output wrong features when formula w/o intercept
[ https://issues.apache.org/jira/browse/SPARK-14657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-14657: Description: SparkR::glm output different features compared with R glm when fit w/o intercept and having string/category features. Take the following example, SparkR output three features compared with four features for native R. SparkR::glm {quote} training <- suppressWarnings(createDataFrame(sqlContext, iris)) model <- glm(Sepal_Width ~ Sepal_Length + Species - 1, data = training) summary(model) Coefficients: Estimate Std. Error t value Pr(>|t|) Sepal_Length0.67468 0.0093013 72.536 0 Species_versicolor -1.2349 0.07269 -16.989 0 Species_virginica -1.4708 0.077397-19.003 0 {quote} stats::glm {quote} summary(glm(Sepal.Width ~ Sepal.Length + Species - 1, data = iris)) Coefficients: Estimate Std. Error t value Pr(>|t|) Sepal.Length0.3499 0.0463 7.557 4.19e-12 *** Speciessetosa 1.6765 0.2354 7.123 4.46e-11 *** Speciesversicolor 0.6931 0.2779 2.494 0.0137 * Speciesvirginica0.6690 0.3078 2.174 0.0313 * {quote} The encoder for string/category feature is different. R did not drop any category but SparkR drop the last one. I searched online and test some other cases, found when we fit R glm model(or other models powered by R formula) w/o intercept on a dataset including string/category features, one of the categories in the first category feature is being used as reference category, we will not drop any category for that feature. I think we should keep consistent semantics between Spark RFormula and R formula. cc [~mengxr] was: SparkR::glm output different features compared with R glm when fit w/o intercept and having string/category features. Take the following example, SparkR output three features compared with four features for native R. SparkR::glm {quote} training <- suppressWarnings(createDataFrame(sqlContext, iris)) model <- glm(Sepal_Width ~ Sepal_Length + Species - 1, data = training) summary(model) Coefficients: Estimate Std. Error t value Pr(>|t|) Sepal_Length0.67468 0.0093013 72.536 0 Species_versicolor -1.2349 0.07269 -16.989 0 Species_virginica -1.4708 0.077397-19.003 0 {quote} stats::glm {quote} summary(glm(Sepal.Width ~ Sepal.Length + Species - 1, data = iris)) Coefficients: Estimate Std. Error t value Pr(>|t|) Sepal.Length0.3499 0.0463 7.557 4.19e-12 *** Speciessetosa 1.6765 0.2354 7.123 4.46e-11 *** Speciesversicolor 0.6931 0.2779 2.494 0.0137 * Speciesvirginica0.6690 0.3078 2.174 0.0313 * {quote} The encoder for string/category feature is different. R did not drop any category but SparkR drop the last one. I searched online and test some other cases, found when we fit R glm model(or other models powered by R formula) w/o intercept on a dataset including string/category features, one of the categories in the first category feature is being used as reference category, we will not drop any category for that feature. I think we should keep consistent semantics between Spark RFormula and R formula. > RFormula output wrong features when formula w/o intercept > - > > Key: SPARK-14657 > URL: https://issues.apache.org/jira/browse/SPARK-14657 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang > > SparkR::glm output different features compared with R glm when fit w/o > intercept and having string/category features. Take the following example, > SparkR output three features compared with four features for native R. > SparkR::glm > {quote} > training <- suppressWarnings(createDataFrame(sqlContext, iris)) > model <- glm(Sepal_Width ~ Sepal_Length + Species - 1, data = training) > summary(model) > Coefficients: > Estimate Std. Error t value Pr(>|t|) > Sepal_Length0.67468 0.0093013 72.536 0 > Species_versicolor -1.2349 0.07269 -16.989 0 > Species_virginica -1.4708 0.077397-19.003 0 > {quote} > stats::glm > {quote} > summary(glm(Sepal.Width ~ Sepal.Length + Species - 1, data = iris)) > Coefficients: > Estimate Std. Error t value Pr(>|t|) > Sepal.Length0.3499 0.0463 7.557 4.19e-12 *** > Speciessetosa 1.6765 0.2354 7.123 4.46e-11 *** > Speciesversicolor 0.6931 0.2779 2.494 0.0137 * > Speciesvirginica0.6690 0.3078 2.174 0.0313 * > {quote} > The encoder for string/category feature is different. R did not drop any > category but SparkR drop the last one. > I searched online and test some other cases, found when we fit R glm model(or > other models powered by R
[jira] [Updated] (SPARK-14657) RFormula output wrong features when formula w/o intercept
[ https://issues.apache.org/jira/browse/SPARK-14657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-14657: Description: SparkR::glm output different features compared with R glm when fit w/o intercept and having string/category features. Take the following example, SparkR output three features compared with four features for native R. SparkR::glm {quote} training <- suppressWarnings(createDataFrame(sqlContext, iris)) model <- glm(Sepal_Width ~ Sepal_Length + Species - 1, data = training) summary(model) Coefficients: Estimate Std. Error t value Pr(>|t|) Sepal_Length0.67468 0.0093013 72.536 0 Species_versicolor -1.2349 0.07269 -16.989 0 Species_virginica -1.4708 0.077397-19.003 0 {quote} stats::glm {quote} summary(glm(Sepal.Width ~ Sepal.Length + Species - 1, data = iris)) Coefficients: Estimate Std. Error t value Pr(>|t|) Sepal.Length0.3499 0.0463 7.557 4.19e-12 *** Speciessetosa 1.6765 0.2354 7.123 4.46e-11 *** Speciesversicolor 0.6931 0.2779 2.494 0.0137 * Speciesvirginica0.6690 0.3078 2.174 0.0313 * {quote} The encoder for string/category feature is different. R did not drop any category but SparkR drop the last one. I searched online and test some other cases, found when we fit R glm model(or other models powered by R formula) w/o intercept on a dataset including string/category features, one of the categories in the first category feature is being used as reference category, we will not drop any category for that feature. I think we should keep consistent semantics between Spark RFormula and R formula. was: SparkR::glm output different features compared with R glm when fit w/o intercept. Take the following example, SparkR output three features compared with four features for native R. SparkR::glm {quote} training <- suppressWarnings(createDataFrame(sqlContext, iris)) model <- glm(Sepal_Width ~ Sepal_Length + Species - 1, data = training) summary(model) Coefficients: Estimate Std. Error t value Pr(>|t|) Sepal_Length0.67468 0.0093013 72.536 0 Species_versicolor -1.2349 0.07269 -16.989 0 Species_virginica -1.4708 0.077397-19.003 0 {quote} stats::glm {quote} summary(glm(Sepal.Width ~ Sepal.Length + Species - 1, data = iris)) Coefficients: Estimate Std. Error t value Pr(>|t|) Sepal.Length0.3499 0.0463 7.557 4.19e-12 *** Speciessetosa 1.6765 0.2354 7.123 4.46e-11 *** Speciesversicolor 0.6931 0.2779 2.494 0.0137 * Speciesvirginica0.6690 0.3078 2.174 0.0313 * {quote} The encoder for string/category feature is different. R did not drop any category but SparkR drop the last one. I searched online and test some other cases, found when we fit R glm model(or other models powered by R formula) w/o intercept on a dataset including string/category features, one of the categories in the first category feature is being used as reference category, we will not drop any category for that feature. I think we should keep consistent semantics between Spark RFormula and R formula. > RFormula output wrong features when formula w/o intercept > - > > Key: SPARK-14657 > URL: https://issues.apache.org/jira/browse/SPARK-14657 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang > > SparkR::glm output different features compared with R glm when fit w/o > intercept and having string/category features. Take the following example, > SparkR output three features compared with four features for native R. > SparkR::glm > {quote} > training <- suppressWarnings(createDataFrame(sqlContext, iris)) > model <- glm(Sepal_Width ~ Sepal_Length + Species - 1, data = training) > summary(model) > Coefficients: > Estimate Std. Error t value Pr(>|t|) > Sepal_Length0.67468 0.0093013 72.536 0 > Species_versicolor -1.2349 0.07269 -16.989 0 > Species_virginica -1.4708 0.077397-19.003 0 > {quote} > stats::glm > {quote} > summary(glm(Sepal.Width ~ Sepal.Length + Species - 1, data = iris)) > Coefficients: > Estimate Std. Error t value Pr(>|t|) > Sepal.Length0.3499 0.0463 7.557 4.19e-12 *** > Speciessetosa 1.6765 0.2354 7.123 4.46e-11 *** > Speciesversicolor 0.6931 0.2779 2.494 0.0137 * > Speciesvirginica0.6690 0.3078 2.174 0.0313 * > {quote} > The encoder for string/category feature is different. R did not drop any > category but SparkR drop the last one. > I searched online and test some other cases, found when we fit R glm model(or > other models powered by R formula) w/o intercept on a dataset including > s
[jira] [Updated] (SPARK-14657) RFormula output wrong features when formula w/o intercept
[ https://issues.apache.org/jira/browse/SPARK-14657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-14657: Description: SparkR::glm output different features compared with R glm when fit w/o intercept. Take the following example, SparkR output three features compared with four features for native R. SparkR::glm {quote} training <- suppressWarnings(createDataFrame(sqlContext, iris)) model <- glm(Sepal_Width ~ Sepal_Length + Species - 1, data = training) summary(model) Coefficients: Estimate Std. Error t value Pr(>|t|) Sepal_Length0.67468 0.0093013 72.536 0 Species_versicolor -1.2349 0.07269 -16.989 0 Species_virginica -1.4708 0.077397-19.003 0 {quote} stats::glm {quote} summary(glm(Sepal.Width ~ Sepal.Length + Species - 1, data = iris)) Coefficients: Estimate Std. Error t value Pr(>|t|) Sepal.Length0.3499 0.0463 7.557 4.19e-12 *** Speciessetosa 1.6765 0.2354 7.123 4.46e-11 *** Speciesversicolor 0.6931 0.2779 2.494 0.0137 * Speciesvirginica0.6690 0.3078 2.174 0.0313 * {quote} The encoder for string/category feature is different. R did not drop any category but SparkR drop the last one. I searched online and test some other cases, found when we fit R glm model(or other models powered by R formula) w/o intercept on a dataset including string/category features, one of the categories in the first category feature is being used as reference category, we will not drop any category for that feature. I think we should keep consistent semantics between Spark RFormula and R formula. was: SparkR::glm output different features compared with R glm. Take the following example, SparkR output three features compared with four features for native R. SparkR::glm {quote} training <- suppressWarnings(createDataFrame(sqlContext, iris)) model <- glm(Sepal_Width ~ Sepal_Length + Species - 1, data = training) summary(model) Coefficients: Estimate Std. Error t value Pr(>|t|) Sepal_Length0.67468 0.0093013 72.536 0 Species_versicolor -1.2349 0.07269 -16.989 0 Species_virginica -1.4708 0.077397-19.003 0 {quote} stats::glm {quote} summary(glm(Sepal.Width ~ Sepal.Length + Species - 1, data = iris)) Coefficients: Estimate Std. Error t value Pr(>|t|) Sepal.Length0.3499 0.0463 7.557 4.19e-12 *** Speciessetosa 1.6765 0.2354 7.123 4.46e-11 *** Speciesversicolor 0.6931 0.2779 2.494 0.0137 * Speciesvirginica0.6690 0.3078 2.174 0.0313 * {quote} The encoder for string/category feature is different. R did not drop any category but SparkR drop the last one. I searched online and test some other cases, found when we fit R glm model(or other models powered by R formula) w/o intercept on a dataset including string/category features, one of the categories in the first category feature is being used as reference category, we will not drop any category for that feature. I think we should keep consistent semantics between Spark RFormula and R formula. > RFormula output wrong features when formula w/o intercept > - > > Key: SPARK-14657 > URL: https://issues.apache.org/jira/browse/SPARK-14657 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang > > SparkR::glm output different features compared with R glm when fit w/o > intercept. Take the following example, SparkR output three features compared > with four features for native R. > SparkR::glm > {quote} > training <- suppressWarnings(createDataFrame(sqlContext, iris)) > model <- glm(Sepal_Width ~ Sepal_Length + Species - 1, data = training) > summary(model) > Coefficients: > Estimate Std. Error t value Pr(>|t|) > Sepal_Length0.67468 0.0093013 72.536 0 > Species_versicolor -1.2349 0.07269 -16.989 0 > Species_virginica -1.4708 0.077397-19.003 0 > {quote} > stats::glm > {quote} > summary(glm(Sepal.Width ~ Sepal.Length + Species - 1, data = iris)) > Coefficients: > Estimate Std. Error t value Pr(>|t|) > Sepal.Length0.3499 0.0463 7.557 4.19e-12 *** > Speciessetosa 1.6765 0.2354 7.123 4.46e-11 *** > Speciesversicolor 0.6931 0.2779 2.494 0.0137 * > Speciesvirginica0.6690 0.3078 2.174 0.0313 * > {quote} > The encoder for string/category feature is different. R did not drop any > category but SparkR drop the last one. > I searched online and test some other cases, found when we fit R glm model(or > other models powered by R formula) w/o intercept on a dataset including > string/category features, one of the categories in the first category feature > is being used as
[jira] [Updated] (SPARK-14657) RFormula output wrong features when formula w/o intercept
[ https://issues.apache.org/jira/browse/SPARK-14657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-14657: Description: SparkR::glm output different features compared with R glm. Take the following example, SparkR output three features compared with four features for native R. SparkR::glm {quote} training <- suppressWarnings(createDataFrame(sqlContext, iris)) model <- glm(Sepal_Width ~ Sepal_Length + Species - 1, data = training) summary(model) Coefficients: Estimate Std. Error t value Pr(>|t|) Sepal_Length0.67468 0.0093013 72.536 0 Species_versicolor -1.2349 0.07269 -16.989 0 Species_virginica -1.4708 0.077397-19.003 0 {quote} stats::glm {quote} summary(glm(Sepal.Width ~ Sepal.Length + Species - 1, data = iris)) Coefficients: Estimate Std. Error t value Pr(>|t|) Sepal.Length0.3499 0.0463 7.557 4.19e-12 *** Speciessetosa 1.6765 0.2354 7.123 4.46e-11 *** Speciesversicolor 0.6931 0.2779 2.494 0.0137 * Speciesvirginica0.6690 0.3078 2.174 0.0313 * {quote} The encoder for string/category feature is different. R did not drop any category but SparkR drop the last one. I searched online and test some other cases, found when we fit R glm model(or other models powered by R formula) w/o intercept on a dataset including string/category features, one of the categories in the first category feature is being used as reference category, we will not drop any category for that feature. I think we should keep consistent semantics between Spark RFormula and R formula. was: SparkR::glm output different features compared with R glm. SparkR::glm {quote} training <- suppressWarnings(createDataFrame(sqlContext, iris)) model <- glm(Sepal_Width ~ Sepal_Length + Species - 1, data = training) summary(model) Coefficients: Estimate Std. Error t value Pr(>|t|) Sepal_Length0.67468 0.0093013 72.536 0 Species_versicolor -1.2349 0.07269 -16.989 0 Species_virginica -1.4708 0.077397-19.003 0 {quote} stats::glm {quote} summary(glm(Sepal.Width ~ Sepal.Length + Species - 1, data = iris)) Coefficients: Estimate Std. Error t value Pr(>|t|) Sepal.Length0.3499 0.0463 7.557 4.19e-12 *** Speciessetosa 1.6765 0.2354 7.123 4.46e-11 *** Speciesversicolor 0.6931 0.2779 2.494 0.0137 * Speciesvirginica0.6690 0.3078 2.174 0.0313 * {quote} The encoder for string/category feature is different. R did not drop any category but SparkR drop the last one. I searched online and test some other cases, found when we fit R glm model(or other models powered by R formula) w/o intercept on a dataset including string/category features, one of the categories in the first category feature is being used as reference category, we will not drop any category for that feature. I think we should keep consistent semantics between Spark RFormula and R formula. > RFormula output wrong features when formula w/o intercept > - > > Key: SPARK-14657 > URL: https://issues.apache.org/jira/browse/SPARK-14657 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang > > SparkR::glm output different features compared with R glm. Take the following > example, SparkR output three features compared with four features for native > R. > SparkR::glm > {quote} > training <- suppressWarnings(createDataFrame(sqlContext, iris)) > model <- glm(Sepal_Width ~ Sepal_Length + Species - 1, data = training) > summary(model) > Coefficients: > Estimate Std. Error t value Pr(>|t|) > Sepal_Length0.67468 0.0093013 72.536 0 > Species_versicolor -1.2349 0.07269 -16.989 0 > Species_virginica -1.4708 0.077397-19.003 0 > {quote} > stats::glm > {quote} > summary(glm(Sepal.Width ~ Sepal.Length + Species - 1, data = iris)) > Coefficients: > Estimate Std. Error t value Pr(>|t|) > Sepal.Length0.3499 0.0463 7.557 4.19e-12 *** > Speciessetosa 1.6765 0.2354 7.123 4.46e-11 *** > Speciesversicolor 0.6931 0.2779 2.494 0.0137 * > Speciesvirginica0.6690 0.3078 2.174 0.0313 * > {quote} > The encoder for string/category feature is different. R did not drop any > category but SparkR drop the last one. > I searched online and test some other cases, found when we fit R glm model(or > other models powered by R formula) w/o intercept on a dataset including > string/category features, one of the categories in the first category feature > is being used as reference category, we will not drop any category for that > feature. > I think we should keep consistent semantics between Spark RFormula and R
[jira] [Updated] (SPARK-14657) RFormula output wrong features when formula w/o intercept
[ https://issues.apache.org/jira/browse/SPARK-14657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-14657: Description: SparkR::glm output different features compared with R glm. SparkR::glm {quote} training <- suppressWarnings(createDataFrame(sqlContext, iris)) model <- glm(Sepal_Width ~ Sepal_Length + Species - 1, data = training) summary(model) Coefficients: Estimate Std. Error t value Pr(>|t|) Sepal_Length0.67468 0.0093013 72.536 0 Species_versicolor -1.2349 0.07269 -16.989 0 Species_virginica -1.4708 0.077397-19.003 0 {quote} stats::glm {quote} summary(glm(Sepal.Width ~ Sepal.Length + Species - 1, data = iris)) Coefficients: Estimate Std. Error t value Pr(>|t|) Sepal.Length0.3499 0.0463 7.557 4.19e-12 *** Speciessetosa 1.6765 0.2354 7.123 4.46e-11 *** Speciesversicolor 0.6931 0.2779 2.494 0.0137 * Speciesvirginica0.6690 0.3078 2.174 0.0313 * {quote} The encoder for string/category feature is different. R did not drop any category but SparkR drop the last one. I searched online and test some other cases, found when we fit R glm model(or other models powered by R formula) w/o intercept on a dataset including string/category features, one of the categories in the first category feature is being used as reference category, we will not drop any category for that feature. I think we should keep consistent semantics between Spark RFormula and R formula. was: SparkR::glm output different features compared with R glm. SparkR::glm {quote} training <- suppressWarnings(createDataFrame(sqlContext, iris)) model <- glm(Sepal_Width ~ Sepal_Length + Species - 1, data = training) summary(model) Coefficients: Estimate Std. Error t value Pr(>|t|) Sepal_Length0.67468 0.0093013 72.536 0 Species_versicolor -1.2349 0.07269 -16.989 0 Species_virginica -1.4708 0.077397-19.003 0 {quote} stats::glm {quote} summary(glm(Sepal.Width ~ Sepal.Length + Species - 1, data = iris)) Coefficients: Estimate Std. Error t value Pr(>|t|) Sepal.Length0.3499 0.0463 7.557 4.19e-12 *** Speciessetosa 1.6765 0.2354 7.123 4.46e-11 *** Speciesversicolor 0.6931 0.2779 2.494 0.0137 * Speciesvirginica0.6690 0.3078 2.174 0.0313 * {quote} The encoder for string/category feature is different. R did not drop any category but SparkR drop the last one. I searched online and test some other cases, found when we fit R glm model(or other models powered by R formula) w/o intercept on a dataset including string/category features, one of the levels in the first category feature is being used as reference level, we will not drop any category for that feature. I think we should keep consistent semantics between Spark RFormula and R formula. > RFormula output wrong features when formula w/o intercept > - > > Key: SPARK-14657 > URL: https://issues.apache.org/jira/browse/SPARK-14657 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang > > SparkR::glm output different features compared with R glm. > SparkR::glm > {quote} > training <- suppressWarnings(createDataFrame(sqlContext, iris)) > model <- glm(Sepal_Width ~ Sepal_Length + Species - 1, data = training) > summary(model) > Coefficients: > Estimate Std. Error t value Pr(>|t|) > Sepal_Length0.67468 0.0093013 72.536 0 > Species_versicolor -1.2349 0.07269 -16.989 0 > Species_virginica -1.4708 0.077397-19.003 0 > {quote} > stats::glm > {quote} > summary(glm(Sepal.Width ~ Sepal.Length + Species - 1, data = iris)) > Coefficients: > Estimate Std. Error t value Pr(>|t|) > Sepal.Length0.3499 0.0463 7.557 4.19e-12 *** > Speciessetosa 1.6765 0.2354 7.123 4.46e-11 *** > Speciesversicolor 0.6931 0.2779 2.494 0.0137 * > Speciesvirginica0.6690 0.3078 2.174 0.0313 * > {quote} > The encoder for string/category feature is different. R did not drop any > category but SparkR drop the last one. > I searched online and test some other cases, found when we fit R glm model(or > other models powered by R formula) w/o intercept on a dataset including > string/category features, one of the categories in the first category feature > is being used as reference category, we will not drop any category for that > feature. > I think we should keep consistent semantics between Spark RFormula and R > formula. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additi
[jira] [Updated] (SPARK-14657) RFormula output wrong features when formula w/o intercept
[ https://issues.apache.org/jira/browse/SPARK-14657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-14657: Description: SparkR::glm output different features compared with R glm. SparkR::glm {quote} training <- suppressWarnings(createDataFrame(sqlContext, iris)) model <- glm(Sepal_Width ~ Sepal_Length + Species - 1, data = training) summary(model) Coefficients: Estimate Std. Error t value Pr(>|t|) Sepal_Length0.67468 0.0093013 72.536 0 Species_versicolor -1.2349 0.07269 -16.989 0 Species_virginica -1.4708 0.077397-19.003 0 {quote} stats::glm {quote} summary(glm(Sepal.Width ~ Sepal.Length + Species - 1, data = iris)) Coefficients: Estimate Std. Error t value Pr(>|t|) Sepal.Length0.3499 0.0463 7.557 4.19e-12 *** Speciessetosa 1.6765 0.2354 7.123 4.46e-11 *** Speciesversicolor 0.6931 0.2779 2.494 0.0137 * Speciesvirginica0.6690 0.3078 2.174 0.0313 * {quote} The encoder for string/category feature is different. R did not drop any category but SparkR drop the last one. I searched online and test some other cases, found when we fit R glm model(or other models powered by R formula) w/o intercept on a dataset including string/category features, one of the levels in the first category feature is being used as reference level, we will not drop any category for that feature. I think we should keep consistent semantics between Spark RFormula and R formula. was: SparkR::glm output different features compared with R glm. SparkR::glm {quote} training <- suppressWarnings(createDataFrame(sqlContext, iris)) model <- glm(Sepal_Width ~ Sepal_Length + Species - 1, data = training) summary(model) Coefficients: Estimate Std. Error t value Pr(>|t|) Sepal_Length0.67468 0.0093013 72.536 0 Species_versicolor -1.2349 0.07269 -16.989 0 Species_virginica -1.4708 0.077397-19.003 0 {quote} stats::glm {quote} summary(glm(Sepal.Width ~ Sepal.Length + Species - 1, data = iris)) Coefficients: Estimate Std. Error t value Pr(>|t|) Sepal.Length0.3499 0.0463 7.557 4.19e-12 *** Speciessetosa 1.6765 0.2354 7.123 4.46e-11 *** Speciesversicolor 0.6931 0.2779 2.494 0.0137 * Speciesvirginica0.6690 0.3078 2.174 0.0313 * {quote} The encoder for string type feature is different. R did not drop any category but SparkR drop the last one. I searched online and test some other cases, found when we fit R glm model(or other models powered by R formula) w/o intercept on a dataset including string/category features, one of the levels in the first category feature is being used as reference level, we will not drop any category for that feature. I think we should keep consistent semantics between Spark RFormula and R formula. > RFormula output wrong features when formula w/o intercept > - > > Key: SPARK-14657 > URL: https://issues.apache.org/jira/browse/SPARK-14657 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang > > SparkR::glm output different features compared with R glm. > SparkR::glm > {quote} > training <- suppressWarnings(createDataFrame(sqlContext, iris)) > model <- glm(Sepal_Width ~ Sepal_Length + Species - 1, data = training) > summary(model) > Coefficients: > Estimate Std. Error t value Pr(>|t|) > Sepal_Length0.67468 0.0093013 72.536 0 > Species_versicolor -1.2349 0.07269 -16.989 0 > Species_virginica -1.4708 0.077397-19.003 0 > {quote} > stats::glm > {quote} > summary(glm(Sepal.Width ~ Sepal.Length + Species - 1, data = iris)) > Coefficients: > Estimate Std. Error t value Pr(>|t|) > Sepal.Length0.3499 0.0463 7.557 4.19e-12 *** > Speciessetosa 1.6765 0.2354 7.123 4.46e-11 *** > Speciesversicolor 0.6931 0.2779 2.494 0.0137 * > Speciesvirginica0.6690 0.3078 2.174 0.0313 * > {quote} > The encoder for string/category feature is different. R did not drop any > category but SparkR drop the last one. > I searched online and test some other cases, found when we fit R glm model(or > other models powered by R formula) w/o intercept on a dataset including > string/category features, one of the levels in the first category feature is > being used as reference level, we will not drop any category for that feature. > I think we should keep consistent semantics between Spark RFormula and R > formula. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail:
[jira] [Updated] (SPARK-14657) RFormula output wrong features when formula w/o intercept
[ https://issues.apache.org/jira/browse/SPARK-14657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-14657: Description: SparkR::glm output different features compared with R glm. SparkR::glm {quote} training <- suppressWarnings(createDataFrame(sqlContext, iris)) model <- glm(Sepal_Width ~ Sepal_Length + Species - 1, data = training) summary(model) Coefficients: Estimate Std. Error t value Pr(>|t|) Sepal_Length0.67468 0.0093013 72.536 0 Species_versicolor -1.2349 0.07269 -16.989 0 Species_virginica -1.4708 0.077397-19.003 0 {quote} stats::glm {quote} summary(glm(Sepal.Width ~ Sepal.Length + Species - 1, data = iris)) Coefficients: Estimate Std. Error t value Pr(>|t|) Sepal.Length0.3499 0.0463 7.557 4.19e-12 *** Speciessetosa 1.6765 0.2354 7.123 4.46e-11 *** Speciesversicolor 0.6931 0.2779 2.494 0.0137 * Speciesvirginica0.6690 0.3078 2.174 0.0313 * {quote} The encoder for string type feature is different. R did not drop any category but SparkR drop the last one. I searched online and test some other cases, found when we fit R glm model(or other models powered by R formula) w/o intercept on a dataset including string/category features, one of the levels in the first category feature is being used as reference level, we will not drop any category for that feature. I think we should keep consistent semantics between Spark RFormula and R formula. was: SparkR::glm output different features compared with R glm. SparkR::glm {quote} training <- suppressWarnings(createDataFrame(sqlContext, iris)) model <- glm(Sepal_Width ~ Sepal_Length + Species - 1, data = training) summary(model) Coefficients: Estimate Std. Error t value Pr(>|t|) Sepal_Length0.67468 0.0093013 72.536 0 Species_versicolor -1.2349 0.07269 -16.989 0 Species_virginica -1.4708 0.077397-19.003 0 {quote} stats::glm {quote} summary(glm(Sepal.Width ~ Sepal.Length + Species - 1, data = iris)) Coefficients: Estimate Std. Error t value Pr(>|t|) Sepal.Length0.3499 0.0463 7.557 4.19e-12 *** Speciessetosa 1.6765 0.2354 7.123 4.46e-11 *** Speciesversicolor 0.6931 0.2779 2.494 0.0137 * Speciesvirginica0.6690 0.3078 2.174 0.0313 * {quote} The encoder for string type feature is different. R did not drop any category but SparkR drop the last one. I searched online and test some other cases, found when we fit R glm model(or other models powered by R formula) w/o intercept on a dataset including string/category features, one of the levels in the first category feature is being used as reference level, we will not drop any category for that feature. I think we should keep consistent semantics for Spark RFormula. > RFormula output wrong features when formula w/o intercept > - > > Key: SPARK-14657 > URL: https://issues.apache.org/jira/browse/SPARK-14657 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang > > SparkR::glm output different features compared with R glm. > SparkR::glm > {quote} > training <- suppressWarnings(createDataFrame(sqlContext, iris)) > model <- glm(Sepal_Width ~ Sepal_Length + Species - 1, data = training) > summary(model) > Coefficients: > Estimate Std. Error t value Pr(>|t|) > Sepal_Length0.67468 0.0093013 72.536 0 > Species_versicolor -1.2349 0.07269 -16.989 0 > Species_virginica -1.4708 0.077397-19.003 0 > {quote} > stats::glm > {quote} > summary(glm(Sepal.Width ~ Sepal.Length + Species - 1, data = iris)) > Coefficients: > Estimate Std. Error t value Pr(>|t|) > Sepal.Length0.3499 0.0463 7.557 4.19e-12 *** > Speciessetosa 1.6765 0.2354 7.123 4.46e-11 *** > Speciesversicolor 0.6931 0.2779 2.494 0.0137 * > Speciesvirginica0.6690 0.3078 2.174 0.0313 * > {quote} > The encoder for string type feature is different. R did not drop any category > but SparkR drop the last one. > I searched online and test some other cases, found when we fit R glm model(or > other models powered by R formula) w/o intercept on a dataset including > string/category features, one of the levels in the first category feature is > being used as reference level, we will not drop any category for that feature. > I think we should keep consistent semantics between Spark RFormula and R > formula. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.o
[jira] [Updated] (SPARK-14657) RFormula output wrong features when formula w/o intercept
[ https://issues.apache.org/jira/browse/SPARK-14657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-14657: Description: SparkR::glm output different features compared with R glm. SparkR::glm {quote} training <- suppressWarnings(createDataFrame(sqlContext, iris)) model <- glm(Sepal_Width ~ Sepal_Length + Species - 1, data = training) summary(model) Coefficients: Estimate Std. Error t value Pr(>|t|) Sepal_Length0.67468 0.0093013 72.536 0 Species_versicolor -1.2349 0.07269 -16.989 0 Species_virginica -1.4708 0.077397-19.003 0 {quote} stats::glm {quote} summary(glm(Sepal.Width ~ Sepal.Length + Species - 1, data = iris)) Coefficients: Estimate Std. Error t value Pr(>|t|) Sepal.Length0.3499 0.0463 7.557 4.19e-12 *** Speciessetosa 1.6765 0.2354 7.123 4.46e-11 *** Speciesversicolor 0.6931 0.2779 2.494 0.0137 * Speciesvirginica0.6690 0.3078 2.174 0.0313 * {quote} The encoder for string type feature is different. R did not drop any category but SparkR drop the last one. I searched online and test some other cases, found when we fit R glm model(or other models powered by R formula) w/o intercept on a dataset including string/category features, one of the levels in the first category feature is being used as reference level, we will not drop any category for that feature. I think we should keep consistent semantics for Spark RFormula. was: SparkR::glm output different features compared with R glm. SparkR::glm {quote} training <- suppressWarnings(createDataFrame(sqlContext, iris)) model <- glm(Sepal_Width ~ Sepal_Length + Species - 1, data = training) summary(model) Coefficients: Estimate Std. Error t value Pr(>|t|) Sepal_Length0.67468 0.0093013 72.536 0 Species_versicolor -1.2349 0.07269 -16.989 0 Species_virginica -1.4708 0.077397-19.003 0 {quote} stats::glm {quote} summary(glm(Sepal.Width ~ Sepal.Length + Species - 1, data = iris)) Coefficients: Estimate Std. Error t value Pr(>|t|) Sepal.Length0.3499 0.0463 7.557 4.19e-12 *** Speciessetosa 1.6765 0.2354 7.123 4.46e-11 *** Speciesversicolor 0.6931 0.2779 2.494 0.0137 * Speciesvirginica0.6690 0.3078 2.174 0.0313 * {quote} The encoder for feature of string type is difference. R did not drop any category but SparkR drop the last one. I refer R documents and search online, found when we fit a R glm model(or other models powered by R formula) w/o intercept on a dataset which including string/category features, one of the levels in the first category feature is being used as reference level, we will not drop any category for that feature. I think we should keep consistent sementics for Spark RFormula. > RFormula output wrong features when formula w/o intercept > - > > Key: SPARK-14657 > URL: https://issues.apache.org/jira/browse/SPARK-14657 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang > > SparkR::glm output different features compared with R glm. > SparkR::glm > {quote} > training <- suppressWarnings(createDataFrame(sqlContext, iris)) > model <- glm(Sepal_Width ~ Sepal_Length + Species - 1, data = training) > summary(model) > Coefficients: > Estimate Std. Error t value Pr(>|t|) > Sepal_Length0.67468 0.0093013 72.536 0 > Species_versicolor -1.2349 0.07269 -16.989 0 > Species_virginica -1.4708 0.077397-19.003 0 > {quote} > stats::glm > {quote} > summary(glm(Sepal.Width ~ Sepal.Length + Species - 1, data = iris)) > Coefficients: > Estimate Std. Error t value Pr(>|t|) > Sepal.Length0.3499 0.0463 7.557 4.19e-12 *** > Speciessetosa 1.6765 0.2354 7.123 4.46e-11 *** > Speciesversicolor 0.6931 0.2779 2.494 0.0137 * > Speciesvirginica0.6690 0.3078 2.174 0.0313 * > {quote} > The encoder for string type feature is different. R did not drop any category > but SparkR drop the last one. > I searched online and test some other cases, found when we fit R glm model(or > other models powered by R formula) w/o intercept on a dataset including > string/category features, one of the levels in the first category feature is > being used as reference level, we will not drop any category for that feature. > I think we should keep consistent semantics for Spark RFormula. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org