[jira] [Created] (SPARK-19452) Fix bug in the name assignment method in SparkR
Wayne Zhang created SPARK-19452: --- Summary: Fix bug in the name assignment method in SparkR Key: SPARK-19452 URL: https://issues.apache.org/jira/browse/SPARK-19452 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.1.0, 2.2.0 Reporter: Wayne Zhang The names method fails to check for validity of the assignment values. This can be fixed by calling colnames within names. See example below. {code} df <- suppressWarnings(createDataFrame(iris)) # this is error colnames(df) <- NULL # this should report error names(df) <- NULL {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19473) Several DataFrame Method still fail with dot in column names
Wayne Zhang created SPARK-19473: --- Summary: Several DataFrame Method still fail with dot in column names Key: SPARK-19473 URL: https://issues.apache.org/jira/browse/SPARK-19473 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Wayne Zhang Here is an example: {code} val df = Seq((1.0, 2.0), (2.0, 3.0)).toDF("y.a", "x.b") df.select("y.a") org.apache.spark.sql.AnalysisException: cannot resolve '`y.a`' given input columns: [y.a, x.b];; df.withColumn("d", col("y.a") + col("x.b")) org.apache.spark.sql.AnalysisException: cannot resolve '`y.a`' given input columns: [y.a, x.b];; {code} We can use backquote to avoid the errors, but this behavior is affecting some downstream work such as RFormula and SparkR. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19473) Several DataFrame Methods still fail with dot in column names
[ https://issues.apache.org/jira/browse/SPARK-19473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wayne Zhang updated SPARK-19473: Summary: Several DataFrame Methods still fail with dot in column names (was: Several DataFrame Method still fail with dot in column names ) > Several DataFrame Methods still fail with dot in column names > -- > > Key: SPARK-19473 > URL: https://issues.apache.org/jira/browse/SPARK-19473 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Wayne Zhang > > Here is an example: > {code} > val df = Seq((1.0, 2.0), (2.0, 3.0)).toDF("y.a", "x.b") > df.select("y.a") > org.apache.spark.sql.AnalysisException: cannot resolve '`y.a`' given input > columns: [y.a, x.b];; > df.withColumn("d", col("y.a") + col("x.b")) > org.apache.spark.sql.AnalysisException: cannot resolve '`y.a`' given input > columns: [y.a, x.b];; > {code} > We can use backquote to avoid the errors, but this behavior is affecting some > downstream work such as RFormula and SparkR. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-18710) Add offset to GeneralizedLinearRegression models
[ https://issues.apache.org/jira/browse/SPARK-18710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wayne Zhang reopened SPARK-18710: - > Add offset to GeneralizedLinearRegression models > > > Key: SPARK-18710 > URL: https://issues.apache.org/jira/browse/SPARK-18710 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.0.2 >Reporter: Wayne Zhang >Assignee: Wayne Zhang > Labels: features > Original Estimate: 10h > Remaining Estimate: 10h > > The current GeneralizedLinearRegression model does not support offset. The > offset can be useful to take into account exposure, or for testing > incremental effect of new variables. It is possible to use weights in current > environment to achieve the same effect of specifying offset for certain > models, e.g., Poisson & Binomial with log offset, it is desirable to have the > offset option to work with more general cases, e.g., negative offset or > offset that is hard to specify using weights (e.g., offset to the probability > rather than odds in logistic regression). > Effort would involve: > * update regression class to support offsetCol > * update IWLS to take into account of offset > * add test case for offset > I can start working on this if the community approves this feature. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19391) Tweedie GLM API in SparkR
Wayne Zhang created SPARK-19391: --- Summary: Tweedie GLM API in SparkR Key: SPARK-19391 URL: https://issues.apache.org/jira/browse/SPARK-19391 Project: Spark Issue Type: Improvement Components: SparkR Reporter: Wayne Zhang Port Tweedie GLM to SparkR https://github.com/apache/spark/pull/16344 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-19395) Convert coefficients in summary to matrix
Title: Message Title Wayne Zhang created an issue Spark / SPARK-19395 Convert coefficients in summary to matrix Issue Type: Bug Assignee: Unassigned Components: SparkR Created: 29/Jan/17 18:28 Priority: Major Reporter: Wayne Zhang The coefficients component in model summary should be 'matrix' but the underlying structure is indeed list. This affects several models except for 'AFTSurvivalRegressionModel' which has the correct implementation. The fix is to first unlist the coefficients returned from the callJMethod before converting to matrix. Add Comment
[jira] (SPARK-19400) GLM fails for intercept only model
Title: Message Title Wayne Zhang updated an issue Spark / SPARK-19400 GLM fails for intercept only model Change By: Wayne Zhang Component/s: ML Add Comment This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)
[jira] (SPARK-19400) GLM fails for intercept only model
Title: Message Title Wayne Zhang created an issue Spark / SPARK-19400 GLM fails for intercept only model Issue Type: Bug Assignee: Unassigned Created: 30/Jan/17 08:07 Priority: Major Reporter: Wayne Zhang Intercept-only GLM fails for non-Gaussian family because of reducing an empty array in IWLS. val dataset = Seq( (1.0, 1.0, 2.0, 0.0, 5.0), (0.5, 2.0, 1.0, 1.0, 2.0), (1.0, 3.0, 0.5, 2.0, 1.0), (2.0, 4.0, 1.5, 3.0, 3.0) ).toDF("y", "w", "off", "x1", "x2") val formula = new RFormula().setFormula("y ~ 1") val output = formula.fit(dataset).transform(dataset) val glr = new GeneralizedLinearRegression().setFamily("poisson") val model = glr.fit(output) java.lang.UnsupportedOperationException: empty.reduceLeft Add Comment
[jira] [Created] (SPARK-19682) Issue warning (or error) when subset method "[[" takes vector index
Wayne Zhang created SPARK-19682: --- Summary: Issue warning (or error) when subset method "[[" takes vector index Key: SPARK-19682 URL: https://issues.apache.org/jira/browse/SPARK-19682 Project: Spark Issue Type: Improvement Components: SparkR Affects Versions: 2.1.0 Reporter: Wayne Zhang Priority: Minor The `[[` method is supposed to take a single index and return a column. This is different from base R which takes a vector index. We should check for this and issue warning or error when vector index is supplied (which is very likely given the behavior in base R). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-19473) Several DataFrame Methods still fail with dot in column names
[ https://issues.apache.org/jira/browse/SPARK-19473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wayne Zhang closed SPARK-19473. --- Resolution: Not A Problem > Several DataFrame Methods still fail with dot in column names > -- > > Key: SPARK-19473 > URL: https://issues.apache.org/jira/browse/SPARK-19473 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Wayne Zhang > > Here is an example: > {code} > val df = Seq((1.0, 2.0), (2.0, 3.0)).toDF("y.a", "x.b") > df.select("y.a") > org.apache.spark.sql.AnalysisException: cannot resolve '`y.a`' given input > columns: [y.a, x.b];; > df.withColumn("d", col("y.a") + col("x.b")) > org.apache.spark.sql.AnalysisException: cannot resolve '`y.a`' given input > columns: [y.a, x.b];; > {code} > We can use backquote to avoid the errors, but this behavior is affecting some > downstream work such as RFormula and SparkR. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19270) Add summary table to GLM summary
Wayne Zhang created SPARK-19270: --- Summary: Add summary table to GLM summary Key: SPARK-19270 URL: https://issues.apache.org/jira/browse/SPARK-19270 Project: Spark Issue Type: Improvement Components: ML Reporter: Wayne Zhang Priority: Minor Add R-like summary table to GLM summary, which includes feature name (if exist), parameter estimate, standard error, t-stat and p-value. This allows scala users to easily gather these commonly used inference results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19270) Add summary table to GLM summary
[ https://issues.apache.org/jira/browse/SPARK-19270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wayne Zhang updated SPARK-19270: Shepherd: Yanbo Liang > Add summary table to GLM summary > > > Key: SPARK-19270 > URL: https://issues.apache.org/jira/browse/SPARK-19270 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Wayne Zhang >Priority: Minor > > Add R-like summary table to GLM summary, which includes feature name (if > exist), parameter estimate, standard error, t-stat and p-value. This allows > scala users to easily gather these commonly used inference results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14659) OneHotEncoder support drop first category alphabetically in the encoded vector
[ https://issues.apache.org/jira/browse/SPARK-14659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15828618#comment-15828618 ] Wayne Zhang commented on SPARK-14659: - [~yanboliang] [~josephkb] Has anyone been working on this ticket? It will also be helpful to support 'dropFirst', since in practice there is often need to set the most frequent as base for interpretability. I'll be happy to work on this (and already have some fix). > OneHotEncoder support drop first category alphabetically in the encoded > vector > --- > > Key: SPARK-14659 > URL: https://issues.apache.org/jira/browse/SPARK-14659 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yanbo Liang > > R formula drop the first category alphabetically when encode string/category > feature. Spark RFormula use OneHotEncoder to encode string/category feature > into vector, but only supporting "dropLast" by string/category frequencies. > This will cause SparkR produce different models compared with native R. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19773) SparkDataFrame should not allow duplicate names
Wayne Zhang created SPARK-19773: --- Summary: SparkDataFrame should not allow duplicate names Key: SPARK-19773 URL: https://issues.apache.org/jira/browse/SPARK-19773 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.1.0 Reporter: Wayne Zhang Priority: Minor SparkDataFrame in SparkR seems to accept duplicate names at creation, but incurs error when calling methods downstream. For example, we can do: {{{code}}} l <- list(list(1, 2), list(3, 4)) df <- createDataFrame(l, c("a", "a")) head(df) {{{code}}} But an error occurs when we do df$a = df$a * 2.0. I suggest we add validity check for duplicate names at initialization. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18166) GeneralizedLinearRegression Wrong Value Range for Poisson Distribution
Wayne Zhang created SPARK-18166: --- Summary: GeneralizedLinearRegression Wrong Value Range for Poisson Distribution Key: SPARK-18166 URL: https://issues.apache.org/jira/browse/SPARK-18166 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.0.0 Reporter: Wayne Zhang The current implementation of Poisson GLM seems to allow only positive values (See below). This is not correct since the support of Poisson includes the origin. override def initialize(y: Double, weight: Double): Double = { require(y {color:red} > {color} 0.0, "The response variable of Poisson family " + s"should be positive, but got $y") y } The fix is easy, just change it to require(y {color:red} >= {color} 0.0, "The response variable of Poisson family " + -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18710) Add offset to GeneralizedLinearRegression models
[ https://issues.apache.org/jira/browse/SPARK-18710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wayne Zhang updated SPARK-18710: Shepherd: Yanbo Liang (was: Sean Owen) Remaining Estimate: 10h (was: 336h) Original Estimate: 10h (was: 336h) > Add offset to GeneralizedLinearRegression models > > > Key: SPARK-18710 > URL: https://issues.apache.org/jira/browse/SPARK-18710 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.0.2 >Reporter: Wayne Zhang > Labels: features > Fix For: 2.2.0 > > Original Estimate: 10h > Remaining Estimate: 10h > > The current GeneralizedLinearRegression model does not support offset. The > offset can be useful to take into account exposure, or for testing > incremental effect of new variables. It is possible to use weights in current > environment to achieve the same effect of specifying offset for certain > models, e.g., Poisson & Binomial with log offset, it is desirable to have the > offset option to work with more general cases, e.g., negative offset or > offset that is hard to specify using weights (e.g., offset to the probability > rather than odds in logistic regression). > Effort would involve: > * update regression class to support offsetCol > * update IWLS to take into account of offset > * add test case for offset > I can start working on this if the community approves this feature. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-18929) Add Tweedie distribution in GLM
[ https://issues.apache.org/jira/browse/SPARK-18929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wayne Zhang reopened SPARK-18929: - > Add Tweedie distribution in GLM > --- > > Key: SPARK-18929 > URL: https://issues.apache.org/jira/browse/SPARK-18929 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Wayne Zhang >Assignee: Wayne Zhang > Labels: features > Original Estimate: 72h > Remaining Estimate: 72h > > I propose to add the full Tweedie family into the GeneralizedLinearRegression > model. The Tweedie family is characterized by a power variance function. > Currently supported distributions such as Gaussian, Poisson and Gamma > families are a special case of the > [Tweedie|https://en.wikipedia.org/wiki/Tweedie_distribution]. > I propose to add support for the other distributions: > * compound Poisson: 1 < variancePower < 2. This one is widely used to model > zero-inflated continuous distributions. > * positive stable: variancePower > 2 and variancePower != 3. Used to model > extreme values. > * inverse Gaussian: variancePower = 3. > The Tweedie family is supported in most statistical packages such as R > (statmod), SAS, h2o etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18701) Poisson GLM fails due to wrong initialization
Wayne Zhang created SPARK-18701: --- Summary: Poisson GLM fails due to wrong initialization Key: SPARK-18701 URL: https://issues.apache.org/jira/browse/SPARK-18701 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 2.0.2 Reporter: Wayne Zhang Priority: Critical Fix For: 2.2.0 Poisson GLM fails for many standard data sets. The issue is incorrect initialization leading to almost zero probability and weights. The following simple example reproduces the error. {code:borderStyle=solid} val datasetPoissonLogWithZero = Seq( LabeledPoint(0.0, Vectors.dense(18, 1.0)), LabeledPoint(1.0, Vectors.dense(12, 0.0)), LabeledPoint(0.0, Vectors.dense(15, 0.0)), LabeledPoint(0.0, Vectors.dense(13, 2.0)), LabeledPoint(0.0, Vectors.dense(15, 1.0)), LabeledPoint(1.0, Vectors.dense(16, 1.0)), LabeledPoint(0.0, Vectors.dense(10, 0.0)), LabeledPoint(0.0, Vectors.dense(15, 0.0)), LabeledPoint(0.0, Vectors.dense(12, 2.0)), LabeledPoint(0.0, Vectors.dense(13, 0.0)), LabeledPoint(1.0, Vectors.dense(15, 0.0)), LabeledPoint(1.0, Vectors.dense(15, 0.0)), LabeledPoint(0.0, Vectors.dense(15, 0.0)), LabeledPoint(0.0, Vectors.dense(12, 2.0)), LabeledPoint(1.0, Vectors.dense(12, 2.0)) ).toDF() val glr = new GeneralizedLinearRegression() .setFamily("poisson") .setLink("log") .setMaxIter(20) .setRegParam(0) val model = glr.fit(datasetPoissonLogWithZero) {code} The issue is in the initialization: the mean is initialized as the response, which could be zero. Applying the log link results in very negative numbers (protected against -Inf), which again leads to close to zero probability and weights in the weighted least squares. The fix is easy: just add a small constant, highlighted in red below. override def initialize(y: Double, weight: Double): Double = { require(y >= 0.0, "The response variable of Poisson family " + s"should be non-negative, but got $y") y {color:red}+ 0.1 {color} } I already have a fix and test code. Will create a PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18710) Add offset to GeneralizedLinearRegression models
Wayne Zhang created SPARK-18710: --- Summary: Add offset to GeneralizedLinearRegression models Key: SPARK-18710 URL: https://issues.apache.org/jira/browse/SPARK-18710 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 2.0.2 Reporter: Wayne Zhang Fix For: 2.2.0 The current GeneralizedLinearRegression model does not support offset. The offset can be useful to take into account exposure, or for testing incremental effect of new variables. It is possible to use weights in current environment to achieve the same effect of specifying offset for certain models, e.g., Poisson & Binomial with log offset, it is desirable to have the offset option to work with more general cases, e.g., negative offset or offset that is hard to specify using weights (e.g., offset to the probability rather than odds in logistic regression). Effort would involve: * update regression class to support offsetCol * update IWLS to take into account of offset * add test case for offset I can start working on this if the community approves this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18715) Correct AIC calculation in Binomial GLM
Wayne Zhang created SPARK-18715: --- Summary: Correct AIC calculation in Binomial GLM Key: SPARK-18715 URL: https://issues.apache.org/jira/browse/SPARK-18715 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.0.2 Reporter: Wayne Zhang Priority: Critical Fix For: 2.2.0 The AIC calculation in Binomial GLM seems to be wrong when there are weights. The weight adjustment should be applied to only the part of the Binomial density involving the parameters, not the normalizing constant. The current implementation is: {code} -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) => weight * dist.Binomial(1, mu).logProbabilityOf(math.round(y).toInt) }.sum() {code} Suggest changing this to {code} -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) => val wt = math.round(weight).toInt if (wt == 0){ 0.0 } else { dist.Binomial(wt, mu).logProbabilityOf(math.round(y * weight).toInt) } }.sum() {code} The following is an example to illustrate the problem. {code} val dataset = Seq( LabeledPoint(0.0, Vectors.dense(18, 1.0)), LabeledPoint(0.5, Vectors.dense(12, 0.0)), LabeledPoint(1.0, Vectors.dense(15, 0.0)), LabeledPoint(0.0, Vectors.dense(13, 2.0)), LabeledPoint(0.0, Vectors.dense(15, 1.0)), LabeledPoint(0.5, Vectors.dense(16, 1.0)) ).toDF().withColumn("weight", col("label") + 1.0) val glr = new GeneralizedLinearRegression() .setFamily("binomial") .setWeightCol("weight") .setRegParam(0) val model = glr.fit(dataset) model.summary.aic {code} This calculation shows the AIC is 14.189026847171382. To verify whether this is correct, I run the same analysis in R but got AIC = 11.66092, -2 * LogLik = 5.660918. {code} da <- scan(, what=list(y = 0, x1 = 0, x2 = 0, w = 0), sep = ",") 0,18,1,1 0.5,12,0,1.5 1,15,0,2 0,13,2,1 0,15,1,1 0.5,16,1,1.5 da <- as.data.frame(da) f <- glm(y ~ x1 + x2 , data = da, family = binomial(), weight = w) AIC(f) -2 * logLik(f) {code} Now, I check whether the proposed change is correct. The following calculates -2 * LogLik manually and get 5.6609177228379055, the same as that in R. {code} val predictions = model.transform(dataset) -2.0 * predictions.select("label", "prediction", "weight").rdd.map {case Row(y: Double, mu: Double, weight: Double) => val wt = math.round(weight).toInt if (wt == 0){ 0.0 } else { dist.Binomial(wt, mu).logProbabilityOf(math.round(y * weight).toInt) } }.sum() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18715) Fix wrong AIC calculation in Binomial GLM
[ https://issues.apache.org/jira/browse/SPARK-18715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wayne Zhang updated SPARK-18715: Summary: Fix wrong AIC calculation in Binomial GLM (was: Correct AIC calculation in Binomial GLM) > Fix wrong AIC calculation in Binomial GLM > - > > Key: SPARK-18715 > URL: https://issues.apache.org/jira/browse/SPARK-18715 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.0.2 >Reporter: Wayne Zhang >Priority: Critical > Labels: patch > Fix For: 2.2.0 > > Original Estimate: 120h > Remaining Estimate: 120h > > The AIC calculation in Binomial GLM seems to be wrong when there are weights. > The weight adjustment should be applied to only the part of the Binomial > density involving the parameters, not the normalizing constant. > The current implementation is: > {code} > -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) => > weight * dist.Binomial(1, mu).logProbabilityOf(math.round(y).toInt) > }.sum() > {code} > Suggest changing this to > {code} > -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) => > val wt = math.round(weight).toInt > if (wt == 0){ > 0.0 > } else { > dist.Binomial(wt, mu).logProbabilityOf(math.round(y * weight).toInt) > } > }.sum() > {code} > > > The following is an example to illustrate the problem. > {code} > val dataset = Seq( > LabeledPoint(0.0, Vectors.dense(18, 1.0)), > LabeledPoint(0.5, Vectors.dense(12, 0.0)), > LabeledPoint(1.0, Vectors.dense(15, 0.0)), > LabeledPoint(0.0, Vectors.dense(13, 2.0)), > LabeledPoint(0.0, Vectors.dense(15, 1.0)), > LabeledPoint(0.5, Vectors.dense(16, 1.0)) > ).toDF().withColumn("weight", col("label") + 1.0) > val glr = new GeneralizedLinearRegression() > .setFamily("binomial") > .setWeightCol("weight") > .setRegParam(0) > val model = glr.fit(dataset) > model.summary.aic > {code} > This calculation shows the AIC is 14.189026847171382. To verify whether this > is correct, I run the same analysis in R but got AIC = 11.66092, -2 * LogLik > = 5.660918. > {code} > da <- scan(, what=list(y = 0, x1 = 0, x2 = 0, w = 0), sep = ",") > 0,18,1,1 > 0.5,12,0,1.5 > 1,15,0,2 > 0,13,2,1 > 0,15,1,1 > 0.5,16,1,1.5 > da <- as.data.frame(da) > f <- glm(y ~ x1 + x2 , data = da, family = binomial(), weight = w) > AIC(f) > -2 * logLik(f) > {code} > Now, I check whether the proposed change is correct. The following calculates > -2 * LogLik manually and get 5.6609177228379055, the same as that in R. > {code} > val predictions = model.transform(dataset) > -2.0 * predictions.select("label", "prediction", "weight").rdd.map {case > Row(y: Double, mu: Double, weight: Double) => > val wt = math.round(weight).toInt > if (wt == 0){ > 0.0 > } else { > dist.Binomial(wt, mu).logProbabilityOf(math.round(y * weight).toInt) > } > }.sum() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18715) Fix wrong AIC calculation in Binomial GLM
[ https://issues.apache.org/jira/browse/SPARK-18715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wayne Zhang updated SPARK-18715: Description: The AIC calculation in Binomial GLM seems to be wrong when there are weights. The result is different from that in R. The current implementation is: {code} -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) => weight * dist.Binomial(1, mu).logProbabilityOf(math.round(y).toInt) }.sum() {code} Suggest changing this to {code} -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) => val wt = math.round(weight).toInt if (wt == 0){ 0.0 } else { dist.Binomial(wt, mu).logProbabilityOf(math.round(y * weight).toInt) } }.sum() {code} The following is an example to illustrate the problem. {code} val dataset = Seq( LabeledPoint(0.0, Vectors.dense(18, 1.0)), LabeledPoint(0.5, Vectors.dense(12, 0.0)), LabeledPoint(1.0, Vectors.dense(15, 0.0)), LabeledPoint(0.0, Vectors.dense(13, 2.0)), LabeledPoint(0.0, Vectors.dense(15, 1.0)), LabeledPoint(0.5, Vectors.dense(16, 1.0)) ).toDF().withColumn("weight", col("label") + 1.0) val glr = new GeneralizedLinearRegression() .setFamily("binomial") .setWeightCol("weight") .setRegParam(0) val model = glr.fit(dataset) model.summary.aic {code} This calculation shows the AIC is 14.189026847171382. To verify whether this is correct, I run the same analysis in R but got AIC = 11.66092, -2 * LogLik = 5.660918. {code} da <- scan(, what=list(y = 0, x1 = 0, x2 = 0, w = 0), sep = ",") 0,18,1,1 0.5,12,0,1.5 1,15,0,2 0,13,2,1 0,15,1,1 0.5,16,1,1.5 da <- as.data.frame(da) f <- glm(y ~ x1 + x2 , data = da, family = binomial(), weight = w) AIC(f) -2 * logLik(f) {code} Now, I check whether the proposed change is correct. The following calculates -2 * LogLik manually and get 5.6609177228379055, the same as that in R. {code} val predictions = model.transform(dataset) -2.0 * predictions.select("label", "prediction", "weight").rdd.map {case Row(y: Double, mu: Double, weight: Double) => val wt = math.round(weight).toInt if (wt == 0){ 0.0 } else { dist.Binomial(wt, mu).logProbabilityOf(math.round(y * weight).toInt) } }.sum() {code} was: The AIC calculation in Binomial GLM seems to be wrong when there are weights. The weight adjustment should be applied to only the part of the Binomial density involving the parameters, not the normalizing constant. The current implementation is: {code} -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) => weight * dist.Binomial(1, mu).logProbabilityOf(math.round(y).toInt) }.sum() {code} Suggest changing this to {code} -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) => val wt = math.round(weight).toInt if (wt == 0){ 0.0 } else { dist.Binomial(wt, mu).logProbabilityOf(math.round(y * weight).toInt) } }.sum() {code} The following is an example to illustrate the problem. {code} val dataset = Seq( LabeledPoint(0.0, Vectors.dense(18, 1.0)), LabeledPoint(0.5, Vectors.dense(12, 0.0)), LabeledPoint(1.0, Vectors.dense(15, 0.0)), LabeledPoint(0.0, Vectors.dense(13, 2.0)), LabeledPoint(0.0, Vectors.dense(15, 1.0)), LabeledPoint(0.5, Vectors.dense(16, 1.0)) ).toDF().withColumn("weight", col("label") + 1.0) val glr = new GeneralizedLinearRegression() .setFamily("binomial") .setWeightCol("weight") .setRegParam(0) val model = glr.fit(dataset) model.summary.aic {code} This calculation shows the AIC is 14.189026847171382. To verify whether this is correct, I run the same analysis in R but got AIC = 11.66092, -2 * LogLik = 5.660918. {code} da <- scan(, what=list(y = 0, x1 = 0, x2 = 0, w = 0), sep = ",") 0,18,1,1 0.5,12,0,1.5 1,15,0,2 0,13,2,1 0,15,1,1 0.5,16,1,1.5 da <- as.data.frame(da) f <- glm(y ~ x1 + x2 , data = da, family = binomial(), weight = w) AIC(f) -2 * logLik(f) {code} Now, I check whether the proposed change is correct. The following calculates -2 * LogLik manually and get 5.6609177228379055, the same as that in R. {code} val predictions = model.transform(dataset) -2.0 * predictions.select("label", "prediction", "weight").rdd.map {case Row(y: Double, mu: Double, weight: Double) => val wt = math.round(weight).toInt if (wt == 0){ 0.0 } else { dist.Binomial(wt, mu).logProbabilityOf(math.round(y * weight).toInt) } }.sum() {code} > Fix wrong AIC calculation in Binomial GLM > - > > Key: SPARK-18715 > URL: https://issues.apache.org/jira/browse/SPARK-18715 > Project: Spark > Issue Type: Bug >
[jira] [Updated] (SPARK-18701) Poisson GLM fails due to wrong initialization
[ https://issues.apache.org/jira/browse/SPARK-18701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wayne Zhang updated SPARK-18701: Shepherd: Sean Owen (was: sean corkum) Issue Type: Bug (was: New Feature) > Poisson GLM fails due to wrong initialization > - > > Key: SPARK-18701 > URL: https://issues.apache.org/jira/browse/SPARK-18701 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.0.2 >Reporter: Wayne Zhang >Priority: Critical > Fix For: 2.2.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > Poisson GLM fails for many standard data sets. The issue is incorrect > initialization leading to almost zero probability and weights. The following > simple example reproduces the error. > {code:borderStyle=solid} > val datasetPoissonLogWithZero = Seq( > LabeledPoint(0.0, Vectors.dense(18, 1.0)), > LabeledPoint(1.0, Vectors.dense(12, 0.0)), > LabeledPoint(0.0, Vectors.dense(15, 0.0)), > LabeledPoint(0.0, Vectors.dense(13, 2.0)), > LabeledPoint(0.0, Vectors.dense(15, 1.0)), > LabeledPoint(1.0, Vectors.dense(16, 1.0)), > LabeledPoint(0.0, Vectors.dense(10, 0.0)), > LabeledPoint(0.0, Vectors.dense(15, 0.0)), > LabeledPoint(0.0, Vectors.dense(12, 2.0)), > LabeledPoint(0.0, Vectors.dense(13, 0.0)), > LabeledPoint(1.0, Vectors.dense(15, 0.0)), > LabeledPoint(1.0, Vectors.dense(15, 0.0)), > LabeledPoint(0.0, Vectors.dense(15, 0.0)), > LabeledPoint(0.0, Vectors.dense(12, 2.0)), > LabeledPoint(1.0, Vectors.dense(12, 2.0)) > ).toDF() > > val glr = new GeneralizedLinearRegression() > .setFamily("poisson") > .setLink("log") > .setMaxIter(20) > .setRegParam(0) > val model = glr.fit(datasetPoissonLogWithZero) > {code} > The issue is in the initialization: the mean is initialized as the response, > which could be zero. Applying the log link results in very negative numbers > (protected against -Inf), which again leads to close to zero probability and > weights in the weighted least squares. The fix is easy: just add a small > constant, highlighted in red below. > > override def initialize(y: Double, weight: Double): Double = { > require(y >= 0.0, "The response variable of Poisson family " + > s"should be non-negative, but got $y") > y {color:red}+ 0.1 {color} > } > I already have a fix and test code. Will create a PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-18710) Add offset to GeneralizedLinearRegression models
[ https://issues.apache.org/jira/browse/SPARK-18710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wayne Zhang closed SPARK-18710. --- Resolution: Unresolved > Add offset to GeneralizedLinearRegression models > > > Key: SPARK-18710 > URL: https://issues.apache.org/jira/browse/SPARK-18710 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.0.2 >Reporter: Wayne Zhang >Assignee: Wayne Zhang > Labels: features > Original Estimate: 10h > Remaining Estimate: 10h > > The current GeneralizedLinearRegression model does not support offset. The > offset can be useful to take into account exposure, or for testing > incremental effect of new variables. It is possible to use weights in current > environment to achieve the same effect of specifying offset for certain > models, e.g., Poisson & Binomial with log offset, it is desirable to have the > offset option to work with more general cases, e.g., negative offset or > offset that is hard to specify using weights (e.g., offset to the probability > rather than odds in logistic regression). > Effort would involve: > * update regression class to support offsetCol > * update IWLS to take into account of offset > * add test case for offset > I can start working on this if the community approves this feature. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-18929) Add Tweedie distribution in GLM
[ https://issues.apache.org/jira/browse/SPARK-18929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wayne Zhang closed SPARK-18929. --- Resolution: Unresolved > Add Tweedie distribution in GLM > --- > > Key: SPARK-18929 > URL: https://issues.apache.org/jira/browse/SPARK-18929 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Wayne Zhang >Assignee: Wayne Zhang > Labels: features > Original Estimate: 72h > Remaining Estimate: 72h > > I propose to add the full Tweedie family into the GeneralizedLinearRegression > model. The Tweedie family is characterized by a power variance function. > Currently supported distributions such as Gaussian, Poisson and Gamma > families are a special case of the > [Tweedie|https://en.wikipedia.org/wiki/Tweedie_distribution]. > I propose to add support for the other distributions: > * compound Poisson: 1 < variancePower < 2. This one is widely used to model > zero-inflated continuous distributions. > * positive stable: variancePower > 2 and variancePower != 3. Used to model > extreme values. > * inverse Gaussian: variancePower = 3. > The Tweedie family is supported in most statistical packages such as R > (statmod), SAS, h2o etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18710) Add offset to GeneralizedLinearRegression models
[ https://issues.apache.org/jira/browse/SPARK-18710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765733#comment-15765733 ] Wayne Zhang commented on SPARK-18710: - [~yanboliang] Thanks for the suggestion. I think the issue is a bit different in this case. The IRWLS relies on the _reweightFunc_, which is hard-coded to take an _Instance_ class: {code} val reweightFunc: (Instance, WeightedLeastSquaresModel) => (Double, Double) {code} I need to pass the offset column to this reweight function. Creating another GLRInstance won't solve the problem, will it? > Add offset to GeneralizedLinearRegression models > > > Key: SPARK-18710 > URL: https://issues.apache.org/jira/browse/SPARK-18710 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.0.2 >Reporter: Wayne Zhang >Assignee: Wayne Zhang > Labels: features > Original Estimate: 10h > Remaining Estimate: 10h > > The current GeneralizedLinearRegression model does not support offset. The > offset can be useful to take into account exposure, or for testing > incremental effect of new variables. It is possible to use weights in current > environment to achieve the same effect of specifying offset for certain > models, e.g., Poisson & Binomial with log offset, it is desirable to have the > offset option to work with more general cases, e.g., negative offset or > offset that is hard to specify using weights (e.g., offset to the probability > rather than odds in logistic regression). > Effort would involve: > * update regression class to support offsetCol > * update IWLS to take into account of offset > * add test case for offset > I can start working on this if the community approves this feature. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18710) Add offset to GeneralizedLinearRegression models
[ https://issues.apache.org/jira/browse/SPARK-18710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15779760#comment-15779760 ] Wayne Zhang commented on SPARK-18710: - Thanks for the comment, Yanbo. In IRLS, the fit method expects RDD[Instance]. Does it still work if one feeds a RDD[GLRInstance] object to it? {code} def fit(instances: RDD[Instance]): IterativelyReweightedLeastSquaresModel = { {code} > Add offset to GeneralizedLinearRegression models > > > Key: SPARK-18710 > URL: https://issues.apache.org/jira/browse/SPARK-18710 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.0.2 >Reporter: Wayne Zhang >Assignee: Wayne Zhang > Labels: features > Original Estimate: 10h > Remaining Estimate: 10h > > The current GeneralizedLinearRegression model does not support offset. The > offset can be useful to take into account exposure, or for testing > incremental effect of new variables. It is possible to use weights in current > environment to achieve the same effect of specifying offset for certain > models, e.g., Poisson & Binomial with log offset, it is desirable to have the > offset option to work with more general cases, e.g., negative offset or > offset that is hard to specify using weights (e.g., offset to the probability > rather than odds in logistic regression). > Effort would involve: > * update regression class to support offsetCol > * update IWLS to take into account of offset > * add test case for offset > I can start working on this if the community approves this feature. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18929) Add Tweedie distribution in GLM
Wayne Zhang created SPARK-18929: --- Summary: Add Tweedie distribution in GLM Key: SPARK-18929 URL: https://issues.apache.org/jira/browse/SPARK-18929 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 2.0.2 Reporter: Wayne Zhang I propose to add the full Tweedie family into the GeneralizedLinearRegression model. The Tweedie family is characterized by a power variance function. Currently supported distributions such as Gaussian, Poisson and Gamma families are a special case of the [Tweedie|https://en.wikipedia.org/wiki/Tweedie_distribution]. I propose to add support for the other distributions: * compound Poisson: 1 < variancePower < 2. This one is widely used to model zero-inflated continuous distributions. * positive stable: variancePower > 2 and variancePower != 3. Used to model extreme values. * inverse Gaussian: variancePower = 3. The Tweedie family is supported in most statistical packages such as R (statmod), SAS, h2o etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18710) Add offset to GeneralizedLinearRegression models
[ https://issues.apache.org/jira/browse/SPARK-18710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15762752#comment-15762752 ] Wayne Zhang commented on SPARK-18710: - [~yanboliang] It seems that I would need to change the case class 'Instance' to include offset... That could be potentially disruptive if many other models also depend on this case class. Any suggestions regarding this? > Add offset to GeneralizedLinearRegression models > > > Key: SPARK-18710 > URL: https://issues.apache.org/jira/browse/SPARK-18710 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.0.2 >Reporter: Wayne Zhang >Assignee: Wayne Zhang > Labels: features > Original Estimate: 10h > Remaining Estimate: 10h > > The current GeneralizedLinearRegression model does not support offset. The > offset can be useful to take into account exposure, or for testing > incremental effect of new variables. It is possible to use weights in current > environment to achieve the same effect of specifying offset for certain > models, e.g., Poisson & Binomial with log offset, it is desirable to have the > offset option to work with more general cases, e.g., negative offset or > offset that is hard to specify using weights (e.g., offset to the probability > rather than odds in logistic regression). > Effort would involve: > * update regression class to support offsetCol > * update IWLS to take into account of offset > * add test case for offset > I can start working on this if the community approves this feature. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20026) Document R GLM Tweedie family support in programming guide and code example
[ https://issues.apache.org/jira/browse/SPARK-20026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15955342#comment-15955342 ] Wayne Zhang commented on SPARK-20026: - [~felixcheung] Yes, I will work on this. Thanks. > Document R GLM Tweedie family support in programming guide and code example > --- > > Key: SPARK-20026 > URL: https://issues.apache.org/jira/browse/SPARK-20026 > Project: Spark > Issue Type: Bug > Components: Documentation, SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19818) SparkR union should check for name consistency of input data frames
Wayne Zhang created SPARK-19818: --- Summary: SparkR union should check for name consistency of input data frames Key: SPARK-19818 URL: https://issues.apache.org/jira/browse/SPARK-19818 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.1.0 Reporter: Wayne Zhang Priority: Minor The current implementation accepts data frames with different schemas. See issues below: {code} df <- createDataFrame(data.frame(name = c("Michael", "Andy", "Justin"), age = c(1, 30, 19))) union(df, df[, c(2, 1)]) name age 1 Michael 1.0 2Andy30.0 3 Justin19.0 4 1.0 Michael {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19819) Use concrete data in SparkR DataFrame examples
Wayne Zhang created SPARK-19819: --- Summary: Use concrete data in SparkR DataFrame examples Key: SPARK-19819 URL: https://issues.apache.org/jira/browse/SPARK-19819 Project: Spark Issue Type: Improvement Components: SparkR Affects Versions: 2.1.0 Reporter: Wayne Zhang Priority: Minor Many examples in SparkDataFrame methods uses: {code} path <- "path/to/file.json" df <- read.json(path) {code} This is not directly runnable. Replace this with real numerical examples so that users can directly execute the examples. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-19773) SparkDataFrame should not allow duplicate names
[ https://issues.apache.org/jira/browse/SPARK-19773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wayne Zhang closed SPARK-19773. --- Resolution: Not A Problem > SparkDataFrame should not allow duplicate names > --- > > Key: SPARK-19773 > URL: https://issues.apache.org/jira/browse/SPARK-19773 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Wayne Zhang >Priority: Minor > > SparkDataFrame in SparkR seems to accept duplicate names at creation, but > incurs error when calling methods downstream. For example, we can do: > {{{code}}} > l <- list(list(1, 2), list(3, 4)) > df <- createDataFrame(l, c("a", "a")) > head(df) > {{{code}}} > But an error occurs when we do df$a = df$a * 2.0. > I suggest we add validity check for duplicate names at initialization. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20258) SparkR logistic regression example did not converge in programming guide
Wayne Zhang created SPARK-20258: --- Summary: SparkR logistic regression example did not converge in programming guide Key: SPARK-20258 URL: https://issues.apache.org/jira/browse/SPARK-20258 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.1.0 Reporter: Wayne Zhang SparkR logistic regression example did not converge in programming guide. All estimates are essentially zero: {code} training2 <- read.df("data/mllib/sample_binary_classification_data.txt", source = "libsvm") df_list2 <- randomSplit(training2, c(7,3), 2) binomialDF <- df_list2[[1]] binomialTestDF <- df_list2[[2]] binomialGLM <- spark.glm(binomialDF, label ~ features, family = "binomial") 17/04/07 11:42:03 WARN WeightedLeastSquares: Cholesky solver failed due to singular covariance matrix. Retrying with Quasi-Newton solver. > summary(binomialGLM) Deviance Residuals: (Note: These are approximate quantiles with relative error <= 0.01) Min 1Q Median 3Q Max -2.4828e-06 -2.4063e-06 2.2778e-06 2.4350e-06 2.7722e-06 Coefficients: Estimate (Intercept)9.0255e+00 features_0 0.e+00 features_1 0.e+00 features_2 0.e+00 features_3 0.e+00 features_4 0.e+00 features_5 0.e+00 features_6 0.e+00 features_7 0.e+00 {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21622) Support Offset in SparkR
Wayne Zhang created SPARK-21622: --- Summary: Support Offset in SparkR Key: SPARK-21622 URL: https://issues.apache.org/jira/browse/SPARK-21622 Project: Spark Issue Type: Improvement Components: SparkR Affects Versions: 2.2.0 Reporter: Wayne Zhang Support offset in GLM in SparkR. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21275) Update GLM test to use supportedFamilyNames
Wayne Zhang created SPARK-21275: --- Summary: Update GLM test to use supportedFamilyNames Key: SPARK-21275 URL: https://issues.apache.org/jira/browse/SPARK-21275 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.1.1 Reporter: Wayne Zhang Priority: Minor Address this comment: https://github.com/apache/spark/pull/16699#discussion-diff-100574976R855 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21310) Add offset to PySpark GLM
Wayne Zhang created SPARK-21310: --- Summary: Add offset to PySpark GLM Key: SPARK-21310 URL: https://issues.apache.org/jira/browse/SPARK-21310 Project: Spark Issue Type: New Feature Components: PySpark Affects Versions: 2.1.1 Reporter: Wayne Zhang -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20574) Allow Bucketizer to handle non-Double column
Wayne Zhang created SPARK-20574: --- Summary: Allow Bucketizer to handle non-Double column Key: SPARK-20574 URL: https://issues.apache.org/jira/browse/SPARK-20574 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.1.0 Reporter: Wayne Zhang Bucketizer currently requires input column to be Double, but the logic should work on any numeric data types. Many practical problems have integer/float data types, and it could get very tedious to manually cast them into Double before calling bucketizer. This transformer could be extended to handle all numeric types. The example below shows failure of Bucketizer on integer data. {code} val splits = Array(-3.0, 0.0, 3.0) val data: Array[Int] = Array(-2, -1, 0, 1, 2) val expectedBuckets = Array(0.0, 0.0, 1.0, 1.0, 1.0) val dataFrame = data.zip(expectedBuckets).toSeq.toDF("feature", "expected") val bucketizer = new Bucketizer() .setInputCol("feature") .setOutputCol("result") .setSplits(splits) bucketizer.transform(dataFrame) java.lang.IllegalArgumentException: requirement failed: Column feature must be of type DoubleType but was actually IntegerType. {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20736) PySpark StringIndexer supports StringOrderType
Wayne Zhang created SPARK-20736: --- Summary: PySpark StringIndexer supports StringOrderType Key: SPARK-20736 URL: https://issues.apache.org/jira/browse/SPARK-20736 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 2.1.0 Reporter: Wayne Zhang Port new support of StringOrderType to PySpark StringIndexer. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20899) PySpark supports stringIndexerOrderType in RFormula
Wayne Zhang created SPARK-20899: --- Summary: PySpark supports stringIndexerOrderType in RFormula Key: SPARK-20899 URL: https://issues.apache.org/jira/browse/SPARK-20899 Project: Spark Issue Type: New Feature Components: PySpark Affects Versions: 2.1.1 Reporter: Wayne Zhang -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20892) Add SQL trunc function to SparkR
Wayne Zhang created SPARK-20892: --- Summary: Add SQL trunc function to SparkR Key: SPARK-20892 URL: https://issues.apache.org/jira/browse/SPARK-20892 Project: Spark Issue Type: New Feature Components: SparkR Affects Versions: 2.1.1 Reporter: Wayne Zhang Add SQL trunc function to SparkR -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20917) SparkR supports string encoding consistent with R
Wayne Zhang created SPARK-20917: --- Summary: SparkR supports string encoding consistent with R Key: SPARK-20917 URL: https://issues.apache.org/jira/browse/SPARK-20917 Project: Spark Issue Type: New Feature Components: SparkR Affects Versions: 2.1.1 Reporter: Wayne Zhang Add stringIndexerOrderType to spark.glm and spark.survreg to support string encoding that is consistent with default R. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20619) StringIndexer supports multiple ways of label ordering
[ https://issues.apache.org/jira/browse/SPARK-20619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wayne Zhang updated SPARK-20619: Description: StringIndexer maps labels to numbers according to the descending order of label frequency. Other types of ordering (e.g., alphabetical) may be needed in feature ETL. For example, the ordering will affect the result in one-hot encoding and RFormula. Propose to support other ordering methods and we add a parameter stringOrderType that supports the following four options: - 'freq_desc': descending order by label frequency (most frequent label assigned 0) - 'freq_asc': ascending order by label frequency (least frequent label assigned 0) - 'alphabet_desc': descending alphabetical order - 'alphabet_asc': ascending alphabetical order was: StringIndexer maps labels to numbers according to the descending order of label frequency. Other types of ordering (e.g., alphabetical) may be needed in feature ETL, for example, in one-hot encoding. Propose to support alphabetic order, and ascending order of label frequency. For example, add a parameter stringOrderType to control how string is ordered which supports four options: - 'freq_desc': descending order by label frequency (most frequent label assigned 0) - 'freq_asc': ascending order by label frequency (least frequent label assigned 0) - 'alphabet_desc': descending alphabetical order - 'alphabet_asc': ascending alphabetical order > StringIndexer supports multiple ways of label ordering > -- > > Key: SPARK-20619 > URL: https://issues.apache.org/jira/browse/SPARK-20619 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.0 >Reporter: Wayne Zhang > > StringIndexer maps labels to numbers according to the descending order of > label frequency. Other types of ordering (e.g., alphabetical) may be needed > in feature ETL. For example, the ordering will affect the result in one-hot > encoding and RFormula. Propose to support other ordering methods and we add a > parameter stringOrderType that supports the following four options: >- 'freq_desc': descending order by label frequency (most frequent label > assigned 0) >- 'freq_asc': ascending order by label frequency (least frequent label > assigned 0) >- 'alphabet_desc': descending alphabetical order >- 'alphabet_asc': ascending alphabetical order -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20619) StringIndexer supports multiple ways of label ordering
Wayne Zhang created SPARK-20619: --- Summary: StringIndexer supports multiple ways of label ordering Key: SPARK-20619 URL: https://issues.apache.org/jira/browse/SPARK-20619 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.1.0 Reporter: Wayne Zhang StringIndexer maps labels to numbers according to the descending order of label frequency. Other types of ordering (e.g., alphabetical) may be needed in feature ETL, for example, in one-hot encoding. Propose to support alphabetic order, and ascending order of label frequency. For example, add a parameter stringOrderType to control how string is ordered which supports four options: - 'freq_desc': descending order by label frequency (most frequent label assigned 0) - 'freq_asc': ascending order by label frequency (least frequent label assigned 0) - 'alphabet_desc': descending alphabetical order - 'alphabet_asc': ascending alphabetical order -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20604) Allow Imputer to handle all numeric types
[ https://issues.apache.org/jira/browse/SPARK-20604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wayne Zhang updated SPARK-20604: Description: Imputer currently requires input column to be Double or Float, but the logic should work on any numeric data types. Many practical problems have integer data types, and it could get very tedious to manually cast them into Double before calling imputer. This transformer could be extended to handle all numeric types. The example below shows failure of Imputer on integer data. {code} val df = spark.createDataFrame( Seq( (0, 1.0, 1.0, 1.0), (1, 11.0, 11.0, 11.0), (2, 1.5, 1.5, 1.5), (3, Double.NaN, 4.5, 1.5) )).toDF("id", "value1", "expected_mean_value1", "expected_median_value1") val imputer = new Imputer() .setInputCols(Array("value1")) .setOutputCols(Array("out1")) imputer.fit(df.withColumn("value1", col("value1").cast(IntegerType))) java.lang.IllegalArgumentException: requirement failed: Column value1 must be of type equal to one of the following types: [DoubleType, FloatType] but was actually of type IntegerType. {code} was: Imputer currently requires input column to be Double or Float, but the logic should work on any numeric data types. Many practical problems have integer data types, and it could get very tedious to manually cast them into Double before calling imputer. This transformer could be extended to handle all numeric types. The example below shows failure of Bucketizer on integer data. {code} val df = spark.createDataFrame( Seq( (0, 1.0, 1.0, 1.0), (1, 11.0, 11.0, 11.0), (2, 1.5, 1.5, 1.5), (3, Double.NaN, 4.5, 1.5) )).toDF("id", "value1", "expected_mean_value1", "expected_median_value1") val imputer = new Imputer() .setInputCols(Array("value1")) .setOutputCols(Array("out1")) imputer.fit(df.withColumn("value1", col("value1").cast(IntegerType))) java.lang.IllegalArgumentException: requirement failed: Column value1 must be of type equal to one of the following types: [DoubleType, FloatType] but was actually of type IntegerType. {code} > Allow Imputer to handle all numeric types > - > > Key: SPARK-20604 > URL: https://issues.apache.org/jira/browse/SPARK-20604 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Wayne Zhang >Assignee: Apache Spark > > Imputer currently requires input column to be Double or Float, but the logic > should work on any numeric data types. Many practical problems have integer > data types, and it could get very tedious to manually cast them into Double > before calling imputer. This transformer could be extended to handle all > numeric types. > The example below shows failure of Imputer on integer data. > {code} > val df = spark.createDataFrame( Seq( > (0, 1.0, 1.0, 1.0), > (1, 11.0, 11.0, 11.0), > (2, 1.5, 1.5, 1.5), > (3, Double.NaN, 4.5, 1.5) > )).toDF("id", "value1", "expected_mean_value1", "expected_median_value1") > val imputer = new Imputer() > .setInputCols(Array("value1")) > .setOutputCols(Array("out1")) > imputer.fit(df.withColumn("value1", col("value1").cast(IntegerType))) > java.lang.IllegalArgumentException: requirement failed: Column value1 must be > of type equal to one of the following types: [DoubleType, FloatType] but was > actually of type IntegerType. > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20604) Allow Imputer to handle all numeric types
Wayne Zhang created SPARK-20604: --- Summary: Allow Imputer to handle all numeric types Key: SPARK-20604 URL: https://issues.apache.org/jira/browse/SPARK-20604 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.1.0 Reporter: Wayne Zhang Imputer currently requires input column to be Double or Float, but the logic should work on any numeric data types. Many practical problems have integer data types, and it could get very tedious to manually cast them into Double before calling imputer. This transformer could be extended to handle all numeric types. The example below shows failure of Bucketizer on integer data. {code} val df = spark.createDataFrame( Seq( (0, 1.0, 1.0, 1.0), (1, 11.0, 11.0, 11.0), (2, 1.5, 1.5, 1.5), (3, Double.NaN, 4.5, 1.5) )).toDF("id", "value1", "expected_mean_value1", "expected_median_value1") val imputer = new Imputer() .setInputCols(Array("value1")) .setOutputCols(Array("out1")) imputer.fit(df.withColumn("value1", col("value1").cast(IntegerType))) java.lang.IllegalArgumentException: requirement failed: Column value1 must be of type equal to one of the following types: [DoubleType, FloatType] but was actually of type IntegerType. {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20889) SparkR grouped documentation for Column methods
Wayne Zhang created SPARK-20889: --- Summary: SparkR grouped documentation for Column methods Key: SPARK-20889 URL: https://issues.apache.org/jira/browse/SPARK-20889 Project: Spark Issue Type: Improvement Components: SparkR Affects Versions: 2.1.1 Reporter: Wayne Zhang Group the documentation of individual methods defined for the Column class. This aims to create the following improvements: - Centralized documentation for easy navigation (user can view multiple related methods on one single page). - Reduced number of items in Seealso. - Betters examples using shared data. This avoids creating a data frame for each function if they are documented separately. And more importantly, user can copy and paste to run them directly! - Cleaner structure and much fewer Rd files (remove a large number of Rd files). - Remove duplicated definition of param (since they share exactly the same argument). - No need to write meaningless examples for trivial functions (because of grouping). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org