[jira] [Created] (SPARK-19452) Fix bug in the name assignment method in SparkR

2017-02-03 Thread Wayne Zhang (JIRA)
Wayne Zhang created SPARK-19452:
---

 Summary: Fix bug in the name assignment method in SparkR
 Key: SPARK-19452
 URL: https://issues.apache.org/jira/browse/SPARK-19452
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.1.0, 2.2.0
Reporter: Wayne Zhang


The names method fails to check for validity of the assignment values. This can 
be fixed by calling colnames within names. See example below.

{code}
df <- suppressWarnings(createDataFrame(iris))
# this is error
colnames(df) <- NULL
# this should report error
names(df) <- NULL
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19473) Several DataFrame Method still fail with dot in column names

2017-02-05 Thread Wayne Zhang (JIRA)
Wayne Zhang created SPARK-19473:
---

 Summary: Several DataFrame Method still fail with dot in column 
names 
 Key: SPARK-19473
 URL: https://issues.apache.org/jira/browse/SPARK-19473
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Wayne Zhang


Here is an example:
{code}
val df = Seq((1.0, 2.0), (2.0, 3.0)).toDF("y.a", "x.b")
df.select("y.a")
org.apache.spark.sql.AnalysisException: cannot resolve '`y.a`' given input 
columns: [y.a, x.b];;

df.withColumn("d", col("y.a") + col("x.b"))
org.apache.spark.sql.AnalysisException: cannot resolve '`y.a`' given input 
columns: [y.a, x.b];;
{code}

We can use backquote to avoid the errors, but this behavior is affecting some 
downstream work such as RFormula and SparkR. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19473) Several DataFrame Methods still fail with dot in column names

2017-02-05 Thread Wayne Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wayne Zhang updated SPARK-19473:

Summary: Several DataFrame Methods still fail with dot in column names   
(was: Several DataFrame Method still fail with dot in column names )

> Several DataFrame Methods still fail with dot in column names 
> --
>
> Key: SPARK-19473
> URL: https://issues.apache.org/jira/browse/SPARK-19473
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Wayne Zhang
>
> Here is an example:
> {code}
> val df = Seq((1.0, 2.0), (2.0, 3.0)).toDF("y.a", "x.b")
> df.select("y.a")
> org.apache.spark.sql.AnalysisException: cannot resolve '`y.a`' given input 
> columns: [y.a, x.b];;
> df.withColumn("d", col("y.a") + col("x.b"))
> org.apache.spark.sql.AnalysisException: cannot resolve '`y.a`' given input 
> columns: [y.a, x.b];;
> {code}
> We can use backquote to avoid the errors, but this behavior is affecting some 
> downstream work such as RFormula and SparkR. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-18710) Add offset to GeneralizedLinearRegression models

2017-01-24 Thread Wayne Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wayne Zhang reopened SPARK-18710:
-

> Add offset to GeneralizedLinearRegression models
> 
>
> Key: SPARK-18710
> URL: https://issues.apache.org/jira/browse/SPARK-18710
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Wayne Zhang
>Assignee: Wayne Zhang
>  Labels: features
>   Original Estimate: 10h
>  Remaining Estimate: 10h
>
> The current GeneralizedLinearRegression model does not support offset. The 
> offset can be useful to take into account exposure, or for testing 
> incremental effect of new variables. It is possible to use weights in current 
> environment to achieve the same effect of specifying offset for certain 
> models, e.g., Poisson & Binomial with log offset, it is desirable to have the 
> offset option to work with more general cases, e.g., negative offset or 
> offset that is hard to specify using weights (e.g., offset to the probability 
> rather than odds in logistic regression).
> Effort would involve:
> * update regression class to support offsetCol
> * update IWLS to take into account of offset
> * add test case for offset
> I can start working on this if the community approves this feature. 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19391) Tweedie GLM API in SparkR

2017-01-28 Thread Wayne Zhang (JIRA)
Wayne Zhang created SPARK-19391:
---

 Summary: Tweedie GLM API in SparkR
 Key: SPARK-19391
 URL: https://issues.apache.org/jira/browse/SPARK-19391
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Reporter: Wayne Zhang


Port Tweedie GLM to SparkR
https://github.com/apache/spark/pull/16344



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-19395) Convert coefficients in summary to matrix

2017-01-29 Thread Wayne Zhang (JIRA)
Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Wayne Zhang created an issue 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 Spark /  SPARK-19395 
 
 
 
  Convert coefficients in summary to matrix  
 
 
 
 
 
 
 
 
 

Issue Type:
 
  Bug 
 
 
 

Assignee:
 

 Unassigned 
 
 
 

Components:
 

 SparkR 
 
 
 

Created:
 

 29/Jan/17 18:28 
 
 
 

Priority:
 
  Major 
 
 
 

Reporter:
 
 Wayne Zhang 
 
 
 
 
 
 
 
 
 
 
The coefficients component in model summary should be 'matrix' but the underlying structure is indeed list. This affects several models except for 'AFTSurvivalRegressionModel' which has the correct implementation. The fix is to first unlist the coefficients returned from the callJMethod before converting to matrix. 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 

[jira] (SPARK-19400) GLM fails for intercept only model

2017-01-30 Thread Wayne Zhang (JIRA)
Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Wayne Zhang updated an issue 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 Spark /  SPARK-19400 
 
 
 
  GLM fails for intercept only model  
 
 
 
 
 
 
 
 
 

Change By:
 
 Wayne Zhang 
 
 
 

Component/s:
 
 ML 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d) 
 
 
 
 
  
 
 
 
 
 
 
 
 
   



[jira] (SPARK-19400) GLM fails for intercept only model

2017-01-30 Thread Wayne Zhang (JIRA)
Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Wayne Zhang created an issue 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 Spark /  SPARK-19400 
 
 
 
  GLM fails for intercept only model  
 
 
 
 
 
 
 
 
 

Issue Type:
 
  Bug 
 
 
 

Assignee:
 

 Unassigned 
 
 
 

Created:
 

 30/Jan/17 08:07 
 
 
 

Priority:
 
  Major 
 
 
 

Reporter:
 
 Wayne Zhang 
 
 
 
 
 
 
 
 
 
 
Intercept-only GLM fails for non-Gaussian family because of reducing an empty array in IWLS.  

 

val dataset = Seq(
  (1.0, 1.0, 2.0, 0.0, 5.0),
  (0.5, 2.0, 1.0, 1.0, 2.0),
  (1.0, 3.0, 0.5, 2.0, 1.0),
  (2.0, 4.0, 1.5, 3.0, 3.0)
).toDF("y", "w", "off", "x1", "x2")

val formula = new RFormula().setFormula("y ~ 1")
val output = formula.fit(dataset).transform(dataset)
val glr = new GeneralizedLinearRegression().setFamily("poisson")
val model = glr.fit(output)

java.lang.UnsupportedOperationException: empty.reduceLeft
 

 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 

[jira] [Created] (SPARK-19682) Issue warning (or error) when subset method "[[" takes vector index

2017-02-21 Thread Wayne Zhang (JIRA)
Wayne Zhang created SPARK-19682:
---

 Summary: Issue warning (or error) when subset method "[[" takes 
vector index
 Key: SPARK-19682
 URL: https://issues.apache.org/jira/browse/SPARK-19682
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 2.1.0
Reporter: Wayne Zhang
Priority: Minor


The `[[` method is supposed to take a single index and return a column. This is 
different from base R which takes a vector index.  We should check for this and 
issue warning or error when vector index is supplied (which is very likely 
given the behavior in base R). 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-19473) Several DataFrame Methods still fail with dot in column names

2017-02-10 Thread Wayne Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wayne Zhang closed SPARK-19473.
---
Resolution: Not A Problem

> Several DataFrame Methods still fail with dot in column names 
> --
>
> Key: SPARK-19473
> URL: https://issues.apache.org/jira/browse/SPARK-19473
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Wayne Zhang
>
> Here is an example:
> {code}
> val df = Seq((1.0, 2.0), (2.0, 3.0)).toDF("y.a", "x.b")
> df.select("y.a")
> org.apache.spark.sql.AnalysisException: cannot resolve '`y.a`' given input 
> columns: [y.a, x.b];;
> df.withColumn("d", col("y.a") + col("x.b"))
> org.apache.spark.sql.AnalysisException: cannot resolve '`y.a`' given input 
> columns: [y.a, x.b];;
> {code}
> We can use backquote to avoid the errors, but this behavior is affecting some 
> downstream work such as RFormula and SparkR. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19270) Add summary table to GLM summary

2017-01-17 Thread Wayne Zhang (JIRA)
Wayne Zhang created SPARK-19270:
---

 Summary: Add summary table to GLM summary
 Key: SPARK-19270
 URL: https://issues.apache.org/jira/browse/SPARK-19270
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Wayne Zhang
Priority: Minor


Add R-like summary table to GLM summary, which includes feature name (if 
exist), parameter estimate, standard error, t-stat and p-value. This allows 
scala users to easily gather these commonly used inference results. 





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19270) Add summary table to GLM summary

2017-01-18 Thread Wayne Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wayne Zhang updated SPARK-19270:

Shepherd: Yanbo Liang

> Add summary table to GLM summary
> 
>
> Key: SPARK-19270
> URL: https://issues.apache.org/jira/browse/SPARK-19270
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Wayne Zhang
>Priority: Minor
>
> Add R-like summary table to GLM summary, which includes feature name (if 
> exist), parameter estimate, standard error, t-stat and p-value. This allows 
> scala users to easily gather these commonly used inference results. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14659) OneHotEncoder support drop first category alphabetically in the encoded vector

2017-01-18 Thread Wayne Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15828618#comment-15828618
 ] 

Wayne Zhang commented on SPARK-14659:
-

[~yanboliang] [~josephkb]
Has anyone been working on this ticket? It will also be helpful to support 
'dropFirst', since in practice there is often need to set the most frequent as 
base for interpretability. I'll be happy to work on this (and already have some 
fix). 


> OneHotEncoder support drop first category alphabetically in the encoded 
> vector 
> ---
>
> Key: SPARK-14659
> URL: https://issues.apache.org/jira/browse/SPARK-14659
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>
> R formula drop the first category alphabetically when encode string/category 
> feature. Spark RFormula use OneHotEncoder to encode string/category feature 
> into vector, but only supporting "dropLast" by string/category frequencies. 
> This will cause SparkR produce different models compared with native R.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19773) SparkDataFrame should not allow duplicate names

2017-02-28 Thread Wayne Zhang (JIRA)
Wayne Zhang created SPARK-19773:
---

 Summary: SparkDataFrame should not allow duplicate names
 Key: SPARK-19773
 URL: https://issues.apache.org/jira/browse/SPARK-19773
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.1.0
Reporter: Wayne Zhang
Priority: Minor


SparkDataFrame in SparkR seems to accept duplicate names at creation, but 
incurs error when calling methods downstream. For example, we can do: 
{{{code}}}
l <- list(list(1, 2), list(3, 4))
df <- createDataFrame(l, c("a", "a"))
head(df)
{{{code}}}
But an error occurs when we do df$a = df$a * 2.0. 

I suggest we add validity check for duplicate names at initialization.  




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18166) GeneralizedLinearRegression Wrong Value Range for Poisson Distribution

2016-10-28 Thread Wayne Zhang (JIRA)
Wayne Zhang created SPARK-18166:
---

 Summary: GeneralizedLinearRegression Wrong Value Range for Poisson 
Distribution  
 Key: SPARK-18166
 URL: https://issues.apache.org/jira/browse/SPARK-18166
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.0.0
Reporter: Wayne Zhang


The current implementation of Poisson GLM seems to allow only positive values 
(See below). This is not correct since the support of Poisson includes the 
origin. 

override def initialize(y: Double, weight: Double): Double = {
  require(y {color:red} > {color} 0.0, "The response variable of Poisson 
family " +
s"should be positive, but got $y")
  y
}

The fix is easy, just change it to 
 require(y {color:red} >= {color} 0.0, "The response variable of Poisson family 
" +



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18710) Add offset to GeneralizedLinearRegression models

2016-12-11 Thread Wayne Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wayne Zhang updated SPARK-18710:

  Shepherd: Yanbo Liang  (was: Sean Owen)
Remaining Estimate: 10h  (was: 336h)
 Original Estimate: 10h  (was: 336h)

> Add offset to GeneralizedLinearRegression models
> 
>
> Key: SPARK-18710
> URL: https://issues.apache.org/jira/browse/SPARK-18710
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Wayne Zhang
>  Labels: features
> Fix For: 2.2.0
>
>   Original Estimate: 10h
>  Remaining Estimate: 10h
>
> The current GeneralizedLinearRegression model does not support offset. The 
> offset can be useful to take into account exposure, or for testing 
> incremental effect of new variables. It is possible to use weights in current 
> environment to achieve the same effect of specifying offset for certain 
> models, e.g., Poisson & Binomial with log offset, it is desirable to have the 
> offset option to work with more general cases, e.g., negative offset or 
> offset that is hard to specify using weights (e.g., offset to the probability 
> rather than odds in logistic regression).
> Effort would involve:
> * update regression class to support offsetCol
> * update IWLS to take into account of offset
> * add test case for offset
> I can start working on this if the community approves this feature. 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-18929) Add Tweedie distribution in GLM

2017-01-10 Thread Wayne Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wayne Zhang reopened SPARK-18929:
-

> Add Tweedie distribution in GLM
> ---
>
> Key: SPARK-18929
> URL: https://issues.apache.org/jira/browse/SPARK-18929
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Wayne Zhang
>Assignee: Wayne Zhang
>  Labels: features
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> I propose to add the full Tweedie family into the GeneralizedLinearRegression 
> model. The Tweedie family is characterized by a power variance function. 
> Currently supported distributions such as Gaussian,  Poisson and Gamma 
> families are a special case of the 
> [Tweedie|https://en.wikipedia.org/wiki/Tweedie_distribution]. 
> I propose to add support for the other distributions:
> * compound Poisson: 1 < variancePower < 2. This one is widely used to model 
> zero-inflated continuous distributions. 
> * positive stable: variancePower > 2 and variancePower != 3. Used to model 
> extreme values.
> * inverse Gaussian: variancePower = 3.
>  The Tweedie family is supported in most statistical packages such as R 
> (statmod), SAS, h2o etc. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18701) Poisson GLM fails due to wrong initialization

2016-12-03 Thread Wayne Zhang (JIRA)
Wayne Zhang created SPARK-18701:
---

 Summary: Poisson GLM fails due to wrong initialization
 Key: SPARK-18701
 URL: https://issues.apache.org/jira/browse/SPARK-18701
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 2.0.2
Reporter: Wayne Zhang
Priority: Critical
 Fix For: 2.2.0


Poisson GLM fails for many standard data sets. The issue is incorrect 
initialization leading to almost zero probability and weights. The following 
simple example reproduces the error. 

{code:borderStyle=solid}
val datasetPoissonLogWithZero = Seq(
  LabeledPoint(0.0, Vectors.dense(18, 1.0)),
  LabeledPoint(1.0, Vectors.dense(12, 0.0)),
  LabeledPoint(0.0, Vectors.dense(15, 0.0)),
  LabeledPoint(0.0, Vectors.dense(13, 2.0)),
  LabeledPoint(0.0, Vectors.dense(15, 1.0)),
  LabeledPoint(1.0, Vectors.dense(16, 1.0)),
  LabeledPoint(0.0, Vectors.dense(10, 0.0)),
  LabeledPoint(0.0, Vectors.dense(15, 0.0)),
  LabeledPoint(0.0, Vectors.dense(12, 2.0)),
  LabeledPoint(0.0, Vectors.dense(13, 0.0)),
  LabeledPoint(1.0, Vectors.dense(15, 0.0)),
  LabeledPoint(1.0, Vectors.dense(15, 0.0)),
  LabeledPoint(0.0, Vectors.dense(15, 0.0)),
  LabeledPoint(0.0, Vectors.dense(12, 2.0)),
  LabeledPoint(1.0, Vectors.dense(12, 2.0))
).toDF()

val glr = new GeneralizedLinearRegression()
  .setFamily("poisson")
  .setLink("log")
  .setMaxIter(20)
  .setRegParam(0)

val model = glr.fit(datasetPoissonLogWithZero)
{code}

The issue is in the initialization:  the mean is initialized as the response, 
which could be zero. Applying the log link results in very negative numbers 
(protected against -Inf), which again leads to close to zero probability and 
weights in the weighted least squares. The fix is easy: just add a small 
constant, highlighted in red below. 
 

override def initialize(y: Double, weight: Double): Double = {
  require(y >= 0.0, "The response variable of Poisson family " +
s"should be non-negative, but got $y")
  y {color:red}+ 0.1 {color}
}

I already have a fix and test code. Will create a PR. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18710) Add offset to GeneralizedLinearRegression models

2016-12-04 Thread Wayne Zhang (JIRA)
Wayne Zhang created SPARK-18710:
---

 Summary: Add offset to GeneralizedLinearRegression models
 Key: SPARK-18710
 URL: https://issues.apache.org/jira/browse/SPARK-18710
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 2.0.2
Reporter: Wayne Zhang
 Fix For: 2.2.0


The current GeneralizedLinearRegression model does not support offset. The 
offset can be useful to take into account exposure, or for testing incremental 
effect of new variables. It is possible to use weights in current environment 
to achieve the same effect of specifying offset for certain models, e.g., 
Poisson & Binomial with log offset, it is desirable to have the offset option 
to work with more general cases, e.g., negative offset or offset that is hard 
to specify using weights (e.g., offset to the probability rather than odds in 
logistic regression).

Effort would involve:
* update regression class to support offsetCol
* update IWLS to take into account of offset
* add test case for offset

I can start working on this if the community approves this feature. 

 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18715) Correct AIC calculation in Binomial GLM

2016-12-04 Thread Wayne Zhang (JIRA)
Wayne Zhang created SPARK-18715:
---

 Summary: Correct AIC calculation in Binomial GLM
 Key: SPARK-18715
 URL: https://issues.apache.org/jira/browse/SPARK-18715
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.0.2
Reporter: Wayne Zhang
Priority: Critical
 Fix For: 2.2.0


The AIC calculation in Binomial GLM seems to be wrong when there are weights. 
The weight adjustment should be applied to only the part of the Binomial 
density involving the parameters, not the normalizing constant. 

The current implementation is:
{code}
  -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) =>
weight * dist.Binomial(1, mu).logProbabilityOf(math.round(y).toInt)
  }.sum()
{code} 

Suggest changing this to 
{code}
  -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) =>
val wt = math.round(weight).toInt
if (wt == 0){
  0.0
} else {
  dist.Binomial(wt, mu).logProbabilityOf(math.round(y * weight).toInt)
}
  }.sum()
{code} 



The following is an example to illustrate the problem.
{code}
val dataset = Seq(
  LabeledPoint(0.0, Vectors.dense(18, 1.0)),
  LabeledPoint(0.5, Vectors.dense(12, 0.0)),
  LabeledPoint(1.0, Vectors.dense(15, 0.0)),
  LabeledPoint(0.0, Vectors.dense(13, 2.0)),
  LabeledPoint(0.0, Vectors.dense(15, 1.0)),
  LabeledPoint(0.5, Vectors.dense(16, 1.0))
).toDF().withColumn("weight", col("label") + 1.0)
val glr = new GeneralizedLinearRegression()
.setFamily("binomial")
.setWeightCol("weight")
.setRegParam(0)
val model = glr.fit(dataset)
model.summary.aic
{code}

This calculation shows the AIC is 14.189026847171382. To verify whether this is 
correct, I run the same analysis in R but got AIC = 11.66092, -2 * LogLik = 
5.660918. 
{code}
da <- scan(, what=list(y = 0, x1 = 0, x2 = 0, w = 0), sep = ",")
0,18,1,1
0.5,12,0,1.5
1,15,0,2
0,13,2,1
0,15,1,1
0.5,16,1,1.5
da <- as.data.frame(da)
f <- glm(y ~ x1 + x2 , data = da, family = binomial(), weight = w)
AIC(f)
-2 * logLik(f)
{code}

Now, I check whether the proposed change is correct. The following calculates 
-2 * LogLik manually and get 5.6609177228379055, the same as that in R.
{code}
val predictions = model.transform(dataset)
-2.0 * predictions.select("label", "prediction", "weight").rdd.map {case Row(y: 
Double, mu: Double, weight: Double) =>
  val wt = math.round(weight).toInt
  if (wt == 0){
0.0
  } else {
dist.Binomial(wt, mu).logProbabilityOf(math.round(y * weight).toInt)
  }
  }.sum()
{code}







--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18715) Fix wrong AIC calculation in Binomial GLM

2016-12-04 Thread Wayne Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wayne Zhang updated SPARK-18715:

Summary: Fix wrong AIC calculation in Binomial GLM  (was: Correct AIC 
calculation in Binomial GLM)

> Fix wrong AIC calculation in Binomial GLM
> -
>
> Key: SPARK-18715
> URL: https://issues.apache.org/jira/browse/SPARK-18715
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Wayne Zhang
>Priority: Critical
>  Labels: patch
> Fix For: 2.2.0
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> The AIC calculation in Binomial GLM seems to be wrong when there are weights. 
> The weight adjustment should be applied to only the part of the Binomial 
> density involving the parameters, not the normalizing constant. 
> The current implementation is:
> {code}
>   -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) =>
> weight * dist.Binomial(1, mu).logProbabilityOf(math.round(y).toInt)
>   }.sum()
> {code} 
> Suggest changing this to 
> {code}
>   -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) =>
> val wt = math.round(weight).toInt
> if (wt == 0){
>   0.0
> } else {
>   dist.Binomial(wt, mu).logProbabilityOf(math.round(y * weight).toInt)
> }
>   }.sum()
> {code} 
> 
> 
> The following is an example to illustrate the problem.
> {code}
> val dataset = Seq(
>   LabeledPoint(0.0, Vectors.dense(18, 1.0)),
>   LabeledPoint(0.5, Vectors.dense(12, 0.0)),
>   LabeledPoint(1.0, Vectors.dense(15, 0.0)),
>   LabeledPoint(0.0, Vectors.dense(13, 2.0)),
>   LabeledPoint(0.0, Vectors.dense(15, 1.0)),
>   LabeledPoint(0.5, Vectors.dense(16, 1.0))
> ).toDF().withColumn("weight", col("label") + 1.0)
> val glr = new GeneralizedLinearRegression()
> .setFamily("binomial")
> .setWeightCol("weight")
> .setRegParam(0)
> val model = glr.fit(dataset)
> model.summary.aic
> {code}
> This calculation shows the AIC is 14.189026847171382. To verify whether this 
> is correct, I run the same analysis in R but got AIC = 11.66092, -2 * LogLik 
> = 5.660918. 
> {code}
> da <- scan(, what=list(y = 0, x1 = 0, x2 = 0, w = 0), sep = ",")
> 0,18,1,1
> 0.5,12,0,1.5
> 1,15,0,2
> 0,13,2,1
> 0,15,1,1
> 0.5,16,1,1.5
> da <- as.data.frame(da)
> f <- glm(y ~ x1 + x2 , data = da, family = binomial(), weight = w)
> AIC(f)
> -2 * logLik(f)
> {code}
> Now, I check whether the proposed change is correct. The following calculates 
> -2 * LogLik manually and get 5.6609177228379055, the same as that in R.
> {code}
> val predictions = model.transform(dataset)
> -2.0 * predictions.select("label", "prediction", "weight").rdd.map {case 
> Row(y: Double, mu: Double, weight: Double) =>
>   val wt = math.round(weight).toInt
>   if (wt == 0){
> 0.0
>   } else {
> dist.Binomial(wt, mu).logProbabilityOf(math.round(y * weight).toInt)
>   }
>   }.sum()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18715) Fix wrong AIC calculation in Binomial GLM

2016-12-04 Thread Wayne Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wayne Zhang updated SPARK-18715:

Description: 
The AIC calculation in Binomial GLM seems to be wrong when there are weights. 
The result is different from that in R.

The current implementation is:
{code}
  -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) =>
weight * dist.Binomial(1, mu).logProbabilityOf(math.round(y).toInt)
  }.sum()
{code} 

Suggest changing this to 
{code}
  -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) =>
val wt = math.round(weight).toInt
if (wt == 0){
  0.0
} else {
  dist.Binomial(wt, mu).logProbabilityOf(math.round(y * weight).toInt)
}
  }.sum()
{code} 



The following is an example to illustrate the problem.
{code}
val dataset = Seq(
  LabeledPoint(0.0, Vectors.dense(18, 1.0)),
  LabeledPoint(0.5, Vectors.dense(12, 0.0)),
  LabeledPoint(1.0, Vectors.dense(15, 0.0)),
  LabeledPoint(0.0, Vectors.dense(13, 2.0)),
  LabeledPoint(0.0, Vectors.dense(15, 1.0)),
  LabeledPoint(0.5, Vectors.dense(16, 1.0))
).toDF().withColumn("weight", col("label") + 1.0)
val glr = new GeneralizedLinearRegression()
.setFamily("binomial")
.setWeightCol("weight")
.setRegParam(0)
val model = glr.fit(dataset)
model.summary.aic
{code}

This calculation shows the AIC is 14.189026847171382. To verify whether this is 
correct, I run the same analysis in R but got AIC = 11.66092, -2 * LogLik = 
5.660918. 
{code}
da <- scan(, what=list(y = 0, x1 = 0, x2 = 0, w = 0), sep = ",")
0,18,1,1
0.5,12,0,1.5
1,15,0,2
0,13,2,1
0,15,1,1
0.5,16,1,1.5
da <- as.data.frame(da)
f <- glm(y ~ x1 + x2 , data = da, family = binomial(), weight = w)
AIC(f)
-2 * logLik(f)
{code}

Now, I check whether the proposed change is correct. The following calculates 
-2 * LogLik manually and get 5.6609177228379055, the same as that in R.
{code}
val predictions = model.transform(dataset)
-2.0 * predictions.select("label", "prediction", "weight").rdd.map {case Row(y: 
Double, mu: Double, weight: Double) =>
  val wt = math.round(weight).toInt
  if (wt == 0){
0.0
  } else {
dist.Binomial(wt, mu).logProbabilityOf(math.round(y * weight).toInt)
  }
  }.sum()
{code}





  was:
The AIC calculation in Binomial GLM seems to be wrong when there are weights. 
The weight adjustment should be applied to only the part of the Binomial 
density involving the parameters, not the normalizing constant. 

The current implementation is:
{code}
  -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) =>
weight * dist.Binomial(1, mu).logProbabilityOf(math.round(y).toInt)
  }.sum()
{code} 

Suggest changing this to 
{code}
  -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) =>
val wt = math.round(weight).toInt
if (wt == 0){
  0.0
} else {
  dist.Binomial(wt, mu).logProbabilityOf(math.round(y * weight).toInt)
}
  }.sum()
{code} 



The following is an example to illustrate the problem.
{code}
val dataset = Seq(
  LabeledPoint(0.0, Vectors.dense(18, 1.0)),
  LabeledPoint(0.5, Vectors.dense(12, 0.0)),
  LabeledPoint(1.0, Vectors.dense(15, 0.0)),
  LabeledPoint(0.0, Vectors.dense(13, 2.0)),
  LabeledPoint(0.0, Vectors.dense(15, 1.0)),
  LabeledPoint(0.5, Vectors.dense(16, 1.0))
).toDF().withColumn("weight", col("label") + 1.0)
val glr = new GeneralizedLinearRegression()
.setFamily("binomial")
.setWeightCol("weight")
.setRegParam(0)
val model = glr.fit(dataset)
model.summary.aic
{code}

This calculation shows the AIC is 14.189026847171382. To verify whether this is 
correct, I run the same analysis in R but got AIC = 11.66092, -2 * LogLik = 
5.660918. 
{code}
da <- scan(, what=list(y = 0, x1 = 0, x2 = 0, w = 0), sep = ",")
0,18,1,1
0.5,12,0,1.5
1,15,0,2
0,13,2,1
0,15,1,1
0.5,16,1,1.5
da <- as.data.frame(da)
f <- glm(y ~ x1 + x2 , data = da, family = binomial(), weight = w)
AIC(f)
-2 * logLik(f)
{code}

Now, I check whether the proposed change is correct. The following calculates 
-2 * LogLik manually and get 5.6609177228379055, the same as that in R.
{code}
val predictions = model.transform(dataset)
-2.0 * predictions.select("label", "prediction", "weight").rdd.map {case Row(y: 
Double, mu: Double, weight: Double) =>
  val wt = math.round(weight).toInt
  if (wt == 0){
0.0
  } else {
dist.Binomial(wt, mu).logProbabilityOf(math.round(y * weight).toInt)
  }
  }.sum()
{code}






> Fix wrong AIC calculation in Binomial GLM
> -
>
> Key: SPARK-18715
> URL: https://issues.apache.org/jira/browse/SPARK-18715
> Project: Spark
>  Issue Type: Bug
>  

[jira] [Updated] (SPARK-18701) Poisson GLM fails due to wrong initialization

2016-12-04 Thread Wayne Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wayne Zhang updated SPARK-18701:

  Shepherd: Sean Owen  (was: sean corkum)
Issue Type: Bug  (was: New Feature)

> Poisson GLM fails due to wrong initialization
> -
>
> Key: SPARK-18701
> URL: https://issues.apache.org/jira/browse/SPARK-18701
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Wayne Zhang
>Priority: Critical
> Fix For: 2.2.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Poisson GLM fails for many standard data sets. The issue is incorrect 
> initialization leading to almost zero probability and weights. The following 
> simple example reproduces the error. 
> {code:borderStyle=solid}
> val datasetPoissonLogWithZero = Seq(
>   LabeledPoint(0.0, Vectors.dense(18, 1.0)),
>   LabeledPoint(1.0, Vectors.dense(12, 0.0)),
>   LabeledPoint(0.0, Vectors.dense(15, 0.0)),
>   LabeledPoint(0.0, Vectors.dense(13, 2.0)),
>   LabeledPoint(0.0, Vectors.dense(15, 1.0)),
>   LabeledPoint(1.0, Vectors.dense(16, 1.0)),
>   LabeledPoint(0.0, Vectors.dense(10, 0.0)),
>   LabeledPoint(0.0, Vectors.dense(15, 0.0)),
>   LabeledPoint(0.0, Vectors.dense(12, 2.0)),
>   LabeledPoint(0.0, Vectors.dense(13, 0.0)),
>   LabeledPoint(1.0, Vectors.dense(15, 0.0)),
>   LabeledPoint(1.0, Vectors.dense(15, 0.0)),
>   LabeledPoint(0.0, Vectors.dense(15, 0.0)),
>   LabeledPoint(0.0, Vectors.dense(12, 2.0)),
>   LabeledPoint(1.0, Vectors.dense(12, 2.0))
> ).toDF()
> 
> val glr = new GeneralizedLinearRegression()
>   .setFamily("poisson")
>   .setLink("log")
>   .setMaxIter(20)
>   .setRegParam(0)
> val model = glr.fit(datasetPoissonLogWithZero)
> {code}
> The issue is in the initialization:  the mean is initialized as the response, 
> which could be zero. Applying the log link results in very negative numbers 
> (protected against -Inf), which again leads to close to zero probability and 
> weights in the weighted least squares. The fix is easy: just add a small 
> constant, highlighted in red below. 
>  
> override def initialize(y: Double, weight: Double): Double = {
>   require(y >= 0.0, "The response variable of Poisson family " +
> s"should be non-negative, but got $y")
>   y {color:red}+ 0.1 {color}
> }
> I already have a fix and test code. Will create a PR. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-18710) Add offset to GeneralizedLinearRegression models

2017-01-06 Thread Wayne Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wayne Zhang closed SPARK-18710.
---
Resolution: Unresolved

> Add offset to GeneralizedLinearRegression models
> 
>
> Key: SPARK-18710
> URL: https://issues.apache.org/jira/browse/SPARK-18710
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Wayne Zhang
>Assignee: Wayne Zhang
>  Labels: features
>   Original Estimate: 10h
>  Remaining Estimate: 10h
>
> The current GeneralizedLinearRegression model does not support offset. The 
> offset can be useful to take into account exposure, or for testing 
> incremental effect of new variables. It is possible to use weights in current 
> environment to achieve the same effect of specifying offset for certain 
> models, e.g., Poisson & Binomial with log offset, it is desirable to have the 
> offset option to work with more general cases, e.g., negative offset or 
> offset that is hard to specify using weights (e.g., offset to the probability 
> rather than odds in logistic regression).
> Effort would involve:
> * update regression class to support offsetCol
> * update IWLS to take into account of offset
> * add test case for offset
> I can start working on this if the community approves this feature. 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-18929) Add Tweedie distribution in GLM

2017-01-06 Thread Wayne Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wayne Zhang closed SPARK-18929.
---
Resolution: Unresolved

> Add Tweedie distribution in GLM
> ---
>
> Key: SPARK-18929
> URL: https://issues.apache.org/jira/browse/SPARK-18929
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Wayne Zhang
>Assignee: Wayne Zhang
>  Labels: features
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> I propose to add the full Tweedie family into the GeneralizedLinearRegression 
> model. The Tweedie family is characterized by a power variance function. 
> Currently supported distributions such as Gaussian,  Poisson and Gamma 
> families are a special case of the 
> [Tweedie|https://en.wikipedia.org/wiki/Tweedie_distribution]. 
> I propose to add support for the other distributions:
> * compound Poisson: 1 < variancePower < 2. This one is widely used to model 
> zero-inflated continuous distributions. 
> * positive stable: variancePower > 2 and variancePower != 3. Used to model 
> extreme values.
> * inverse Gaussian: variancePower = 3.
>  The Tweedie family is supported in most statistical packages such as R 
> (statmod), SAS, h2o etc. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18710) Add offset to GeneralizedLinearRegression models

2016-12-20 Thread Wayne Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765733#comment-15765733
 ] 

Wayne Zhang commented on SPARK-18710:
-

[~yanboliang] Thanks for the suggestion. I think the issue is a bit different 
in this case. The IRWLS relies on the _reweightFunc_, which is hard-coded to 
take an _Instance_ class:
{code}  
val reweightFunc: (Instance, WeightedLeastSquaresModel) => (Double, Double)
{code}

I need to pass the offset column to this reweight function. Creating another 
GLRInstance won't solve the problem, will it?

> Add offset to GeneralizedLinearRegression models
> 
>
> Key: SPARK-18710
> URL: https://issues.apache.org/jira/browse/SPARK-18710
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Wayne Zhang
>Assignee: Wayne Zhang
>  Labels: features
>   Original Estimate: 10h
>  Remaining Estimate: 10h
>
> The current GeneralizedLinearRegression model does not support offset. The 
> offset can be useful to take into account exposure, or for testing 
> incremental effect of new variables. It is possible to use weights in current 
> environment to achieve the same effect of specifying offset for certain 
> models, e.g., Poisson & Binomial with log offset, it is desirable to have the 
> offset option to work with more general cases, e.g., negative offset or 
> offset that is hard to specify using weights (e.g., offset to the probability 
> rather than odds in logistic regression).
> Effort would involve:
> * update regression class to support offsetCol
> * update IWLS to take into account of offset
> * add test case for offset
> I can start working on this if the community approves this feature. 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18710) Add offset to GeneralizedLinearRegression models

2016-12-26 Thread Wayne Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15779760#comment-15779760
 ] 

Wayne Zhang commented on SPARK-18710:
-

Thanks for the comment, Yanbo. In IRLS, the fit method expects RDD[Instance]. 
Does it still work if one feeds a RDD[GLRInstance] object to it? 

{code}
  def fit(instances: RDD[Instance]): IterativelyReweightedLeastSquaresModel = {
{code}

> Add offset to GeneralizedLinearRegression models
> 
>
> Key: SPARK-18710
> URL: https://issues.apache.org/jira/browse/SPARK-18710
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Wayne Zhang
>Assignee: Wayne Zhang
>  Labels: features
>   Original Estimate: 10h
>  Remaining Estimate: 10h
>
> The current GeneralizedLinearRegression model does not support offset. The 
> offset can be useful to take into account exposure, or for testing 
> incremental effect of new variables. It is possible to use weights in current 
> environment to achieve the same effect of specifying offset for certain 
> models, e.g., Poisson & Binomial with log offset, it is desirable to have the 
> offset option to work with more general cases, e.g., negative offset or 
> offset that is hard to specify using weights (e.g., offset to the probability 
> rather than odds in logistic regression).
> Effort would involve:
> * update regression class to support offsetCol
> * update IWLS to take into account of offset
> * add test case for offset
> I can start working on this if the community approves this feature. 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18929) Add Tweedie distribution in GLM

2016-12-19 Thread Wayne Zhang (JIRA)
Wayne Zhang created SPARK-18929:
---

 Summary: Add Tweedie distribution in GLM
 Key: SPARK-18929
 URL: https://issues.apache.org/jira/browse/SPARK-18929
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 2.0.2
Reporter: Wayne Zhang


I propose to add the full Tweedie family into the GeneralizedLinearRegression 
model. The Tweedie family is characterized by a power variance function. 
Currently supported distributions such as Gaussian,  Poisson and Gamma families 
are a special case of the 
[Tweedie|https://en.wikipedia.org/wiki/Tweedie_distribution]. 

I propose to add support for the other distributions:
* compound Poisson: 1 < variancePower < 2. This one is widely used to model 
zero-inflated continuous distributions. 
* positive stable: variancePower > 2 and variancePower != 3. Used to model 
extreme values.
* inverse Gaussian: variancePower = 3.

 The Tweedie family is supported in most statistical packages such as R 
(statmod), SAS, h2o etc. 





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18710) Add offset to GeneralizedLinearRegression models

2016-12-19 Thread Wayne Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15762752#comment-15762752
 ] 

Wayne Zhang commented on SPARK-18710:
-

[~yanboliang] It seems that I would need to change the case class 'Instance' to 
include offset...  That could be potentially disruptive if many other models 
also depend on this case class. Any suggestions regarding this?  

> Add offset to GeneralizedLinearRegression models
> 
>
> Key: SPARK-18710
> URL: https://issues.apache.org/jira/browse/SPARK-18710
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Wayne Zhang
>Assignee: Wayne Zhang
>  Labels: features
>   Original Estimate: 10h
>  Remaining Estimate: 10h
>
> The current GeneralizedLinearRegression model does not support offset. The 
> offset can be useful to take into account exposure, or for testing 
> incremental effect of new variables. It is possible to use weights in current 
> environment to achieve the same effect of specifying offset for certain 
> models, e.g., Poisson & Binomial with log offset, it is desirable to have the 
> offset option to work with more general cases, e.g., negative offset or 
> offset that is hard to specify using weights (e.g., offset to the probability 
> rather than odds in logistic regression).
> Effort would involve:
> * update regression class to support offsetCol
> * update IWLS to take into account of offset
> * add test case for offset
> I can start working on this if the community approves this feature. 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20026) Document R GLM Tweedie family support in programming guide and code example

2017-04-04 Thread Wayne Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15955342#comment-15955342
 ] 

Wayne Zhang commented on SPARK-20026:
-

[~felixcheung] Yes, I will work on this. Thanks. 

> Document R GLM Tweedie family support in programming guide and code example
> ---
>
> Key: SPARK-20026
> URL: https://issues.apache.org/jira/browse/SPARK-20026
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19818) SparkR union should check for name consistency of input data frames

2017-03-03 Thread Wayne Zhang (JIRA)
Wayne Zhang created SPARK-19818:
---

 Summary: SparkR union should check for name consistency of input 
data frames 
 Key: SPARK-19818
 URL: https://issues.apache.org/jira/browse/SPARK-19818
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.1.0
Reporter: Wayne Zhang
Priority: Minor


The current implementation accepts data frames with different schemas. See 
issues below:
{code}
df <- createDataFrame(data.frame(name = c("Michael", "Andy", "Justin"), age = 
c(1, 30, 19)))
union(df, df[, c(2, 1)])
 name age
1 Michael 1.0
2Andy30.0
3  Justin19.0
4 1.0 Michael
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19819) Use concrete data in SparkR DataFrame examples

2017-03-04 Thread Wayne Zhang (JIRA)
Wayne Zhang created SPARK-19819:
---

 Summary: Use concrete data in SparkR DataFrame examples 
 Key: SPARK-19819
 URL: https://issues.apache.org/jira/browse/SPARK-19819
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 2.1.0
Reporter: Wayne Zhang
Priority: Minor


Many examples in SparkDataFrame methods uses: 
{code}
path <- "path/to/file.json"
df <- read.json(path)
{code}

This is not directly runnable. Replace this with real numerical examples so 
that users can directly execute the examples. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-19773) SparkDataFrame should not allow duplicate names

2017-03-01 Thread Wayne Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wayne Zhang closed SPARK-19773.
---
Resolution: Not A Problem

> SparkDataFrame should not allow duplicate names
> ---
>
> Key: SPARK-19773
> URL: https://issues.apache.org/jira/browse/SPARK-19773
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Wayne Zhang
>Priority: Minor
>
> SparkDataFrame in SparkR seems to accept duplicate names at creation, but 
> incurs error when calling methods downstream. For example, we can do: 
> {{{code}}}
> l <- list(list(1, 2), list(3, 4))
> df <- createDataFrame(l, c("a", "a"))
> head(df)
> {{{code}}}
> But an error occurs when we do df$a = df$a * 2.0. 
> I suggest we add validity check for duplicate names at initialization.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20258) SparkR logistic regression example did not converge in programming guide

2017-04-07 Thread Wayne Zhang (JIRA)
Wayne Zhang created SPARK-20258:
---

 Summary: SparkR logistic regression example did not converge in 
programming guide
 Key: SPARK-20258
 URL: https://issues.apache.org/jira/browse/SPARK-20258
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.1.0
Reporter: Wayne Zhang


SparkR logistic regression example did not converge in programming guide. All 
estimates are essentially zero:

{code}
training2 <- read.df("data/mllib/sample_binary_classification_data.txt", source 
= "libsvm")
df_list2 <- randomSplit(training2, c(7,3), 2)
binomialDF <- df_list2[[1]]
binomialTestDF <- df_list2[[2]]
binomialGLM <- spark.glm(binomialDF, label ~ features, family = "binomial")


17/04/07 11:42:03 WARN WeightedLeastSquares: Cholesky solver failed due to 
singular covariance matrix. Retrying with Quasi-Newton solver.

> summary(binomialGLM)

Deviance Residuals: 
(Note: These are approximate quantiles with relative error <= 0.01)
Min   1Q   Median   3Q  Max  
-2.4828e-06  -2.4063e-06   2.2778e-06   2.4350e-06   2.7722e-06  

Coefficients:
 Estimate
(Intercept)9.0255e+00
features_0 0.e+00
features_1 0.e+00
features_2 0.e+00
features_3 0.e+00
features_4 0.e+00
features_5 0.e+00
features_6 0.e+00
features_7 0.e+00
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21622) Support Offset in SparkR

2017-08-03 Thread Wayne Zhang (JIRA)
Wayne Zhang created SPARK-21622:
---

 Summary: Support Offset in SparkR
 Key: SPARK-21622
 URL: https://issues.apache.org/jira/browse/SPARK-21622
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 2.2.0
Reporter: Wayne Zhang


Support offset in GLM in SparkR.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21275) Update GLM test to use supportedFamilyNames

2017-06-30 Thread Wayne Zhang (JIRA)
Wayne Zhang created SPARK-21275:
---

 Summary: Update GLM test to use supportedFamilyNames
 Key: SPARK-21275
 URL: https://issues.apache.org/jira/browse/SPARK-21275
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.1.1
Reporter: Wayne Zhang
Priority: Minor


Address this comment: 
https://github.com/apache/spark/pull/16699#discussion-diff-100574976R855



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21310) Add offset to PySpark GLM

2017-07-04 Thread Wayne Zhang (JIRA)
Wayne Zhang created SPARK-21310:
---

 Summary: Add offset to PySpark GLM 
 Key: SPARK-21310
 URL: https://issues.apache.org/jira/browse/SPARK-21310
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Affects Versions: 2.1.1
Reporter: Wayne Zhang






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20574) Allow Bucketizer to handle non-Double column

2017-05-03 Thread Wayne Zhang (JIRA)
Wayne Zhang created SPARK-20574:
---

 Summary: Allow Bucketizer to handle non-Double column
 Key: SPARK-20574
 URL: https://issues.apache.org/jira/browse/SPARK-20574
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.1.0
Reporter: Wayne Zhang


Bucketizer currently requires input column to be Double, but the logic should 
work on any numeric data types. Many practical problems have integer/float data 
types, and it could get very tedious to manually cast them into Double before 
calling bucketizer. This transformer could be extended to handle all numeric 
types.  

The example below shows failure of Bucketizer on integer data. 
{code}
val splits = Array(-3.0, 0.0, 3.0)
val data: Array[Int] = Array(-2, -1, 0, 1, 2)
val expectedBuckets = Array(0.0, 0.0, 1.0, 1.0, 1.0)
val dataFrame = data.zip(expectedBuckets).toSeq.toDF("feature", "expected")
val bucketizer = new Bucketizer()
  .setInputCol("feature")
  .setOutputCol("result")
  .setSplits(splits)
bucketizer.transform(dataFrame)  

java.lang.IllegalArgumentException: requirement failed: Column feature must be 
of type DoubleType but was actually IntegerType.
{code}





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20736) PySpark StringIndexer supports StringOrderType

2017-05-14 Thread Wayne Zhang (JIRA)
Wayne Zhang created SPARK-20736:
---

 Summary: PySpark StringIndexer supports StringOrderType
 Key: SPARK-20736
 URL: https://issues.apache.org/jira/browse/SPARK-20736
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.1.0
Reporter: Wayne Zhang


Port new support of StringOrderType to PySpark StringIndexer. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20899) PySpark supports stringIndexerOrderType in RFormula

2017-05-26 Thread Wayne Zhang (JIRA)
Wayne Zhang created SPARK-20899:
---

 Summary: PySpark supports stringIndexerOrderType in RFormula
 Key: SPARK-20899
 URL: https://issues.apache.org/jira/browse/SPARK-20899
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Affects Versions: 2.1.1
Reporter: Wayne Zhang






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20892) Add SQL trunc function to SparkR

2017-05-25 Thread Wayne Zhang (JIRA)
Wayne Zhang created SPARK-20892:
---

 Summary: Add SQL trunc function to SparkR
 Key: SPARK-20892
 URL: https://issues.apache.org/jira/browse/SPARK-20892
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Affects Versions: 2.1.1
Reporter: Wayne Zhang


Add SQL trunc function to SparkR



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20917) SparkR supports string encoding consistent with R

2017-05-29 Thread Wayne Zhang (JIRA)
Wayne Zhang created SPARK-20917:
---

 Summary: SparkR supports string encoding consistent with R
 Key: SPARK-20917
 URL: https://issues.apache.org/jira/browse/SPARK-20917
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Affects Versions: 2.1.1
Reporter: Wayne Zhang


Add stringIndexerOrderType to spark.glm and spark.survreg to support string 
encoding that is consistent with default R.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20619) StringIndexer supports multiple ways of label ordering

2017-05-06 Thread Wayne Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wayne Zhang updated SPARK-20619:

Description: 
StringIndexer maps labels to numbers according to the descending order of label 
frequency. Other types of ordering (e.g., alphabetical) may be needed in 
feature ETL. For example, the ordering will affect the result in one-hot 
encoding and RFormula. Propose to support other ordering methods and we add a 
parameter stringOrderType that supports the following four options:

   - 'freq_desc': descending order by label frequency (most frequent label 
assigned 0)
   - 'freq_asc': ascending order by label frequency (least frequent label 
assigned 0)
   - 'alphabet_desc': descending alphabetical order
   - 'alphabet_asc': ascending alphabetical order

  was:
StringIndexer maps labels to numbers according to the descending order of label 
frequency. Other types of ordering (e.g., alphabetical) may be needed in 
feature ETL, for example, in one-hot encoding. Propose to support alphabetic 
order, and ascending order of label frequency. For example, add a parameter 
stringOrderType to control how string is ordered which supports four options:

   - 'freq_desc': descending order by label frequency (most frequent label 
assigned 0)
   - 'freq_asc': ascending order by label frequency (least frequent label 
assigned 0)
   - 'alphabet_desc': descending alphabetical order
   - 'alphabet_asc': ascending alphabetical order


> StringIndexer supports multiple ways of label ordering
> --
>
> Key: SPARK-20619
> URL: https://issues.apache.org/jira/browse/SPARK-20619
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Wayne Zhang
>
> StringIndexer maps labels to numbers according to the descending order of 
> label frequency. Other types of ordering (e.g., alphabetical) may be needed 
> in feature ETL. For example, the ordering will affect the result in one-hot 
> encoding and RFormula. Propose to support other ordering methods and we add a 
> parameter stringOrderType that supports the following four options:
>- 'freq_desc': descending order by label frequency (most frequent label 
> assigned 0)
>- 'freq_asc': ascending order by label frequency (least frequent label 
> assigned 0)
>- 'alphabet_desc': descending alphabetical order
>- 'alphabet_asc': ascending alphabetical order



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20619) StringIndexer supports multiple ways of label ordering

2017-05-06 Thread Wayne Zhang (JIRA)
Wayne Zhang created SPARK-20619:
---

 Summary: StringIndexer supports multiple ways of label ordering
 Key: SPARK-20619
 URL: https://issues.apache.org/jira/browse/SPARK-20619
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.1.0
Reporter: Wayne Zhang


StringIndexer maps labels to numbers according to the descending order of label 
frequency. Other types of ordering (e.g., alphabetical) may be needed in 
feature ETL, for example, in one-hot encoding. Propose to support alphabetic 
order, and ascending order of label frequency. For example, add a parameter 
stringOrderType to control how string is ordered which supports four options:

   - 'freq_desc': descending order by label frequency (most frequent label 
assigned 0)
   - 'freq_asc': ascending order by label frequency (least frequent label 
assigned 0)
   - 'alphabet_desc': descending alphabetical order
   - 'alphabet_asc': ascending alphabetical order



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20604) Allow Imputer to handle all numeric types

2017-05-04 Thread Wayne Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wayne Zhang updated SPARK-20604:

Description: 
Imputer currently requires input column to be Double or Float, but the logic 
should work on any numeric data types. Many practical problems have integer  
data types, and it could get very tedious to manually cast them into Double 
before calling imputer. This transformer could be extended to handle all 
numeric types.  

The example below shows failure of Imputer on integer data. 
{code}
val df = spark.createDataFrame( Seq(
  (0, 1.0, 1.0, 1.0),
  (1, 11.0, 11.0, 11.0),
  (2, 1.5, 1.5, 1.5),
  (3, Double.NaN, 4.5, 1.5)
)).toDF("id", "value1", "expected_mean_value1", "expected_median_value1")
val imputer = new Imputer()
  .setInputCols(Array("value1"))
  .setOutputCols(Array("out1"))
imputer.fit(df.withColumn("value1", col("value1").cast(IntegerType)))

java.lang.IllegalArgumentException: requirement failed: Column value1 must be 
of type equal to one of the following types: [DoubleType, FloatType] but was 
actually of type IntegerType.

{code}



  was:
Imputer currently requires input column to be Double or Float, but the logic 
should work on any numeric data types. Many practical problems have integer  
data types, and it could get very tedious to manually cast them into Double 
before calling imputer. This transformer could be extended to handle all 
numeric types.  

The example below shows failure of Bucketizer on integer data. 
{code}
val df = spark.createDataFrame( Seq(
  (0, 1.0, 1.0, 1.0),
  (1, 11.0, 11.0, 11.0),
  (2, 1.5, 1.5, 1.5),
  (3, Double.NaN, 4.5, 1.5)
)).toDF("id", "value1", "expected_mean_value1", "expected_median_value1")
val imputer = new Imputer()
  .setInputCols(Array("value1"))
  .setOutputCols(Array("out1"))
imputer.fit(df.withColumn("value1", col("value1").cast(IntegerType)))

java.lang.IllegalArgumentException: requirement failed: Column value1 must be 
of type equal to one of the following types: [DoubleType, FloatType] but was 
actually of type IntegerType.

{code}




> Allow Imputer to handle all numeric types
> -
>
> Key: SPARK-20604
> URL: https://issues.apache.org/jira/browse/SPARK-20604
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Wayne Zhang
>Assignee: Apache Spark
>
> Imputer currently requires input column to be Double or Float, but the logic 
> should work on any numeric data types. Many practical problems have integer  
> data types, and it could get very tedious to manually cast them into Double 
> before calling imputer. This transformer could be extended to handle all 
> numeric types.  
> The example below shows failure of Imputer on integer data. 
> {code}
> val df = spark.createDataFrame( Seq(
>   (0, 1.0, 1.0, 1.0),
>   (1, 11.0, 11.0, 11.0),
>   (2, 1.5, 1.5, 1.5),
>   (3, Double.NaN, 4.5, 1.5)
> )).toDF("id", "value1", "expected_mean_value1", "expected_median_value1")
> val imputer = new Imputer()
>   .setInputCols(Array("value1"))
>   .setOutputCols(Array("out1"))
> imputer.fit(df.withColumn("value1", col("value1").cast(IntegerType)))
> java.lang.IllegalArgumentException: requirement failed: Column value1 must be 
> of type equal to one of the following types: [DoubleType, FloatType] but was 
> actually of type IntegerType.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20604) Allow Imputer to handle all numeric types

2017-05-04 Thread Wayne Zhang (JIRA)
Wayne Zhang created SPARK-20604:
---

 Summary: Allow Imputer to handle all numeric types
 Key: SPARK-20604
 URL: https://issues.apache.org/jira/browse/SPARK-20604
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.1.0
Reporter: Wayne Zhang


Imputer currently requires input column to be Double or Float, but the logic 
should work on any numeric data types. Many practical problems have integer  
data types, and it could get very tedious to manually cast them into Double 
before calling imputer. This transformer could be extended to handle all 
numeric types.  

The example below shows failure of Bucketizer on integer data. 
{code}
val df = spark.createDataFrame( Seq(
  (0, 1.0, 1.0, 1.0),
  (1, 11.0, 11.0, 11.0),
  (2, 1.5, 1.5, 1.5),
  (3, Double.NaN, 4.5, 1.5)
)).toDF("id", "value1", "expected_mean_value1", "expected_median_value1")
val imputer = new Imputer()
  .setInputCols(Array("value1"))
  .setOutputCols(Array("out1"))
imputer.fit(df.withColumn("value1", col("value1").cast(IntegerType)))

java.lang.IllegalArgumentException: requirement failed: Column value1 must be 
of type equal to one of the following types: [DoubleType, FloatType] but was 
actually of type IntegerType.

{code}





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20889) SparkR grouped documentation for Column methods

2017-05-25 Thread Wayne Zhang (JIRA)
Wayne Zhang created SPARK-20889:
---

 Summary: SparkR grouped documentation for Column methods
 Key: SPARK-20889
 URL: https://issues.apache.org/jira/browse/SPARK-20889
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 2.1.1
Reporter: Wayne Zhang


Group the documentation of individual methods defined for the Column class. 
This aims to create the following improvements:

- Centralized documentation for easy navigation (user can view multiple related 
methods on one single page).
- Reduced number of items in Seealso.
- Betters examples using shared data. This avoids creating a data frame for 
each function if they are documented separately. And more importantly, user can 
copy and paste to run them directly!
- Cleaner structure and much fewer Rd files (remove a large number of Rd files).
- Remove duplicated definition of param (since they share exactly the same 
argument).
- No need to write meaningless examples for trivial functions (because of 
grouping).




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org