[jira] [Updated] (SPARK-19319) SparkR Kmeans summary returns error when the cluster size doesn't equal to k

2017-02-12 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-19319:
-
Target Version/s: 2.1.1, 2.2.0  (was: 2.2.0)

> SparkR Kmeans summary returns error when the cluster size doesn't equal to k
> 
>
> Key: SPARK-19319
> URL: https://issues.apache.org/jira/browse/SPARK-19319
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Reporter: Miao Wang
>Assignee: Miao Wang
> Fix For: 2.1.1, 2.2.0
>
>
> When Kmeans using initMode = "random" and some random seed, it is possible 
> the actual cluster size doesn't equal to the configured `k`.
> In this case, summary(model) returns error due to the number of cols of 
> coefficient matrix doesn't equal to k.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19319) SparkR Kmeans summary returns error when the cluster size doesn't equal to k

2017-02-12 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-19319:
-
Fix Version/s: 2.1.1

> SparkR Kmeans summary returns error when the cluster size doesn't equal to k
> 
>
> Key: SPARK-19319
> URL: https://issues.apache.org/jira/browse/SPARK-19319
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Reporter: Miao Wang
>Assignee: Miao Wang
> Fix For: 2.1.1, 2.2.0
>
>
> When Kmeans using initMode = "random" and some random seed, it is possible 
> the actual cluster size doesn't equal to the configured `k`.
> In this case, summary(model) returns error due to the number of cols of 
> coefficient matrix doesn't equal to k.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19342) Datatype tImestamp is converted to numeric in collect method

2017-02-12 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-19342:
-
Target Version/s: 2.1.1, 2.2.0
   Fix Version/s: 2.2.0
  2.1.1

> Datatype tImestamp is converted to numeric in collect method 
> -
>
> Key: SPARK-19342
> URL: https://issues.apache.org/jira/browse/SPARK-19342
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Fangzhou Yang
> Fix For: 2.1.1, 2.2.0
>
>
> Get double instead of POSIX in collect method for timestamp column datatype, 
> when NA exists at the top of the column.
> The following codes and outputs show that, how the bug can be reproduced:
> {code}
> > sparkR.session(master = "local")
> Spark package found in SPARK_HOME: /home/titicaca/spark-2.1
> Launching java with spark-submit command 
> /home/titicaca/spark-2.1/bin/spark-submit   sparkr-shell 
> /tmp/RtmpqmpZUg/backend_port363a898be92 
> Java ref type org.apache.spark.sql.SparkSession id 1 
> > df <- data.frame(col1 = c(0, 1, 2), 
> +  col2 = c(as.POSIXct("2017-01-01 00:00:01"), NA, 
> as.POSIXct("2017-01-01 12:00:01")))
> > sdf1 <- createDataFrame(df)
> > print(dtypes(sdf1))
> [[1]]
> [1] "col1"   "double"
> [[2]]
> [1] "col2"  "timestamp"
> > df1 <- collect(sdf1)
> > print(lapply(df1, class))
> $col1
> [1] "numeric"
> $col2
> [1] "POSIXct" "POSIXt" 
> > sdf2 <- filter(sdf1, "col1 > 0")
> > print(dtypes(sdf2))
> [[1]]
> [1] "col1"   "double"
> [[2]]
> [1] "col2"  "timestamp"
> > df2 <- collect(sdf2)
> > print(lapply(df2, class))
> $col1
> [1] "numeric"
> $col2
> [1] "numeric"
> {code}
> As we can see, the data type of col2 is converted to numberic unexpectedly in 
> the collected local data frame df2



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19452) Fix bug in the name assignment method in SparkR

2017-02-05 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-19452.
--
  Resolution: Fixed
Assignee: Wayne Zhang
   Fix Version/s: 2.2.0
Target Version/s: 2.2.0

> Fix bug in the name assignment method in SparkR
> ---
>
> Key: SPARK-19452
> URL: https://issues.apache.org/jira/browse/SPARK-19452
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Wayne Zhang
>Assignee: Wayne Zhang
> Fix For: 2.2.0
>
>
> The names method fails to check for validity of the assignment values. This 
> can be fixed by calling colnames within names. See example below.
> {code}
> df <- suppressWarnings(createDataFrame(iris))
> # this is error
> colnames(df) <- NULL
> # this should report error
> names(df) <- NULL
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19460) Update dataset used in R documentation, examples to reduce warning noise and confusions

2017-02-04 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-19460:


 Summary: Update dataset used in R documentation, examples to 
reduce warning noise and confusions
 Key: SPARK-19460
 URL: https://issues.apache.org/jira/browse/SPARK-19460
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.1.0
Reporter: Felix Cheung


Running build we have a bunch of warnings from using the `iris` dataset, for 
example.

Warning in FUN(X[[1L]], ...) :
Use Sepal_Length instead of Sepal.Length as column name
Warning in FUN(X[[2L]], ...) :
Use Sepal_Width instead of Sepal.Width as column name
Warning in FUN(X[[3L]], ...) :
Use Petal_Length instead of Petal.Length as column name
Warning in FUN(X[[4L]], ...) :
Use Petal_Width instead of Petal.Width as column name
Warning in FUN(X[[1L]], ...) :
Use Sepal_Length instead of Sepal.Length as column name
Warning in FUN(X[[2L]], ...) :
Use Sepal_Width instead of Sepal.Width as column name
Warning in FUN(X[[3L]], ...) :
Use Petal_Length instead of Petal.Length as column name
Warning in FUN(X[[4L]], ...) :
Use Petal_Width instead of Petal.Width as column name
Warning in FUN(X[[1L]], ...) :
Use Sepal_Length instead of Sepal.Length as column name
Warning in FUN(X[[2L]], ...) :
Use Sepal_Width instead of Sepal.Width as column name
Warning in FUN(X[[3L]], ...) :
Use Petal_Length instead of Petal.Length as column name

These are the results of having `.` in the column name. For reference, see 
SPARK-12191, SPARK-11976. Since it involves changing SQL, if we couldn't 
support that there then we should strongly consider using other dataset without 
`.`, eg. `cars`

And we should update this in API doc (roxygen2 doc string), vignettes, 
programming guide, R code example.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (ZEPPELIN-2058) Reduce test matrix on Travis

2017-02-04 Thread Felix Cheung (JIRA)
Felix Cheung created ZEPPELIN-2058:
--

 Summary: Reduce test matrix on Travis
 Key: ZEPPELIN-2058
 URL: https://issues.apache.org/jira/browse/ZEPPELIN-2058
 Project: Zeppelin
  Issue Type: Bug
  Components: build
Reporter: Felix Cheung


We have 11 profile in the Travis matrix and tests are running for a long time. 
We should consider streamlining it:

- do we really support that many versions of Spark? how about just 1.6.x, 2.0.x 
and 2.1.x?
- could we merge the python 2 and 3 tests into other profiles?
- could we merge the Livy test into another profile?




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (SPARK-19386) Bisecting k-means in SparkR documentation

2017-02-03 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung reassigned SPARK-19386:


Assignee: Krishna Kalyan  (was: Miao Wang)

> Bisecting k-means in SparkR documentation
> -
>
> Key: SPARK-19386
> URL: https://issues.apache.org/jira/browse/SPARK-19386
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Assignee: Krishna Kalyan
> Fix For: 2.2.0
>
>
> we need updates to programming guide, example and vignettes



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19386) Bisecting k-means in SparkR documentation

2017-02-03 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-19386.
--
   Resolution: Fixed
Fix Version/s: 2.2.0

> Bisecting k-means in SparkR documentation
> -
>
> Key: SPARK-19386
> URL: https://issues.apache.org/jira/browse/SPARK-19386
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Assignee: Krishna Kalyan
> Fix For: 2.2.0
>
>
> we need updates to programming guide, example and vignettes



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19319) SparkR Kmeans summary returns error when the cluster size doesn't equal to k

2017-02-01 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-19319:
-
Component/s: SparkR
 ML

> SparkR Kmeans summary returns error when the cluster size doesn't equal to k
> 
>
> Key: SPARK-19319
> URL: https://issues.apache.org/jira/browse/SPARK-19319
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Reporter: Miao Wang
>Assignee: Miao Wang
> Fix For: 2.2.0
>
>
> When Kmeans using initMode = "random" and some random seed, it is possible 
> the actual cluster size doesn't equal to the configured `k`.
> In this case, summary(model) returns error due to the number of cols of 
> coefficient matrix doesn't equal to k.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19319) SparkR Kmeans summary returns error when the cluster size doesn't equal to k

2017-01-31 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-19319.
--
  Resolution: Fixed
Assignee: Miao Wang
   Fix Version/s: 2.2.0
Target Version/s: 2.2.0

> SparkR Kmeans summary returns error when the cluster size doesn't equal to k
> 
>
> Key: SPARK-19319
> URL: https://issues.apache.org/jira/browse/SPARK-19319
> Project: Spark
>  Issue Type: Bug
>Reporter: Miao Wang
>Assignee: Miao Wang
> Fix For: 2.2.0
>
>
> When Kmeans using initMode = "random" and some random seed, it is possible 
> the actual cluster size doesn't equal to the configured `k`.
> In this case, summary(model) returns error due to the number of cols of 
> coefficient matrix doesn't equal to k.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19386) Bisecting k-means in SparkR documentation

2017-01-31 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-19386:
-
Target Version/s: 2.2.0

> Bisecting k-means in SparkR documentation
> -
>
> Key: SPARK-19386
> URL: https://issues.apache.org/jira/browse/SPARK-19386
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Assignee: Miao Wang
>
> we need updates to programming guide, example and vignettes



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-18864) Changes of MLlib and SparkR behavior for 2.2

2017-01-31 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847452#comment-15847452
 ] 

Felix Cheung commented on SPARK-18864:
--

SPARK-19066 LDA doesn't set optimizer correctly
(This is also in 2.1.1 though)

> Changes of MLlib and SparkR behavior for 2.2
> 
>
> Key: SPARK-18864
> URL: https://issues.apache.org/jira/browse/SPARK-18864
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML, MLlib, SparkR
>Reporter: Joseph K. Bradley
>
> This JIRA is for tracking changes of behavior within MLlib and SparkR for the 
> Spark 2.2 release.  If any JIRAs change behavior, please list them below with 
> a short description of the change.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-18864) Changes of MLlib and SparkR behavior for 2.2

2017-01-31 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-18864:
-
Comment: was deleted

(was: SPARK-19291 spark.gaussianMixture supports output log-likelihood)

> Changes of MLlib and SparkR behavior for 2.2
> 
>
> Key: SPARK-18864
> URL: https://issues.apache.org/jira/browse/SPARK-18864
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML, MLlib, SparkR
>Reporter: Joseph K. Bradley
>
> This JIRA is for tracking changes of behavior within MLlib and SparkR for the 
> Spark 2.2 release.  If any JIRAs change behavior, please list them below with 
> a short description of the change.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-18864) Changes of MLlib and SparkR behavior for 2.2

2017-01-31 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847449#comment-15847449
 ] 

Felix Cheung commented on SPARK-18864:
--

SPARK-19291 spark.gaussianMixture supports output log-likelihood

> Changes of MLlib and SparkR behavior for 2.2
> 
>
> Key: SPARK-18864
> URL: https://issues.apache.org/jira/browse/SPARK-18864
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML, MLlib, SparkR
>Reporter: Joseph K. Bradley
>
> This JIRA is for tracking changes of behavior within MLlib and SparkR for the 
> Spark 2.2 release.  If any JIRAs change behavior, please list them below with 
> a short description of the change.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-19395) Convert coefficients in summary to matrix

2017-01-31 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-19395.
--
  Resolution: Fixed
Assignee: Wayne Zhang
   Fix Version/s: 2.2.0
Target Version/s: 2.2.0

> Convert coefficients in summary to matrix
> -
>
> Key: SPARK-19395
> URL: https://issues.apache.org/jira/browse/SPARK-19395
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Wayne Zhang
>Assignee: Wayne Zhang
> Fix For: 2.2.0
>
>
> The coefficients component in model summary should be 'matrix' but the 
> underlying structure is indeed list. This affects several models except for 
> 'AFTSurvivalRegressionModel' which has the correct implementation. The fix is 
> to first unlist the coefficients returned from the callJMethod before 
> converting to matrix.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-18864) Changes of MLlib and SparkR behavior for 2.2

2017-01-31 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847442#comment-15847442
 ] 

Felix Cheung edited comment on SPARK-18864 at 1/31/17 8:24 PM:
---

SPARK-19395 changes to summary format for coefficients 


was (Author: felixcheung):
SPARK-19395 changes to summary format

> Changes of MLlib and SparkR behavior for 2.2
> 
>
> Key: SPARK-18864
> URL: https://issues.apache.org/jira/browse/SPARK-18864
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML, MLlib, SparkR
>Reporter: Joseph K. Bradley
>
> This JIRA is for tracking changes of behavior within MLlib and SparkR for the 
> Spark 2.2 release.  If any JIRAs change behavior, please list them below with 
> a short description of the change.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-18864) Changes of MLlib and SparkR behavior for 2.2

2017-01-31 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847442#comment-15847442
 ] 

Felix Cheung commented on SPARK-18864:
--

SPARK-19395 changes to summary format

> Changes of MLlib and SparkR behavior for 2.2
> 
>
> Key: SPARK-18864
> URL: https://issues.apache.org/jira/browse/SPARK-18864
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML, MLlib, SparkR
>Reporter: Joseph K. Bradley
>
> This JIRA is for tracking changes of behavior within MLlib and SparkR for the 
> Spark 2.2 release.  If any JIRAs change behavior, please list them below with 
> a short description of the change.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-19399) R Coalesce on DataFrame and coalesce on column

2017-01-29 Thread Felix Cheung (JIRA)
Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Felix Cheung updated an issue 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 Spark /  SPARK-19399 
 
 
 
  R Coalesce on DataFrame and coalesce on column  
 
 
 
 
 
 
 
 
 

Change By:
 
 Felix Cheung 
 
 
 
 
 
 
 
 
 
 coalesce on DataFrame is different from repartition, where shuffling is avoided. We should have that in SparkR.coalesce on Column is convenient to have in _expression_. 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d) 
 
 
 
 
  
 
 
 
 
 
 
 
 
   



[jira] (SPARK-19399) R Coalesce on DataFrame and coalesce on column

2017-01-29 Thread Felix Cheung (JIRA)
Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Felix Cheung created an issue 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 Spark /  SPARK-19399 
 
 
 
  R Coalesce on DataFrame and coalesce on column  
 
 
 
 
 
 
 
 
 

Issue Type:
 
  Bug 
 
 
 

Affects Versions:
 

 2.1.0 
 
 
 

Assignee:
 

 Unassigned 
 
 
 

Components:
 

 SparkR 
 
 
 

Created:
 

 30/Jan/17 07:47 
 
 
 

Priority:
 
  Major 
 
 
 

Reporter:
 
 Felix Cheung 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d) 
 
 
 

[jira] [Comment Edited] (SPARK-14709) spark.ml API for linear SVM

2017-01-28 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15835554#comment-15835554
 ] 

Felix Cheung edited comment on SPARK-14709 at 1/28/17 11:17 PM:


[~josephkb] should we add SparkR API as one follow up tasks? (I could shepherd 
that)


was (Author: felixcheung):
[~josephkb] should we add SparR API as one follow up tasks? (I could shepherd 
that)

> spark.ml API for linear SVM
> ---
>
> Key: SPARK-14709
> URL: https://issues.apache.org/jira/browse/SPARK-14709
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
> Fix For: 2.2.0
>
>
> Provide API for SVM algorithm for DataFrames.  I would recommend using 
> OWL-QN, rather than wrapping spark.mllib's SGD-based implementation.
> The API should mimic existing spark.ml.classification APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19387) CRAN tests do not run with SparkR source package

2017-01-27 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-19387:
-
Description: 
It looks like sparkR.session() is not installing Spark - as a result, running R 
CMD check --as-cran SparkR_*.tar.gz fails, blocking possible submission to CRAN.


> CRAN tests do not run with SparkR source package
> 
>
> Key: SPARK-19387
> URL: https://issues.apache.org/jira/browse/SPARK-19387
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>
> It looks like sparkR.session() is not installing Spark - as a result, running 
> R CMD check --as-cran SparkR_*.tar.gz fails, blocking possible submission to 
> CRAN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19387) CRAN tests do not run with SparkR source package

2017-01-27 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-19387:


 Summary: CRAN tests do not run with SparkR source package
 Key: SPARK-19387
 URL: https://issues.apache.org/jira/browse/SPARK-19387
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.1.0
Reporter: Felix Cheung
Assignee: Felix Cheung






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19386) Bisecting k-means in SparkR documentation

2017-01-27 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-19386:


 Summary: Bisecting k-means in SparkR documentation
 Key: SPARK-19386
 URL: https://issues.apache.org/jira/browse/SPARK-19386
 Project: Spark
  Issue Type: Bug
  Components: ML, SparkR
Affects Versions: 2.2.0
Reporter: Felix Cheung
Assignee: Miao Wang


we need updates to programming guide, example and vignettes




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19333) Files out of compliance with ASF policy

2017-01-27 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-19333:
-
Affects Version/s: 2.0.0
   2.1.0

> Files out of compliance with ASF policy
> ---
>
> Key: SPARK-19333
> URL: https://issues.apache.org/jira/browse/SPARK-19333
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: John D. Ament
>Assignee: Felix Cheung
>Priority: Minor
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> ASF policy is that source files include our headers
> http://www.apache.org/legal/release-policy.html#license-headers
> However, there are a few files in spark's release that are missing headers.  
> this is not exhaustive
> https://github.com/apache/spark/blob/master/R/pkg/DESCRIPTION
> https://github.com/apache/spark/blob/master/R/pkg/NAMESPACE



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19333) Files out of compliance with ASF policy

2017-01-27 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-19333:
-
Affects Version/s: 2.0.1
   2.0.2

> Files out of compliance with ASF policy
> ---
>
> Key: SPARK-19333
> URL: https://issues.apache.org/jira/browse/SPARK-19333
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: John D. Ament
>Assignee: Felix Cheung
>Priority: Minor
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> ASF policy is that source files include our headers
> http://www.apache.org/legal/release-policy.html#license-headers
> However, there are a few files in spark's release that are missing headers.  
> this is not exhaustive
> https://github.com/apache/spark/blob/master/R/pkg/DESCRIPTION
> https://github.com/apache/spark/blob/master/R/pkg/NAMESPACE



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19333) Files out of compliance with ASF policy

2017-01-27 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-19333.
--
  Resolution: Fixed
   Fix Version/s: 2.2.0
  2.1.1
  2.0.3
Target Version/s: 2.0.3, 2.1.1, 2.2.0

> Files out of compliance with ASF policy
> ---
>
> Key: SPARK-19333
> URL: https://issues.apache.org/jira/browse/SPARK-19333
> Project: Spark
>  Issue Type: Improvement
>Reporter: John D. Ament
>Assignee: Felix Cheung
>Priority: Minor
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> ASF policy is that source files include our headers
> http://www.apache.org/legal/release-policy.html#license-headers
> However, there are a few files in spark's release that are missing headers.  
> this is not exhaustive
> https://github.com/apache/spark/blob/master/R/pkg/DESCRIPTION
> https://github.com/apache/spark/blob/master/R/pkg/NAMESPACE



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18788) Add getNumPartitions() to SparkR

2017-01-26 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-18788.
--
  Resolution: Fixed
   Fix Version/s: 2.2.0
  2.1.1
Target Version/s: 2.1.1, 2.2.0

> Add getNumPartitions() to SparkR
> 
>
> Key: SPARK-18788
> URL: https://issues.apache.org/jira/browse/SPARK-18788
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Raela Wang
>Assignee: Felix Cheung
>Priority: Minor
> Fix For: 2.1.1, 2.2.0
>
>
> Would be really convenient to have getNumPartitions() in SparkR, which was in 
> the RDD API.
> rdd <- SparkR:::toRDD(df)
> SparkR:::getNumPartitions(rdd)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18821) Bisecting k-means wrapper in SparkR

2017-01-26 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15841004#comment-15841004
 ] 

Felix Cheung commented on SPARK-18821:
--

Need to follow up with programming guide, example and vignettes

> Bisecting k-means wrapper in SparkR
> ---
>
> Key: SPARK-18821
> URL: https://issues.apache.org/jira/browse/SPARK-18821
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Reporter: Felix Cheung
>Assignee: Miao Wang
> Fix For: 2.2.0
>
>
> Implement a wrapper in SparkR to support bisecting k-means



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18821) Bisecting k-means wrapper in SparkR

2017-01-26 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-18821:
-
Fix Version/s: 2.2.0

> Bisecting k-means wrapper in SparkR
> ---
>
> Key: SPARK-18821
> URL: https://issues.apache.org/jira/browse/SPARK-18821
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Reporter: Felix Cheung
>Assignee: Miao Wang
> Fix For: 2.2.0
>
>
> Implement a wrapper in SparkR to support bisecting k-means



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18821) Bisecting k-means wrapper in SparkR

2017-01-26 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-18821.
--
Resolution: Fixed
  Assignee: Miao Wang

> Bisecting k-means wrapper in SparkR
> ---
>
> Key: SPARK-18821
> URL: https://issues.apache.org/jira/browse/SPARK-18821
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Reporter: Felix Cheung
>Assignee: Miao Wang
>
> Implement a wrapper in SparkR to support bisecting k-means



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19366) Dataset should have getNumPartitions method

2017-01-25 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-19366:
-
Component/s: (was: Spark Core)
 SQL

> Dataset should have getNumPartitions method
> ---
>
> Key: SPARK-19366
> URL: https://issues.apache.org/jira/browse/SPARK-19366
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Minor
>
> This would avoid inefficiency in converting Dataset/DataFrame into RDD in 
> non-JVM languages (specifically in R where the conversion can be expensive)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19366) Dataset should have getNumPartitions method

2017-01-25 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung reassigned SPARK-19366:


Assignee: Felix Cheung

> Dataset should have getNumPartitions method
> ---
>
> Key: SPARK-19366
> URL: https://issues.apache.org/jira/browse/SPARK-19366
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Minor
>
> This would avoid inefficiency in converting Dataset/DataFrame into RDD in 
> non-JVM languages (specifically in R where the conversion can be expensive)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19366) Dataset should have getNumPartitions method

2017-01-25 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-19366:
-
Description: This would avoid inefficiency in converting Dataset/DataFrame 
into RDD in non-JVM languages (specifically in R where the conversion can be 
expensive)

> Dataset should have getNumPartitions method
> ---
>
> Key: SPARK-19366
> URL: https://issues.apache.org/jira/browse/SPARK-19366
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Priority: Minor
>
> This would avoid inefficiency in converting Dataset/DataFrame into RDD in 
> non-JVM languages (specifically in R where the conversion can be expensive)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19366) Dataset should have getNumPartitions method

2017-01-25 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-19366:


 Summary: Dataset should have getNumPartitions method
 Key: SPARK-19366
 URL: https://issues.apache.org/jira/browse/SPARK-19366
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Felix Cheung
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19324) JVM stdout output is dropped in SparkR

2017-01-21 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-19324:
-
Description: 
Whenever there are stdout outputs from Spark in JVM (typically when calling 
println()) they are dropped by SparkR.

For example, explain() for Column
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L


> JVM stdout output is dropped in SparkR
> --
>
> Key: SPARK-19324
> URL: https://issues.apache.org/jira/browse/SPARK-19324
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>
> Whenever there are stdout outputs from Spark in JVM (typically when calling 
> println()) they are dropped by SparkR.
> For example, explain() for Column
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19324) JVM stdout output is dropped in SparkR

2017-01-21 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung reassigned SPARK-19324:


Assignee: Felix Cheung

> JVM stdout output is dropped in SparkR
> --
>
> Key: SPARK-19324
> URL: https://issues.apache.org/jira/browse/SPARK-19324
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>
> Whenever there are stdout outputs from Spark in JVM (typically when calling 
> println()) they are dropped by SparkR.
> For example, explain() for Column
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19324) JVM stdout output is dropped in SparkR

2017-01-21 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-19324:


 Summary: JVM stdout output is dropped in SparkR
 Key: SPARK-19324
 URL: https://issues.apache.org/jira/browse/SPARK-19324
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.1.0
Reporter: Felix Cheung






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16693) Remove R deprecated methods

2017-01-21 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15833206#comment-15833206
 ] 

Felix Cheung commented on SPARK-16693:
--

I want to bring this up again and potentially to discuss in dev@, now that we 
are on Spark 2.2.

Not only having these are harder to work with, making it harder to add new 
parameter (eg. numPartitions) and so on, but more importantly, the wrapper/stub 
methods (e.g createDataFrame.default) shows up in auto-complete, tooltips, help 
doc and so on and create confusion.

Moreover, on a slightly orthogonal note, we should also consider removing all 
the internal RDD methods, or at least making them non-export S3 methods. Right 
now every time we are adding a new method having the same name of the existing, 
internal-only RDD method in R we would need to update the generic, and rename 
the existing method (by appending "RDD" to its name), and all its call sites, 
otherwise we would get a check-cran warning of lacking documentation. And by 
adding such method to the NAMESPACE file the existing RDD-only will get exposed 
as well (again showing up in auto-complete etc.), unless it is renamed. Since 
we are renaming the existing method to fix this, we are also breaking backward 
compatibility anyway (although the method was not public so strictly speaking 
there has not been any guarantee).

But I could scope this JIRA to only sqlContext methods and leave this 2nd issue 
to a different JIRA.


> Remove R deprecated methods
> ---
>
> Key: SPARK-16693
> URL: https://issues.apache.org/jira/browse/SPARK-16693
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Felix Cheung
>
> For methods deprecated in Spark 2.0.0, we should remove them in 2.1.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18788) Add getNumPartitions() to SparkR

2017-01-20 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung reassigned SPARK-18788:


Assignee: Felix Cheung

> Add getNumPartitions() to SparkR
> 
>
> Key: SPARK-18788
> URL: https://issues.apache.org/jira/browse/SPARK-18788
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Raela Wang
>Assignee: Felix Cheung
>Priority: Minor
>
> Would be really convenient to have getNumPartitions() in SparkR, which was in 
> the RDD API.
> rdd <- SparkR:::toRDD(df)
> SparkR:::getNumPartitions(rdd)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19288) Failure (at test_sparkSQL.R#1300): date functions on a DataFrame in R/run-tests.sh

2017-01-20 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832837#comment-15832837
 ] 

Felix Cheung commented on SPARK-19288:
--

hmm, that's odd. what system and R version?
I'm wondering if this is related to time zone?

> Failure (at test_sparkSQL.R#1300): date functions on a DataFrame in 
> R/run-tests.sh
> --
>
> Key: SPARK-19288
> URL: https://issues.apache.org/jira/browse/SPARK-19288
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL, Tests
>Affects Versions: 2.0.1
> Environment: Ubuntu 16.04, X86_64, ppc64le
>Reporter: Nirman Narang
>
> Full log here.
> {code:title=R/run-tests.sh|borderStyle=solid}
> Loading required package: methods
> Attaching package: 'SparkR'
> The following object is masked from 'package:testthat':
> describe
> The following objects are masked from 'package:stats':
> cov, filter, lag, na.omit, predict, sd, var, window
> The following objects are masked from 'package:base':
> as.data.frame, colnames, colnames<-, drop, intersect, rank, rbind,
> sample, subset, summary, transform, union
> functions on binary files : Spark package found in SPARK_HOME: 
> /var/lib/jenkins/workspace/Sparkv2.0.1/spark
> 
> binary functions : ...
> broadcast variables : ..
> functions in client.R : .
> test functions in sparkR.R : .Re-using existing Spark Context. Call 
> sparkR.session.stop() or restart R to create a new Spark Context
> ...
> include R packages : Spark package found in SPARK_HOME: 
> /var/lib/jenkins/workspace/Sparkv2.0.1/spark
> JVM API : ..
> MLlib functions : Spark package found in SPARK_HOME: 
> /var/lib/jenkins/workspace/Sparkv2.0.1/spark
> .SLF4J: Failed to load class 
> "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> .Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet block size to 134217728
> Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet page size to 1048576
> Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet dictionary page size to 1048576
> Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Dictionary is on
> Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Validation is off
> Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Writer version is: PARQUET_1_0
> Jan 19, 2017 5:40:54 PM INFO: 
> org.apache.parquet.hadoop.InternalParquetRecordWriter: Flushing mem 
> columnStore to file. allocated memory: 65,622
> Jan 19, 2017 5:40:54 PM INFO: 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 70B for [label] 
> BINARY: 1 values, 21B raw, 23B comp, 1 pages, encodings: [PLAIN, BIT_PACKED, 
> RLE]
> Jan 19, 2017 5:40:54 PM INFO: 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 87B for [terms, 
> list, element, list, element] BINARY: 2 values, 42B raw, 43B comp, 1 pages, 
> encodings: [PLAIN, RLE]
> Jan 19, 2017 5:40:54 PM INFO: 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 30B for 
> [hasIntercept] BOOLEAN: 1 values, 1B raw, 3B comp, 1 pages, encodings: 
> [PLAIN, BIT_PACKED]
> Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet block size to 134217728
> Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet page size to 1048576
> Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet dictionary page size to 1048576
> Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Dictionary is on
> Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Validation is off
> Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Writer version is: PARQUET_1_0
> Jan 19, 2017 5:40:55 PM INFO: 
> org.apache.parquet.hadoop.InternalParquetRecordWriter: Flushing mem 
> columnStore to file. allocated memory: 49
> Jan 19, 2017 5:40:55 PM INFO: 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 90B for [labels, 
> list, element] BINARY: 3 values, 50B raw, 50B comp, 1 pages, encodings: 
> [PLAIN, RLE]
> Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> 

[jira] [Updated] (SPARK-12347) Write script to run all MLlib examples for testing

2017-01-19 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-12347:
-
Shepherd: Felix Cheung

> Write script to run all MLlib examples for testing
> --
>
> Key: SPARK-12347
> URL: https://issues.apache.org/jira/browse/SPARK-12347
> Project: Spark
>  Issue Type: Test
>  Components: ML, MLlib, PySpark, SparkR, Tests
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> It would facilitate testing to have a script which runs all MLlib examples 
> for all languages.
> Design sketch to ensure all examples are run:
> * Generate a list of examples to run programmatically (not from a fixed list).
> * Use a list of special examples to handle examples which require command 
> line arguments.
> * Make sure data, etc. used are small to keep the tests quick.
> This could be broken into subtasks for each language, though it would be nice 
> to provide a single script.
> Not sure where the script should live; perhaps in {{bin/}}?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12347) Write script to run all MLlib examples for testing

2017-01-19 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15831191#comment-15831191
 ] 

Felix Cheung commented on SPARK-12347:
--

Great!

> Write script to run all MLlib examples for testing
> --
>
> Key: SPARK-12347
> URL: https://issues.apache.org/jira/browse/SPARK-12347
> Project: Spark
>  Issue Type: Test
>  Components: ML, MLlib, PySpark, SparkR, Tests
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> It would facilitate testing to have a script which runs all MLlib examples 
> for all languages.
> Design sketch to ensure all examples are run:
> * Generate a list of examples to run programmatically (not from a fixed list).
> * Use a list of special examples to handle examples which require command 
> line arguments.
> * Make sure data, etc. used are small to keep the tests quick.
> This could be broken into subtasks for each language, though it would be nice 
> to provide a single script.
> Not sure where the script should live; perhaps in {{bin/}}?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19288) Failure (at test_sparkSQL.R#1300): date functions on a DataFrame in R/run-tests.sh

2017-01-19 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15830284#comment-15830284
 ] 

Felix Cheung commented on SPARK-19288:
--

We are not seeing this in Jenkins? Which branch are you running this from?

> Failure (at test_sparkSQL.R#1300): date functions on a DataFrame in 
> R/run-tests.sh
> --
>
> Key: SPARK-19288
> URL: https://issues.apache.org/jira/browse/SPARK-19288
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL, Tests
>Affects Versions: 2.0.1
> Environment: Ubuntu 16.04, X86_64, ppc64le
>Reporter: Nirman Narang
>
> Full log here.
> {code:title=R/run-tests.sh|borderStyle=solid}
> Loading required package: methods
> Attaching package: 'SparkR'
> The following object is masked from 'package:testthat':
> describe
> The following objects are masked from 'package:stats':
> cov, filter, lag, na.omit, predict, sd, var, window
> The following objects are masked from 'package:base':
> as.data.frame, colnames, colnames<-, drop, intersect, rank, rbind,
> sample, subset, summary, transform, union
> functions on binary files : Spark package found in SPARK_HOME: 
> /var/lib/jenkins/workspace/Sparkv2.0.1/spark
> 
> binary functions : ...
> broadcast variables : ..
> functions in client.R : .
> test functions in sparkR.R : .Re-using existing Spark Context. Call 
> sparkR.session.stop() or restart R to create a new Spark Context
> ...
> include R packages : Spark package found in SPARK_HOME: 
> /var/lib/jenkins/workspace/Sparkv2.0.1/spark
> JVM API : ..
> MLlib functions : Spark package found in SPARK_HOME: 
> /var/lib/jenkins/workspace/Sparkv2.0.1/spark
> .SLF4J: Failed to load class 
> "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> .Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet block size to 134217728
> Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet page size to 1048576
> Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet dictionary page size to 1048576
> Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Dictionary is on
> Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Validation is off
> Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Writer version is: PARQUET_1_0
> Jan 19, 2017 5:40:54 PM INFO: 
> org.apache.parquet.hadoop.InternalParquetRecordWriter: Flushing mem 
> columnStore to file. allocated memory: 65,622
> Jan 19, 2017 5:40:54 PM INFO: 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 70B for [label] 
> BINARY: 1 values, 21B raw, 23B comp, 1 pages, encodings: [PLAIN, BIT_PACKED, 
> RLE]
> Jan 19, 2017 5:40:54 PM INFO: 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 87B for [terms, 
> list, element, list, element] BINARY: 2 values, 42B raw, 43B comp, 1 pages, 
> encodings: [PLAIN, RLE]
> Jan 19, 2017 5:40:54 PM INFO: 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 30B for 
> [hasIntercept] BOOLEAN: 1 values, 1B raw, 3B comp, 1 pages, encodings: 
> [PLAIN, BIT_PACKED]
> Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet block size to 134217728
> Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet page size to 1048576
> Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet dictionary page size to 1048576
> Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Dictionary is on
> Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Validation is off
> Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Writer version is: PARQUET_1_0
> Jan 19, 2017 5:40:55 PM INFO: 
> org.apache.parquet.hadoop.InternalParquetRecordWriter: Flushing mem 
> columnStore to file. allocated memory: 49
> Jan 19, 2017 5:40:55 PM INFO: 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 90B for [labels, 
> list, element] BINARY: 3 values, 50B raw, 50B comp, 1 pages, encodings: 
> [PLAIN, RLE]
> Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet block size 

[jira] [Updated] (SPARK-18569) Support R formula arithmetic

2017-01-18 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-18569:
-
Shepherd: Felix Cheung

> Support R formula arithmetic 
> -
>
> Key: SPARK-18569
> URL: https://issues.apache.org/jira/browse/SPARK-18569
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Felix Cheung
>
> I think we should support arithmetic which makes it a lot more convenient to 
> build model. Something like
> {code}
>   log(y) ~ a + log(x)
> {code}
> And to avoid resolution confusions we should support the I() operator:
> {code}
> I
>  I(X∗Z) as is: include a new variable consisting of these variables multiplied
> {code}
> Such that this works:
> {code}
> y ~ a + I(b+c)
> {code}
> the term b+c is to be interpreted as the sum of b and c.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19231) SparkR hangs when there is download or untar failure

2017-01-18 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-19231.
--
  Resolution: Fixed
   Fix Version/s: 2.2.0
  2.1.1
Target Version/s: 2.1.1, 2.2.0

> SparkR hangs when there is download or untar failure
> 
>
> Key: SPARK-19231
> URL: https://issues.apache.org/jira/browse/SPARK-19231
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
> Fix For: 2.1.1, 2.2.0
>
>
> When there is any partial download, or download error it is not cleaned up, 
> and sparkR.session will continue to stuck with no error message.
> {code}
> > sparkR.session()
> Spark not found in SPARK_HOME:
> Spark not found in the cache directory. Installation will start.
> MirrorUrl not provided.
> Looking for preferred site from apache website...
> Preferred mirror site found: http://www-eu.apache.org/dist/spark
> Downloading spark-2.1.0 for Hadoop 2.7 from:
> - 
> http://www-eu.apache.org/dist/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.7.tgz
> trying URL 
> 'http://www-eu.apache.org/dist/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.7.tgz'
> Content type 'application/x-gzip' length 195636829 bytes (186.6 MB)
> downloaded 31.9 MB
>  
> Installing to C:\Users\felix\AppData\Local\spark\spark\Cache
> Error in untar2(tarfile, files, list, exdir) : incomplete block on file
> In addition: Warning message:
> In download.file(remotePath, localPath) :
>   downloaded length 33471940 != reported length 195636829
> > sparkR.session()
> Spark not found in SPARK_HOME:
> spark-2.1.0 for Hadoop 2.7 found, setting SPARK_HOME to 
> C:\Users\felix\AppData\Local\spark\spark\Cache/spark-2.1.0-bin-hadoop2.7
> Launching java with spark-submit command 
> C:\Users\felix\AppData\Local\spark\spark\Cache/spark-2.1.0-bin-hadoop2.7/bin/spark-submit2.cmd
>sparkr-shell 
> C:\Users\felix\AppData\Local\Temp\RtmpCqNdne\backend_port16d04191e7
> {code}
> {code}
> Directory of C:\Users\felix\AppData\Local\spark\spark\Cache
> 01/13/2017  11:25 AM  spark-2.1.0-bin-hadoop2.7
> 01/13/2017  11:25 AM33,471,940 spark-2.1.0-bin-hadoop2.7.tgz
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18569) Support R formula arithmetic

2017-01-18 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15828435#comment-15828435
 ] 

Felix Cheung commented on SPARK-18569:
--

Yes, I'll put together a proposal and shepherd this

> Support R formula arithmetic 
> -
>
> Key: SPARK-18569
> URL: https://issues.apache.org/jira/browse/SPARK-18569
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Felix Cheung
>
> I think we should support arithmetic which makes it a lot more convenient to 
> build model. Something like
> {code}
>   log(y) ~ a + log(x)
> {code}
> And to avoid resolution confusions we should support the I() operator:
> {code}
> I
>  I(X∗Z) as is: include a new variable consisting of these variables multiplied
> {code}
> Such that this works:
> {code}
> y ~ a + I(b+c)
> {code}
> the term b+c is to be interpreted as the sum of b and c.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18570) Consider supporting other R formula operators

2017-01-18 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15828432#comment-15828432
 ] 

Felix Cheung edited comment on SPARK-18570 at 1/18/17 5:27 PM:
---

[~KrishnaKalyan3]I think supporting 
{code}
x * y
(a+b+c)^2
{code}

and double checking
{code}
- a:b
{code}
is supported (ie. - non-constant term)

would be a great starting point!



was (Author: felixcheung):
[~KrishnaKalyan3]I think supporting 
x * y
(a+b+c)^2

and double checking
- a:b
is supported (ie. - non-constant term)

would be a great starting point!


> Consider supporting other R formula operators
> -
>
> Key: SPARK-18570
> URL: https://issues.apache.org/jira/browse/SPARK-18570
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Felix Cheung
>Priority: Minor
>
> Such as
> {code}
> ∗ 
>  X∗Y include these variables and the interactions between them
> ^
>  (X + Z + W)^3 include these variables and all interactions up to three way
> |
>  X | Z conditioning: include x given z
> {code}
> Other includes, %in%, ` (backtick)
> https://stat.ethz.ch/R-manual/R-devel/library/stats/html/formula.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18570) Consider supporting other R formula operators

2017-01-18 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15828432#comment-15828432
 ] 

Felix Cheung commented on SPARK-18570:
--

[~KrishnaKalyan3]I think supporting 
x * y
(a+b+c)^2

and double checking
- a:b
is supported (ie. - non-constant term)

would be a great starting point!


> Consider supporting other R formula operators
> -
>
> Key: SPARK-18570
> URL: https://issues.apache.org/jira/browse/SPARK-18570
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Felix Cheung
>Priority: Minor
>
> Such as
> {code}
> ∗ 
>  X∗Y include these variables and the interactions between them
> ^
>  (X + Z + W)^3 include these variables and all interactions up to three way
> |
>  X | Z conditioning: include x given z
> {code}
> Other includes, %in%, ` (backtick)
> https://stat.ethz.ch/R-manual/R-devel/library/stats/html/formula.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18011) SparkR serialize "NA" throws exception

2017-01-17 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827478#comment-15827478
 ] 

Felix Cheung commented on SPARK-18011:
--

very cool, thanks for all the investigation. What version of R you are running 
on? Is this on Mac or Linux?

> SparkR serialize "NA" throws exception
> --
>
> Key: SPARK-18011
> URL: https://issues.apache.org/jira/browse/SPARK-18011
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Miao Wang
>
> For some versions of R, if Date has "NA" field, backend will throw negative 
> index exception.
> To reproduce the problem:
> {code}
> > a <- as.Date(c("2016-11-11", "NA"))
> > b <- as.data.frame(a)
> > c <- createDataFrame(b)
> > dim(c)
> 16/10/19 10:31:24 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.NegativeArraySizeException
>   at org.apache.spark.api.r.SerDe$.readStringBytes(SerDe.scala:110)
>   at org.apache.spark.api.r.SerDe$.readString(SerDe.scala:119)
>   at org.apache.spark.api.r.SerDe$.readDate(SerDe.scala:128)
>   at org.apache.spark.api.r.SerDe$.readTypedObject(SerDe.scala:77)
>   at org.apache.spark.api.r.SerDe$.readObject(SerDe.scala:61)
>   at 
> org.apache.spark.sql.api.r.SQLUtils$$anonfun$bytesToRow$1.apply(SQLUtils.scala:161)
>   at 
> org.apache.spark.sql.api.r.SQLUtils$$anonfun$bytesToRow$1.apply(SQLUtils.scala:160)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.Range.foreach(Range.scala:160)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at org.apache.spark.sql.api.r.SQLUtils$.bytesToRow(SQLUtils.scala:160)
>   at 
> org.apache.spark.sql.api.r.SQLUtils$$anonfun$5.apply(SQLUtils.scala:138)
>   at 
> org.apache.spark.sql.api.r.SQLUtils$$anonfun$5.apply(SQLUtils.scala:138)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:372)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12347) Write script to run all MLlib examples for testing

2017-01-17 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827475#comment-15827475
 ] 

Felix Cheung commented on SPARK-12347:
--

[~ethanlu...@gmail.com] would you be willing and able to spend more time on 
this? I would be interested in working with you to shepherd this for 2.2.

> Write script to run all MLlib examples for testing
> --
>
> Key: SPARK-12347
> URL: https://issues.apache.org/jira/browse/SPARK-12347
> Project: Spark
>  Issue Type: Test
>  Components: ML, MLlib, PySpark, SparkR, Tests
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> It would facilitate testing to have a script which runs all MLlib examples 
> for all languages.
> Design sketch to ensure all examples are run:
> * Generate a list of examples to run programmatically (not from a fixed list).
> * Use a list of special examples to handle examples which require command 
> line arguments.
> * Make sure data, etc. used are small to keep the tests quick.
> This could be broken into subtasks for each language, though it would be nice 
> to provide a single script.
> Not sure where the script should live; perhaps in {{bin/}}?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18348) Improve tree ensemble model summary

2017-01-17 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827474#comment-15827474
 ] 

Felix Cheung commented on SPARK-18348:
--

[~yanboliang] would you like to run with this? you had a few great points when 
we discussed this earlier.

> Improve tree ensemble model summary
> ---
>
> Key: SPARK-18348
> URL: https://issues.apache.org/jira/browse/SPARK-18348
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Felix Cheung
>
> During work on R APIs for tree ensemble models (eg. Random Forest, GBT) it is 
> discovered and discussed that
> - we don't have a good summary on nodes or trees for their observations, 
> loss, probability and so on
> - we don't have a shared API with nicely formatted output
> We believe this could be a shared API that benefits multiple language 
> bindings, including R, when available.
> For example, here is what R {code}rpart{code} shows for model summary:
> {code}
> Call:
> rpart(formula = Kyphosis ~ Age + Number + Start, data = kyphosis,
> method = "class")
>   n= 81
>   CP nsplit rel errorxerror  xstd
> 1 0.17647059  0 1.000 1.000 0.2155872
> 2 0.01960784  1 0.8235294 0.9411765 0.2107780
> 3 0.0100  4 0.7647059 1.0588235 0.2200975
> Variable importance
>  StartAge Number
> 64 24 12
> Node number 1: 81 observations,complexity param=0.1764706
>   predicted class=absent   expected loss=0.2098765  P(node) =1
> class counts:6417
>probabilities: 0.790 0.210
>   left son=2 (62 obs) right son=3 (19 obs)
>   Primary splits:
>   Start  < 8.5  to the right, improve=6.762330, (0 missing)
>   Number < 5.5  to the left,  improve=2.866795, (0 missing)
>   Age< 39.5 to the left,  improve=2.250212, (0 missing)
>   Surrogate splits:
>   Number < 6.5  to the left,  agree=0.802, adj=0.158, (0 split)
> Node number 2: 62 observations,complexity param=0.01960784
>   predicted class=absent   expected loss=0.09677419  P(node) =0.7654321
> class counts:56 6
>probabilities: 0.903 0.097
>   left son=4 (29 obs) right son=5 (33 obs)
>   Primary splits:
>   Start  < 14.5 to the right, improve=1.0205280, (0 missing)
>   Age< 55   to the left,  improve=0.6848635, (0 missing)
>   Number < 4.5  to the left,  improve=0.2975332, (0 missing)
>   Surrogate splits:
>   Number < 3.5  to the left,  agree=0.645, adj=0.241, (0 split)
>   Age< 16   to the left,  agree=0.597, adj=0.138, (0 split)
> Node number 3: 19 observations
>   predicted class=present  expected loss=0.4210526  P(node) =0.2345679
> class counts: 811
>probabilities: 0.421 0.579
> Node number 4: 29 observations
>   predicted class=absent   expected loss=0  P(node) =0.3580247
> class counts:29 0
>probabilities: 1.000 0.000
> Node number 5: 33 observations,complexity param=0.01960784
>   predicted class=absent   expected loss=0.1818182  P(node) =0.4074074
> class counts:27 6
>probabilities: 0.818 0.182
>   left son=10 (12 obs) right son=11 (21 obs)
>   Primary splits:
>   Age< 55   to the left,  improve=1.2467530, (0 missing)
>   Start  < 12.5 to the right, improve=0.2887701, (0 missing)
>   Number < 3.5  to the right, improve=0.1753247, (0 missing)
>   Surrogate splits:
>   Start  < 9.5  to the left,  agree=0.758, adj=0.333, (0 split)
>   Number < 5.5  to the right, agree=0.697, adj=0.167, (0 split)
> Node number 10: 12 observations
>   predicted class=absent   expected loss=0  P(node) =0.1481481
> class counts:12 0
>probabilities: 1.000 0.000
> Node number 11: 21 observations,complexity param=0.01960784
>   predicted class=absent   expected loss=0.2857143  P(node) =0.2592593
> class counts:15 6
>probabilities: 0.714 0.286
>   left son=22 (14 obs) right son=23 (7 obs)
>   Primary splits:
>   Age< 111  to the right, improve=1.71428600, (0 missing)
>   Start  < 12.5 to the right, improve=0.79365080, (0 missing)
>   Number < 3.5  to the right, improve=0.07142857, (0 missing)
> Node number 22: 14 observations
>   predicted class=absent   expected loss=0.1428571  P(node) =0.1728395
> class counts:12 2
>probabilities: 0.857 0.143
> Node number 23: 7 observations
>   predicted class=present  expected loss=0.4285714  P(node) =0.08641975
> class counts: 3 4
>probabilities: 0.429 0.571
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19066) SparkR LDA doesn't set optimizer correctly

2017-01-17 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-19066:
-
Fix Version/s: 2.1.1

> SparkR LDA doesn't set optimizer correctly
> --
>
> Key: SPARK-19066
> URL: https://issues.apache.org/jira/browse/SPARK-19066
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Miao Wang
>Assignee: Miao Wang
> Fix For: 2.1.1, 2.2.0
>
>
> spark.lda pass the optimizer "em" or "online" to the backend. However, 
> LDAWrapper doesn't set optimizer based on the value from R. Therefore, for 
> optimizer "em", the `isDistributed` field is FALSE, which should be TRUE.
> In addition, the `summary` method should bring back the results related to 
> `DistributedLDAModel`. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18828) Refactor SparkR build and test scripts

2017-01-16 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-18828.
--
  Resolution: Fixed
Assignee: Felix Cheung
   Fix Version/s: 2.2.0
Target Version/s: 2.2.0

> Refactor SparkR build and test scripts
> --
>
> Key: SPARK-18828
> URL: https://issues.apache.org/jira/browse/SPARK-18828
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
> Fix For: 2.2.0
>
>
> Since we are building SparkR source package we are now seeing the call tree 
> getting more convoluted and more parts are getting duplicated.
> We should try to clean this up.
> One issue is with the requirement to install SparkR before building SparkR 
> source package (ie. R CMD build) because of the loading of SparkR via 
> "library(SparkR)" in the vignettes. When we refactor that part in the 
> vignettes we should be able to further decouple the scripts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19237) SparkR package install stuck when no java is found

2017-01-15 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-19237:
-
Description: When installing SparkR as a R package (install.packages), it 
will check for Spark distribution and automatically download and cache it. But 
if there is no java runtime on the machine spark-submit will just hang.

> SparkR package install stuck when no java is found
> --
>
> Key: SPARK-19237
> URL: https://issues.apache.org/jira/browse/SPARK-19237
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>
> When installing SparkR as a R package (install.packages), it will check for 
> Spark distribution and automatically download and cache it. But if there is 
> no java runtime on the machine spark-submit will just hang.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19237) SparkR package stuck when no java is found

2017-01-15 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-19237:


 Summary: SparkR package stuck when no java is found
 Key: SPARK-19237
 URL: https://issues.apache.org/jira/browse/SPARK-19237
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.1.0
Reporter: Felix Cheung






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19237) SparkR package install stuck when no java is found

2017-01-15 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-19237:
-
Summary: SparkR package install stuck when no java is found  (was: SparkR 
package stuck when no java is found)

> SparkR package install stuck when no java is found
> --
>
> Key: SPARK-19237
> URL: https://issues.apache.org/jira/browse/SPARK-19237
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19232) SparkR distribution cache location is wrong on Windows

2017-01-15 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-19232:
-
Description: 
On Linux:

{code}
~/.cache/spark# ls -lart
total 12
drwxr-xr-x 12  500  500 4096 Dec 16 02:18 spark-2.1.0-bin-hadoop2.7
drwxr-xr-x  3 root root 4096 Dec 18 00:03 ..
drwxr-xr-x  3 root root 4096 Dec 18 00:06 .
{code}

On Windows:
{code}
C:\Users\felix\AppData\Local\spark\spark\Cache
01/13/2017  11:25 AM  spark-2.1.0-bin-hadoop2.7
01/13/2017  11:25 AM33,471,940 spark-2.1.0-bin-hadoop2.7.tgz
{code}

If we follow https://pypi.python.org/pypi/appdirs, appauthor should be "Apache"?


  was:
On Linux:

{code}
~/.cache/spark# ls -lart
total 12
drwxr-xr-x 12  500  500 4096 Dec 16 02:18 spark-2.1.0-bin-hadoop2.7
drwxr-xr-x  3 root root 4096 Dec 18 00:03 ..
drwxr-xr-x  3 root root 4096 Dec 18 00:06 .
{code}

On Windows:
{code}
C:\Users\felix\AppData\Local\spark\spark\Cache
01/13/2017  11:25 AM  spark-2.1.0-bin-hadoop2.7
01/13/2017  11:25 AM33,471,940 spark-2.1.0-bin-hadoop2.7.tgz
{code}

it should be consistently under Cache\spark or .cache/spark



> SparkR distribution cache location is wrong on Windows
> --
>
> Key: SPARK-19232
> URL: https://issues.apache.org/jira/browse/SPARK-19232
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Trivial
>
> On Linux:
> {code}
> ~/.cache/spark# ls -lart
> total 12
> drwxr-xr-x 12  500  500 4096 Dec 16 02:18 spark-2.1.0-bin-hadoop2.7
> drwxr-xr-x  3 root root 4096 Dec 18 00:03 ..
> drwxr-xr-x  3 root root 4096 Dec 18 00:06 .
> {code}
> On Windows:
> {code}
> C:\Users\felix\AppData\Local\spark\spark\Cache
> 01/13/2017  11:25 AM  spark-2.1.0-bin-hadoop2.7
> 01/13/2017  11:25 AM33,471,940 spark-2.1.0-bin-hadoop2.7.tgz
> {code}
> If we follow https://pypi.python.org/pypi/appdirs, appauthor should be 
> "Apache"?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19232) SparkR distribution cache location is wrong on Windows

2017-01-15 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-19232:
-
Priority: Trivial  (was: Major)

> SparkR distribution cache location is wrong on Windows
> --
>
> Key: SPARK-19232
> URL: https://issues.apache.org/jira/browse/SPARK-19232
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Trivial
>
> On Linux:
> {code}
> ~/.cache/spark# ls -lart
> total 12
> drwxr-xr-x 12  500  500 4096 Dec 16 02:18 spark-2.1.0-bin-hadoop2.7
> drwxr-xr-x  3 root root 4096 Dec 18 00:03 ..
> drwxr-xr-x  3 root root 4096 Dec 18 00:06 .
> {code}
> On Windows:
> {code}
> C:\Users\felix\AppData\Local\spark\spark\Cache
> 01/13/2017  11:25 AM  spark-2.1.0-bin-hadoop2.7
> 01/13/2017  11:25 AM33,471,940 spark-2.1.0-bin-hadoop2.7.tgz
> {code}
> it should be consistently under Cache\spark or .cache/spark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19232) SparkR distribution cache location is wrong on Windows

2017-01-15 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-19232:
-
Description: 
On Linux:

{code}
~/.cache/spark# ls -lart
total 12
drwxr-xr-x 12  500  500 4096 Dec 16 02:18 spark-2.1.0-bin-hadoop2.7
drwxr-xr-x  3 root root 4096 Dec 18 00:03 ..
drwxr-xr-x  3 root root 4096 Dec 18 00:06 .
{code}

On Windows:
{code}
C:\Users\felix\AppData\Local\spark\spark\Cache
01/13/2017  11:25 AM  spark-2.1.0-bin-hadoop2.7
01/13/2017  11:25 AM33,471,940 spark-2.1.0-bin-hadoop2.7.tgz
{code}

it should be consistently under Cache\spark or .cache/spark


  was:
On Linux:

{code}
~/.cache/spark# ls -lart
total 12
drwxr-xr-x 12  500  500 4096 Dec 16 02:18 spark-2.1.0-bin-hadoop2.7
drwxr-xr-x  3 root root 4096 Dec 18 00:03 ..
drwxr-xr-x  3 root root 4096 Dec 18 00:06 .
{code}

On Windows:
{code}
C:\Users\felix\AppData\Local\spark\spark\Cache
01/13/2017  11:25 AM  spark-2.1.0-bin-hadoop2.7
01/13/2017  11:25 AM33,471,940 spark-2.1.0-bin-hadoop2.7.tgz
{code}



> SparkR distribution cache location is wrong on Windows
> --
>
> Key: SPARK-19232
> URL: https://issues.apache.org/jira/browse/SPARK-19232
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>
> On Linux:
> {code}
> ~/.cache/spark# ls -lart
> total 12
> drwxr-xr-x 12  500  500 4096 Dec 16 02:18 spark-2.1.0-bin-hadoop2.7
> drwxr-xr-x  3 root root 4096 Dec 18 00:03 ..
> drwxr-xr-x  3 root root 4096 Dec 18 00:06 .
> {code}
> On Windows:
> {code}
> C:\Users\felix\AppData\Local\spark\spark\Cache
> 01/13/2017  11:25 AM  spark-2.1.0-bin-hadoop2.7
> 01/13/2017  11:25 AM33,471,940 spark-2.1.0-bin-hadoop2.7.tgz
> {code}
> it should be consistently under Cache\spark or .cache/spark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19232) SparkR distribution cache location is wrong on Windows

2017-01-15 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-19232:
-
Description: 
On Linux:

{code}
~/.cache/spark# ls -lart
total 12
drwxr-xr-x 12  500  500 4096 Dec 16 02:18 spark-2.1.0-bin-hadoop2.7
drwxr-xr-x  3 root root 4096 Dec 18 00:03 ..
drwxr-xr-x  3 root root 4096 Dec 18 00:06 .
{code}

On Windows:
{code}
C:\Users\felix\AppData\Local\spark\spark\Cache
01/13/2017  11:25 AM  spark-2.1.0-bin-hadoop2.7
01/13/2017  11:25 AM33,471,940 spark-2.1.0-bin-hadoop2.7.tgz
{code}


> SparkR distribution cache location is wrong on Windows
> --
>
> Key: SPARK-19232
> URL: https://issues.apache.org/jira/browse/SPARK-19232
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>
> On Linux:
> {code}
> ~/.cache/spark# ls -lart
> total 12
> drwxr-xr-x 12  500  500 4096 Dec 16 02:18 spark-2.1.0-bin-hadoop2.7
> drwxr-xr-x  3 root root 4096 Dec 18 00:03 ..
> drwxr-xr-x  3 root root 4096 Dec 18 00:06 .
> {code}
> On Windows:
> {code}
> C:\Users\felix\AppData\Local\spark\spark\Cache
> 01/13/2017  11:25 AM  spark-2.1.0-bin-hadoop2.7
> 01/13/2017  11:25 AM33,471,940 spark-2.1.0-bin-hadoop2.7.tgz
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19232) SparkR distribution cache location is wrong on Windows

2017-01-15 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-19232:
-
Summary: SparkR distribution cache location is wrong on Windows  (was: 
SparkR distribution cache location is wrong)

> SparkR distribution cache location is wrong on Windows
> --
>
> Key: SPARK-19232
> URL: https://issues.apache.org/jira/browse/SPARK-19232
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19231) SparkR hangs when there is download or untar failure

2017-01-15 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung reassigned SPARK-19231:


Assignee: Felix Cheung

> SparkR hangs when there is download or untar failure
> 
>
> Key: SPARK-19231
> URL: https://issues.apache.org/jira/browse/SPARK-19231
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>
> When there is any partial download, or download error it is not cleaned up, 
> and sparkR.session will continue to stuck with no error message.
> {code}
> > sparkR.session()
> Spark not found in SPARK_HOME:
> Spark not found in the cache directory. Installation will start.
> MirrorUrl not provided.
> Looking for preferred site from apache website...
> Preferred mirror site found: http://www-eu.apache.org/dist/spark
> Downloading spark-2.1.0 for Hadoop 2.7 from:
> - 
> http://www-eu.apache.org/dist/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.7.tgz
> trying URL 
> 'http://www-eu.apache.org/dist/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.7.tgz'
> Content type 'application/x-gzip' length 195636829 bytes (186.6 MB)
> downloaded 31.9 MB
>  
> Installing to C:\Users\felix\AppData\Local\spark\spark\Cache
> Error in untar2(tarfile, files, list, exdir) : incomplete block on file
> In addition: Warning message:
> In download.file(remotePath, localPath) :
>   downloaded length 33471940 != reported length 195636829
> > sparkR.session()
> Spark not found in SPARK_HOME:
> spark-2.1.0 for Hadoop 2.7 found, setting SPARK_HOME to 
> C:\Users\felix\AppData\Local\spark\spark\Cache/spark-2.1.0-bin-hadoop2.7
> Launching java with spark-submit command 
> C:\Users\felix\AppData\Local\spark\spark\Cache/spark-2.1.0-bin-hadoop2.7/bin/spark-submit2.cmd
>sparkr-shell 
> C:\Users\felix\AppData\Local\Temp\RtmpCqNdne\backend_port16d04191e7
> {code}
> {code}
> Directory of C:\Users\felix\AppData\Local\spark\spark\Cache
> 01/13/2017  11:25 AM  spark-2.1.0-bin-hadoop2.7
> 01/13/2017  11:25 AM33,471,940 spark-2.1.0-bin-hadoop2.7.tgz
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19232) SparkR distribution cache location is wrong

2017-01-15 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-19232:


 Summary: SparkR distribution cache location is wrong
 Key: SPARK-19232
 URL: https://issues.apache.org/jira/browse/SPARK-19232
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.1.0
Reporter: Felix Cheung
Assignee: Felix Cheung






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19231) SparkR hangs when there is download or untar failure

2017-01-15 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-19231:
-
Description: 
When there is any partial download, or download error it is not cleaned up, and 
sparkR.session will continue to stuck with no error message.

{code}
> sparkR.session()
Spark not found in SPARK_HOME:
Spark not found in the cache directory. Installation will start.
MirrorUrl not provided.
Looking for preferred site from apache website...
Preferred mirror site found: http://www-eu.apache.org/dist/spark
Downloading spark-2.1.0 for Hadoop 2.7 from:
- http://www-eu.apache.org/dist/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.7.tgz
trying URL 
'http://www-eu.apache.org/dist/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.7.tgz'
Content type 'application/x-gzip' length 195636829 bytes (186.6 MB)
downloaded 31.9 MB
 
Installing to C:\Users\felixc\AppData\Local\spark\spark\Cache
Error in untar2(tarfile, files, list, exdir) : incomplete block on file

In addition: Warning message:
In download.file(remotePath, localPath) :
  downloaded length 33471940 != reported length 195636829
> sparkR.session()
Spark not found in SPARK_HOME:
spark-2.1.0 for Hadoop 2.7 found, setting SPARK_HOME to 
C:\Users\felixc\AppData\Local\spark\spark\Cache/spark-2.1.0-bin-hadoop2.7
Launching java with spark-submit command 
C:\Users\felixc\AppData\Local\spark\spark\Cache/spark-2.1.0-bin-hadoop2.7/bin/spark-submit2.cmd
   sparkr-shell 
C:\Users\felixc\AppData\Local\Temp\RtmpCqNdne\backend_port16d04191e7
{code}

{code}
Directory of C:\Users\felixc\AppData\Local\spark\spark\Cache
 01/13/2017  11:25 AM  .
01/13/2017  11:25 AM  ..
01/13/2017  11:25 AM  spark-2.1.0-bin-hadoop2.7
01/13/2017  11:25 AM33,471,940 spark-2.1.0-bin-hadoop2.7.tgz
{code}


> SparkR hangs when there is download or untar failure
> 
>
> Key: SPARK-19231
> URL: https://issues.apache.org/jira/browse/SPARK-19231
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>
> When there is any partial download, or download error it is not cleaned up, 
> and sparkR.session will continue to stuck with no error message.
> {code}
> > sparkR.session()
> Spark not found in SPARK_HOME:
> Spark not found in the cache directory. Installation will start.
> MirrorUrl not provided.
> Looking for preferred site from apache website...
> Preferred mirror site found: http://www-eu.apache.org/dist/spark
> Downloading spark-2.1.0 for Hadoop 2.7 from:
> - 
> http://www-eu.apache.org/dist/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.7.tgz
> trying URL 
> 'http://www-eu.apache.org/dist/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.7.tgz'
> Content type 'application/x-gzip' length 195636829 bytes (186.6 MB)
> downloaded 31.9 MB
>  
> Installing to C:\Users\felixc\AppData\Local\spark\spark\Cache
> Error in untar2(tarfile, files, list, exdir) : incomplete block on file
> In addition: Warning message:
> In download.file(remotePath, localPath) :
>   downloaded length 33471940 != reported length 195636829
> > sparkR.session()
> Spark not found in SPARK_HOME:
> spark-2.1.0 for Hadoop 2.7 found, setting SPARK_HOME to 
> C:\Users\felixc\AppData\Local\spark\spark\Cache/spark-2.1.0-bin-hadoop2.7
> Launching java with spark-submit command 
> C:\Users\felixc\AppData\Local\spark\spark\Cache/spark-2.1.0-bin-hadoop2.7/bin/spark-submit2.cmd
>sparkr-shell 
> C:\Users\felixc\AppData\Local\Temp\RtmpCqNdne\backend_port16d04191e7
> {code}
> {code}
> Directory of C:\Users\felixc\AppData\Local\spark\spark\Cache
>  01/13/2017  11:25 AM  .
> 01/13/2017  11:25 AM  ..
> 01/13/2017  11:25 AM  spark-2.1.0-bin-hadoop2.7
> 01/13/2017  11:25 AM33,471,940 spark-2.1.0-bin-hadoop2.7.tgz
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19231) SparkR hangs when there is download or untar failure

2017-01-15 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-19231:
-
Description: 
When there is any partial download, or download error it is not cleaned up, and 
sparkR.session will continue to stuck with no error message.

{code}
> sparkR.session()
Spark not found in SPARK_HOME:
Spark not found in the cache directory. Installation will start.
MirrorUrl not provided.
Looking for preferred site from apache website...
Preferred mirror site found: http://www-eu.apache.org/dist/spark
Downloading spark-2.1.0 for Hadoop 2.7 from:
- http://www-eu.apache.org/dist/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.7.tgz
trying URL 
'http://www-eu.apache.org/dist/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.7.tgz'
Content type 'application/x-gzip' length 195636829 bytes (186.6 MB)
downloaded 31.9 MB
 
Installing to C:\Users\felix\AppData\Local\spark\spark\Cache
Error in untar2(tarfile, files, list, exdir) : incomplete block on file

In addition: Warning message:
In download.file(remotePath, localPath) :
  downloaded length 33471940 != reported length 195636829
> sparkR.session()
Spark not found in SPARK_HOME:
spark-2.1.0 for Hadoop 2.7 found, setting SPARK_HOME to 
C:\Users\felix\AppData\Local\spark\spark\Cache/spark-2.1.0-bin-hadoop2.7
Launching java with spark-submit command 
C:\Users\felix\AppData\Local\spark\spark\Cache/spark-2.1.0-bin-hadoop2.7/bin/spark-submit2.cmd
   sparkr-shell 
C:\Users\felix\AppData\Local\Temp\RtmpCqNdne\backend_port16d04191e7
{code}

{code}
Directory of C:\Users\felix\AppData\Local\spark\spark\Cache
01/13/2017  11:25 AM  spark-2.1.0-bin-hadoop2.7
01/13/2017  11:25 AM33,471,940 spark-2.1.0-bin-hadoop2.7.tgz
{code}


  was:
When there is any partial download, or download error it is not cleaned up, and 
sparkR.session will continue to stuck with no error message.

{code}
> sparkR.session()
Spark not found in SPARK_HOME:
Spark not found in the cache directory. Installation will start.
MirrorUrl not provided.
Looking for preferred site from apache website...
Preferred mirror site found: http://www-eu.apache.org/dist/spark
Downloading spark-2.1.0 for Hadoop 2.7 from:
- http://www-eu.apache.org/dist/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.7.tgz
trying URL 
'http://www-eu.apache.org/dist/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.7.tgz'
Content type 'application/x-gzip' length 195636829 bytes (186.6 MB)
downloaded 31.9 MB
 
Installing to C:\Users\felixc\AppData\Local\spark\spark\Cache
Error in untar2(tarfile, files, list, exdir) : incomplete block on file

In addition: Warning message:
In download.file(remotePath, localPath) :
  downloaded length 33471940 != reported length 195636829
> sparkR.session()
Spark not found in SPARK_HOME:
spark-2.1.0 for Hadoop 2.7 found, setting SPARK_HOME to 
C:\Users\felixc\AppData\Local\spark\spark\Cache/spark-2.1.0-bin-hadoop2.7
Launching java with spark-submit command 
C:\Users\felixc\AppData\Local\spark\spark\Cache/spark-2.1.0-bin-hadoop2.7/bin/spark-submit2.cmd
   sparkr-shell 
C:\Users\felixc\AppData\Local\Temp\RtmpCqNdne\backend_port16d04191e7
{code}

{code}
Directory of C:\Users\felixc\AppData\Local\spark\spark\Cache
 01/13/2017  11:25 AM  .
01/13/2017  11:25 AM  ..
01/13/2017  11:25 AM  spark-2.1.0-bin-hadoop2.7
01/13/2017  11:25 AM33,471,940 spark-2.1.0-bin-hadoop2.7.tgz
{code}



> SparkR hangs when there is download or untar failure
> 
>
> Key: SPARK-19231
> URL: https://issues.apache.org/jira/browse/SPARK-19231
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>
> When there is any partial download, or download error it is not cleaned up, 
> and sparkR.session will continue to stuck with no error message.
> {code}
> > sparkR.session()
> Spark not found in SPARK_HOME:
> Spark not found in the cache directory. Installation will start.
> MirrorUrl not provided.
> Looking for preferred site from apache website...
> Preferred mirror site found: http://www-eu.apache.org/dist/spark
> Downloading spark-2.1.0 for Hadoop 2.7 from:
> - 
> http://www-eu.apache.org/dist/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.7.tgz
> trying URL 
> 'http://www-eu.apache.org/dist/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.7.tgz'
> Content type 'application/x-gzip' length 195636829 bytes (186.6 MB)
> downloaded 31.9 MB
>  
> Installing to C:\Users\felix\AppData\Local\spark\spark\Cache
> Error in untar2(tarfile, files, list, exdir) : incomplete block on file
> In addition: Warning message:
> In download.file(remotePath, localPath) :
>   downloaded length 33471940 != reported length 195636829
> > sparkR.session()
> Spark not found in SPARK_HOME:
> spark-2.1.0 for Hadoop 2.7 found, setting SPARK_HOME to 
> 

[jira] [Created] (SPARK-19231) SparkR hangs when there is download or untar failure

2017-01-15 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-19231:


 Summary: SparkR hangs when there is download or untar failure
 Key: SPARK-19231
 URL: https://issues.apache.org/jira/browse/SPARK-19231
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.1.0
Reporter: Felix Cheung






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18823) Assignation by column name variable not available or bug?

2017-01-11 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15820389#comment-15820389
 ] 

Felix Cheung commented on SPARK-18823:
--

Yap. I'll start on this shortly.

> Assignation by column name variable not available or bug?
> -
>
> Key: SPARK-18823
> URL: https://issues.apache.org/jira/browse/SPARK-18823
> Project: Spark
>  Issue Type: Question
>  Components: SparkR
>Affects Versions: 2.0.2
> Environment: RStudio Server in EC2 Instances (EMR Service of AWS) Emr 
> 4. Or databricks (community.cloud.databricks.com) .
>Reporter: Vicente Masip
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I really don't know if this is a bug or can be done with some function:
> Sometimes is very important to assign something to a column which name has to 
> be access trough a variable. Normally, I have always used it with doble 
> brackets likes this out of SparkR problems:
> # df could be faithful normal data frame or data table.
> # accesing by variable name:
> myname = "waiting"
> df[[myname]] <- c(1:nrow(df))
> # or even column number
> df[[2]] <- df$eruptions
> The error is not caused by the right side of the "<-" operator of assignment. 
> The problem is that I can't assign to a column name using a variable or 
> column number as I do in this examples out of spark. Doesn't matter if I am 
> modifying or creating column. Same problem.
> I have also tried to use this with no results:
> val df2 = withColumn(df,"tmp", df$eruptions)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19133) SparkR glm Gamma family results in error

2017-01-11 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-19133:
-
Affects Version/s: 2.0.0
 Target Version/s: 2.0.3, 2.1.1, 2.2.0  (was: 2.2.0)
Fix Version/s: 2.1.1
   2.0.3

> SparkR glm Gamma family results in error
> 
>
> Key: SPARK-19133
> URL: https://issues.apache.org/jira/browse/SPARK-19133
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> > glm(y~1,family=Gamma, data = dy)
> 17/01/09 06:10:47 ERROR RBackendHandler: fit on 
> org.apache.spark.ml.r.GeneralizedLinearRegressionWrapper failed
> java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:167)
>   at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:108)
>   at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:40)
>   at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346)
>   at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346)
>   at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346)
>   at 
> io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293)
>   at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:267)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346)
>   at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:652)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:575)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:489)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
>   at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.IllegalArgumentException: glm_e3483764cdf9 parameter 
> family given invalid value Gamma.
>   at org.apache.spark.ml.param.Param.validate(params.scala:77)
>   at org.apache.spark.ml.param.ParamPair.(params.scala:528)
>   at 

[jira] [Resolved] (SPARK-19133) SparkR glm Gamma family results in error

2017-01-10 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-19133.
--
  Resolution: Fixed
   Fix Version/s: 2.2.0
Target Version/s: 2.2.0

> SparkR glm Gamma family results in error
> 
>
> Key: SPARK-19133
> URL: https://issues.apache.org/jira/browse/SPARK-19133
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
> Fix For: 2.2.0
>
>
> > glm(y~1,family=Gamma, data = dy)
> 17/01/09 06:10:47 ERROR RBackendHandler: fit on 
> org.apache.spark.ml.r.GeneralizedLinearRegressionWrapper failed
> java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:167)
>   at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:108)
>   at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:40)
>   at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346)
>   at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346)
>   at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346)
>   at 
> io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293)
>   at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:267)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346)
>   at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:652)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:575)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:489)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
>   at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.IllegalArgumentException: glm_e3483764cdf9 parameter 
> family given invalid value Gamma.
>   at org.apache.spark.ml.param.Param.validate(params.scala:77)
>   at org.apache.spark.ml.param.ParamPair.(params.scala:528)
>   at org.apache.spark.ml.param.Param.$minus$greater(params.scala:87)
>   at 

[jira] [Updated] (SPARK-19133) SparkR glm Gamma family results in error

2017-01-09 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-19133:
-
Summary: SparkR glm Gamma family results in error  (was: glm Gamma family 
results in error)

> SparkR glm Gamma family results in error
> 
>
> Key: SPARK-19133
> URL: https://issues.apache.org/jira/browse/SPARK-19133
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>
> > glm(y~1,family=Gamma, data = dy)
> 17/01/09 06:10:47 ERROR RBackendHandler: fit on 
> org.apache.spark.ml.r.GeneralizedLinearRegressionWrapper failed
> java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:167)
>   at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:108)
>   at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:40)
>   at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346)
>   at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346)
>   at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346)
>   at 
> io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293)
>   at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:267)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346)
>   at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:652)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:575)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:489)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
>   at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.IllegalArgumentException: glm_e3483764cdf9 parameter 
> family given invalid value Gamma.
>   at org.apache.spark.ml.param.Param.validate(params.scala:77)
>   at org.apache.spark.ml.param.ParamPair.(params.scala:528)
>   at org.apache.spark.ml.param.Param.$minus$greater(params.scala:87)
>   at 

[jira] [Updated] (SPARK-19133) glm Gamma family results in error

2017-01-08 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-19133:
-
Description: 
> glm(y~1,family=Gamma, data = dy)
17/01/09 06:10:47 ERROR RBackendHandler: fit on 
org.apache.spark.ml.r.GeneralizedLinearRegressionWrapper failed
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:167)
at 
org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:108)
at 
org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:40)
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346)
at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346)
at 
io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293)
at 
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:267)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346)
at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:652)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:575)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:489)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalArgumentException: glm_e3483764cdf9 parameter 
family given invalid value Gamma.
at org.apache.spark.ml.param.Param.validate(params.scala:77)
at org.apache.spark.ml.param.ParamPair.(params.scala:528)
at org.apache.spark.ml.param.Param.$minus$greater(params.scala:87)
at org.apache.spark.ml.param.Params$class.set(params.scala:609)
at org.apache.spark.ml.PipelineStage.set(Pipeline.scala:42)
at 
org.apache.spark.ml.regression.GeneralizedLinearRegression.setFamily(GeneralizedLinearRegression.scala:157)
at 
org.apache.spark.ml.r.GeneralizedLinearRegressionWrapper$.fit(GeneralizedLinearRegressionWrapper.scala:85)
at 
org.apache.spark.ml.r.GeneralizedLinearRegressionWrapper.fit(GeneralizedLinearRegressionWrapper.scala)
... 36 more
Error in handleErrors(returnStatus, conn) :
  java.lang.IllegalArgumentException: 

[jira] [Created] (SPARK-19133) glm Gamma family results in error

2017-01-08 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-19133:


 Summary: glm Gamma family results in error
 Key: SPARK-19133
 URL: https://issues.apache.org/jira/browse/SPARK-19133
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.1.0
Reporter: Felix Cheung






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15237) SparkR corr function documentation

2017-01-08 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15810760#comment-15810760
 ] 

Felix Cheung commented on SPARK-15237:
--

I think this is better now? Shall we resolve this JIRA?

> SparkR corr function documentation
> --
>
> Key: SPARK-15237
> URL: https://issues.apache.org/jira/browse/SPARK-15237
> Project: Spark
>  Issue Type: Documentation
>  Components: SparkR
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Shaul
>Priority: Minor
>  Labels: corr, sparkr
>
> Please review the documentation of the corr function in SparkR, the example 
> given: corr(df$c, df$d) won't run. The correct usage seems to be 
> corr(dataFrame,"someColumn","OtherColumn"), is this correct? 
> Thank you.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14692) Error While Setting the path for R front end

2017-01-08 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15810754#comment-15810754
 ] 

Felix Cheung commented on SPARK-14692:
--

closing. let us know if this is still an issue. thanks

> Error While Setting the path for R front end
> 
>
> Key: SPARK-14692
> URL: https://issues.apache.org/jira/browse/SPARK-14692
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.1
> Environment: Mac OSX
>Reporter: Niranjan Molkeri`
>
> Trying to set Environment path for SparkR in RStudio. 
> Getting this bug. 
> > .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
> > library(SparkR)
> Error in library(SparkR) : there is no package called ‘SparkR’
> > sc <- sparkR.init(master="local")
> Error: could not find function "sparkR.init"
> In the directory which it is pointed. There is directory called SparkR. I 
> don't know how to proceed with this.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14692) Error While Setting the path for R front end

2017-01-08 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-14692.
--
Resolution: Cannot Reproduce
  Assignee: Felix Cheung

> Error While Setting the path for R front end
> 
>
> Key: SPARK-14692
> URL: https://issues.apache.org/jira/browse/SPARK-14692
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.1
> Environment: Mac OSX
>Reporter: Niranjan Molkeri`
>Assignee: Felix Cheung
>
> Trying to set Environment path for SparkR in RStudio. 
> Getting this bug. 
> > .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
> > library(SparkR)
> Error in library(SparkR) : there is no package called ‘SparkR’
> > sc <- sparkR.init(master="local")
> Error: could not find function "sparkR.init"
> In the directory which it is pointed. There is directory called SparkR. I 
> don't know how to proceed with this.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19126) Join Documentation Improvements

2017-01-08 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-19126.
--
  Resolution: Fixed
Assignee: Bill Chambers
   Fix Version/s: 2.2.0
  2.1.1
Target Version/s: 2.1.1, 2.2.0

> Join Documentation Improvements
> ---
>
> Key: SPARK-19126
> URL: https://issues.apache.org/jira/browse/SPARK-19126
> Project: Spark
>  Issue Type: Improvement
>Reporter: Bill Chambers
>Assignee: Bill Chambers
>Priority: Minor
> Fix For: 2.1.1, 2.2.0
>
>
> - Some join types are missing (no mention of anti join)
> - Joins are labelled inconsistently both within each language and between 
> languages.
> - Update according to new join spec for `crossJoin`
> Pull request coming...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18011) SparkR serialize "NA" throws exception

2017-01-08 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15810660#comment-15810660
 ] 

Felix Cheung commented on SPARK-18011:
--

[~wangmiao1981]do you remember this one? I thought at one point you said you 
were close to a fix? Would you be interested in addressing this?

> SparkR serialize "NA" throws exception
> --
>
> Key: SPARK-18011
> URL: https://issues.apache.org/jira/browse/SPARK-18011
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Miao Wang
>
> For some versions of R, if Date has "NA" field, backend will throw negative 
> index exception.
> To reproduce the problem:
> {code}
> > a <- as.Date(c("2016-11-11", "NA"))
> > b <- as.data.frame(a)
> > c <- createDataFrame(b)
> > dim(c)
> 16/10/19 10:31:24 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.NegativeArraySizeException
>   at org.apache.spark.api.r.SerDe$.readStringBytes(SerDe.scala:110)
>   at org.apache.spark.api.r.SerDe$.readString(SerDe.scala:119)
>   at org.apache.spark.api.r.SerDe$.readDate(SerDe.scala:128)
>   at org.apache.spark.api.r.SerDe$.readTypedObject(SerDe.scala:77)
>   at org.apache.spark.api.r.SerDe$.readObject(SerDe.scala:61)
>   at 
> org.apache.spark.sql.api.r.SQLUtils$$anonfun$bytesToRow$1.apply(SQLUtils.scala:161)
>   at 
> org.apache.spark.sql.api.r.SQLUtils$$anonfun$bytesToRow$1.apply(SQLUtils.scala:160)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.Range.foreach(Range.scala:160)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at org.apache.spark.sql.api.r.SQLUtils$.bytesToRow(SQLUtils.scala:160)
>   at 
> org.apache.spark.sql.api.r.SQLUtils$$anonfun$5.apply(SQLUtils.scala:138)
>   at 
> org.apache.spark.sql.api.r.SQLUtils$$anonfun$5.apply(SQLUtils.scala:138)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:372)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19130) SparkR should support setting and adding new column with singular value implicitly

2017-01-08 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-19130:


 Summary: SparkR should support setting and adding new column with 
singular value implicitly
 Key: SPARK-19130
 URL: https://issues.apache.org/jira/browse/SPARK-19130
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.1.0
Reporter: Felix Cheung


for parity with framework like dplyr



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18823) Assignation by column name variable not available or bug?

2017-01-08 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15810634#comment-15810634
 ] 

Felix Cheung commented on SPARK-18823:
--

I think to Shivaram, this is a bit tricky since we are making assumption that 
the column data can fit in memory of a single node (where the R client is 
running). Even then, we would need to handle a potentially large amount of data 
to serialze and distribute and so on. 

> Assignation by column name variable not available or bug?
> -
>
> Key: SPARK-18823
> URL: https://issues.apache.org/jira/browse/SPARK-18823
> Project: Spark
>  Issue Type: Question
>  Components: SparkR
>Affects Versions: 2.0.2
> Environment: RStudio Server in EC2 Instances (EMR Service of AWS) Emr 
> 4. Or databricks (community.cloud.databricks.com) .
>Reporter: Vicente Masip
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I really don't know if this is a bug or can be done with some function:
> Sometimes is very important to assign something to a column which name has to 
> be access trough a variable. Normally, I have always used it with doble 
> brackets likes this out of SparkR problems:
> # df could be faithful normal data frame or data table.
> # accesing by variable name:
> myname = "waiting"
> df[[myname]] <- c(1:nrow(df))
> # or even column number
> df[[2]] <- df$eruptions
> The error is not caused by the right side of the "<-" operator of assignment. 
> The problem is that I can't assign to a column name using a variable or 
> column number as I do in this examples out of spark. Doesn't matter if I am 
> modifying or creating column. Same problem.
> I have also tried to use this with no results:
> val df2 = withColumn(df,"tmp", df$eruptions)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18570) Consider supporting other R formula operators

2017-01-08 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15810250#comment-15810250
 ] 

Felix Cheung commented on SPARK-18570:
--

Hi - the code is here: 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala#L74

It would be great if you would start the discussion around what operators to 
actually support. Thanks!

> Consider supporting other R formula operators
> -
>
> Key: SPARK-18570
> URL: https://issues.apache.org/jira/browse/SPARK-18570
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Felix Cheung
>Priority: Minor
>
> Such as
> {code}
> ∗ 
>  X∗Y include these variables and the interactions between them
> ^
>  (X + Z + W)^3 include these variables and all interactions up to three way
> |
>  X | Z conditioning: include x given z
> {code}
> Other includes, %in%, ` (backtick)
> https://stat.ethz.ch/R-manual/R-devel/library/stats/html/formula.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12757) Use reference counting to prevent blocks from being evicted during reads

2016-12-29 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15786726#comment-15786726
 ] 

Felix Cheung commented on SPARK-12757:
--

ping. Still seeing a lot of these messages on Spark 2.1. Is that a new issue?


> Use reference counting to prevent blocks from being evicted during reads
> 
>
> Key: SPARK-12757
> URL: https://issues.apache.org/jira/browse/SPARK-12757
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.0.0
>
>
> As a pre-requisite to off-heap caching of blocks, we need a mechanism to 
> prevent pages / blocks from being evicted while they are being read. With 
> on-heap objects, evicting a block while it is being read merely leads to 
> memory-accounting problems (because we assume that an evicted block is a 
> candidate for garbage-collection, which will not be true during a read), but 
> with off-heap memory this will lead to either data corruption or segmentation 
> faults.
> To address this, we should add a reference-counting mechanism to track which 
> blocks/pages are being read in order to prevent them from being evicted 
> prematurely. I propose to do this in two phases: first, add a safe, 
> conservative approach in which all BlockManager.get*() calls implicitly 
> increment the reference count of blocks and where tasks' references are 
> automatically freed upon task completion. This will be correct but may have 
> adverse performance impacts because it will prevent legitimate block 
> evictions. In phase two, we should incrementally add release() calls in order 
> to fix the eviction of unreferenced blocks. The latter change may need to 
> touch many different components, which is why I propose to do it separately 
> in order to make the changes easier to reason about and review.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18958) SparkR should support toJSON on DataFrame

2016-12-28 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-18958.
--
  Resolution: Fixed
Target Version/s: 2.2.0

> SparkR should support toJSON on DataFrame
> -
>
> Key: SPARK-18958
> URL: https://issues.apache.org/jira/browse/SPARK-18958
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Minor
>
> It makes it easier to interop with other component (esp. since R does not 
> have json support built in)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18903) uiWebUrl is not accessible to SparkR

2016-12-21 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-18903.
--
  Resolution: Fixed
Assignee: Felix Cheung
Target Version/s: 2.2.0

> uiWebUrl is not accessible to SparkR
> 
>
> Key: SPARK-18903
> URL: https://issues.apache.org/jira/browse/SPARK-18903
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR, Web UI
>Affects Versions: 2.0.2
>Reporter: Diogo Munaro Vieira
>Assignee: Felix Cheung
>Priority: Minor
>
> Like https://issues.apache.org/jira/browse/SPARK-17437 uiWebUrl is not 
> accessible to SparkR context



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10523) SparkR formula syntax to turn strings/factors into numerics

2016-12-21 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15768165#comment-15768165
 ] 

Felix Cheung commented on SPARK-10523:
--

[~cantdutchthis]I'm curious, do you know why all your clients have switched to 
sparklyr?

> SparkR formula syntax to turn strings/factors into numerics
> ---
>
> Key: SPARK-10523
> URL: https://issues.apache.org/jira/browse/SPARK-10523
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Vincent Warmerdam
>
> In normal (non SparkR) R the formula syntax enables strings or factors to be 
> turned into dummy variables immediately when calling a classifier. This way, 
> the following R pattern is legal and often used:
> {code}
> library(magrittr) 
> df <- data.frame( class = c("a", "a", "b", "b"), i = c(1, 2, 5, 6))
> glm(class ~ i, family = "binomial", data = df)
> {code}
> The glm method will know that `class` is a string/factor and handles it 
> appropriately by casting it to a 0/1 array before applying any machine 
> learning. SparkR doesn't do this. 
> {code}
> > ddf <- sqlContext %>% 
>   createDataFrame(df)
> > glm(class ~ i, family = "binomial", data = ddf)
> Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
>   java.lang.IllegalArgumentException: Unsupported type for label: StringType
>   at 
> org.apache.spark.ml.feature.RFormulaModel.transformLabel(RFormula.scala:185)
>   at 
> org.apache.spark.ml.feature.RFormulaModel.transform(RFormula.scala:150)
>   at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:146)
>   at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:134)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:42)
>   at 
> scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:43)
>   at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:134)
>   at 
> org.apache.spark.ml.api.r.SparkRWrappers$.fitRModelFormula(SparkRWrappers.scala:46)
>   at 
> org.apache.spark.ml.api.r.SparkRWrappers.fitRModelFormula(SparkRWrappers.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at sun.refl
> {code}
> This can be fixed by doing a bit of manual labor. SparkR does accept booleans 
> as if they are integers here. 
> {code}
> > ddf <- ddf %>% 
>   withColumn("to_pred", .$class == "a") 
> > glm(to_pred ~ i, family = "binomial", data = ddf)
> {code}
> But this can become quite tedious, especially when you want to have models 
> that are using multiple classes that need classification. This is perhaps 
> less relevant for logistic regression (because it is a bit more like a 
> one-off classification approach) but it certainly is relevant if you would 
> want to use a formula for a randomforest and a column denotes, say, a type of 
> flower from the iris dataset. 
> Is there a good reason why this should not be a feature of formulas in Spark? 
> I am aware of issue 8774, which looks like it is adressing a similar theme 
> but a different issue. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18958) SparkR should support toJSON on DataFrame

2016-12-20 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-18958:


 Summary: SparkR should support toJSON on DataFrame
 Key: SPARK-18958
 URL: https://issues.apache.org/jira/browse/SPARK-18958
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.1.0
Reporter: Felix Cheung
Assignee: Felix Cheung
Priority: Minor


It makes it easier to interop with other component (esp. since R does not have 
json support built in)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18924) Improve collect/createDataFrame performance in SparkR

2016-12-19 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15763165#comment-15763165
 ] 

Felix Cheung commented on SPARK-18924:
--

Thank you for bring this up. JVM<->Java performance has been reported a few 
times and definitely something I have been tracking but I didn't get around to.

I don't think rJava would work since it is GPLv2 licensed. Rcpp is also 
GPLv2/v3.

Strategically placed calls to C might be a way to go (cross-platform 
complications aside)? That seems to be the approach for a lot of R packages. I 
recall we have a JIRA on performance tests, do we have more break down of the 
time spent?


> Improve collect/createDataFrame performance in SparkR
> -
>
> Key: SPARK-18924
> URL: https://issues.apache.org/jira/browse/SPARK-18924
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Xiangrui Meng
>Priority: Critical
>
> SparkR has its own SerDe for data serialization between JVM and R.
> The SerDe on the JVM side is implemented in:
> * 
> [SerDe.scala|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/SerDe.scala]
> * 
> [SQLUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala]
> The SerDe on the R side is implemented in:
> * 
> [deserialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/deserialize.R]
> * 
> [serialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/serialize.R]
> The serialization between JVM and R suffers from huge storage and computation 
> overhead. For example, a short round trip of 1 million doubles surprisingly 
> took 3 minutes on my laptop:
> {code}
> > system.time(collect(createDataFrame(data.frame(x=runif(100)
>user  system elapsed
>  14.224   0.582 189.135
> {code}
> Collecting a medium-sized DataFrame to local and continuing with a local R 
> workflow is a use case we should pay attention to. SparkR will never be able 
> to cover all existing features from CRAN packages. It is also unnecessary for 
> Spark to do so because not all features need scalability. 
> Several factors contribute to the serialization overhead:
> 1. The SerDe in R side is implemented using high-level R methods.
> 2. DataFrame columns are not efficiently serialized, primitive type columns 
> in particular.
> 3. Some overhead in the serialization protocol/impl.
> 1) might be discussed before because R packages like rJava exist before 
> SparkR. I'm not sure whether we have a license issue in depending on those 
> libraries. Another option is to switch to low-level R'C interface or Rcpp, 
> which again might have license issue. I'm not an expert here. If we have to 
> implement our own, there still exist much space for improvement, discussed 
> below.
> 2) is a huge gap. The current collect is implemented by `SQLUtils.dfToCols`, 
> which collects rows to local and then constructs columns. However,
> * it ignores column types and results boxing/unboxing overhead
> * it collects all objects to driver and results high GC pressure
> A relatively simple change is to implement specialized column builder based 
> on column types, primitive types in particular. We need to handle null/NA 
> values properly. A simple data structure we can use is
> {code}
> val size: Int
> val nullIndexes: Array[Int]
> val notNullValues: Array[T] // specialized for primitive types
> {code}
> On the R side, we can use `readBin` and `writeBin` to read the entire vector 
> in a single method call. The speed seems reasonable (at the order of GB/s):
> {code}
> > x <- runif(1000) # 1e7, not 1e6
> > system.time(r <- writeBin(x, raw(0)))
>user  system elapsed
>   0.036   0.021   0.059
> > > system.time(y <- readBin(r, double(), 1000))
>user  system elapsed
>   0.015   0.007   0.024
> {code}
> This is just a proposal that needs to be discussed and formalized. But in 
> general, it should be feasible to obtain 20x or more performance gain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default

2016-12-18 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15759368#comment-15759368
 ] 

Felix Cheung commented on SPARK-18817:
--

testing fix, will open a PR shortly.

> Ensure nothing is written outside R's tempdir() by default
> --
>
> Key: SPARK-18817
> URL: https://issues.apache.org/jira/browse/SPARK-18817
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Brendan Dwyer
>Priority: Critical
>
> Per CRAN policies
> https://cran.r-project.org/web/packages/policies.html
> {quote}
> - Packages should not write in the users’ home filespace, nor anywhere else 
> on the file system apart from the R session’s temporary directory (or during 
> installation in the location pointed to by TMPDIR: and such usage should be 
> cleaned up). Installing into the system’s R installation (e.g., scripts to 
> its bin directory) is not allowed.
> Limited exceptions may be allowed in interactive sessions if the package 
> obtains confirmation from the user.
> - Packages should not modify the global environment (user’s workspace).
> {quote}
> Currently "spark-warehouse" gets created in the working directory when 
> sparkR.session() is called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default

2016-12-17 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15758005#comment-15758005
 ] 

Felix Cheung commented on SPARK-18817:
--

And as a side note, I feel like spark-warehouse, metastore_db, and derby.log 
should all be in temporary directory that is cleaned out when the app is done, 
much like how Spark Thrift Server is doing currently (at least to metastore_db, 
derby.log). Perhaps that is the more correct fix longer term.

> Ensure nothing is written outside R's tempdir() by default
> --
>
> Key: SPARK-18817
> URL: https://issues.apache.org/jira/browse/SPARK-18817
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Brendan Dwyer
>Priority: Critical
>
> Per CRAN policies
> https://cran.r-project.org/web/packages/policies.html
> {quote}
> - Packages should not write in the users’ home filespace, nor anywhere else 
> on the file system apart from the R session’s temporary directory (or during 
> installation in the location pointed to by TMPDIR: and such usage should be 
> cleaned up). Installing into the system’s R installation (e.g., scripts to 
> its bin directory) is not allowed.
> Limited exceptions may be allowed in interactive sessions if the package 
> obtains confirmation from the user.
> - Packages should not modify the global environment (user’s workspace).
> {quote}
> Currently "spark-warehouse" gets created in the working directory when 
> sparkR.session() is called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default

2016-12-17 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15757998#comment-15757998
 ] 

Felix Cheung edited comment on SPARK-18817 at 12/18/16 2:04 AM:


Aside from changing the existing shipped behavior, there are a few mentions of 
this behavior in various documentation that would become wrong and would need 
to be updated.

IMO more importantly we still have a feature that can be turned on (as 
documented or suggested in documentations) that would cause files to be written 
without the user explicitly agreeing to it (or understanding it). This to me 
doesn't seem like we would be addressing the root of the issue fully, merely 
side-stepping it?

I've managed to track down the fix to move metastore_db and derby.log though. 
There are two separate switches to set and it is doable from pure R (have 
tested that); but I'd recommend doing in 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L116
  in order to respect any existing value from hive-site.xml if given one. 

How about we introduce something like spark.sql.default.derby.dir and fix this 
that way?


was (Author: felixcheung):
Aside from changing the existing shipped behavior, there are a few mentions of 
this behavior in various documentation that would become wrong and would need 
to be updated.

IMO more importantly we still have a feature that can be turned on (as 
documented or suggested in documentations) that would cause files to be written 
without the user explicitly agreeing to it (or understanding it). This to me 
doesn't seem like we would be addressing the root of the issue fully, merely 
side-stepping it?

I've managed to track down the fix to move metastore_db and derby.log though. 
There are two separate switches to set that it doable from pure R; but I'd 
recommend doing in 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L116
  in order to respect any existing value from hive-site.xml if given one. 

How about we introduce something like spark.sql.default.derby.dir and fix this 
that way?

> Ensure nothing is written outside R's tempdir() by default
> --
>
> Key: SPARK-18817
> URL: https://issues.apache.org/jira/browse/SPARK-18817
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Brendan Dwyer
>Priority: Critical
>
> Per CRAN policies
> https://cran.r-project.org/web/packages/policies.html
> {quote}
> - Packages should not write in the users’ home filespace, nor anywhere else 
> on the file system apart from the R session’s temporary directory (or during 
> installation in the location pointed to by TMPDIR: and such usage should be 
> cleaned up). Installing into the system’s R installation (e.g., scripts to 
> its bin directory) is not allowed.
> Limited exceptions may be allowed in interactive sessions if the package 
> obtains confirmation from the user.
> - Packages should not modify the global environment (user’s workspace).
> {quote}
> Currently "spark-warehouse" gets created in the working directory when 
> sparkR.session() is called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default

2016-12-17 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15757998#comment-15757998
 ] 

Felix Cheung commented on SPARK-18817:
--

Aside from changing the existing shipped behavior, there are a few mentions of 
this behavior in various documentation that would become wrong and would need 
to be updated.

IMO more importantly we still have a feature that can be turned on (as 
documented or suggested in documentations) that would cause files to be written 
without the user explicitly agreeing to it (or understanding it). This to me 
doesn't seem like we would be addressing the root of the issue fully, merely 
side-stepping it?

I've managed to track down the fix to move metastore_db and derby.log though. 
There are two separate switches to set that it doable from pure R; but I'd 
recommend doing in 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L116
  in order to respect any existing value from hive-site.xml if given one. 

How about we introduce something like spark.sql.default.derby.dir and fix this 
that way?

> Ensure nothing is written outside R's tempdir() by default
> --
>
> Key: SPARK-18817
> URL: https://issues.apache.org/jira/browse/SPARK-18817
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Brendan Dwyer
>Priority: Critical
>
> Per CRAN policies
> https://cran.r-project.org/web/packages/policies.html
> {quote}
> - Packages should not write in the users’ home filespace, nor anywhere else 
> on the file system apart from the R session’s temporary directory (or during 
> installation in the location pointed to by TMPDIR: and such usage should be 
> cleaned up). Installing into the system’s R installation (e.g., scripts to 
> its bin directory) is not allowed.
> Limited exceptions may be allowed in interactive sessions if the package 
> obtains confirmation from the user.
> - Packages should not modify the global environment (user’s workspace).
> {quote}
> Currently "spark-warehouse" gets created in the working directory when 
> sparkR.session() is called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18903) uiWebUrl is not accessible to SparkR

2016-12-16 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15756379#comment-15756379
 ] 

Felix Cheung commented on SPARK-18903:
--

this sounds like a reasonable ask, I'll take a look.

> uiWebUrl is not accessible to SparkR
> 
>
> Key: SPARK-18903
> URL: https://issues.apache.org/jira/browse/SPARK-18903
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR, Web UI
>Affects Versions: 2.0.2
>Reporter: Diogo Munaro Vieira
>Priority: Minor
>
> Like https://issues.apache.org/jira/browse/SPARK-17437 uiWebUrl is not 
> accessible to SparkR context



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18903) uiWebUrl is not accessible to SparkR

2016-12-16 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-18903:
-
Component/s: (was: Java API)

> uiWebUrl is not accessible to SparkR
> 
>
> Key: SPARK-18903
> URL: https://issues.apache.org/jira/browse/SPARK-18903
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR, Web UI
>Affects Versions: 2.0.2
>Reporter: Diogo Munaro Vieira
>Priority: Minor
>
> Like https://issues.apache.org/jira/browse/SPARK-17437 uiWebUrl is not 
> accessible to SparkR context



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18902) Include Apache License in R source Package

2016-12-16 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-18902.
--
Resolution: Not A Problem

> Include Apache License in R source Package
> --
>
> Key: SPARK-18902
> URL: https://issues.apache.org/jira/browse/SPARK-18902
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Shivaram Venkataraman
>
> Per [~srowen]'s email on the dev mailing list
> {quote}
> I don't see an Apache license / notice for the Pyspark or SparkR artifacts. 
> It would be good practice to include this in a convenience binary. I'm not 
> sure if it's strictly mandatory, but something to adjust in any event. I 
> think that's all there is to do for SparkR
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18902) Include Apache License in R source Package

2016-12-16 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15756369#comment-15756369
 ] 

Felix Cheung commented on SPARK-18902:
--

We have the license in DESCRIPTION file as required for R package:
https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Licensing

Closing this issue - thanks for validating!

> Include Apache License in R source Package
> --
>
> Key: SPARK-18902
> URL: https://issues.apache.org/jira/browse/SPARK-18902
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Shivaram Venkataraman
>
> Per [~srowen]'s email on the dev mailing list
> {quote}
> I don't see an Apache license / notice for the Pyspark or SparkR artifacts. 
> It would be good practice to include this in a convenience binary. I'm not 
> sure if it's strictly mandatory, but something to adjust in any event. I 
> think that's all there is to do for SparkR
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-18902) Include Apache License in R source Package

2016-12-16 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung closed SPARK-18902.

Assignee: Felix Cheung

> Include Apache License in R source Package
> --
>
> Key: SPARK-18902
> URL: https://issues.apache.org/jira/browse/SPARK-18902
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Shivaram Venkataraman
>Assignee: Felix Cheung
>
> Per [~srowen]'s email on the dev mailing list
> {quote}
> I don't see an Apache license / notice for the Pyspark or SparkR artifacts. 
> It would be good practice to include this in a convenience binary. I'm not 
> sure if it's strictly mandatory, but something to adjust in any event. I 
> think that's all there is to do for SparkR
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default

2016-12-15 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753462#comment-15753462
 ] 

Felix Cheung commented on SPARK-18817:
--

I ran more of this but wasn't seeinng derby.log or metastore_db

> Ensure nothing is written outside R's tempdir() by default
> --
>
> Key: SPARK-18817
> URL: https://issues.apache.org/jira/browse/SPARK-18817
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Brendan Dwyer
>Priority: Critical
>
> Per CRAN policies
> https://cran.r-project.org/web/packages/policies.html
> {quote}
> - Packages should not write in the users’ home filespace, nor anywhere else 
> on the file system apart from the R session’s temporary directory (or during 
> installation in the location pointed to by TMPDIR: and such usage should be 
> cleaned up). Installing into the system’s R installation (e.g., scripts to 
> its bin directory) is not allowed.
> Limited exceptions may be allowed in interactive sessions if the package 
> obtains confirmation from the user.
> - Packages should not modify the global environment (user’s workspace).
> {quote}
> Currently "spark-warehouse" gets created in the working directory when 
> sparkR.session() is called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default

2016-12-15 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753414#comment-15753414
 ] 

Felix Cheung commented on SPARK-18817:
--

It looks like javax.jdo.option.ConnectionURL can also be set in Hive-site.xml?

In that sense we should only change javax.jdo.option.ConnectionURL and 
spark.sql.default.warehouse.dir when they are not set in conf or hive-site, and 
we need to handle both for a complete fix.



> Ensure nothing is written outside R's tempdir() by default
> --
>
> Key: SPARK-18817
> URL: https://issues.apache.org/jira/browse/SPARK-18817
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Brendan Dwyer
>Priority: Critical
>
> Per CRAN policies
> https://cran.r-project.org/web/packages/policies.html
> {quote}
> - Packages should not write in the users’ home filespace, nor anywhere else 
> on the file system apart from the R session’s temporary directory (or during 
> installation in the location pointed to by TMPDIR: and such usage should be 
> cleaned up). Installing into the system’s R installation (e.g., scripts to 
> its bin directory) is not allowed.
> Limited exceptions may be allowed in interactive sessions if the package 
> obtains confirmation from the user.
> - Packages should not modify the global environment (user’s workspace).
> {quote}
> Currently "spark-warehouse" gets created in the working directory when 
> sparkR.session() is called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default

2016-12-15 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753064#comment-15753064
 ] 

Felix Cheung commented on SPARK-18817:
--

Actually, I'm not seeing derby.log or metastore_db in the quick tests I have, 

{code}
> createOrReplaceTempView(a, "foo")
> sql("SELECT * from foo")
{code}

[~bdwyer]do you have the steps that create these files?

> Ensure nothing is written outside R's tempdir() by default
> --
>
> Key: SPARK-18817
> URL: https://issues.apache.org/jira/browse/SPARK-18817
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Brendan Dwyer
>Priority: Critical
>
> Per CRAN policies
> https://cran.r-project.org/web/packages/policies.html
> {quote}
> - Packages should not write in the users’ home filespace, nor anywhere else 
> on the file system apart from the R session’s temporary directory (or during 
> installation in the location pointed to by TMPDIR: and such usage should be 
> cleaned up). Installing into the system’s R installation (e.g., scripts to 
> its bin directory) is not allowed.
> Limited exceptions may be allowed in interactive sessions if the package 
> obtains confirmation from the user.
> - Packages should not modify the global environment (user’s workspace).
> {quote}
> Currently "spark-warehouse" gets created in the working directory when 
> sparkR.session() is called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default

2016-12-15 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753048#comment-15753048
 ] 

Felix Cheung commented on SPARK-18817:
--

Tested this just now, I still see spark-warehouse when enableHiveSupport = FALSE

> Ensure nothing is written outside R's tempdir() by default
> --
>
> Key: SPARK-18817
> URL: https://issues.apache.org/jira/browse/SPARK-18817
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Brendan Dwyer
>Priority: Critical
>
> Per CRAN policies
> https://cran.r-project.org/web/packages/policies.html
> {quote}
> - Packages should not write in the users’ home filespace, nor anywhere else 
> on the file system apart from the R session’s temporary directory (or during 
> installation in the location pointed to by TMPDIR: and such usage should be 
> cleaned up). Installing into the system’s R installation (e.g., scripts to 
> its bin directory) is not allowed.
> Limited exceptions may be allowed in interactive sessions if the package 
> obtains confirmation from the user.
> - Packages should not modify the global environment (user’s workspace).
> {quote}
> Currently "spark-warehouse" gets created in the working directory when 
> sparkR.session() is called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default

2016-12-15 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753040#comment-15753040
 ] 

Felix Cheung commented on SPARK-18817:
--

we could, but we did ship 2.0 with it enabled by default though.
perhaps
{code}
enableHiveSupport = !interactive()
{code}
as default?


shouldn't derby.log and metastore_db go to the warehouse.dir?


> Ensure nothing is written outside R's tempdir() by default
> --
>
> Key: SPARK-18817
> URL: https://issues.apache.org/jira/browse/SPARK-18817
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Brendan Dwyer
>Priority: Critical
>
> Per CRAN policies
> https://cran.r-project.org/web/packages/policies.html
> {quote}
> - Packages should not write in the users’ home filespace, nor anywhere else 
> on the file system apart from the R session’s temporary directory (or during 
> installation in the location pointed to by TMPDIR: and such usage should be 
> cleaned up). Installing into the system’s R installation (e.g., scripts to 
> its bin directory) is not allowed.
> Limited exceptions may be allowed in interactive sessions if the package 
> obtains confirmation from the user.
> - Packages should not modify the global environment (user’s workspace).
> {quote}
> Currently "spark-warehouse" gets created in the working directory when 
> sparkR.session() is called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    5   6   7   8   9   10   11   12   13   14   >