[jira] [Commented] (SPARK-15585) Don't use null in data source options to indicate default value

2016-05-27 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15304267#comment-15304267
 ] 

Shivaram Venkataraman commented on SPARK-15585:
---

[~maropu] Can you also add test cases in Python, R in the PR ? 

> Don't use null in data source options to indicate default value
> ---
>
> Key: SPARK-15585
> URL: https://issues.apache.org/jira/browse/SPARK-15585
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> See email: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/changed-behavior-for-csv-datasource-and-quoting-in-spark-2-0-0-SNAPSHOT-td17704.html
> We'd need to change DataFrameReader/DataFrameWriter in Python's 
> csv/json/parquet/... functions to put the actual default option values as 
> function parameters, rather than setting them to None. We can then in 
> CSVOptions.getChar (and JSONOptions, etc) to actually return null if the 
> value is null, rather  than setting it to default value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15585) Don't use null in data source options to indicate default value

2016-05-26 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303523#comment-15303523
 ] 

Shivaram Venkataraman commented on SPARK-15585:
---

I am not sure i completely understand the question - The way the options get 
passed in R [1]  is that we create a hash map and fill it in with anything 
passed in by the user. `NULL` is a restricted keyword in R (note that its in 
all caps), and it gets deserialized / passed as `null` to Scala.

[1] 
https://github.com/apache/spark/blob/c82883239eadc4615a3aba907cd4633cb7aed26e/R/pkg/R/SQLContext.R#L658

> Don't use null in data source options to indicate default value
> ---
>
> Key: SPARK-15585
> URL: https://issues.apache.org/jira/browse/SPARK-15585
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> See email: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/changed-behavior-for-csv-datasource-and-quoting-in-spark-2-0-0-SNAPSHOT-td17704.html
> We'd need to change DataFrameReader/DataFrameWriter in Python's 
> csv/json/parquet/... functions to put the actual default option values as 
> function parameters, rather than setting them to None. We can then in 
> CSVOptions.getChar (and JSONOptions, etc) to actually return null if the 
> value is null, rather  than setting it to default value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8603) In Windows,Not able to create a Spark context from R studio

2016-05-26 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-8603.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13165
[https://github.com/apache/spark/pull/13165]

> In Windows,Not able to create a Spark context from R studio 
> 
>
> Key: SPARK-8603
> URL: https://issues.apache.org/jira/browse/SPARK-8603
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.4.0
> Environment: Windows, R studio
>Reporter: Prakash Ponshankaarchinnusamy
> Fix For: 2.0.0
>
>   Original Estimate: 0.5m
>  Remaining Estimate: 0.5m
>
> In windows ,creation of spark context fails using below code from R studio
> Sys.setenv(SPARK_HOME="C:\\spark\\spark-1.4.0")
> .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
> library(SparkR)
> sc <- sparkR.init(master="spark://localhost:7077", appName="SparkR")
> Error: JVM is not ready after 10 seconds
> Reason: Wrong file path computed in client.R. File seperator for windows["\"] 
> is not respected by "file.Path" function by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8603) In Windows,Not able to create a Spark context from R studio

2016-05-26 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-8603:
-
Assignee: Hyukjin Kwon

> In Windows,Not able to create a Spark context from R studio 
> 
>
> Key: SPARK-8603
> URL: https://issues.apache.org/jira/browse/SPARK-8603
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.4.0
> Environment: Windows, R studio
>Reporter: Prakash Ponshankaarchinnusamy
>Assignee: Hyukjin Kwon
> Fix For: 2.0.0
>
>   Original Estimate: 0.5m
>  Remaining Estimate: 0.5m
>
> In windows ,creation of spark context fails using below code from R studio
> Sys.setenv(SPARK_HOME="C:\\spark\\spark-1.4.0")
> .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
> library(SparkR)
> sc <- sparkR.init(master="spark://localhost:7077", appName="SparkR")
> Error: JVM is not ready after 10 seconds
> Reason: Wrong file path computed in client.R. File seperator for windows["\"] 
> is not respected by "file.Path" function by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10903) Simplify SQLContext method signatures and use a singleton

2016-05-26 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-10903.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 9192
[https://github.com/apache/spark/pull/9192]

> Simplify SQLContext method signatures and use a singleton
> -
>
> Key: SPARK-10903
> URL: https://issues.apache.org/jira/browse/SPARK-10903
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>Priority: Minor
> Fix For: 2.0.0
>
>
> Make sqlContext global so that we don't have to always specify it.
> e.g. createDataFrame(iris) instead of createDataFrame(sqlContext, iris)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10903) Simplify SQLContext method signatures and use a singleton

2016-05-26 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-10903:
--
Assignee: Felix Cheung

> Simplify SQLContext method signatures and use a singleton
> -
>
> Key: SPARK-10903
> URL: https://issues.apache.org/jira/browse/SPARK-10903
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>Assignee: Felix Cheung
>Priority: Minor
> Fix For: 2.0.0
>
>
> Make sqlContext global so that we don't have to always specify it.
> e.g. createDataFrame(iris) instead of createDataFrame(sqlContext, iris)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15439) Failed to run unit test in SparkR

2016-05-26 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15302525#comment-15302525
 ] 

Shivaram Venkataraman commented on SPARK-15439:
---

Hmm - I think the check needs to be `R.version$minor >= 3` and similarly the 
major version should be `>=3` ?

> Failed to run unit test in SparkR
> -
>
> Key: SPARK-15439
> URL: https://issues.apache.org/jira/browse/SPARK-15439
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Kai Jiang
>Assignee: Miao Wang
> Fix For: 2.0.0
>
>
> Failed to run ./R/run-tests.sh   around recent commit (May 19, 2016)
> It might be related to permission. It seems I used `sudo ./R/run-tests.sh` 
> and it worked sometimes. Without permission, maybe we couldn't access /tmp 
> directory.  However, the SparkR unit testing is still brittle.
> [error 
> message|https://gist.github.com/vectorijk/71f4ff34e3d34a628b8a3013f0ca2aa2]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15439) Failed to run unit test in SparkR

2016-05-25 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-15439.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13284
[https://github.com/apache/spark/pull/13284]

> Failed to run unit test in SparkR
> -
>
> Key: SPARK-15439
> URL: https://issues.apache.org/jira/browse/SPARK-15439
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Kai Jiang
> Fix For: 2.0.0
>
>
> Failed to run ./R/run-tests.sh   around recent commit (May 19, 2016)
> It might be related to permission. It seems I used `sudo ./R/run-tests.sh` 
> and it worked sometimes. Without permission, maybe we couldn't access /tmp 
> directory.  However, the SparkR unit testing is still brittle.
> [error 
> message|https://gist.github.com/vectorijk/71f4ff34e3d34a628b8a3013f0ca2aa2]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15439) Failed to run unit test in SparkR

2016-05-25 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-15439:
--
Assignee: Miao Wang

> Failed to run unit test in SparkR
> -
>
> Key: SPARK-15439
> URL: https://issues.apache.org/jira/browse/SPARK-15439
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Kai Jiang
>Assignee: Miao Wang
> Fix For: 2.0.0
>
>
> Failed to run ./R/run-tests.sh   around recent commit (May 19, 2016)
> It might be related to permission. It seems I used `sudo ./R/run-tests.sh` 
> and it worked sometimes. Without permission, maybe we couldn't access /tmp 
> directory.  However, the SparkR unit testing is still brittle.
> [error 
> message|https://gist.github.com/vectorijk/71f4ff34e3d34a628b8a3013f0ca2aa2]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12071) Programming guide should explain NULL in JVM translate to NA in R

2016-05-24 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-12071.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13268
[https://github.com/apache/spark/pull/13268]

> Programming guide should explain NULL in JVM translate to NA in R
> -
>
> Key: SPARK-12071
> URL: https://issues.apache.org/jira/browse/SPARK-12071
> Project: Spark
>  Issue Type: Documentation
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Felix Cheung
>Priority: Minor
>  Labels: releasenotes, starter
> Fix For: 2.0.0
>
>
> This behavior seems to be new for Spark 1.6.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15412) Improve linear & isotonic regression methods PyDocs

2016-05-24 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-15412.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13199
[https://github.com/apache/spark/pull/13199]

> Improve linear & isotonic regression methods PyDocs
> ---
>
> Key: SPARK-15412
> URL: https://issues.apache.org/jira/browse/SPARK-15412
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Reporter: holdenk
>Assignee: holdenk
>Priority: Trivial
> Fix For: 2.0.0
>
>
> Very minor, but LinearRegression & Isotonic regression's PyDocs are missing 
> link, have a shorter description of boundaries, and aren't using list mode 
> for types of reguluarization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15439) Failed to run unit test in SparkR

2016-05-24 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15298579#comment-15298579
 ] 

Shivaram Venkataraman commented on SPARK-15439:
---

I think the `maskedCompletely` failures might be related to some R version or 
package versions. At a high level those tests check how many base R functions 
are masked by SparkR. cc [~felixcheung] who knows more about those unit tests.

> Failed to run unit test in SparkR
> -
>
> Key: SPARK-15439
> URL: https://issues.apache.org/jira/browse/SPARK-15439
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Kai Jiang
>
> Failed to run ./R/run-tests.sh   around recent commit (May 19, 2016)
> It might be related to permission. It seems I used `sudo ./R/run-tests.sh` 
> and it worked sometimes. Without permission, maybe we couldn't access /tmp 
> directory.  However, the SparkR unit testing is still brittle.
> [error 
> message|https://gist.github.com/vectorijk/71f4ff34e3d34a628b8a3013f0ca2aa2]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12071) Programming guide should explain NULL in JVM translate to NA in R

2016-05-23 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15297245#comment-15297245
 ] 

Shivaram Venkataraman commented on SPARK-12071:
---

You can add it under the migration guide section in 
http://spark.apache.org/docs/latest/sparkr.html#migration-guide

> Programming guide should explain NULL in JVM translate to NA in R
> -
>
> Key: SPARK-12071
> URL: https://issues.apache.org/jira/browse/SPARK-12071
> Project: Spark
>  Issue Type: Documentation
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Felix Cheung
>Priority: Minor
>  Labels: releasenotes, starter
>
> This behavior seems to be new for Spark 1.6.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15439) Failed to run unit test in SparkR

2016-05-23 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15297176#comment-15297176
 ] 

Shivaram Venkataraman commented on SPARK-15439:
---

[~wm624] Any luck in figuring out why this is happening ? Would be good to fix 
this before the 2.0 RC goes out

cc [~sunrui]

> Failed to run unit test in SparkR
> -
>
> Key: SPARK-15439
> URL: https://issues.apache.org/jira/browse/SPARK-15439
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Kai Jiang
>
> Failed to run ./R/run-tests.sh   around recent commit (May 19, 2016)
> It might be related to permission. It seems I used `sudo ./R/run-tests.sh` 
> and it worked sometimes. Without permission, maybe we couldn't access /tmp 
> directory.  However, the SparkR unit testing is still brittle.
> [error 
> message|https://gist.github.com/vectorijk/71f4ff34e3d34a628b8a3013f0ca2aa2]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12071) Programming guide should explain NULL in JVM translate to NA in R

2016-05-23 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15297162#comment-15297162
 ] 

Shivaram Venkataraman commented on SPARK-12071:
---

[~KrishnaKalyan3] Feel free to work on this issue and open a PR.

> Programming guide should explain NULL in JVM translate to NA in R
> -
>
> Key: SPARK-12071
> URL: https://issues.apache.org/jira/browse/SPARK-12071
> Project: Spark
>  Issue Type: Documentation
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Felix Cheung
>Priority: Minor
>  Labels: releasenotes, starter
>
> This behavior seems to be new for Spark 1.6.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13525) SparkR: java.net.SocketTimeoutException: Accept timed out when running any dataframe function

2016-05-23 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15297159#comment-15297159
 ] 

Shivaram Venkataraman commented on SPARK-13525:
---

If the problem is on line 353 it means that the forked R worker process is not 
connecting to the JVM process. The way to debug this is to add some print 
statements to daemon.R (specifically around lines 
https://github.com/apache/spark/blob/37c617e4f580482b59e1abbe3c0c27c7125cf605/R/pkg/inst/worker/daemon.R#L29)
 and see if the port number matches what is expected and/or if the connection 
is failing due to some other reason. 

> SparkR: java.net.SocketTimeoutException: Accept timed out when running any 
> dataframe function
> -
>
> Key: SPARK-13525
> URL: https://issues.apache.org/jira/browse/SPARK-13525
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Shubhanshu Mishra
>  Labels: sparkr
>
> I am following the code steps from this example:
> https://spark.apache.org/docs/1.6.0/sparkr.html
> There are multiple issues: 
> 1. The head and summary and filter methods are not overridden by spark. Hence 
> I need to call them using `SparkR::` namespace.
> 2. When I try to execute the following, I get errors:
> {code}
> $> $R_HOME/bin/R
> R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree"
> Copyright (C) 2015 The R Foundation for Statistical Computing
> Platform: x86_64-pc-linux-gnu (64-bit)
> R is free software and comes with ABSOLUTELY NO WARRANTY.
> You are welcome to redistribute it under certain conditions.
> Type 'license()' or 'licence()' for distribution details.
>   Natural language support but running in an English locale
> R is a collaborative project with many contributors.
> Type 'contributors()' for more information and
> 'citation()' on how to cite R or R packages in publications.
> Type 'demo()' for some demos, 'help()' for on-line help, or
> 'help.start()' for an HTML browser interface to help.
> Type 'q()' to quit R.
> Welcome at Fri Feb 26 16:19:35 2016 
> Attaching package: ‘SparkR’
> The following objects are masked from ‘package:base’:
> colnames, colnames<-, drop, intersect, rank, rbind, sample, subset,
> summary, transform
> Launching java with spark-submit command 
> /content/smishra8/SOFTWARE/spark/bin/spark-submit   --driver-memory "50g" 
> sparkr-shell /tmp/RtmpfBQRg6/backend_portc3bc16f09b1b 
> > df <- createDataFrame(sqlContext, iris)
> Warning messages:
> 1: In FUN(X[[i]], ...) :
>   Use Sepal_Length instead of Sepal.Length  as column name
> 2: In FUN(X[[i]], ...) :
>   Use Sepal_Width instead of Sepal.Width  as column name
> 3: In FUN(X[[i]], ...) :
>   Use Petal_Length instead of Petal.Length  as column name
> 4: In FUN(X[[i]], ...) :
>   Use Petal_Width instead of Petal.Width  as column name
> > training <- filter(df, df$Species != "setosa")
> Error in filter(df, df$Species != "setosa") : 
>   no method for coercing this S4 class to a vector
> > training <- SparkR::filter(df, df$Species != "setosa")
> > model <- SparkR::glm(Species ~ Sepal_Length + Sepal_Width, data = training, 
> > family = "binomial")
> 16/02/26 16:26:46 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.net.SocketTimeoutException: Accept timed out
> at java.net.PlainSocketImpl.socketAccept(Native Method)
> at 
> java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)
> at java.net.ServerSocket.implAccept(ServerSocket.java:530)
> at java.net.ServerSocket.accept(ServerSocket.java:498)
> at org.apache.spark.api.r.RRDD$.createRWorker(RRDD.scala:431)
> at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:62)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> 

[jira] [Commented] (SPARK-15294) Add pivot functionality to SparkR

2016-05-23 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15297146#comment-15297146
 ] 

Shivaram Venkataraman commented on SPARK-15294:
---

[~mhnatiuk] I think the code diff looks pretty good and you can go ahead and 
open a PR for this. Opening a PR should be pretty simple if you follow the 
instructions at 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-ContributingCodeChanges
 (See the section titled 'Pull Request' specifically).

Regarding whether it should be `sum(df$earnings)` - I'd like to think of it as 
a pointer to the column that should be summed. Ideally we'd get it to work with 
just `earnings` (i.e. without the need for df$), but that has some 
complications we haven't figured out yet.

> Add pivot functionality to SparkR
> -
>
> Key: SPARK-15294
> URL: https://issues.apache.org/jira/browse/SPARK-15294
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Mikołaj Hnatiuk
>Priority: Minor
>  Labels: pivot
>
> R users are very used to transforming data using functions such as dcast 
> (pkg:reshape2). https://github.com/apache/spark/pull/7841 introduces such 
> functionality to Scala and Python APIs. I'd like to suggest adding this 
> functionality into SparkR API to pivot DataFrames.
> I'd love to to this, however, my knowledge of Scala is still limited, but 
> with a proper guidance I can give it a try.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15202) add dapplyCollect() method for DataFrame in SparkR

2016-05-12 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-15202.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12989
[https://github.com/apache/spark/pull/12989]

> add dapplyCollect() method for DataFrame in SparkR
> --
>
> Key: SPARK-15202
> URL: https://issues.apache.org/jira/browse/SPARK-15202
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.1
>Reporter: Sun Rui
> Fix For: 2.0.0
>
>
> dapplyCollect() applies an R function on each partition of a SparkDataFrame 
> and collects the result back to R as a data.frame.
> The signature of dapplyCollect() is as follows:
> {code}
>   dapplyCollect(df, function(ldf) {...})
> {code}
> R function input: local data.frame from the partition on local node
> R function output: local data.frame



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15202) add dapplyCollect() method for DataFrame in SparkR

2016-05-12 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-15202:
--
Assignee: Sun Rui

> add dapplyCollect() method for DataFrame in SparkR
> --
>
> Key: SPARK-15202
> URL: https://issues.apache.org/jira/browse/SPARK-15202
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.1
>Reporter: Sun Rui
>Assignee: Sun Rui
> Fix For: 2.0.0
>
>
> dapplyCollect() applies an R function on each partition of a SparkDataFrame 
> and collects the result back to R as a data.frame.
> The signature of dapplyCollect() is as follows:
> {code}
>   dapplyCollect(df, function(ldf) {...})
> {code}
> R function input: local data.frame from the partition on local node
> R function output: local data.frame



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15294) Add pivot functionality to SparkR

2016-05-12 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15281807#comment-15281807
 ] 

Shivaram Venkataraman commented on SPARK-15294:
---

[~mhnatiuk] Thanks for opening this. I dont think we need any Scala changes to 
implement this. Similar to the count method [1], we just need to add a R 
wrapper that calls the pivot method in Scala. The python implementation is 
similar as you can see in [2]

Let us know if you want to open a PR for this.

[1] 
https://github.com/apache/spark/blob/470de743ecf3617babd86f50ab203e85aa975d69/R/pkg/R/group.R#L65
[2] 
https://github.com/apache/spark/blob/470de743ecf3617babd86f50ab203e85aa975d69/python/pyspark/sql/group.py#L171

> Add pivot functionality to SparkR
> -
>
> Key: SPARK-15294
> URL: https://issues.apache.org/jira/browse/SPARK-15294
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Mikołaj Hnatiuk
>Priority: Minor
>  Labels: pivot
>
> R users are very used to transforming data using functions such as dcast 
> (pkg:reshape2). https://github.com/apache/spark/pull/7841 introduces such 
> functionality to Scala and Python APIs. I'd like to suggest adding this 
> functionality into SparkR API to pivot DataFrames.
> I'd love to to this, however, my knowledge of Scala is still limited, but 
> with a proper guidance I can give it a try.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14995) Add "since" tag in Roxygen documentation for SparkR API methods

2016-05-12 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15281746#comment-15281746
 ] 

Shivaram Venkataraman commented on SPARK-14995:
---

Technically entire SparkR component is still considered Experimental. But yeah 
in the future we can add more tags like Stable, Experimental etc. into the same 
note field

> Add "since" tag in Roxygen documentation for SparkR API methods
> ---
>
> Key: SPARK-14995
> URL: https://issues.apache.org/jira/browse/SPARK-14995
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.6.1
>Reporter: Sun Rui
>
> This is request adding something in SparkR API like "versionadded" in PySpark 
> API and "@since" in Scala/Java API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14751) SparkR fails on Cassandra map with numeric key

2016-05-12 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15281700#comment-15281700
 ] 

Shivaram Venkataraman commented on SPARK-14751:
---

[~sunrui] That sounds fine to me. Can we make it a whitelist of certain 
SQLTypes that we do this for ? The number of SQL types should be limited and we 
can be sure of ones that will work. Also printing a warning would be good.

> SparkR fails on Cassandra map with numeric key
> --
>
> Key: SPARK-14751
> URL: https://issues.apache.org/jira/browse/SPARK-14751
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.1
>Reporter: Michał Matłoka
>
> Hi,
> I have created an issue for spark  cassandra connector ( 
> https://datastax-oss.atlassian.net/projects/SPARKC/issues/SPARKC-366 ) but 
> after a bit of digging it seems this is a better place for this issue:
> {code}
> CREATE TABLE test.map (
> id text,
> somemap map,
> PRIMARY KEY (id)
> );
> insert into test.map(id, somemap) values ('a', { 0 : 12 }); 
> {code}
> {code}
>   sqlContext <- sparkRSQL.init(sc)
>   test <-read.df(sqlContext,  source = "org.apache.spark.sql.cassandra",  
> keyspace = "test", table = "map")
>   head(test)
> {code}
> Results in:
> {code}
> 16/04/19 14:47:02 ERROR RBackendHandler: dfToCols on 
> org.apache.spark.sql.api.r.SQLUtils failed
> Error in readBin(con, raw(), stringLen, endian = "big") :
>   invalid 'n' argument
> {code}
> Problem occurs even for int key. For text key it works. Every scenario works 
> under scala & python.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12661) Drop Python 2.6 support in PySpark

2016-05-09 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15277473#comment-15277473
 ] 

Shivaram Venkataraman commented on SPARK-12661:
---

Yes - We will do a full round of AMI updates as well to get Python 2.7, Scala 
2.11 and Java 8 on the EC2 images. I plan to do that once we have 2.0 RC1

> Drop Python 2.6 support in PySpark
> --
>
> Key: SPARK-12661
> URL: https://issues.apache.org/jira/browse/SPARK-12661
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Reporter: Davies Liu
>  Labels: releasenotes
>
> 1. stop testing with 2.6
> 2. remove the code for python 2.6
> see discussion : 
> https://www.mail-archive.com/user@spark.apache.org/msg43423.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12479) sparkR collect on GroupedData throws R error "missing value where TRUE/FALSE needed"

2016-05-08 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-12479:
--
Assignee: Sun Rui

>  sparkR collect on GroupedData  throws R error "missing value where 
> TRUE/FALSE needed"
> --
>
> Key: SPARK-12479
> URL: https://issues.apache.org/jira/browse/SPARK-12479
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Paulo Magalhaes
>Assignee: Sun Rui
> Fix For: 2.0.0
>
>
> sparkR collect on GroupedData  throws "missing value where TRUE/FALSE needed"
> Spark Version: 1.5.1
> R Version: 3.2.2
> I tracked down the root cause of this exception to an specific key for which 
> the hashCode could not be calculated.
> The following code recreates the problem when ran in sparkR:
> hashCode <- getFromNamespace("hashCode","SparkR")
> hashCode("bc53d3605e8a5b7de1e8e271c2317645")
> Error in if (value > .Machine$integer.max) { :
>   missing value where TRUE/FALSE needed
> I went one step further and relaised the the problem happens because of the  
> bit wise shift below returning NA.
> bitwShiftL(-1073741824,1)
> where bitwShiftL is an R function. 
> I believe the bitwShiftL function is working as it is supposed to. Therefore, 
> this PR fixes it in the SparkR package: 
> https://github.com/apache/spark/pull/10436
> .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12479) sparkR collect on GroupedData throws R error "missing value where TRUE/FALSE needed"

2016-05-08 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-12479.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12976
[https://github.com/apache/spark/pull/12976]

>  sparkR collect on GroupedData  throws R error "missing value where 
> TRUE/FALSE needed"
> --
>
> Key: SPARK-12479
> URL: https://issues.apache.org/jira/browse/SPARK-12479
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Paulo Magalhaes
> Fix For: 2.0.0
>
>
> sparkR collect on GroupedData  throws "missing value where TRUE/FALSE needed"
> Spark Version: 1.5.1
> R Version: 3.2.2
> I tracked down the root cause of this exception to an specific key for which 
> the hashCode could not be calculated.
> The following code recreates the problem when ran in sparkR:
> hashCode <- getFromNamespace("hashCode","SparkR")
> hashCode("bc53d3605e8a5b7de1e8e271c2317645")
> Error in if (value > .Machine$integer.max) { :
>   missing value where TRUE/FALSE needed
> I went one step further and relaised the the problem happens because of the  
> bit wise shift below returning NA.
> bitwShiftL(-1073741824,1)
> where bitwShiftL is an R function. 
> I believe the bitwShiftL function is working as it is supposed to. Therefore, 
> this PR fixes it in the SparkR package: 
> https://github.com/apache/spark/pull/10436
> .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11395) Support over and window specification in SparkR

2016-05-05 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-11395.
---
   Resolution: Fixed
 Assignee: Sun Rui
Fix Version/s: 2.0.0

Resolved by https://github.com/apache/spark/pull/10094

> Support over and window specification in SparkR
> ---
>
> Key: SPARK-11395
> URL: https://issues.apache.org/jira/browse/SPARK-11395
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Sun Rui
>Assignee: Sun Rui
> Fix For: 2.0.0
>
>
> 1. implement over() in Column class.
> 2. support window spec 
> (http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.expressions.WindowSpec)
> 3. support utility functions for defining window in DataFrames. 
> (http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.expressions.Window)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10043) Add window functions into SparkR

2016-05-05 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273475#comment-15273475
 ] 

Shivaram Venkataraman commented on SPARK-10043:
---

[~sunrui] Can we resolve this issue now ?

> Add window functions into SparkR
> 
>
> Key: SPARK-10043
> URL: https://issues.apache.org/jira/browse/SPARK-10043
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Yu Ishikawa
>
> Add window functions as follows in SparkR. I think we should improve 
> {{collect}} function in SparkR.
> - lead
> - cumuDist
> - denseRank
> - lag
> - ntile
> - percentRank
> - rank
> - rowNumber



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15159) Remove usage of HiveContext in SparkR.

2016-05-05 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273396#comment-15273396
 ] 

Shivaram Venkataraman commented on SPARK-15159:
---

Is it being removed or is it being deprecated in 2.0 - If its being removed 
then we need to make this a priority

> Remove usage of HiveContext in SparkR.
> --
>
> Key: SPARK-15159
> URL: https://issues.apache.org/jira/browse/SPARK-15159
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.1
>Reporter: Sun Rui
>
> HiveContext is to be deprecated in 2.0.  Replace them with 
> SparkSession.withHiveSupport in SparkR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15091) Fix warnings and a failure in SparkR test cases with testthat version 1.0.1

2016-05-03 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-15091.
---
   Resolution: Fixed
 Assignee: Sun Rui
Fix Version/s: 2.0.0

Issue resolved by https://github.com/apache/spark/pull/12867

> Fix warnings and a failure in SparkR test cases with testthat version 1.0.1
> ---
>
> Key: SPARK-15091
> URL: https://issues.apache.org/jira/browse/SPARK-15091
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.6.1
>Reporter: Sun Rui
>Assignee: Sun Rui
> Fix For: 2.0.0
>
>
> After upgrading "testthat" package to version 1.0.1, new warnings and a new 
> failure were found in SparkR test cases:
> ```
> Warnings 
> ---
> 1. multiple packages don't produce a warning (@test_client.R#35) - `not()` is 
> deprecated.
> 2. sparkJars sparkPackages as comma-separated strings (@test_context.R#141) - 
> `not()` is deprecated.
> 3. spark.survreg (@test_mllib.R#453) - `not()` is deprecated.
> 4. date functions on a DataFrame (@test_sparkSQL.R#1199) - Deprecated: please 
> use `expect_gt()` instead
> 5. date functions on a DataFrame (@test_sparkSQL.R#1200) - Deprecated: please 
> use `expect_gt()` instead
> 6. date functions on a DataFrame (@test_sparkSQL.R#1201) - Deprecated: please 
> use `expect_gt()` instead
> 7. Method as.data.frame as a synonym for collect() (@test_sparkSQL.R#1899) - 
> `not()` is deprecated.
> Failure: showDF() (@test_sparkSQL.R#1513) ---
> `s` produced no output
> ```
> Changes in releases of testthat can be found at 
> https://github.com/hadley/testthat/blob/master/NEWS.md



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2016-04-29 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15264954#comment-15264954
 ] 

Shivaram Venkataraman commented on SPARK-6817:
--

I just merged https://issues.apache.org/jira/browse/SPARK-12919 which contains 
the main part of UDFs (`dapply`). I think we'll have a few follow up PRs during 
the QA period - but lets leave this as 2.0 target.



> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12919) Implement dapply() on DataFrame in SparkR

2016-04-29 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-12919.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12493
[https://github.com/apache/spark/pull/12493]

> Implement dapply() on DataFrame in SparkR
> -
>
> Key: SPARK-12919
> URL: https://issues.apache.org/jira/browse/SPARK-12919
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Sun Rui
> Fix For: 2.0.0
>
>
> dapply() applies an R function on each partition of a DataFrame and returns a 
> new DataFrame.
> The function signature is:
> {code}
>   dapply(df, function(localDF) {}, schema = NULL)
> {code}
> R function input: local data.frame from the partition on local node
> R function output: local data.frame
> Schema specifies the Row format of the resulting DataFrame. It must match the 
> R function's output.
> If schema is not specified, each partition of the result DataFrame will be 
> serialized in R into a single byte array. Such resulting DataFrame can be 
> processed by successive calls to dapply() or collect(), but can't be 
> processed by normal DataFrame operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12919) Implement dapply() on DataFrame in SparkR

2016-04-29 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-12919:
--
Assignee: Sun Rui

> Implement dapply() on DataFrame in SparkR
> -
>
> Key: SPARK-12919
> URL: https://issues.apache.org/jira/browse/SPARK-12919
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Sun Rui
>Assignee: Sun Rui
> Fix For: 2.0.0
>
>
> dapply() applies an R function on each partition of a DataFrame and returns a 
> new DataFrame.
> The function signature is:
> {code}
>   dapply(df, function(localDF) {}, schema = NULL)
> {code}
> R function input: local data.frame from the partition on local node
> R function output: local data.frame
> Schema specifies the Row format of the resulting DataFrame. It must match the 
> R function's output.
> If schema is not specified, each partition of the result DataFrame will be 
> serialized in R into a single byte array. Such resulting DataFrame can be 
> processed by successive calls to dapply() or collect(), but can't be 
> processed by normal DataFrame operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14995) Add "since" tag in Roxygen documentation for SparkR API methods

2016-04-28 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15263549#comment-15263549
 ] 

Shivaram Venkataraman commented on SPARK-14995:
---

We could just use one of the existing sections like @note ? 

> Add "since" tag in Roxygen documentation for SparkR API methods
> ---
>
> Key: SPARK-14995
> URL: https://issues.apache.org/jira/browse/SPARK-14995
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.6.1
>Reporter: Sun Rui
>
> This is request adding something in SparkR API like "versionadded" in PySpark 
> API and "@since" in Scala/Java API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10894) Add 'drop' support for DataFrame's subset function

2016-04-28 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-10894.
---
Resolution: Fixed

> Add 'drop' support for DataFrame's subset function
> --
>
> Key: SPARK-10894
> URL: https://issues.apache.org/jira/browse/SPARK-10894
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Weiqiang Zhuang
>
> SparkR DataFrame can be subset to get one or more columns of the dataset. The 
> current '[' implementation does not support 'drop' when is asked for just one 
> column. This is not consistent with the R syntax:
> x[i, j, ... , drop = TRUE]
> # in R, when drop is FALSE, remain as data.frame
> > class(iris[, "Sepal.Width", drop=F])
> [1] "data.frame"
> # when drop is TRUE (default), drop to be a vector
> > class(iris[, "Sepal.Width", drop=T])
> [1] "numeric"
> > class(iris[,"Sepal.Width"])
> [1] "numeric"
> > df <- createDataFrame(sqlContext, iris)
> # in SparkR, 'drop' argument has no impact
> > class(df[,"Sepal_Width", drop=F])
> [1] "DataFrame"
> attr(,"package")
> [1] "SparkR"
> # should have dropped to be a Column class instead
> > class(df[,"Sepal_Width", drop=T])
> [1] "DataFrame"
> attr(,"package")
> [1] "SparkR"
> > class(df[,"Sepal_Width"])
> [1] "DataFrame"
> attr(,"package")
> [1] "SparkR"
> We should add the 'drop' support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10346) SparkR mutate and transform should replace column with same name to match R data.frame behavior

2016-04-28 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-10346.
---
Resolution: Fixed

> SparkR mutate and transform should replace column with same name to match R 
> data.frame behavior
> ---
>
> Key: SPARK-10346
> URL: https://issues.apache.org/jira/browse/SPARK-10346
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.0
>Reporter: Felix Cheung
>
> Spark doesn't seem to replace existing column with the name in mutate (ie. 
> mutate(df, age = df$age + 2) - returned DataFrame has 2 columns with the same 
> name 'age'), so therefore not doing that for now in transform.
> Though it is clearly stated it should replace column with matching name:
> https://stat.ethz.ch/R-manual/R-devel/library/base/html/transform.html
> "The tags are matched against names(_data), and for those that match, the 
> value replace the corresponding variable in _data, and the others are 
> appended to _data."
> Also the resulting DataFrame might be hard to work with if one is to use 
> select with column names, or to register the table to SQL, and so on, since 
> then 2 columns have the same name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6817) DataFrame UDFs in R

2016-04-28 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-6817:
-
Target Version/s: 2.0.0

> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12235) Enhance mutate() to support replace existing columns

2016-04-28 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-12235.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10220
[https://github.com/apache/spark/pull/10220]

> Enhance mutate() to support replace existing columns
> 
>
> Key: SPARK-12235
> URL: https://issues.apache.org/jira/browse/SPARK-12235
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Sun Rui
> Fix For: 2.0.0
>
>
> mutate() in the dplyr package supports adding new columns and replacing 
> existing columns. But currently the implementation of mutate() in SparkR 
> supports adding new columns only.
> Also make the behavior of mutate more consistent with that in dplyr.
> 1. Throw error message when there are duplicated column names in the 
> DataFrame being mutated.
> 2. when there are duplicated column names in specified columns by arguments, 
> the last column of the same name takes effect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12235) Enhance mutate() to support replace existing columns

2016-04-28 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-12235:
--
Assignee: Sun Rui

> Enhance mutate() to support replace existing columns
> 
>
> Key: SPARK-12235
> URL: https://issues.apache.org/jira/browse/SPARK-12235
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Sun Rui
>Assignee: Sun Rui
> Fix For: 2.0.0
>
>
> mutate() in the dplyr package supports adding new columns and replacing 
> existing columns. But currently the implementation of mutate() in SparkR 
> supports adding new columns only.
> Also make the behavior of mutate more consistent with that in dplyr.
> 1. Throw error message when there are duplicated column names in the 
> DataFrame being mutated.
> 2. when there are duplicated column names in specified columns by arguments, 
> the last column of the same name takes effect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13436) Add parameter drop to subsetting operator [

2016-04-27 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-13436.
---
   Resolution: Fixed
 Assignee: Oscar D. Lara Yejas
Fix Version/s: 2.0.0

Resolved by https://github.com/apache/spark/pull/11318

> Add parameter drop to subsetting operator [
> ---
>
> Key: SPARK-13436
> URL: https://issues.apache.org/jira/browse/SPARK-13436
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Oscar D. Lara Yejas
>Assignee: Oscar D. Lara Yejas
> Fix For: 2.0.0
>
>
> Parameter drops allows to return a vector/data.frame accordingly if the 
> result of subsetting a data.frame has one single column (see example below). 
> The same behavior is needed on a DataFrame.
> > head(iris[, 1, drop=F])
>   Sepal.Length
> 1  5.1
> 2  4.9
> 3  4.7
> 4  4.6
> 5  5.0
> 6  5.4
> > head(iris[, 1, drop=T])
> [1] 5.1 4.9 4.7 4.6 5.0 5.4



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14831) Make ML APIs in SparkR consistent

2016-04-27 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261005#comment-15261005
 ] 

Shivaram Venkataraman commented on SPARK-14831:
---

+1

> Make ML APIs in SparkR consistent
> -
>
> Key: SPARK-14831
> URL: https://issues.apache.org/jira/browse/SPARK-14831
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> In current master, we have 4 ML methods in SparkR:
> {code:none}
> glm(formula, family, data, ...)
> kmeans(data, centers, ...)
> naiveBayes(formula, data, ...)
> survreg(formula, data, ...)
> {code}
> We tried to keep the signatures similar to existing ones in R. However, if we 
> put them together, they are not consistent. One example is k-means, which 
> doesn't accept a formula. Instead of looking at each method independently, we 
> might want to update the signature of kmeans to
> {code:none}
> kmeans(formula, data, centers, ...)
> {code}
> We can also discuss possible global changes here. For example, `glm` puts 
> `family` before `data` while `kmeans` puts `centers` after `data`. This is 
> not consistent. And logically, the formula doesn't mean anything without 
> associating with a DataFrame. So it makes more sense to me to have the 
> following signature:
> {code:none}
> algorithm(df, formula, [required params], [optional params])
> {code}
> If we make this change, we might want to avoid name collisions because they 
> have different signature. We can use `ml.kmeans`, 'ml.glm`, etc.
> Sorry for discussing API changes in the last minute. But I think it would be 
> better to have consistent signatures in SparkR.
> cc: [~shivaram] [~josephkb] [~yanboliang]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12922) Implement gapply() on DataFrame in SparkR

2016-04-27 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15260626#comment-15260626
 ] 

Shivaram Venkataraman commented on SPARK-12922:
---

Could you post a WIP pull request using your own column appender ? I am not too 
familiar with the Spark SQL internals but I think [~rxin] or [~davies] will be 
able to provide feedback if we have a PR up.

> Implement gapply() on DataFrame in SparkR
> -
>
> Key: SPARK-12922
> URL: https://issues.apache.org/jira/browse/SPARK-12922
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Sun Rui
>
> gapply() applies an R function on groups grouped by one or more columns of a 
> DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() 
> in the Dataset API.
> Two API styles are supported:
> 1.
> {code}
> gd <- groupBy(df, col1, ...)
> gapply(gd, function(grouping_key, group) {}, schema)
> {code}
> 2.
> {code}
> gapply(df, grouping_columns, function(grouping_key, group) {}, schema) 
> {code}
> R function input: grouping keys value, a local data.frame of this grouped 
> data 
> R function output: local data.frame
> Schema specifies the Row format of the output of the R function. It must 
> match the R function's output.
> Note that map-side combination (partial aggregation) is not supported, user 
> could do map-side combination via dapply().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12922) Implement gapply() on DataFrame in SparkR

2016-04-27 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15260360#comment-15260360
 ] 

Shivaram Venkataraman commented on SPARK-12922:
---

[~Narine] Any update on this ? Would be great to have this in Spark 2.0

> Implement gapply() on DataFrame in SparkR
> -
>
> Key: SPARK-12922
> URL: https://issues.apache.org/jira/browse/SPARK-12922
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Sun Rui
>
> gapply() applies an R function on groups grouped by one or more columns of a 
> DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() 
> in the Dataset API.
> Two API styles are supported:
> 1.
> {code}
> gd <- groupBy(df, col1, ...)
> gapply(gd, function(grouping_key, group) {}, schema)
> {code}
> 2.
> {code}
> gapply(df, grouping_columns, function(grouping_key, group) {}, schema) 
> {code}
> R function input: grouping keys value, a local data.frame of this grouped 
> data 
> R function output: local data.frame
> Schema specifies the Row format of the output of the R function. It must 
> match the R function's output.
> Note that map-side combination (partial aggregation) is not supported, user 
> could do map-side combination via dapply().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13734) SparkR histogram

2016-04-26 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-13734:
--
Assignee: Oscar D. Lara Yejas

> SparkR histogram
> 
>
> Key: SPARK-13734
> URL: https://issues.apache.org/jira/browse/SPARK-13734
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Oscar D. Lara Yejas
>Assignee: Oscar D. Lara Yejas
>Priority: Minor
> Fix For: 2.0.0
>
>
> Create method histogram() on SparkR to render a histogram of a given Column.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13734) SparkR histogram

2016-04-26 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-13734.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11569
[https://github.com/apache/spark/pull/11569]

> SparkR histogram
> 
>
> Key: SPARK-13734
> URL: https://issues.apache.org/jira/browse/SPARK-13734
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Oscar D. Lara Yejas
>Priority: Minor
> Fix For: 2.0.0
>
>
> Create method histogram() on SparkR to render a histogram of a given Column.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14883) Fix wrong R examples and make them up-to-date

2016-04-24 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-14883.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12649
[https://github.com/apache/spark/pull/12649]

> Fix wrong R examples and make them up-to-date
> -
>
> Key: SPARK-14883
> URL: https://issues.apache.org/jira/browse/SPARK-14883
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Examples
>Reporter: Dongjoon Hyun
> Fix For: 2.0.0
>
>
> This issue aims to fix some errors in R examples and make them up-to-date in 
> docs and example modules.
> - Remove the wrong usage of map. We need to use `lapply` in `SparkR` if 
> needed. However, `lapply` is private now. The correct usage will be added 
> later.
> {code}
> -teenNames <- map(teenagers, function(p) { paste("Name:", p$name)})
> ...
> {code}
> - Fix the wrong example in Section `Generic Load/Save Functions` of 
> `docs/sql-programming-guide.md` for consistency.
> {code}
> -df <- loadDF(sqlContext, "people.parquet")
> -saveDF(select(df, "name", "age"), "namesAndAges.parquet")
> +df <- read.df(sqlContext, "examples/src/main/resources/users.parquet")
> +write.df(select(df, "name", "favorite_color"), "namesAndFavColors.parquet")
> {code}
> - Fix datatypes in `sparkr.md`.
> {code}
> -#  |-- age: integer (nullable = true)
> +#  |-- age: long (nullable = true)
> {code}
> {code}
> -## DataFrame[eruptions:double, waiting:double]
> +## SparkDataFrame[eruptions:double, waiting:double]
> {code}
> - Update data results
> {code}
>  head(summarize(groupBy(df, df$waiting), count = n(df$waiting)))
>  ##  waiting count
> -##1  8113
> -##2  60 6
> -##3  68 1
> +##1  70 4
> +##2  67 1
> +##3  69 2
> {code}
> - Replace deprecated functions: jsonFile -> read.json, parquetFile -> 
> read.parquet
> {code}
> df <- jsonFile(sqlContext, "examples/src/main/resources/people.json")
> Warning message:
> 'jsonFile' is deprecated.
> Use 'read.json' instead.
> See help("Deprecated") 
> {code}
> - Use up-to-date R-like functions: loadDF -> read.df, saveDF -> write.df, 
> saveAsParquetFile -> write.parquet
> - Replace `SparkR DataFrame` with `SparkDataFrame` in `dataframe.R` and 
> `data-manipulation.R`.
> - Other minor syntax fixes and typos.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14883) Fix wrong R examples and make them up-to-date

2016-04-24 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-14883:
--
Assignee: Dongjoon Hyun

> Fix wrong R examples and make them up-to-date
> -
>
> Key: SPARK-14883
> URL: https://issues.apache.org/jira/browse/SPARK-14883
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Examples
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
> Fix For: 2.0.0
>
>
> This issue aims to fix some errors in R examples and make them up-to-date in 
> docs and example modules.
> - Remove the wrong usage of map. We need to use `lapply` in `SparkR` if 
> needed. However, `lapply` is private now. The correct usage will be added 
> later.
> {code}
> -teenNames <- map(teenagers, function(p) { paste("Name:", p$name)})
> ...
> {code}
> - Fix the wrong example in Section `Generic Load/Save Functions` of 
> `docs/sql-programming-guide.md` for consistency.
> {code}
> -df <- loadDF(sqlContext, "people.parquet")
> -saveDF(select(df, "name", "age"), "namesAndAges.parquet")
> +df <- read.df(sqlContext, "examples/src/main/resources/users.parquet")
> +write.df(select(df, "name", "favorite_color"), "namesAndFavColors.parquet")
> {code}
> - Fix datatypes in `sparkr.md`.
> {code}
> -#  |-- age: integer (nullable = true)
> +#  |-- age: long (nullable = true)
> {code}
> {code}
> -## DataFrame[eruptions:double, waiting:double]
> +## SparkDataFrame[eruptions:double, waiting:double]
> {code}
> - Update data results
> {code}
>  head(summarize(groupBy(df, df$waiting), count = n(df$waiting)))
>  ##  waiting count
> -##1  8113
> -##2  60 6
> -##3  68 1
> +##1  70 4
> +##2  67 1
> +##3  69 2
> {code}
> - Replace deprecated functions: jsonFile -> read.json, parquetFile -> 
> read.parquet
> {code}
> df <- jsonFile(sqlContext, "examples/src/main/resources/people.json")
> Warning message:
> 'jsonFile' is deprecated.
> Use 'read.json' instead.
> See help("Deprecated") 
> {code}
> - Use up-to-date R-like functions: loadDF -> read.df, saveDF -> write.df, 
> saveAsParquetFile -> write.parquet
> - Replace `SparkR DataFrame` with `SparkDataFrame` in `dataframe.R` and 
> `data-manipulation.R`.
> - Other minor syntax fixes and typos.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14594) Improve error messages for RDD API

2016-04-23 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-14594:
--
Assignee: Felix Cheung

> Improve error messages for RDD API
> --
>
> Key: SPARK-14594
> URL: https://issues.apache.org/jira/browse/SPARK-14594
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Marco Gaido
>Assignee: Felix Cheung
> Fix For: 2.0.0
>
>
> When you have an error in your R code using the RDD API, you always get as 
> error message:
> Error in if (returnStatus != 0) { : argument is of length zero
> This is not very useful and I think it might be better to catch the R 
> exception and show it instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14594) Improve error messages for RDD API

2016-04-23 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-14594.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12622
[https://github.com/apache/spark/pull/12622]

> Improve error messages for RDD API
> --
>
> Key: SPARK-14594
> URL: https://issues.apache.org/jira/browse/SPARK-14594
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Marco Gaido
> Fix For: 2.0.0
>
>
> When you have an error in your R code using the RDD API, you always get as 
> error message:
> Error in if (returnStatus != 0) { : argument is of length zero
> This is not very useful and I think it might be better to catch the R 
> exception and show it instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14831) Make ML APIs in SparkR consistent

2016-04-22 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15254551#comment-15254551
 ] 

Shivaram Venkataraman commented on SPARK-14831:
---

1. Agree. I think a valid policy could be that if we are able to support say 
most of the functionality in the base R function then we add the overload 
method. All methods though will have the spark. variant that. We 
can do one pass right now to add spark. and remove the overloads 
that don't match the base R functionality well enough.

2. We have so far used `read.df` and `write.df` to save and load data frames. I 
think read.model and write.model might work (I can't find a overloaded method 
in R for that) but I'm also fine if we just want to have a separate set of 
commands for models.

> Make ML APIs in SparkR consistent
> -
>
> Key: SPARK-14831
> URL: https://issues.apache.org/jira/browse/SPARK-14831
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> In current master, we have 4 ML methods in SparkR:
> {code:none}
> glm(formula, family, data, ...)
> kmeans(data, centers, ...)
> naiveBayes(formula, data, ...)
> survreg(formula, data, ...)
> {code}
> We tried to keep the signatures similar to existing ones in R. However, if we 
> put them together, they are not consistent. One example is k-means, which 
> doesn't accept a formula. Instead of looking at each method independently, we 
> might want to update the signature of kmeans to
> {code:none}
> kmeans(formula, data, centers, ...)
> {code}
> We can also discuss possible global changes here. For example, `glm` puts 
> `family` before `data` while `kmeans` puts `centers` after `data`. This is 
> not consistent. And logically, the formula doesn't mean anything without 
> associating with a DataFrame. So it makes more sense to me to have the 
> following signature:
> {code:none}
> algorithm(df, formula, [required params], [optional params])
> {code}
> If we make this change, we might want to avoid name collisions because they 
> have different signature. We can use `ml.kmeans`, 'ml.glm`, etc.
> Sorry for discussing API changes in the last minute. But I think it would be 
> better to have consistent signatures in SparkR.
> cc: [~shivaram] [~josephkb] [~yanboliang]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14831) Make ML APIs in SparkR consistent

2016-04-22 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15254447#comment-15254447
 ] 

Shivaram Venkataraman commented on SPARK-14831:
---

Yeah I think there are a couple of factors to consider here

1. Existing R users who want to use SparkR: For this case I think its valuable 
to have the methods mimic the ordering that is used by the corresponding R 
function. So we will then have kmeans(data, centers, ...) and glm(formula, 
family, data, ...) . I think its useful to mimic the ordering for two reasons 
(a) its helps with familiarity (b) it also ensures we can safely override the 
base R functions as they are now

2. New users for SparkR / Spark-ML: I think having internal consistency is 
useful for these users. My take on SparkR API has always been that it doesn't 
hurt to support multiple ways to do things as long they don't collide etc. In 
this scenario if we want to define a new set of consistent APIs we should adopt 
a new namespace as [~mengxr] indicated. I would suggest `spark.kmeans` and 
`spark.glm` as opposed to `ml.glm` to make it more clear these are SparkR 
functions (we are also using spark.lapply for example)

> Make ML APIs in SparkR consistent
> -
>
> Key: SPARK-14831
> URL: https://issues.apache.org/jira/browse/SPARK-14831
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> In current master, we have 4 ML methods in SparkR:
> {code:none}
> glm(formula, family, data, ...)
> kmeans(data, centers, ...)
> naiveBayes(formula, data, ...)
> survreg(formula, data, ...)
> {code}
> We tried to keep the signatures similar to existing ones in R. However, if we 
> put them together, they are not consistent. One example is k-means, which 
> doesn't accept a formula. Instead of looking at each method independently, we 
> might want to update the signature of kmeans to
> {code:none}
> kmeans(formula, data, centers, ...)
> {code}
> We can also discuss possible global changes here. For example, `glm` puts 
> `family` before `data` while `kmeans` puts `centers` after `data`. This is 
> not consistent. And logically, the formula doesn't mean anything without 
> associating with a DataFrame. So it makes more sense to me to have the 
> following signature:
> {code:none}
> algorithm(df, formula, [required params], [optional params])
> {code}
> If we make this change, we might want to avoid name collisions because they 
> have different signature. We can use `ml.kmeans`, 'ml.glm`, etc.
> Sorry for discussing API changes in the last minute. But I think it would be 
> better to have consistent signatures in SparkR.
> cc: [~shivaram] [~josephkb] [~yanboliang]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13178) RRDD faces with concurrency issue in case of rdd.zip(rdd).count()

2016-04-22 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-13178:
--
Assignee: Sun Rui

> RRDD faces with concurrency issue in case of rdd.zip(rdd).count()
> -
>
> Key: SPARK-13178
> URL: https://issues.apache.org/jira/browse/SPARK-13178
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Xusen Yin
>Assignee: Sun Rui
> Fix For: 2.0.0
>
>
> In Kmeans algorithm, there is a zip operation before taking samples, i.e. 
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L210,
>  which can be simplified in the following code:
> {code:title=zip.scala|theme=FadeToGrey|linenumbers=true|language=scala|firstline=0001|collapse=false}
> val rdd =  ...
> val rdd2 = rdd.map(x => x)
> rdd.zip(rdd2).count()
> {code}
> However, RRDD fails on this operation with an error of "can only zip rdd with 
> same number of elements" or "stream closed", similar to the JIRA issue: 
> https://issues.apache.org/jira/browse/SPARK-2251
> Inside RRDD, a data stream is used to ingest data from the R side. In the zip 
> operation, zip with self computes each partition twice. So if we zip a 
> HadoopRDD (iris dataset) with itself, we get 
> {code:title=log-from-zip-HadoopRDD|theme=FadeToGrey|linenumbers=true|language=scala|firstline=0001|collapse=false}
> we get a pair (6.8, 6.8)
> we get a pair (5.1, 5.1)
> we get a pair (6.7, 6.7)
> we get a pair (4.9, 4.9)
> we get a pair (6.0, 6.0)
> we get a pair (4.7, 4.7)
> we get a pair (5.7, 5.7)
> we get a pair (4.6, 4.6)
> we get a pair (5.5, 5.5)
> we get a pair (5.0, 5.0)
> we get a pair (5.5, 5.5)
> we get a pair (5.4, 5.4)
> we get a pair (5.8, 5.8)
> we get a pair (4.6, 4.6)
> we get a pair (6.0, 6.0)
> we get a pair (5.0, 5.0)
> we get a pair (5.4, 5.4)
> we get a pair (4.4, 4.4)
> we get a pair (6.0, 6.0)
> we get a pair (4.9, 4.9)
> we get a pair (6.7, 6.7)
> we get a pair (5.4, 5.4)
> we get a pair (6.3, 6.3)
> we get a pair (4.8, 4.8)
> we get a pair (5.6, 5.6)
> we get a pair (4.8, 4.8)
> we get a pair (5.5, 5.5)
> we get a pair (4.3, 4.3)
> we get a pair (5.5, 5.5)
> we get a pair (5.8, 5.8)
> we get a pair (6.1, 6.1)
> we get a pair (5.7, 5.7)
> we get a pair (5.8, 5.8)
> we get a pair (5.4, 5.4)
> we get a pair (5.0, 5.0)
> we get a pair (5.1, 5.1)
> we get a pair (5.6, 5.6)
> we get a pair (5.7, 5.7)
> we get a pair (5.7, 5.7)
> we get a pair (5.1, 5.1)
> we get a pair (5.7, 5.7)
> we get a pair (5.4, 5.4)
> we get a pair (6.2, 6.2)
> we get a pair (5.1, 5.1)
> we get a pair (5.1, 5.1)
> we get a pair (4.6, 4.6)
> we get a pair (5.7, 5.7)
> we get a pair (5.1, 5.1)
> we get a pair (6.3, 6.3)
> we get a pair (4.8, 4.8)
> we get a pair (5.8, 5.8)
> we get a pair (5.0, 5.0)
> we get a pair (7.1, 7.1)
> we get a pair (5.0, 5.0)
> we get a pair (6.3, 6.3)
> we get a pair (5.2, 5.2)
> we get a pair (6.5, 6.5)
> we get a pair (5.2, 5.2)
> we get a pair (7.6, 7.6)
> we get a pair (4.7, 4.7)
> we get a pair (4.9, 4.9)
> we get a pair (4.8, 4.8)
> we get a pair (7.3, 7.3)
> we get a pair (5.4, 5.4)
> we get a pair (6.7, 6.7)
> we get a pair (5.2, 5.2)
> we get a pair (7.2, 7.2)
> we get a pair (5.5, 5.5)
> we get a pair (6.5, 6.5)
> we get a pair (4.9, 4.9)
> we get a pair (6.4, 6.4)
> we get a pair (5.0, 5.0)
> we get a pair (6.8, 6.8)
> we get a pair (5.5, 5.5)
> we get a pair (5.7, 5.7)
> we get a pair (4.9, 4.9)
> we get a pair (5.8, 5.8)
> we get a pair (4.4, 4.4)
> we get a pair (6.4, 6.4)
> we get a pair (5.1, 5.1)
> we get a pair (6.5, 6.5)
> we get a pair (5.0, 5.0)
> we get a pair (7.7, 7.7)
> we get a pair (4.5, 4.5)
> we get a pair (7.7, 7.7)
> we get a pair (4.4, 4.4)
> we get a pair (6.0, 6.0)
> we get a pair (5.0, 5.0)
> we get a pair (6.9, 6.9)
> we get a pair (5.1, 5.1)
> we get a pair (5.6, 5.6)
> we get a pair (4.8, 4.8)
> we get a pair (7.7, 7.7)
> we get a pair (6.3, 6.3)
> we get a pair (5.1, 5.1)
> we get a pair (6.7, 6.7)
> we get a pair (4.6, 4.6)
> we get a pair (7.2, 7.2)
> we get a pair (5.3, 5.3)
> we get a pair (6.2, 6.2)
> we get a pair (5.0, 5.0)
> we get a pair (6.1, 6.1)
> we get a pair (7.0, 7.0)
> we get a pair (6.4, 6.4)
> we get a pair (6.4, 6.4)
> we get a pair (7.2, 7.2)
> we get a pair (6.9, 6.9)
> we get a pair (7.4, 7.4)
> we get a pair (5.5, 5.5)
> we get a pair (7.9, 7.9)
> we get a pair (6.5, 6.5)
> we get a pair (6.4, 6.4)
> we get a pair (5.7, 5.7)
> we get a pair (6.3, 6.3)
> we get a pair (6.3, 6.3)
> we get a pair (6.1, 6.1)
> we get a pair (4.9, 4.9)
> we get a pair (7.7, 7.7)
> we get a pair (6.6, 6.6)
> we get a pair (6.3, 6.3)
> we get a pair (5.2, 5.2)
> we get a pair (6.4, 6.4)
> we get a pair (5.0, 5.0)
> we get a pair (6.0, 6.0)
> we get a pair (5.9, 5.9)

[jira] [Resolved] (SPARK-13178) RRDD faces with concurrency issue in case of rdd.zip(rdd).count()

2016-04-22 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-13178.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12606
[https://github.com/apache/spark/pull/12606]

> RRDD faces with concurrency issue in case of rdd.zip(rdd).count()
> -
>
> Key: SPARK-13178
> URL: https://issues.apache.org/jira/browse/SPARK-13178
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Xusen Yin
> Fix For: 2.0.0
>
>
> In Kmeans algorithm, there is a zip operation before taking samples, i.e. 
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L210,
>  which can be simplified in the following code:
> {code:title=zip.scala|theme=FadeToGrey|linenumbers=true|language=scala|firstline=0001|collapse=false}
> val rdd =  ...
> val rdd2 = rdd.map(x => x)
> rdd.zip(rdd2).count()
> {code}
> However, RRDD fails on this operation with an error of "can only zip rdd with 
> same number of elements" or "stream closed", similar to the JIRA issue: 
> https://issues.apache.org/jira/browse/SPARK-2251
> Inside RRDD, a data stream is used to ingest data from the R side. In the zip 
> operation, zip with self computes each partition twice. So if we zip a 
> HadoopRDD (iris dataset) with itself, we get 
> {code:title=log-from-zip-HadoopRDD|theme=FadeToGrey|linenumbers=true|language=scala|firstline=0001|collapse=false}
> we get a pair (6.8, 6.8)
> we get a pair (5.1, 5.1)
> we get a pair (6.7, 6.7)
> we get a pair (4.9, 4.9)
> we get a pair (6.0, 6.0)
> we get a pair (4.7, 4.7)
> we get a pair (5.7, 5.7)
> we get a pair (4.6, 4.6)
> we get a pair (5.5, 5.5)
> we get a pair (5.0, 5.0)
> we get a pair (5.5, 5.5)
> we get a pair (5.4, 5.4)
> we get a pair (5.8, 5.8)
> we get a pair (4.6, 4.6)
> we get a pair (6.0, 6.0)
> we get a pair (5.0, 5.0)
> we get a pair (5.4, 5.4)
> we get a pair (4.4, 4.4)
> we get a pair (6.0, 6.0)
> we get a pair (4.9, 4.9)
> we get a pair (6.7, 6.7)
> we get a pair (5.4, 5.4)
> we get a pair (6.3, 6.3)
> we get a pair (4.8, 4.8)
> we get a pair (5.6, 5.6)
> we get a pair (4.8, 4.8)
> we get a pair (5.5, 5.5)
> we get a pair (4.3, 4.3)
> we get a pair (5.5, 5.5)
> we get a pair (5.8, 5.8)
> we get a pair (6.1, 6.1)
> we get a pair (5.7, 5.7)
> we get a pair (5.8, 5.8)
> we get a pair (5.4, 5.4)
> we get a pair (5.0, 5.0)
> we get a pair (5.1, 5.1)
> we get a pair (5.6, 5.6)
> we get a pair (5.7, 5.7)
> we get a pair (5.7, 5.7)
> we get a pair (5.1, 5.1)
> we get a pair (5.7, 5.7)
> we get a pair (5.4, 5.4)
> we get a pair (6.2, 6.2)
> we get a pair (5.1, 5.1)
> we get a pair (5.1, 5.1)
> we get a pair (4.6, 4.6)
> we get a pair (5.7, 5.7)
> we get a pair (5.1, 5.1)
> we get a pair (6.3, 6.3)
> we get a pair (4.8, 4.8)
> we get a pair (5.8, 5.8)
> we get a pair (5.0, 5.0)
> we get a pair (7.1, 7.1)
> we get a pair (5.0, 5.0)
> we get a pair (6.3, 6.3)
> we get a pair (5.2, 5.2)
> we get a pair (6.5, 6.5)
> we get a pair (5.2, 5.2)
> we get a pair (7.6, 7.6)
> we get a pair (4.7, 4.7)
> we get a pair (4.9, 4.9)
> we get a pair (4.8, 4.8)
> we get a pair (7.3, 7.3)
> we get a pair (5.4, 5.4)
> we get a pair (6.7, 6.7)
> we get a pair (5.2, 5.2)
> we get a pair (7.2, 7.2)
> we get a pair (5.5, 5.5)
> we get a pair (6.5, 6.5)
> we get a pair (4.9, 4.9)
> we get a pair (6.4, 6.4)
> we get a pair (5.0, 5.0)
> we get a pair (6.8, 6.8)
> we get a pair (5.5, 5.5)
> we get a pair (5.7, 5.7)
> we get a pair (4.9, 4.9)
> we get a pair (5.8, 5.8)
> we get a pair (4.4, 4.4)
> we get a pair (6.4, 6.4)
> we get a pair (5.1, 5.1)
> we get a pair (6.5, 6.5)
> we get a pair (5.0, 5.0)
> we get a pair (7.7, 7.7)
> we get a pair (4.5, 4.5)
> we get a pair (7.7, 7.7)
> we get a pair (4.4, 4.4)
> we get a pair (6.0, 6.0)
> we get a pair (5.0, 5.0)
> we get a pair (6.9, 6.9)
> we get a pair (5.1, 5.1)
> we get a pair (5.6, 5.6)
> we get a pair (4.8, 4.8)
> we get a pair (7.7, 7.7)
> we get a pair (6.3, 6.3)
> we get a pair (5.1, 5.1)
> we get a pair (6.7, 6.7)
> we get a pair (4.6, 4.6)
> we get a pair (7.2, 7.2)
> we get a pair (5.3, 5.3)
> we get a pair (6.2, 6.2)
> we get a pair (5.0, 5.0)
> we get a pair (6.1, 6.1)
> we get a pair (7.0, 7.0)
> we get a pair (6.4, 6.4)
> we get a pair (6.4, 6.4)
> we get a pair (7.2, 7.2)
> we get a pair (6.9, 6.9)
> we get a pair (7.4, 7.4)
> we get a pair (5.5, 5.5)
> we get a pair (7.9, 7.9)
> we get a pair (6.5, 6.5)
> we get a pair (6.4, 6.4)
> we get a pair (5.7, 5.7)
> we get a pair (6.3, 6.3)
> we get a pair (6.3, 6.3)
> we get a pair (6.1, 6.1)
> we get a pair (4.9, 4.9)
> we get a pair (7.7, 7.7)
> we get a pair (6.6, 6.6)
> we get a pair (6.3, 6.3)
> we get a pair (5.2, 5.2)
> we get a pair (6.4, 

[jira] [Commented] (SPARK-14751) SparkR fails on Cassandra map with numeric key

2016-04-20 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15250385#comment-15250385
 ] 

Shivaram Venkataraman commented on SPARK-14751:
---

I think the problem here is that map types in SQL are converted to environments 
in R and environments in R can only have string keys as far as I know. Is there 
a better way to represent a map in R ?  

> SparkR fails on Cassandra map with numeric key
> --
>
> Key: SPARK-14751
> URL: https://issues.apache.org/jira/browse/SPARK-14751
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.1
>Reporter: Michał Matłoka
>
> Hi,
> I have created an issue for spark  cassandra connector ( 
> https://datastax-oss.atlassian.net/projects/SPARKC/issues/SPARKC-366 ) but 
> after a bit of digging it seems this is a better place for this issue:
> {code}
> CREATE TABLE test.map (
> id text,
> somemap map,
> PRIMARY KEY (id)
> );
> insert into test.map(id, somemap) values ('a', { 0 : 12 }); 
> {code}
> {code}
>   sqlContext <- sparkRSQL.init(sc)
>   test <-read.df(sqlContext,  source = "org.apache.spark.sql.cassandra",  
> keyspace = "test", table = "map")
>   head(test)
> {code}
> Results in:
> {code}
> 16/04/19 14:47:02 ERROR RBackendHandler: dfToCols on 
> org.apache.spark.sql.api.r.SQLUtils failed
> Error in readBin(con, raw(), stringLen, endian = "big") :
>   invalid 'n' argument
> {code}
> Problem occurs even for int key. For text key it works. Every scenario works 
> under scala & python.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13178) RRDD faces with concurrency issue in case of rdd.zip(rdd).count()

2016-04-20 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15250268#comment-15250268
 ] 

Shivaram Venkataraman commented on SPARK-13178:
---

[~sunrui] [~yinxusen] Now that https://github.com/apache/spark/pull/10947 has 
been merged, is this issue resolved ?

> RRDD faces with concurrency issue in case of rdd.zip(rdd).count()
> -
>
> Key: SPARK-13178
> URL: https://issues.apache.org/jira/browse/SPARK-13178
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Xusen Yin
>
> In Kmeans algorithm, there is a zip operation before taking samples, i.e. 
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L210,
>  which can be simplified in the following code:
> {code:title=zip.scala|theme=FadeToGrey|linenumbers=true|language=scala|firstline=0001|collapse=false}
> val rdd =  ...
> val rdd2 = rdd.map(x => x)
> rdd.zip(rdd2).count()
> {code}
> However, RRDD fails on this operation with an error of "can only zip rdd with 
> same number of elements" or "stream closed", similar to the JIRA issue: 
> https://issues.apache.org/jira/browse/SPARK-2251
> Inside RRDD, a data stream is used to ingest data from the R side. In the zip 
> operation, zip with self computes each partition twice. So if we zip a 
> HadoopRDD (iris dataset) with itself, we get 
> {code:title=log-from-zip-HadoopRDD|theme=FadeToGrey|linenumbers=true|language=scala|firstline=0001|collapse=false}
> we get a pair (6.8, 6.8)
> we get a pair (5.1, 5.1)
> we get a pair (6.7, 6.7)
> we get a pair (4.9, 4.9)
> we get a pair (6.0, 6.0)
> we get a pair (4.7, 4.7)
> we get a pair (5.7, 5.7)
> we get a pair (4.6, 4.6)
> we get a pair (5.5, 5.5)
> we get a pair (5.0, 5.0)
> we get a pair (5.5, 5.5)
> we get a pair (5.4, 5.4)
> we get a pair (5.8, 5.8)
> we get a pair (4.6, 4.6)
> we get a pair (6.0, 6.0)
> we get a pair (5.0, 5.0)
> we get a pair (5.4, 5.4)
> we get a pair (4.4, 4.4)
> we get a pair (6.0, 6.0)
> we get a pair (4.9, 4.9)
> we get a pair (6.7, 6.7)
> we get a pair (5.4, 5.4)
> we get a pair (6.3, 6.3)
> we get a pair (4.8, 4.8)
> we get a pair (5.6, 5.6)
> we get a pair (4.8, 4.8)
> we get a pair (5.5, 5.5)
> we get a pair (4.3, 4.3)
> we get a pair (5.5, 5.5)
> we get a pair (5.8, 5.8)
> we get a pair (6.1, 6.1)
> we get a pair (5.7, 5.7)
> we get a pair (5.8, 5.8)
> we get a pair (5.4, 5.4)
> we get a pair (5.0, 5.0)
> we get a pair (5.1, 5.1)
> we get a pair (5.6, 5.6)
> we get a pair (5.7, 5.7)
> we get a pair (5.7, 5.7)
> we get a pair (5.1, 5.1)
> we get a pair (5.7, 5.7)
> we get a pair (5.4, 5.4)
> we get a pair (6.2, 6.2)
> we get a pair (5.1, 5.1)
> we get a pair (5.1, 5.1)
> we get a pair (4.6, 4.6)
> we get a pair (5.7, 5.7)
> we get a pair (5.1, 5.1)
> we get a pair (6.3, 6.3)
> we get a pair (4.8, 4.8)
> we get a pair (5.8, 5.8)
> we get a pair (5.0, 5.0)
> we get a pair (7.1, 7.1)
> we get a pair (5.0, 5.0)
> we get a pair (6.3, 6.3)
> we get a pair (5.2, 5.2)
> we get a pair (6.5, 6.5)
> we get a pair (5.2, 5.2)
> we get a pair (7.6, 7.6)
> we get a pair (4.7, 4.7)
> we get a pair (4.9, 4.9)
> we get a pair (4.8, 4.8)
> we get a pair (7.3, 7.3)
> we get a pair (5.4, 5.4)
> we get a pair (6.7, 6.7)
> we get a pair (5.2, 5.2)
> we get a pair (7.2, 7.2)
> we get a pair (5.5, 5.5)
> we get a pair (6.5, 6.5)
> we get a pair (4.9, 4.9)
> we get a pair (6.4, 6.4)
> we get a pair (5.0, 5.0)
> we get a pair (6.8, 6.8)
> we get a pair (5.5, 5.5)
> we get a pair (5.7, 5.7)
> we get a pair (4.9, 4.9)
> we get a pair (5.8, 5.8)
> we get a pair (4.4, 4.4)
> we get a pair (6.4, 6.4)
> we get a pair (5.1, 5.1)
> we get a pair (6.5, 6.5)
> we get a pair (5.0, 5.0)
> we get a pair (7.7, 7.7)
> we get a pair (4.5, 4.5)
> we get a pair (7.7, 7.7)
> we get a pair (4.4, 4.4)
> we get a pair (6.0, 6.0)
> we get a pair (5.0, 5.0)
> we get a pair (6.9, 6.9)
> we get a pair (5.1, 5.1)
> we get a pair (5.6, 5.6)
> we get a pair (4.8, 4.8)
> we get a pair (7.7, 7.7)
> we get a pair (6.3, 6.3)
> we get a pair (5.1, 5.1)
> we get a pair (6.7, 6.7)
> we get a pair (4.6, 4.6)
> we get a pair (7.2, 7.2)
> we get a pair (5.3, 5.3)
> we get a pair (6.2, 6.2)
> we get a pair (5.0, 5.0)
> we get a pair (6.1, 6.1)
> we get a pair (7.0, 7.0)
> we get a pair (6.4, 6.4)
> we get a pair (6.4, 6.4)
> we get a pair (7.2, 7.2)
> we get a pair (6.9, 6.9)
> we get a pair (7.4, 7.4)
> we get a pair (5.5, 5.5)
> we get a pair (7.9, 7.9)
> we get a pair (6.5, 6.5)
> we get a pair (6.4, 6.4)
> we get a pair (5.7, 5.7)
> we get a pair (6.3, 6.3)
> we get a pair (6.3, 6.3)
> we get a pair (6.1, 6.1)
> we get a pair (4.9, 4.9)
> we get a pair (7.7, 7.7)
> we get a pair (6.6, 6.6)
> we get a pair (6.3, 6.3)
> we get a pair (5.2, 5.2)
> we get a pair (6.4, 6.4)
> we 

[jira] [Commented] (SPARK-14746) Support transformations in R source code for Dataset/DataFrame

2016-04-20 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15250229#comment-15250229
 ] 

Shivaram Venkataraman commented on SPARK-14746:
---

Yeah I can see that RDD pipe is limited and it would be good to have some 
better support for calling into R from a Scala / Java environment. However I 
feel this is introducing a new paradigm in DataFrame / Spark where one can mix 
and match languages (so far we have built things to be used in Python / R) and 
I don't know if this is a route we want to go down ([~rxin] ?) .  My other 
reason considering this as lower priority is that this is useful for users who 
are proficient in more than 1 language which I think is a smaller set of users.

> Support transformations in R source code for Dataset/DataFrame
> --
>
> Key: SPARK-14746
> URL: https://issues.apache.org/jira/browse/SPARK-14746
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Sun Rui
>
> there actually is a desired scenario mentioned several times in the Spark 
> mailing list that users are writing Scala/Java Spark applications (not 
> SparkR) but want to use R functions in some transformations. typically this 
> can be achieved by calling Pipe() in RDD. However, there are limitations on 
> pipe(). So we can support applying a R function in source code format to a 
> Dataset/DataFrame (Thus SparkR is not needed for serializing an R function.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14742) Redirect spark-ec2 doc to new location

2016-04-20 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15250216#comment-15250216
 ] 

Shivaram Venkataraman commented on SPARK-14742:
---

Ah my bad - I forgot that docs/ is actually in the master branch of our github 
repo and not in the ASF site. Actually can we add the auto-redirect or an 
explicit redirect message in docs/ec2-scripts.html ? 

> Redirect spark-ec2 doc to new location
> --
>
> Key: SPARK-14742
> URL: https://issues.apache.org/jira/browse/SPARK-14742
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, EC2
>Reporter: Nicholas Chammas
>Priority: Minor
>
> See: https://github.com/amplab/spark-ec2/pull/24#issuecomment-212033453
> We need to redirect this page
> http://spark.apache.org/docs/latest/ec2-scripts.html
> to this page
> https://github.com/amplab/spark-ec2#readme



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13905) Change signature of as.data.frame() to be consistent with the R base package

2016-04-19 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-13905.
---
   Resolution: Fixed
 Assignee: Sun Rui
Fix Version/s: 2.0.0

Resolved by https://github.com/apache/spark/pull/11811

> Change signature of as.data.frame() to be consistent with the R base package
> 
>
> Key: SPARK-13905
> URL: https://issues.apache.org/jira/browse/SPARK-13905
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.6.1
>Reporter: Sun Rui
>Assignee: Sun Rui
> Fix For: 2.0.0
>
>
> change the signature of as.data.frame() to be consistent with that in the R 
> base package to meet R user's convention, as documented at 
> http://www.inside-r.org/r-doc/base/as.data.frame



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12224) R support for JDBC source

2016-04-19 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-12224.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10480
[https://github.com/apache/spark/pull/10480]

> R support for JDBC source
> -
>
> Key: SPARK-12224
> URL: https://issues.apache.org/jira/browse/SPARK-12224
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Felix Cheung
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12224) R support for JDBC source

2016-04-19 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-12224:
--
Assignee: Felix Cheung  (was: Apache Spark)

> R support for JDBC source
> -
>
> Key: SPARK-12224
> URL: https://issues.apache.org/jira/browse/SPARK-12224
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Minor
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14037) count(df) is very slow for dataframe constrcuted using SparkR::createDataFrame

2016-04-19 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248416#comment-15248416
 ] 

Shivaram Venkataraman commented on SPARK-14037:
---

Thanks [~samalexg] and [~sunrui] for investigating this issue. I think one 
thing that we could add to our profiling is the size of the data being written 
out of the RRDD. My guess is that the overhead is due to (a) serialization of 
strings is slow in SparkR (b) the serialization of strings in SparkR increases 
the size of the data being written out and as /tmp might not be mounted in 
memory here, we are running into disk overheads here.

> count(df) is very slow for dataframe constrcuted using SparkR::createDataFrame
> --
>
> Key: SPARK-14037
> URL: https://issues.apache.org/jira/browse/SPARK-14037
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.1
> Environment: Ubuntu 12.04
> RAM : 6 GB
> Spark 1.6.1 Standalone
>Reporter: Samuel Alexander
>  Labels: performance, sparkR
> Attachments: console.log, spark_ui.png, spark_ui_ray.png
>
>
> Any operations on dataframe created using SparkR::createDataFrame is very 
> slow.
> I have a CSV of size ~ 6MB. Below is the sample content
> 12121212Juej1XC,A_String,5460.8,2016-03-14,7,Quarter
> 12121212K6sZ1XS,A_String,0.0,2016-03-14,7,Quarter
> 12121212K9Xc1XK,A_String,7803.0,2016-03-14,7,Quarter
> 12121212ljXE1XY,A_String,226944.25,2016-03-14,7,Quarter
> 12121212lr8p1XA,A_String,368022.26,2016-03-14,7,Quarter
> 12121212lwip1XA,A_String,84091.0,2016-03-14,7,Quarter
> 12121212lwkn1XA,A_String,54154.0,2016-03-14,7,Quarter
> 12121212lwlv1XA,A_String,11219.09,2016-03-14,7,Quarter
> 12121212lwmL1XQ,A_String,23808.0,2016-03-14,7,Quarter
> 12121212lwnj1XA,A_String,32029.3,2016-03-14,7,Quarter
> I created R data.frame using r_df <- read.csv(file="r_df.csv", head=TRUE, 
> sep=","). And then converted into Spark dataframe using sp_df <- 
> createDataFrame(sqlContext, r_df)
> Now count(sp_df) took more than 30 seconds
> When I load the same CSV using spark-csv like, direct_df <- 
> read.df(sqlContext, "/home/sam/tmp/csv/orig_content.csv", source = 
> "com.databricks.spark.csv", inferSchema = "false", header="true")
> count(direct_df) took below 1 sec.
> I know performance has been improved in createDataFrame in Spark 1.6. But 
> other operations like count(), is very slow.
> How can I get rid of this performance issue? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14594) Improve error messages for RDD API

2016-04-19 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248392#comment-15248392
 ] 

Shivaram Venkataraman commented on SPARK-14594:
---

I think this would be very useful to have if we can get it in before the 2.0 
time frame.

cc [~felixcheung] [~olarayej] [~sunrui]

> Improve error messages for RDD API
> --
>
> Key: SPARK-14594
> URL: https://issues.apache.org/jira/browse/SPARK-14594
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Marco Gaido
>
> When you have an error in your R code using the RDD API, you always get as 
> error message:
> Error in if (returnStatus != 0) { : argument is of length zero
> This is not very useful and I think it might be better to catch the R 
> exception and show it instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14325) some strange name conflicts in `group_by`

2016-04-19 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248374#comment-15248374
 ] 

Shivaram Venkataraman commented on SPARK-14325:
---

I think the problem might be related to `x` being an argument of type 
GroupedData to agg ? Does your code work if you replace `x = "sum"` with some 
other name ?
{code}
setMethod("agg",
  signature(x = "GroupedData"),
  function(x, ...) {
{code}

> some strange name conflicts in `group_by`
> -
>
> Key: SPARK-14325
> URL: https://issues.apache.org/jira/browse/SPARK-14325
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.0, 1.6.1
> Environment: sparkR 1.6.0
>Reporter: Dmitriy Selivanov
>
> group_by strange behaviour when try to aggregate by column with name "x".
> consider following example
> {code}
> df
> # DataFrame[userId:bigint, type:string, j:int, x:int]
> df %>%group_by(df$userId, df$type, df$j) %>% agg(x = "sum")
> #Error in (function (classes, fdef, mtable)  : 
> #  unable to find an inherited method for function ‘agg’ for signature 
> ‘"character"’
> {code}
> after renaming x -> x2 works just file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14692) Error While Setting the path for R front end

2016-04-19 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248363#comment-15248363
 ] 

Shivaram Venkataraman commented on SPARK-14692:
---

I think SparkR might not have been built ? Is this from cloning the source code 
or from downloading a release. 

Also please note that JIRA issues are typically used to track development of 
features / bugs and this question is more suited to the spark users mailing 
list http://spark.apache.org/community.html

> Error While Setting the path for R front end
> 
>
> Key: SPARK-14692
> URL: https://issues.apache.org/jira/browse/SPARK-14692
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.1
> Environment: Mac OSX
>Reporter: Niranjan Molkeri`
>
> Trying to set Environment path for SparkR in RStudio. 
> Getting this bug. 
> > .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
> > library(SparkR)
> Error in library(SparkR) : there is no package called ‘SparkR’
> > sc <- sparkR.init(master="local")
> Error: could not find function "sparkR.init"
> In the directory which it is pointed. There is directory called SparkR. I 
> don't know how to proceed with this.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14326) Can't specify "long" type in structField

2016-04-19 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248335#comment-15248335
 ] 

Shivaram Venkataraman commented on SPARK-14326:
---

I think its fine to support this in StructField / other schema specification 
that we use to pass information to Scala. I don't think adding a int64 
dependency is necessarily something we should do as per the discussion in 
https://issues.apache.org/jira/browse/SPARK-12360 

> Can't specify "long" type in structField
> 
>
> Key: SPARK-14326
> URL: https://issues.apache.org/jira/browse/SPARK-14326
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Dmitriy Selivanov
>
> tried `long`, `bigint`, `LongType`, `Long`. Nothing works...
> {code}
> schema <- structType(structField("id", "long"))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14147) SparkR - ML predictors return features with vector datatype, however SparkR doesn't support it

2016-03-24 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15211394#comment-15211394
 ] 

Shivaram Venkataraman commented on SPARK-14147:
---

cc [~mengxr]

> SparkR - ML predictors return features with vector datatype, however SparkR 
> doesn't support it
> --
>
> Key: SPARK-14147
> URL: https://issues.apache.org/jira/browse/SPARK-14147
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>
> It seems that ML predictors in SparkR return an output which contains 
> features represented by vector datatype, however SparkR doesn't support it 
> and as a result features are being displayed as an environment variable.
> example: 
> prediction <- predict(model, training)
> DataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, 
> Petal_Width:double, features:vector, prediction:int]
> collect(prediction)
> Sepal_Length Sepal_Width Petal_Length Petal_Width   
> features prediction
> 15.1 3.5  1.4 0.2  0x10b7a8870>  1
> 24.9 3.0  1.4 0.2  0x10b79d498>  1
> 34.7 3.2  1.3 0.2  0x10b7960a8>  1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14074) Do not use install_github in SparkR build

2016-03-22 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207866#comment-15207866
 ] 

Shivaram Venkataraman commented on SPARK-14074:
---

[~sunrui] Would you have a chance to check if the tag 0.3.1 is good enough for 
us ? If so we can switch to that

> Do not use install_github in SparkR build
> -
>
> Key: SPARK-14074
> URL: https://issues.apache.org/jira/browse/SPARK-14074
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>
> In dev/lint-r.R, `install_github` makes our builds depend on a unstable 
> source. We should use official releases on CRAN instead even the released 
> version has less feature.
> cc: [~shivaram] [~sunrui]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14074) Do not use install_github in SparkR build

2016-03-22 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207152#comment-15207152
 ] 

Shivaram Venkataraman commented on SPARK-14074:
---

I think the last CRAN release when we first started using lint-r was very old. 
Right now it looks like 0.3.3 is the latest release on lint-r -- thats going to 
be missing some of the latest commits from 
https://github.com/jimhester/lintr/commits/master and we'll need to see if we 
are using any of it.

Another approach which might be better than the current `install_github`is to 
use `install_github` with a specific tag. That way we don't need to rely on the 
package being updated in CRAN but can use the latest github tag.

> Do not use install_github in SparkR build
> -
>
> Key: SPARK-14074
> URL: https://issues.apache.org/jira/browse/SPARK-14074
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>
> In dev/lint-r.R, `install_github` makes our builds depend on a unstable 
> source. We should use official releases on CRAN instead even the released 
> version has less feature.
> cc: [~shivaram] [~sunrui]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14006) Builds of 1.6 branch fail R style check

2016-03-18 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15201894#comment-15201894
 ] 

Shivaram Venkataraman commented on SPARK-14006:
---

Ah we just updated our style check plugin and that seems to have affected 
things. If this is blocking anything we can comment out the R style checks at 
https://github.com/apache/spark/blob/master/dev/run-tests.py#L557 for now.

I'm at a conference and will try to get this fixed by later this evening

cc [~sunrui]

> Builds of 1.6 branch fail R style check
> ---
>
> Key: SPARK-14006
> URL: https://issues.apache.org/jira/browse/SPARK-14006
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, Tests
>Reporter: Yin Huai
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-1.6-test-sbt-hadoop-2.2/152/console



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13812) Fix SparkR lint-r test errors

2016-03-13 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-13812.
---
   Resolution: Fixed
 Assignee: Sun Rui
Fix Version/s: 2.0.0

Resolved by https://github.com/apache/spark/pull/11652

> Fix SparkR lint-r test errors
> -
>
> Key: SPARK-13812
> URL: https://issues.apache.org/jira/browse/SPARK-13812
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.1
>Reporter: Sun Rui
>Assignee: Sun Rui
> Fix For: 2.0.0
>
>
> After get updated from github, the lintr package can detect errors that are 
> not detected in previous versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13389) SparkR support first/last with ignore NAs

2016-03-10 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-13389.
---
   Resolution: Fixed
 Assignee: Yanbo Liang
Fix Version/s: 2.0.0

Resolved by https://github.com/apache/spark/pull/11267

> SparkR support first/last with ignore NAs
> -
>
> Key: SPARK-13389
> URL: https://issues.apache.org/jira/browse/SPARK-13389
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
> Fix For: 2.0.0
>
>
> SparkR support first/last with ignore NAs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13327) colnames()<- allows invalid column names

2016-03-10 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-13327.
---
   Resolution: Fixed
Fix Version/s: 2.0.0
   1.6.2

Resolved by https://github.com/apache/spark/pull/11220

> colnames()<- allows invalid column names
> 
>
> Key: SPARK-13327
> URL: https://issues.apache.org/jira/browse/SPARK-13327
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Oscar D. Lara Yejas
>Assignee: Oscar D. Lara Yejas
> Fix For: 1.6.2, 2.0.0
>
>
> colnames<- fails if:
> 1) Given colnames contain .
> 2) Given colnames contain NA
> 3) Given colnames are not character
> 4) Given colnames have different length than dataset's (SparkSQL error is 
> through but not user friendly)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13327) colnames()<- allows invalid column names

2016-03-10 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-13327:
--
Assignee: Oscar D. Lara Yejas

> colnames()<- allows invalid column names
> 
>
> Key: SPARK-13327
> URL: https://issues.apache.org/jira/browse/SPARK-13327
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Oscar D. Lara Yejas
>Assignee: Oscar D. Lara Yejas
>
> colnames<- fails if:
> 1) Given colnames contain .
> 2) Given colnames contain NA
> 3) Given colnames are not character
> 4) Given colnames have different length than dataset's (SparkSQL error is 
> through but not user friendly)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13262) cannot coerce type 'environment' to vector of type 'list'

2016-02-10 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15141877#comment-15141877
 ] 

Shivaram Venkataraman commented on SPARK-13262:
---

Can you paste the code you ran that led to the error ? It'll be great if there 
is a small reproducible example

> cannot coerce type 'environment' to vector of type 'list'
> -
>
> Key: SPARK-13262
> URL: https://issues.apache.org/jira/browse/SPARK-13262
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Samuel Alexander
>
> Occasionally getting the following error while using Spark R while 
> constructing dataframe in R
> 16/02/09 13:28:06 WARN RBackendHandler: cannot find matching method class 
> org.apache.spark.sql.api.r.SQLUtils.dfToCols. Candidates are:
> Error in as.vector(x, "list") : 
>   cannot coerce type 'environment' to vector of type 'list'
> Restarting SparkR fixed the error.
> What is the cause for this issue? How can we solve it? 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13256) Strange Error

2016-02-09 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-13256.
---
Resolution: Incomplete

Please use the spark user mailing list / stack overflow to ask questions about 
your error. The JIRA is used to track development issues in Spark

> Strange Error
> -
>
> Key: SPARK-13256
> URL: https://issues.apache.org/jira/browse/SPARK-13256
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.0
> Environment: CDH 5.5.1
> Spark 1.6.0
>Reporter: 邱承
>Priority: Critical
>
> always get the following error:
> Error in if (returnStatus != 0) { : argument is of length zero



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13178) RRDD faces with concurrency issue in case of rdd.zip(rdd).count()

2016-02-03 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131436#comment-15131436
 ] 

Shivaram Venkataraman commented on SPARK-13178:
---

Hmm this is tricky to debug -- A higher level question: Why do we need to 
implement this using RRDD and zip on it ? The RRDD class is deprecated and 
going to go away soon. I thought the KMeans effort would only require wrapping 
the scala algorithm ?

> RRDD faces with concurrency issue in case of rdd.zip(rdd).count()
> -
>
> Key: SPARK-13178
> URL: https://issues.apache.org/jira/browse/SPARK-13178
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Xusen Yin
>
> In Kmeans algorithm, there is a zip operation before taking samples, i.e. 
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L210,
>  which can be simplified in the following code:
> {code:title=zip.scala|theme=FadeToGrey|linenumbers=true|language=scala|firstline=0001|collapse=false}
> val rdd =  ...
> val rdd2 = rdd.map(x => x)
> rdd.zip(rdd2).count()
> {code}
> However, RRDD fails on this operation with an error of "can only zip rdd with 
> same number of elements" or "stream closed", similar to the JIRA issue: 
> https://issues.apache.org/jira/browse/SPARK-2251
> Inside RRDD, a data stream is used to ingest data from the R side. In the zip 
> operation, zip with self computes each partition twice. So if we zip a 
> HadoopRDD (iris dataset) with itself, we get 
> {code:title=log-from-zip-HadoopRDD|theme=FadeToGrey|linenumbers=true|language=scala|firstline=0001|collapse=false}
> we get a pair (6.8, 6.8)
> we get a pair (5.1, 5.1)
> we get a pair (6.7, 6.7)
> we get a pair (4.9, 4.9)
> we get a pair (6.0, 6.0)
> we get a pair (4.7, 4.7)
> we get a pair (5.7, 5.7)
> we get a pair (4.6, 4.6)
> we get a pair (5.5, 5.5)
> we get a pair (5.0, 5.0)
> we get a pair (5.5, 5.5)
> we get a pair (5.4, 5.4)
> we get a pair (5.8, 5.8)
> we get a pair (4.6, 4.6)
> we get a pair (6.0, 6.0)
> we get a pair (5.0, 5.0)
> we get a pair (5.4, 5.4)
> we get a pair (4.4, 4.4)
> we get a pair (6.0, 6.0)
> we get a pair (4.9, 4.9)
> we get a pair (6.7, 6.7)
> we get a pair (5.4, 5.4)
> we get a pair (6.3, 6.3)
> we get a pair (4.8, 4.8)
> we get a pair (5.6, 5.6)
> we get a pair (4.8, 4.8)
> we get a pair (5.5, 5.5)
> we get a pair (4.3, 4.3)
> we get a pair (5.5, 5.5)
> we get a pair (5.8, 5.8)
> we get a pair (6.1, 6.1)
> we get a pair (5.7, 5.7)
> we get a pair (5.8, 5.8)
> we get a pair (5.4, 5.4)
> we get a pair (5.0, 5.0)
> we get a pair (5.1, 5.1)
> we get a pair (5.6, 5.6)
> we get a pair (5.7, 5.7)
> we get a pair (5.7, 5.7)
> we get a pair (5.1, 5.1)
> we get a pair (5.7, 5.7)
> we get a pair (5.4, 5.4)
> we get a pair (6.2, 6.2)
> we get a pair (5.1, 5.1)
> we get a pair (5.1, 5.1)
> we get a pair (4.6, 4.6)
> we get a pair (5.7, 5.7)
> we get a pair (5.1, 5.1)
> we get a pair (6.3, 6.3)
> we get a pair (4.8, 4.8)
> we get a pair (5.8, 5.8)
> we get a pair (5.0, 5.0)
> we get a pair (7.1, 7.1)
> we get a pair (5.0, 5.0)
> we get a pair (6.3, 6.3)
> we get a pair (5.2, 5.2)
> we get a pair (6.5, 6.5)
> we get a pair (5.2, 5.2)
> we get a pair (7.6, 7.6)
> we get a pair (4.7, 4.7)
> we get a pair (4.9, 4.9)
> we get a pair (4.8, 4.8)
> we get a pair (7.3, 7.3)
> we get a pair (5.4, 5.4)
> we get a pair (6.7, 6.7)
> we get a pair (5.2, 5.2)
> we get a pair (7.2, 7.2)
> we get a pair (5.5, 5.5)
> we get a pair (6.5, 6.5)
> we get a pair (4.9, 4.9)
> we get a pair (6.4, 6.4)
> we get a pair (5.0, 5.0)
> we get a pair (6.8, 6.8)
> we get a pair (5.5, 5.5)
> we get a pair (5.7, 5.7)
> we get a pair (4.9, 4.9)
> we get a pair (5.8, 5.8)
> we get a pair (4.4, 4.4)
> we get a pair (6.4, 6.4)
> we get a pair (5.1, 5.1)
> we get a pair (6.5, 6.5)
> we get a pair (5.0, 5.0)
> we get a pair (7.7, 7.7)
> we get a pair (4.5, 4.5)
> we get a pair (7.7, 7.7)
> we get a pair (4.4, 4.4)
> we get a pair (6.0, 6.0)
> we get a pair (5.0, 5.0)
> we get a pair (6.9, 6.9)
> we get a pair (5.1, 5.1)
> we get a pair (5.6, 5.6)
> we get a pair (4.8, 4.8)
> we get a pair (7.7, 7.7)
> we get a pair (6.3, 6.3)
> we get a pair (5.1, 5.1)
> we get a pair (6.7, 6.7)
> we get a pair (4.6, 4.6)
> we get a pair (7.2, 7.2)
> we get a pair (5.3, 5.3)
> we get a pair (6.2, 6.2)
> we get a pair (5.0, 5.0)
> we get a pair (6.1, 6.1)
> we get a pair (7.0, 7.0)
> we get a pair (6.4, 6.4)
> we get a pair (6.4, 6.4)
> we get a pair (7.2, 7.2)
> we get a pair (6.9, 6.9)
> we get a pair (7.4, 7.4)
> we get a pair (5.5, 5.5)
> we get a pair (7.9, 7.9)
> we get a pair (6.5, 6.5)
> we get a pair (6.4, 6.4)
> we get a pair (5.7, 5.7)
> we get a pair (6.3, 6.3)
> we get a pair (6.3, 6.3)
> we get a pair (6.1, 6.1)
> we get a pair (4.9, 4.9)
> we 

[jira] [Commented] (SPARK-13178) RRDD faces with concurrency issue in case of rdd.zip(rdd).count()

2016-02-03 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131556#comment-15131556
 ] 

Shivaram Venkataraman commented on SPARK-13178:
---

Ah I see - so the problem is that createDataFrame is returning this RRDD which 
on zipping leads to the problem. Is the problem just about closing the stream 
twice ? If that is the case we should probably fix that. 

cc [~sunrui] 

> RRDD faces with concurrency issue in case of rdd.zip(rdd).count()
> -
>
> Key: SPARK-13178
> URL: https://issues.apache.org/jira/browse/SPARK-13178
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Xusen Yin
>
> In Kmeans algorithm, there is a zip operation before taking samples, i.e. 
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L210,
>  which can be simplified in the following code:
> {code:title=zip.scala|theme=FadeToGrey|linenumbers=true|language=scala|firstline=0001|collapse=false}
> val rdd =  ...
> val rdd2 = rdd.map(x => x)
> rdd.zip(rdd2).count()
> {code}
> However, RRDD fails on this operation with an error of "can only zip rdd with 
> same number of elements" or "stream closed", similar to the JIRA issue: 
> https://issues.apache.org/jira/browse/SPARK-2251
> Inside RRDD, a data stream is used to ingest data from the R side. In the zip 
> operation, zip with self computes each partition twice. So if we zip a 
> HadoopRDD (iris dataset) with itself, we get 
> {code:title=log-from-zip-HadoopRDD|theme=FadeToGrey|linenumbers=true|language=scala|firstline=0001|collapse=false}
> we get a pair (6.8, 6.8)
> we get a pair (5.1, 5.1)
> we get a pair (6.7, 6.7)
> we get a pair (4.9, 4.9)
> we get a pair (6.0, 6.0)
> we get a pair (4.7, 4.7)
> we get a pair (5.7, 5.7)
> we get a pair (4.6, 4.6)
> we get a pair (5.5, 5.5)
> we get a pair (5.0, 5.0)
> we get a pair (5.5, 5.5)
> we get a pair (5.4, 5.4)
> we get a pair (5.8, 5.8)
> we get a pair (4.6, 4.6)
> we get a pair (6.0, 6.0)
> we get a pair (5.0, 5.0)
> we get a pair (5.4, 5.4)
> we get a pair (4.4, 4.4)
> we get a pair (6.0, 6.0)
> we get a pair (4.9, 4.9)
> we get a pair (6.7, 6.7)
> we get a pair (5.4, 5.4)
> we get a pair (6.3, 6.3)
> we get a pair (4.8, 4.8)
> we get a pair (5.6, 5.6)
> we get a pair (4.8, 4.8)
> we get a pair (5.5, 5.5)
> we get a pair (4.3, 4.3)
> we get a pair (5.5, 5.5)
> we get a pair (5.8, 5.8)
> we get a pair (6.1, 6.1)
> we get a pair (5.7, 5.7)
> we get a pair (5.8, 5.8)
> we get a pair (5.4, 5.4)
> we get a pair (5.0, 5.0)
> we get a pair (5.1, 5.1)
> we get a pair (5.6, 5.6)
> we get a pair (5.7, 5.7)
> we get a pair (5.7, 5.7)
> we get a pair (5.1, 5.1)
> we get a pair (5.7, 5.7)
> we get a pair (5.4, 5.4)
> we get a pair (6.2, 6.2)
> we get a pair (5.1, 5.1)
> we get a pair (5.1, 5.1)
> we get a pair (4.6, 4.6)
> we get a pair (5.7, 5.7)
> we get a pair (5.1, 5.1)
> we get a pair (6.3, 6.3)
> we get a pair (4.8, 4.8)
> we get a pair (5.8, 5.8)
> we get a pair (5.0, 5.0)
> we get a pair (7.1, 7.1)
> we get a pair (5.0, 5.0)
> we get a pair (6.3, 6.3)
> we get a pair (5.2, 5.2)
> we get a pair (6.5, 6.5)
> we get a pair (5.2, 5.2)
> we get a pair (7.6, 7.6)
> we get a pair (4.7, 4.7)
> we get a pair (4.9, 4.9)
> we get a pair (4.8, 4.8)
> we get a pair (7.3, 7.3)
> we get a pair (5.4, 5.4)
> we get a pair (6.7, 6.7)
> we get a pair (5.2, 5.2)
> we get a pair (7.2, 7.2)
> we get a pair (5.5, 5.5)
> we get a pair (6.5, 6.5)
> we get a pair (4.9, 4.9)
> we get a pair (6.4, 6.4)
> we get a pair (5.0, 5.0)
> we get a pair (6.8, 6.8)
> we get a pair (5.5, 5.5)
> we get a pair (5.7, 5.7)
> we get a pair (4.9, 4.9)
> we get a pair (5.8, 5.8)
> we get a pair (4.4, 4.4)
> we get a pair (6.4, 6.4)
> we get a pair (5.1, 5.1)
> we get a pair (6.5, 6.5)
> we get a pair (5.0, 5.0)
> we get a pair (7.7, 7.7)
> we get a pair (4.5, 4.5)
> we get a pair (7.7, 7.7)
> we get a pair (4.4, 4.4)
> we get a pair (6.0, 6.0)
> we get a pair (5.0, 5.0)
> we get a pair (6.9, 6.9)
> we get a pair (5.1, 5.1)
> we get a pair (5.6, 5.6)
> we get a pair (4.8, 4.8)
> we get a pair (7.7, 7.7)
> we get a pair (6.3, 6.3)
> we get a pair (5.1, 5.1)
> we get a pair (6.7, 6.7)
> we get a pair (4.6, 4.6)
> we get a pair (7.2, 7.2)
> we get a pair (5.3, 5.3)
> we get a pair (6.2, 6.2)
> we get a pair (5.0, 5.0)
> we get a pair (6.1, 6.1)
> we get a pair (7.0, 7.0)
> we get a pair (6.4, 6.4)
> we get a pair (6.4, 6.4)
> we get a pair (7.2, 7.2)
> we get a pair (6.9, 6.9)
> we get a pair (7.4, 7.4)
> we get a pair (5.5, 5.5)
> we get a pair (7.9, 7.9)
> we get a pair (6.5, 6.5)
> we get a pair (6.4, 6.4)
> we get a pair (5.7, 5.7)
> we get a pair (6.3, 6.3)
> we get a pair (6.3, 6.3)
> we get a pair (6.1, 6.1)
> we get a pair (4.9, 4.9)
> we get a pair (7.7, 7.7)

[jira] [Updated] (SPARK-12629) SparkR: DataFrame's saveAsTable method has issues with the signature and HiveContext

2016-02-02 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-12629:
--
Fix Version/s: 1.6.1

> SparkR: DataFrame's saveAsTable method has issues with the signature and 
> HiveContext 
> -
>
> Key: SPARK-12629
> URL: https://issues.apache.org/jira/browse/SPARK-12629
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>Assignee: Narine Kokhlikyan
> Fix For: 1.6.1, 2.0.0
>
>
> There are several issues with the DataFrame's saveAsTable method in SparkR. 
> Here is a summary of some of them. Hope this will help to fix the issues.
> 1. According to SparkR's saveAsTable(...) documentation, we can call the 
> "saveAsTable(df, "myfile")" in order to store the dataframe.
> However, this signature isn't working. It seems that "source" and "mode" are 
> forced according to signature.
> 2. Within the method saveAsTable(...) it tries to retrieve the SQL context 
> and tries to create/initialize source as parquet, but this is also failing 
> because the context has to be Hive Context. Based on the error messages I see.
> 3. In general the method fails when I try to call it with sqlContext
> 4. Also, it seems that SQL DataFrame.saveAsTable is deprecated, we could use 
> df.write.saveAsTable(...) instead ...
> [~shivaram] [~sunrui] [~felixcheung]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13119) SparkR Ser/De fail to handle "columns(df)"

2016-02-01 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15127093#comment-15127093
 ] 

Shivaram Venkataraman commented on SPARK-13119:
---

cc [~sunrui] Thanks [~yinxusen] -- I'll try to take a look at this later today

> SparkR Ser/De fail to handle "columns(df)"
> --
>
> Key: SPARK-13119
> URL: https://issues.apache.org/jira/browse/SPARK-13119
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Xusen Yin
>
> When I want to extract names of columns of a DataFrame for 
> https://issues.apache.org/jira/browse/SPARK-13011, the Ser/De of SparkR fail 
> to handle column names of a DataFrame, as illustrated in the test code below:
> {code:title=test_Serde.R|theme=FadeToGrey|linenumbers=true|language=R|firstline=0001|collapse=false}
> test_that("SerDe of primitive types", {
>   sqlContext <- sparkRSQL.init(sc)
>   df <- suppressWarnings(createDataFrame(sqlContext, iris))
>   names <- columns(df)
>   x <- callJStatic("SparkRHandler", "echo", names)
>   expect_equal(x, names)
>   expect_equal(class(x), class(names))
> })
> {code}
> We can get the following error:
> {code:title=stack-trace|theme=FadeToGrey|linenumbers=true|language=R|firstline=0001|collapse=false}
> 1. Error: SerDe of primitive types 
> -
> (converted from warning) the condition has length > 1 and only the first 
> element will be used
> 1: withCallingHandlers(eval(code, new_test_environment), error = 
> capture_calls, message = function(c) invokeRestart("muffleMessage"))
> 2: eval(code, new_test_environment)
> 3: eval(expr, envir, enclos)
> 4: callJStatic("SparkRHandler", "echo", names) at test_Serde.R:42
> 5: invokeJava(isStatic = TRUE, className, methodName, ...)
> 6: writeArgs(rc, args)
> 7: writeObject(con, a)
> 8: .signalSimpleWarning("the condition has length > 1 and only the first 
> element will be used",
>quote(if (is.na(object)) {
>object <- NULL
>type <- "NULL"
>}))
> 9: withRestarts({
>.Internal(.signalCondition(simpleWarning(msg, call), msg, call))
>.Internal(.dfltWarn(msg, call))
>}, muffleWarning = function() NULL)
> 10: withOneRestart(expr, restarts[[1L]])
> 11: doWithOneRestart(return(expr), restart) 
> {code}
> It occurs because the result of "class(columns(df))" is "character". Ser/De 
> uses the result to check the type of object and select different ser/de 
> methods. However, "columns(df)" is not the common "character" type so the 
> ser/de fails.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5331) Spark workers can't find tachyon master as spark-ec2 doesn't set spark.tachyonStore.url

2016-01-29 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-5331.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Fixed by https://github.com/mesos/spark-ec2/pull/125

> Spark workers can't find tachyon master as spark-ec2 doesn't set 
> spark.tachyonStore.url
> ---
>
> Key: SPARK-5331
> URL: https://issues.apache.org/jira/browse/SPARK-5331
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
> Environment: Running on EC2 via modified spark-ec2 scripts (to get 
> dependencies right so tachyon starts)
> Using tachyon 0.5.0 built against hadoop 2.4.1
> Spark 1.2.0 built against tachyon 0.5.0 and hadoop 0.4.1
> Tachyon configured using the template in 0.5.0 but updated with slave list 
> and master variables etc..
>Reporter: Florian Verhein
> Fix For: 1.4.0
>
>
> ps -ef | grep Tachyon 
> shows Tachyon running on the master (and the slave) node with correct setting:
> -Dtachyon.master.hostname=ec2-54-252-156-187.ap-southeast-2.compute.amazonaws.com
> However from stderr log on worker running the SparkTachyonPi example:
> 15/01/20 06:00:56 INFO CacheManager: Partition rdd_0_0 not found, computing it
> 15/01/20 06:00:56 INFO : Trying to connect master @ localhost/127.0.0.1:19998
> 15/01/20 06:00:56 ERROR : Failed to connect (1) to master 
> localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused
> 15/01/20 06:00:57 ERROR : Failed to connect (2) to master 
> localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused
> 15/01/20 06:00:58 ERROR : Failed to connect (3) to master 
> localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused
> 15/01/20 06:00:59 ERROR : Failed to connect (4) to master 
> localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused
> 15/01/20 06:01:00 ERROR : Failed to connect (5) to master 
> localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused
> 15/01/20 06:01:01 WARN TachyonBlockManager: Attempt 1 to create tachyon dir 
> null failed
> java.io.IOException: Failed to connect to master localhost/127.0.0.1:19998 
> after 5 attempts
>   at tachyon.client.TachyonFS.connect(TachyonFS.java:293)
>   at tachyon.client.TachyonFS.getFileId(TachyonFS.java:1011)
>   at tachyon.client.TachyonFS.exist(TachyonFS.java:633)
>   at 
> org.apache.spark.storage.TachyonBlockManager$$anonfun$createTachyonDirs$2.apply(TachyonBlockManager.scala:117)
>   at 
> org.apache.spark.storage.TachyonBlockManager$$anonfun$createTachyonDirs$2.apply(TachyonBlockManager.scala:106)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
>   at 
> org.apache.spark.storage.TachyonBlockManager.createTachyonDirs(TachyonBlockManager.scala:106)
>   at 
> org.apache.spark.storage.TachyonBlockManager.(TachyonBlockManager.scala:57)
>   at 
> org.apache.spark.storage.BlockManager.tachyonStore$lzycompute(BlockManager.scala:94)
>   at 
> org.apache.spark.storage.BlockManager.tachyonStore(BlockManager.scala:88)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:773)
>   at 
> org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:638)
>   at 
> org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:145)
>   at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
>   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: tachyon.org.apache.thrift.TException: Failed to connect to master 
> localhost/127.0.0.1:19998 after 5 attempts
>   at tachyon.master.MasterClient.connect(MasterClient.java:178)
>  

[jira] [Resolved] (SPARK-10462) spark-ec2 not creating ephemeral volumes

2016-01-29 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-10462.
---
Resolution: Won't Fix

According to http://www.ec2instances.info/?filter=c4.2xlarge these machines 
don't have any ephemeral disks so there is nothing to mount. Closing this here, 
and we can open a new issue on amplab/spark-ec2 if required

> spark-ec2 not creating ephemeral volumes
> 
>
> Key: SPARK-10462
> URL: https://issues.apache.org/jira/browse/SPARK-10462
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.5.0
>Reporter: Joseph E. Gonzalez
>
> When trying to launch an ec2 cluster with the following:
> ```
> ./ec2/spark-ec2 -r us-west-2 -k mykey -i mykey.pem \
>   --hadoop-major-version=yarn \
>   --spot-price=1.0 \
>   -t c4.2xlarge -s 2 \
>   launch test-dato-yarn
> ```
> None of the nodes had an ephemeral volume and the /mnt was mounted to the 
> root 8G file-system.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9494) 'spark-ec2 launch' fails with anaconda python 3.4

2016-01-29 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-9494.
--
   Resolution: Duplicate
Fix Version/s: 1.6.0

> 'spark-ec2 launch' fails with anaconda python 3.4
> -
>
> Key: SPARK-9494
> URL: https://issues.apache.org/jira/browse/SPARK-9494
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.4.1
> Environment: OSX, Anaconda, Python 3.4
>Reporter: Stuart Owen
>Priority: Minor
> Fix For: 1.6.0
>
>
> Command I used to launch:
> {code:none}
> $SPARK_HOME/ec2/spark-ec2 \
> -k spark \
> -i ~/keys/spark.pem \
> -s $NUM_SLAVES \
> --copy-aws-credentials \
> --region=us-east-1 \
> --instance-type=m3.2xlarge \
> --spot-price=0.1 \
> launch $CLUSTER_NAME
> {code}
> Execution log:
> {code:none}
> /Users/stuart/Applications/anaconda/lib/python3.4/imp.py:32: 
> PendingDeprecationWarning: the imp module is deprecated in favour of 
> importlib; see the module's documentation for alternative uses
>   PendingDeprecationWarning)
> Setting up security groups...
> Searching for existing cluster july-event-fix in region us-east-1...
> Spark AMI: ami-35b1885c
> Launching instances...
> Traceback (most recent call last):
>   File 
> "/Users/stuart/Applications/spark-1.4.0-bin-hadoop2.6/ec2/spark_ec2.py", line 
> 1455, in 
> main()
>   File 
> "/Users/stuart/Applications/spark-1.4.0-bin-hadoop2.6/ec2/spark_ec2.py", line 
> 1447, in main
> real_main()
>   File 
> "/Users/stuart/Applications/spark-1.4.0-bin-hadoop2.6/ec2/spark_ec2.py", line 
> 1276, in real_main
> (master_nodes, slave_nodes) = launch_cluster(conn, opts, cluster_name)
>   File 
> "/Users/stuart/Applications/spark-1.4.0-bin-hadoop2.6/ec2/spark_ec2.py", line 
> 566, in launch_cluster
> name = '/dev/sd' + string.letters[i + 1]
> AttributeError: 'module' object has no attribute 'letters'
> /Users/stuart/Applications/anaconda/lib/python3.4/imp.py:32: 
> PendingDeprecationWarning: the imp module is deprecated in favour of 
> importlib; see the module's documentation for alternative uses
>   PendingDeprecationWarning)
> ERROR: Could not find a master for cluster july-event-fix in region us-east-1.
> sys:1: ResourceWarning: unclosed  family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, 
> laddr=('192.168.1.2', 55678), raddr=('207.171.162.181', 443)>
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8980) Setup cluster with spark-ec2 scripts as non-root user

2016-01-29 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-8980.
--
Resolution: Won't Fix

Now tracked by https://github.com/amplab/spark-ec2/issues/1

> Setup cluster with spark-ec2 scripts as non-root user
> -
>
> Key: SPARK-8980
> URL: https://issues.apache.org/jira/browse/SPARK-8980
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Affects Versions: 1.4.0
>Reporter: Mathieu D
>Priority: Minor
>
> Spark-ec2 scripts installs everything as root, which is not a best practice.
> Suggestion to use a sudoer instead (ec2-user, available in the AMI, is).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5629) Add spark-ec2 action to return info about an existing cluster

2016-01-29 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15124613#comment-15124613
 ] 

Shivaram Venkataraman commented on SPARK-5629:
--

I manually cleaned up some of the issues and pointed them to open issues on 
spark-ec2. I think for some of the issues we should just ping the issue and see 
if its still a relevant issue. Finally I think some of the S3 reading issues 
aren't spark-ec2 issues but more an issue with jets3t etc. 

> Add spark-ec2 action to return info about an existing cluster
> -
>
> Key: SPARK-5629
> URL: https://issues.apache.org/jira/browse/SPARK-5629
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Reporter: Nicholas Chammas
>Priority: Minor
>
> You can launch multiple clusters using spark-ec2. At some point, you might 
> just want to get some information about an existing cluster.
> Use cases include:
> * Wanting to check something about your cluster in the EC2 web console.
> * Wanting to feed information about your cluster to another tool (e.g. as 
> described in [SPARK-5627]).
> So, in addition to the [existing 
> actions|https://github.com/apache/spark/blob/9b746f380869b54d673e3758ca5e4475f76c864a/ec2/spark_ec2.py#L115]:
> * {{launch}}
> * {{destroy}}
> * {{login}}
> * {{stop}}
> * {{start}}
> * {{get-master}}
> * {{reboot-slaves}}
> We add a new action, {{describe}}, which describes an existing cluster if 
> given a cluster name, and all clusters if not.
> Some examples:
> {code}
> # describes all clusters launched by spark-ec2
> spark-ec2 describe
> {code}
> {code}
> # describes cluster-1
> spark-ec2 describe cluster-1
> {code}
> In combination with the proposal in [SPARK-5627]:
> {code}
> # describes cluster-3 in a machine-readable way (e.g. JSON)
> spark-ec2 describe cluster-3 --machine-readable
> {code}
> Parallels in similar tools include:
> * [{{juju status}}|https://juju.ubuntu.com/docs/] from Ubuntu Juju
> * [{{starcluster 
> listclusters}}|http://star.mit.edu/cluster/docs/latest/manual/getting_started.html?highlight=listclusters#logging-into-a-worker-node]
>  from MIT StarCluster



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9688) Improve spark-ec2 script to handle users that are not root

2016-01-29 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-9688.
--
Resolution: Won't Fix

Now tracked at https://github.com/amplab/spark-ec2/issues/1

> Improve spark-ec2 script to handle users that are not root
> --
>
> Key: SPARK-9688
> URL: https://issues.apache.org/jira/browse/SPARK-9688
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Affects Versions: 1.4.0, 1.4.1
> Environment: All
>Reporter: Karina Uribe
>  Labels: EC2, aws-ec2, security
>   Original Estimate: 252h
>  Remaining Estimate: 252h
>
> Hi, 
> I was trying to use the spark-ec2 script from Spark to create a new Spark 
> cluster wit an user different than root (--user=ec2-user). Unfortunately the 
> part of the script that attempts to copy the templates into the target 
> machines fail because it tries to rsync /etc/* and /root/* 
> This is the full traceback
> rsync: recv_generator: mkdir "/root/spark-ec2" failed: Permission denied (13)
> *** Skipping any contents from this failed directory ***
> sent 95 bytes  received 17 bytes  224.00 bytes/sec
> total size is 1444  speedup is 12.89
> rsync error: some files/attrs were not transferred (see previous errors) 
> (code 2  3) at main.c(1039) [sender=3.0.6]
> Traceback (most recent call last):
>   File "/home/ec2-user/spark-1.4.0/ec2/spark_ec2.py", line 1455, in 
> main()
>   File "/home/ec2-user/spark-1.4.0/ec2/spark_ec2.py", line 1447, in main
> real_main()
>   File "/home/ec2-user/spark-1.4.0/ec2/spark_ec2.py", line 1283, in real_main
> setup_cluster(conn, master_nodes, slave_nodes, opts, True)
>   File "/home/ec2-user/spark-1.4.0/ec2/spark_ec2.py", line 785, in 
> setup_cluster
> modules=modules
>   File "/home/ec2-user/spark-1.4.0/ec2/spark_ec2.py", line 1049, in 
> deploy_files
> subprocess.check_call(command)
>   File "/usr/lib64/python2.7/subprocess.py", line 540, in check_call
> raise CalledProcessError(retcode, cmd)
> subprocess.CalledProcessError: Command '['rsync', '-rv', '-e', 'ssh -o 
> StrictHos  tKeyChecking=no -o UserKnownHostsFile=/dev/null -i 
> /home/ec2-user/.ssh/sparkclus  terkey_us_east.pem', 
> '/tmp/tmpT4Iw54/', u'ec2-u...@ec2-52-2-96-193.compute-1.ama  
> zonaws.com:/']' returned non-zero exit status 23
> Is there a workaround for this? I want to improve security of our operations 
> by avoiding user root on the instances. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12526) `ifelse`, `when`, `otherwise` unable to take Column as value

2016-01-28 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15122631#comment-15122631
 ] 

Shivaram Venkataraman commented on SPARK-12526:
---

Yeah this fix was checked in after 1.6.0 was cut -- so its a known bug in 
1.6.0. I think we might have a 1.6.1 soon and this should be a part of that.

> `ifelse`, `when`, `otherwise` unable to take Column as value
> 
>
> Key: SPARK-12526
> URL: https://issues.apache.org/jira/browse/SPARK-12526
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Sen Fang
>Assignee: Sen Fang
> Fix For: 1.6.1, 2.0.0
>
>
> When passing a Column to {{ifelse}}, {{when}}, {{otherwise}}, it will error 
> out with
> {code}
> attempt to replicate an object of type 'environment'
> {code}
> The problems lies in the use of base R {{ifelse}} function, which is 
> vectorized version of {{if ... else ...}} idiom, but it is unable to 
> replicate a Column's job id as it is an environment.
> Considering {{callJMethod}} was never designed to be vectorized, the safe 
> option is to replace {{ifelse}} with {{if ... else ...}} instead. However 
> technically this is inconsistent to base R's ifelse, which is meant to be 
> vectorized.
> I can send a PR for review first and discuss further if there is scenario at 
> all when `ifelse`, `when`, `otherwise` would be used vectorizedly.
> A dummy example is:
> {code}
> ifelse(lit(1) == lit(1), lit(2), lit(3))
> {code}
> A concrete example might be:
> {code}
> ifelse(df$mpg > 0, df$mpg, 0)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12903) Add covar_samp and covar_pop for SparkR

2016-01-26 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-12903.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10829
[https://github.com/apache/spark/pull/10829]

> Add covar_samp and covar_pop for SparkR
> ---
>
> Key: SPARK-12903
> URL: https://issues.apache.org/jira/browse/SPARK-12903
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Yanbo Liang
> Fix For: 2.0.0
>
>
> Add covar_samp and covar_pop for SparkR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12903) Add covar_samp and covar_pop for SparkR

2016-01-26 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-12903:
--
Assignee: Yanbo Liang

> Add covar_samp and covar_pop for SparkR
> ---
>
> Key: SPARK-12903
> URL: https://issues.apache.org/jira/browse/SPARK-12903
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
> Fix For: 2.0.0
>
>
> Add covar_samp and covar_pop for SparkR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12629) SparkR: DataFrame's saveAsTable method has issues with the signature and HiveContext

2016-01-22 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-12629:
--
Assignee: Narine Kokhlikyan

> SparkR: DataFrame's saveAsTable method has issues with the signature and 
> HiveContext 
> -
>
> Key: SPARK-12629
> URL: https://issues.apache.org/jira/browse/SPARK-12629
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>Assignee: Narine Kokhlikyan
>
> There are several issues with the DataFrame's saveAsTable method in SparkR. 
> Here is a summary of some of them. Hope this will help to fix the issues.
> 1. According to SparkR's saveAsTable(...) documentation, we can call the 
> "saveAsTable(df, "myfile")" in order to store the dataframe.
> However, this signature isn't working. It seems that "source" and "mode" are 
> forced according to signature.
> 2. Within the method saveAsTable(...) it tries to retrieve the SQL context 
> and tries to create/initialize source as parquet, but this is also failing 
> because the context has to be Hive Context. Based on the error messages I see.
> 3. In general the method fails when I try to call it with sqlContext
> 4. Also, it seems that SQL DataFrame.saveAsTable is deprecated, we could use 
> df.write.saveAsTable(...) instead ...
> [~shivaram] [~sunrui] [~felixcheung]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12629) SparkR: DataFrame's saveAsTable method has issues with the signature and HiveContext

2016-01-22 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-12629.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by https://github.com/apache/spark/pull/10580

> SparkR: DataFrame's saveAsTable method has issues with the signature and 
> HiveContext 
> -
>
> Key: SPARK-12629
> URL: https://issues.apache.org/jira/browse/SPARK-12629
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>Assignee: Narine Kokhlikyan
> Fix For: 2.0.0
>
>
> There are several issues with the DataFrame's saveAsTable method in SparkR. 
> Here is a summary of some of them. Hope this will help to fix the issues.
> 1. According to SparkR's saveAsTable(...) documentation, we can call the 
> "saveAsTable(df, "myfile")" in order to store the dataframe.
> However, this signature isn't working. It seems that "source" and "mode" are 
> forced according to signature.
> 2. Within the method saveAsTable(...) it tries to retrieve the SQL context 
> and tries to create/initialize source as parquet, but this is also failing 
> because the context has to be Hive Context. Based on the error messages I see.
> 3. In general the method fails when I try to call it with sqlContext
> 4. Also, it seems that SQL DataFrame.saveAsTable is deprecated, we could use 
> df.write.saveAsTable(...) instead ...
> [~shivaram] [~sunrui] [~felixcheung]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5629) Add spark-ec2 action to return info about an existing cluster

2016-01-21 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15111056#comment-15111056
 ] 

Shivaram Venkataraman commented on SPARK-5629:
--

Yes - though I think its beneficial to see if the ticket is still valid and if 
it is, we can open a corresponding issue at github.com/amplab/spark-ec2. Then 
we can leave a marker here saying where this issue is being followed up at.

> Add spark-ec2 action to return info about an existing cluster
> -
>
> Key: SPARK-5629
> URL: https://issues.apache.org/jira/browse/SPARK-5629
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Reporter: Nicholas Chammas
>Priority: Minor
>
> You can launch multiple clusters using spark-ec2. At some point, you might 
> just want to get some information about an existing cluster.
> Use cases include:
> * Wanting to check something about your cluster in the EC2 web console.
> * Wanting to feed information about your cluster to another tool (e.g. as 
> described in [SPARK-5627]).
> So, in addition to the [existing 
> actions|https://github.com/apache/spark/blob/9b746f380869b54d673e3758ca5e4475f76c864a/ec2/spark_ec2.py#L115]:
> * {{launch}}
> * {{destroy}}
> * {{login}}
> * {{stop}}
> * {{start}}
> * {{get-master}}
> * {{reboot-slaves}}
> We add a new action, {{describe}}, which describes an existing cluster if 
> given a cluster name, and all clusters if not.
> Some examples:
> {code}
> # describes all clusters launched by spark-ec2
> spark-ec2 describe
> {code}
> {code}
> # describes cluster-1
> spark-ec2 describe cluster-1
> {code}
> In combination with the proposal in [SPARK-5627]:
> {code}
> # describes cluster-3 in a machine-readable way (e.g. JSON)
> spark-ec2 describe cluster-3 --machine-readable
> {code}
> Parallels in similar tools include:
> * [{{juju status}}|https://juju.ubuntu.com/docs/] from Ubuntu Juju
> * [{{starcluster 
> listclusters}}|http://star.mit.edu/cluster/docs/latest/manual/getting_started.html?highlight=listclusters#logging-into-a-worker-node]
>  from MIT StarCluster



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12204) Implement drop method for DataFrame in SparkR

2016-01-20 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-12204.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10201
[https://github.com/apache/spark/pull/10201]

> Implement drop method for DataFrame in SparkR
> -
>
> Key: SPARK-12204
> URL: https://issues.apache.org/jira/browse/SPARK-12204
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Sun Rui
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12204) Implement drop method for DataFrame in SparkR

2016-01-20 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-12204:
--
Assignee: Sun Rui

> Implement drop method for DataFrame in SparkR
> -
>
> Key: SPARK-12204
> URL: https://issues.apache.org/jira/browse/SPARK-12204
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Sun Rui
>Assignee: Sun Rui
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12910) Support for specifying version of R to use while creating sparkR libraries

2016-01-20 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-12910:
--
Assignee: Shubhanshu Mishra

> Support for specifying version of R to use while creating sparkR libraries
> --
>
> Key: SPARK-12910
> URL: https://issues.apache.org/jira/browse/SPARK-12910
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
> Environment: Linux
>Reporter: Shubhanshu Mishra
>Assignee: Shubhanshu Mishra
>Priority: Minor
>  Labels: installation, sparkR
> Fix For: 2.0.0
>
>
> When we use `$SPARK_HOME/R/install-dev.sh` it uses the default system R. 
> However, a user might have locally installed their own version of R. There 
> should be a way to specify which R version to use. 
> I have fixed this in my code using the following patch:
> {code:bash}
> $ git diff HEAD
> diff --git a/R/README.md b/R/README.md
> index 005f56d..99182e5 100644
> --- a/R/README.md
> +++ b/R/README.md
> @@ -1,6 +1,15 @@
>  # R on Spark
>  
>  SparkR is an R package that provides a light-weight frontend to use Spark 
> from R.
> +### Installing sparkR
> +
> +Libraries of sparkR need to be created in `$SPARK_HOME/R/lib`. This can be 
> done by running the script `$SPARK_HOME/R/install-dev.sh`.
> +By default the above script uses the system wide installation of R. However, 
> this can be changed to any user installed location of R by giving the full 
> path of the `$R_HOME` as the first argument to the install-dev.sh script.
> +Example: 
> +```
> +# where /home/username/R is where R is installed and /home/username/R/bin 
> contains the files R and RScript
> +./install-dev.sh /home/username/R 
> +```
>  
>  ### SparkR development
>  
> diff --git a/R/install-dev.sh b/R/install-dev.sh
> index 4972bb9..a8efa86 100755
> --- a/R/install-dev.sh
> +++ b/R/install-dev.sh
> @@ -35,12 +35,19 @@ LIB_DIR="$FWDIR/lib"
>  mkdir -p $LIB_DIR
>  
>  pushd $FWDIR > /dev/null
> +if [ ! -z "$1" ]
> +  then
> +R_HOME="$1/bin"
> +   else
> +R_HOME="$(dirname $(which R))"
> +fi
> +echo "USING R_HOME = $R_HOME"
>  
>  # Generate Rd files if devtools is installed
> -Rscript -e ' if("devtools" %in% rownames(installed.packages())) { 
> library(devtools); devtools::document(pkg="./pkg", roclets=c("rd")) }'
> +"$R_HOME/"Rscript -e ' if("devtools" %in% rownames(installed.packages())) { 
> library(devtools); devtools::document(pkg="./pkg", roclets=c("rd")) }'
>  
>  # Install SparkR to $LIB_DIR
> -R CMD INSTALL --library=$LIB_DIR $FWDIR/pkg/
> +"$R_HOME/"R CMD INSTALL --library=$LIB_DIR $FWDIR/pkg/
>  
>  # Zip the SparkR package so that it can be distributed to worker nodes on 
> YARN
>  cd $LIB_DIR
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12910) Support for specifying version of R to use while creating sparkR libraries

2016-01-20 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-12910:
--
Shepherd:   (was: Shivram Mani)

> Support for specifying version of R to use while creating sparkR libraries
> --
>
> Key: SPARK-12910
> URL: https://issues.apache.org/jira/browse/SPARK-12910
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
> Environment: Linux
>Reporter: Shubhanshu Mishra
>Priority: Minor
>  Labels: installation, sparkR
> Fix For: 2.0.0
>
>
> When we use `$SPARK_HOME/R/install-dev.sh` it uses the default system R. 
> However, a user might have locally installed their own version of R. There 
> should be a way to specify which R version to use. 
> I have fixed this in my code using the following patch:
> {code:bash}
> $ git diff HEAD
> diff --git a/R/README.md b/R/README.md
> index 005f56d..99182e5 100644
> --- a/R/README.md
> +++ b/R/README.md
> @@ -1,6 +1,15 @@
>  # R on Spark
>  
>  SparkR is an R package that provides a light-weight frontend to use Spark 
> from R.
> +### Installing sparkR
> +
> +Libraries of sparkR need to be created in `$SPARK_HOME/R/lib`. This can be 
> done by running the script `$SPARK_HOME/R/install-dev.sh`.
> +By default the above script uses the system wide installation of R. However, 
> this can be changed to any user installed location of R by giving the full 
> path of the `$R_HOME` as the first argument to the install-dev.sh script.
> +Example: 
> +```
> +# where /home/username/R is where R is installed and /home/username/R/bin 
> contains the files R and RScript
> +./install-dev.sh /home/username/R 
> +```
>  
>  ### SparkR development
>  
> diff --git a/R/install-dev.sh b/R/install-dev.sh
> index 4972bb9..a8efa86 100755
> --- a/R/install-dev.sh
> +++ b/R/install-dev.sh
> @@ -35,12 +35,19 @@ LIB_DIR="$FWDIR/lib"
>  mkdir -p $LIB_DIR
>  
>  pushd $FWDIR > /dev/null
> +if [ ! -z "$1" ]
> +  then
> +R_HOME="$1/bin"
> +   else
> +R_HOME="$(dirname $(which R))"
> +fi
> +echo "USING R_HOME = $R_HOME"
>  
>  # Generate Rd files if devtools is installed
> -Rscript -e ' if("devtools" %in% rownames(installed.packages())) { 
> library(devtools); devtools::document(pkg="./pkg", roclets=c("rd")) }'
> +"$R_HOME/"Rscript -e ' if("devtools" %in% rownames(installed.packages())) { 
> library(devtools); devtools::document(pkg="./pkg", roclets=c("rd")) }'
>  
>  # Install SparkR to $LIB_DIR
> -R CMD INSTALL --library=$LIB_DIR $FWDIR/pkg/
> +"$R_HOME/"R CMD INSTALL --library=$LIB_DIR $FWDIR/pkg/
>  
>  # Zip the SparkR package so that it can be distributed to worker nodes on 
> YARN
>  cd $LIB_DIR
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12910) Support for specifying version of R to use while creating sparkR libraries

2016-01-20 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-12910.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10836
[https://github.com/apache/spark/pull/10836]

> Support for specifying version of R to use while creating sparkR libraries
> --
>
> Key: SPARK-12910
> URL: https://issues.apache.org/jira/browse/SPARK-12910
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
> Environment: Linux
>Reporter: Shubhanshu Mishra
>Priority: Minor
>  Labels: installation, sparkR
> Fix For: 2.0.0
>
>
> When we use `$SPARK_HOME/R/install-dev.sh` it uses the default system R. 
> However, a user might have locally installed their own version of R. There 
> should be a way to specify which R version to use. 
> I have fixed this in my code using the following patch:
> {code:bash}
> $ git diff HEAD
> diff --git a/R/README.md b/R/README.md
> index 005f56d..99182e5 100644
> --- a/R/README.md
> +++ b/R/README.md
> @@ -1,6 +1,15 @@
>  # R on Spark
>  
>  SparkR is an R package that provides a light-weight frontend to use Spark 
> from R.
> +### Installing sparkR
> +
> +Libraries of sparkR need to be created in `$SPARK_HOME/R/lib`. This can be 
> done by running the script `$SPARK_HOME/R/install-dev.sh`.
> +By default the above script uses the system wide installation of R. However, 
> this can be changed to any user installed location of R by giving the full 
> path of the `$R_HOME` as the first argument to the install-dev.sh script.
> +Example: 
> +```
> +# where /home/username/R is where R is installed and /home/username/R/bin 
> contains the files R and RScript
> +./install-dev.sh /home/username/R 
> +```
>  
>  ### SparkR development
>  
> diff --git a/R/install-dev.sh b/R/install-dev.sh
> index 4972bb9..a8efa86 100755
> --- a/R/install-dev.sh
> +++ b/R/install-dev.sh
> @@ -35,12 +35,19 @@ LIB_DIR="$FWDIR/lib"
>  mkdir -p $LIB_DIR
>  
>  pushd $FWDIR > /dev/null
> +if [ ! -z "$1" ]
> +  then
> +R_HOME="$1/bin"
> +   else
> +R_HOME="$(dirname $(which R))"
> +fi
> +echo "USING R_HOME = $R_HOME"
>  
>  # Generate Rd files if devtools is installed
> -Rscript -e ' if("devtools" %in% rownames(installed.packages())) { 
> library(devtools); devtools::document(pkg="./pkg", roclets=c("rd")) }'
> +"$R_HOME/"Rscript -e ' if("devtools" %in% rownames(installed.packages())) { 
> library(devtools); devtools::document(pkg="./pkg", roclets=c("rd")) }'
>  
>  # Install SparkR to $LIB_DIR
> -R CMD INSTALL --library=$LIB_DIR $FWDIR/pkg/
> +"$R_HOME/"R CMD INSTALL --library=$LIB_DIR $FWDIR/pkg/
>  
>  # Zip the SparkR package so that it can be distributed to worker nodes on 
> YARN
>  cd $LIB_DIR
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6825) Data sources implementation to support `sequenceFile`

2016-01-20 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15109446#comment-15109446
 ] 

Shivaram Venkataraman commented on SPARK-6825:
--

No - you can go ahead and work on it.

> Data sources implementation to support `sequenceFile`
> -
>
> Key: SPARK-6825
> URL: https://issues.apache.org/jira/browse/SPARK-6825
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
>
> SequenceFiles are a widely used input format and right now they are not 
> supported in SparkR. 
> It would be good to add support for SequenceFiles by implementing a new data 
> source that can create a DataFrame from a SequenceFile. However as 
> SequenceFiles can have arbitrary types, we probably need to map them to 
> User-defined types in SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12168) Need test for conflicted function in R

2016-01-19 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-12168:
--
Assignee: Felix Cheung

> Need test for conflicted function in R
> --
>
> Key: SPARK-12168
> URL: https://issues.apache.org/jira/browse/SPARK-12168
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Minor
> Fix For: 2.0.0
>
>
> Currently it is hard to know if a function in base or stats packages are 
> masked when add new function in SparkR.
> Having an automated test would make it easier to track such changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12168) Need test for conflicted function in R

2016-01-19 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-12168.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10171
[https://github.com/apache/spark/pull/10171]

> Need test for conflicted function in R
> --
>
> Key: SPARK-12168
> URL: https://issues.apache.org/jira/browse/SPARK-12168
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Felix Cheung
>Priority: Minor
> Fix For: 2.0.0
>
>
> Currently it is hard to know if a function in base or stats packages are 
> masked when add new function in SparkR.
> Having an automated test would make it easier to track such changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12337) Implement dropDuplicates() method of DataFrame in SparkR

2016-01-19 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-12337:
--
Assignee: Sun Rui

> Implement dropDuplicates() method of DataFrame in SparkR
> 
>
> Key: SPARK-12337
> URL: https://issues.apache.org/jira/browse/SPARK-12337
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Sun Rui
>Assignee: Sun Rui
> Fix For: 2.0.0
>
>
> distinct() and unique() drop duplicated rows on all columns. While 
> dropDuplicates() can drop duplicated rows on selected columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12635) More efficient (column batch) serialization for Python/R

2016-01-19 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107860#comment-15107860
 ] 

Shivaram Venkataraman commented on SPARK-12635:
---

Just to clarify a couple of things - we should probably move this out to a new 
JIRA issue.
- The main purpose for creating the SerDe library in SparkR was to enable 
inter-process communication (IPC) between R and the JVM that if flexible, works 
on multiple platforms and works without needing too many dependencies. By IPC, 
I mean having the ability to call methods on the JVM from R. The reason for 
implementing this in Spark was that we need flexibility for either R or the JVM 
to come up first (as opposed to an embedded JVM) and also to make installing / 
deploying Spark easier.
- Using the same SerDe mechanism for collect is just a natural extension and as 
Spark is primarily tuned to do distributed operation we haven't profiled / 
benchmarked the collect performance so far. So your benchmarks are very useful 
and provide a baseline that we can improve on.
- In terms of future improvements I see two things (a) better benchmarks, 
profiling of the serialization costs -- we will also need to do this for the 
UDF work as we will be similarly transferring data from JVM to R and back there 
(b) designing or using a faster serialization for batch transfers like collect, 
UDFs. 

> More efficient (column batch) serialization for Python/R
> 
>
> Key: SPARK-12635
> URL: https://issues.apache.org/jira/browse/SPARK-12635
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SparkR, SQL
>Reporter: Reynold Xin
>
> Serialization between Scala / Python / R is pretty slow. Python and R both 
> work pretty well with column batch interface (e.g. numpy arrays). Technically 
> we should be able to just pass column batches around with minimal 
> serialization (maybe even zero copy memory).
> Note that this depends on some internal refactoring to use a column batch 
> interface in Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2   3   4   5   6   7   8   9   10   >