subject:"\[GitHub\] spark pull request\: \[SPARK\-10752\]\[SPARKR\] Implement corr\(\) and cov..."

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-10-07 Thread davies

Github user davies commented on the pull request:

https://github.com/apache/spark/pull/8869#issuecomment-146258846
  
Merging into master, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-10-07 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/8869


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-10-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8869#issuecomment-146078080
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-10-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8869#issuecomment-146078070
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-10-06 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8869#issuecomment-146080129
  
  [Test build #43320 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43320/console)
 for   PR 8869 at commit 
[`e73c8f3`](https://github.com/apache/spark/commit/e73c8f3a01a68a4ea839aa3925581e67744cfddd).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-10-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8869#issuecomment-146080183
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43320/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-10-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8869#issuecomment-146080181
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-10-06 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8869#issuecomment-146078664
  
  [Test build #43320 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43320/consoleFull)
 for   PR 8869 at commit 
[`e73c8f3`](https://github.com/apache/spark/commit/e73c8f3a01a68a4ea839aa3925581e67744cfddd).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-10-06 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/8869#discussion_r41348491
  
--- Diff: R/pkg/R/stats.R ---
@@ -0,0 +1,102 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# DataFrameStatFunctions.R - Statistic functions for DataFrames.
+
+setOldClass("jobj")
+
+#' crosstab
+#'
+#' Computes a pair-wise frequency table of the given columns. Also known 
as a contingency
+#' table. The number of distinct values for each column should be less 
than 1e4. At most 1e6
+#' non-zero pair frequencies will be returned.
+#'
+#' @param col1 name of the first column. Distinct items will make the 
first item of each row.
+#' @param col2 name of the second column. Distinct items will make the 
column names of the output.
+#' @return a local R data.frame representing the contingency table. The 
first column of each row
+#' will be the distinct values of `col1` and the column names will 
be the distinct values
+#' of `col2`. The name of the first column will be `$col1_$col2`. 
Pairs that have no
+#' occurrences will have zero as their counts.
+#'
+#' @rdname statfunctions
+#' @name crosstab
+#' @export
+#' @examples
+#' \dontrun{
+#' df <- jsonFile(sqlCtx, "/path/to/file.json")
+#' ct <- crosstab(df, "title", "gender")
+#' }
+setMethod("crosstab",
+  signature(x = "DataFrame", col1 = "character", col2 = 
"character"),
+  function(x, col1, col2) {
+statFunctions <- callJMethod(x@sdf, "stat")
+sct <- callJMethod(statFunctions, "crosstab", col1, col2)
+collect(dataFrame(sct))
+  })
+
+#' cov
+#'
+#' Calculate the sample covariance of two numerical columns of a DataFrame.
+#'
+#' @param x A SparkSQL DataFrame
+#' @param col1 the name of the first column
+#' @param col2 the name of the second column
+#' @return the covariance of the two columns.
+#'
+#' @rdname statfunctions
+#' @name cov
+#' @export
+#' @examples
+#'\dontrun{
+#' df <- jsonFile(sqlCtx, "/path/to/file.json")
+#' cov <- cov(df, "title", "gender")
+#' }
+setMethod("cov",
+  signature(x = "DataFrame", col1 = "character", col2 = 
"character"),
+  function(x, col1, col2) {
+statFunctions <- callJMethod(x@sdf, "stat")
+callJMethod(statFunctions, "cov", col1, col2)
+  })
+
+#' corr
+#'
+#' Calculates the correlation of two columns of a DataFrame.
+#' Currently only supports the Pearson Correlation Coefficient.
+#' For Spearman Correlation, consider using RDD methods found in MLlib's 
Statistics.
+#' 
+#' @param x A SparkSQL DataFrame
+#' @param col1 the name of the first column
+#' @param col2 the name of the second column
+#' @param method Optional. A character specifying the method for 
calculating the correlation.
+#'   only "pearson" is allowed now.
+#' @return The Pearson Correlation Coefficient as a Double.
+#'
+#' @rdname statfunctions
+#' @name corr
+#' @export
+#' @examples
+#'\dontrun{
+#' df <- jsonFile(sqlCtx, "/path/to/file.json")
+#' corr <- corr(df, "title", "gender")
+#' corr <- corr(df, "title", "gender", "pearson")
--- End diff --

would it be better to say
`corr <- corr(df, "title", "gender", method = "pearson")`
?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-10-06 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/8869#discussion_r41348453
  
--- Diff: R/pkg/R/stats.R ---
@@ -0,0 +1,102 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# DataFrameStatFunctions.R - Statistic functions for DataFrames.
+
+setOldClass("jobj")
+
+#' crosstab
+#'
+#' Computes a pair-wise frequency table of the given columns. Also known 
as a contingency
+#' table. The number of distinct values for each column should be less 
than 1e4. At most 1e6
+#' non-zero pair frequencies will be returned.
+#'
+#' @param col1 name of the first column. Distinct items will make the 
first item of each row.
+#' @param col2 name of the second column. Distinct items will make the 
column names of the output.
+#' @return a local R data.frame representing the contingency table. The 
first column of each row
+#' will be the distinct values of `col1` and the column names will 
be the distinct values
+#' of `col2`. The name of the first column will be `$col1_$col2`. 
Pairs that have no
+#' occurrences will have zero as their counts.
+#'
+#' @rdname statfunctions
+#' @name crosstab
+#' @export
+#' @examples
+#' \dontrun{
+#' df <- jsonFile(sqlCtx, "/path/to/file.json")
--- End diff --

perhaps a good time to update `sqlCtx` to `sqlContext`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-10-06 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/8869#discussion_r41348531
  
--- Diff: R/pkg/R/stats.R ---
@@ -0,0 +1,102 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# DataFrameStatFunctions.R - Statistic functions for DataFrames.
--- End diff --

stats.R


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-10-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8869#issuecomment-146071562
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-10-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8869#issuecomment-146071597
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-10-06 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8869#issuecomment-146072487
  
  [Test build #43316 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43316/consoleFull)
 for   PR 8869 at commit 
[`b05c443`](https://github.com/apache/spark/commit/b05c44386a7bbc14bb05f5eb11844fda2bc84623).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-10-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8869#issuecomment-146072680
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43316/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-10-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8869#issuecomment-146072678
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-10-06 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8869#issuecomment-146072671
  
  [Test build #43316 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43316/console)
 for   PR 8869 at commit 
[`b05c443`](https://github.com/apache/spark/commit/b05c44386a7bbc14bb05f5eb11844fda2bc84623).
 * This patch **fails R style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-10-06 Thread sun-rui

Github user sun-rui commented on the pull request:

https://github.com/apache/spark/pull/8869#issuecomment-146048835
  
rebased to master


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-10-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8869#issuecomment-146048920
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-10-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8869#issuecomment-146048909
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-10-06 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8869#issuecomment-146049134
  
  [Test build #43311 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43311/consoleFull)
 for   PR 8869 at commit 
[`ac2fd32`](https://github.com/apache/spark/commit/ac2fd32a660a14c2b20a82bf7207c4805f46a9be).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-10-06 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8869#issuecomment-146049566
  
  [Test build #43311 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43311/console)
 for   PR 8869 at commit 
[`ac2fd32`](https://github.com/apache/spark/commit/ac2fd32a660a14c2b20a82bf7207c4805f46a9be).
 * This patch **fails R style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-10-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8869#issuecomment-146049577
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43311/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-10-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8869#issuecomment-146049574
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-10-06 Thread davies

Github user davies commented on the pull request:

https://github.com/apache/spark/pull/8869#issuecomment-145942409
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-10-06 Thread shivaram

Github user shivaram commented on the pull request:

https://github.com/apache/spark/pull/8869#issuecomment-145932636
  
@davies any other comments ? 
@sun-rui Could you bring this up to date with master branch ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-10-03 Thread sun-rui

Github user sun-rui commented on a diff in the pull request:

https://github.com/apache/spark/pull/8869#discussion_r41087117
  
--- Diff: R/pkg/R/DataFrameStatFunctions.R ---
@@ -0,0 +1,102 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# DataFrameStatFunctions.R - Statistic functions for DataFrames.
+
+setOldClass("jobj")
+
+#' crosstab
+#'
+#' Computes a pair-wise frequency table of the given columns. Also known 
as a contingency
+#' table. The number of distinct values for each column should be less 
than 1e4. At most 1e6
+#' non-zero pair frequencies will be returned.
+#'
+#' @param col1 name of the first column. Distinct items will make the 
first item of each row.
+#' @param col2 name of the second column. Distinct items will make the 
column names of the output.
+#' @return a local R data.frame representing the contingency table. The 
first column of each row
+#' will be the distinct values of `col1` and the column names will 
be the distinct values
+#' of `col2`. The name of the first column will be `$col1_$col2`. 
Pairs that have no
+#' occurrences will have zero as their counts.
+#'
+#' @rdname statfunctions
+#' @name crosstab
+#' @export
+#' @examples
+#' \dontrun{
+#' df <- jsonFile(sqlCtx, "/path/to/file.json")
+#' ct <- crosstab(df, "title", "gender")
+#' }
+setMethod("crosstab",
+  signature(x = "DataFrame", col1 = "character", col2 = 
"character"),
+  function(x, col1, col2) {
+statFunctions <- callJMethod(x@sdf, "stat")
+sct <- callJMethod(statFunctions, "crosstab", col1, col2)
+collect(dataFrame(sct))
+  })
+
+#' cov
+#'
+#' Calculate the sample covariance of two numerical columns of a DataFrame.
+#'
+#' @param x A SparkSQL DataFrame
+#' @param col1 the name of the first column
+#' @param col2 the name of the second column
+#' @return the covariance of the two columns.
+#'
+#' @rdname statfunctions
+#' @name cov
+#' @export
+#' @examples
+#'\dontrun{
+#' df <- jsonFile(sqlCtx, "/path/to/file.json")
+#' cov <- cov(df, "title", "gender")
+#' }
+setMethod("cov",
+  signature(x = "DataFrame", col1 = "character", col2 = 
"character"),
--- End diff --

yeah, I agree.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-10-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8869#issuecomment-145249335
  
 Build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-10-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8869#issuecomment-145249343
  
Build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-10-03 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8869#issuecomment-145249449
  
  [Test build #43212 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43212/consoleFull)
 for   PR 8869 at commit 
[`302af26`](https://github.com/apache/spark/commit/302af267195b1bcb7e3171f26afae29993025de5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-10-03 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8869#issuecomment-145261664
  
  [Test build #43212 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43212/console)
 for   PR 8869 at commit 
[`302af26`](https://github.com/apache/spark/commit/302af267195b1bcb7e3171f26afae29993025de5).
 * This patch **passes all tests**.
 * This patch **does not merge cleanly**.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-10-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8869#issuecomment-145261693
  
Build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-10-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8869#issuecomment-145261694
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43212/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-10-01 Thread davies

Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/8869#discussion_r40967930
  
--- Diff: R/pkg/DESCRIPTION ---
@@ -23,6 +23,7 @@ Collate:
 'column.R'
 'group.R'
 'DataFrame.R'
+'DataFrameStatFunctions.R'
--- End diff --

Can we use a shorter name? like stats.R ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-10-01 Thread davies

Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/8869#discussion_r40968690
  
--- Diff: R/pkg/R/DataFrameStatFunctions.R ---
@@ -0,0 +1,102 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# DataFrameStatFunctions.R - Statistic functions for DataFrames.
+
+setOldClass("jobj")
+
+#' crosstab
+#'
+#' Computes a pair-wise frequency table of the given columns. Also known 
as a contingency
+#' table. The number of distinct values for each column should be less 
than 1e4. At most 1e6
+#' non-zero pair frequencies will be returned.
+#'
+#' @param col1 name of the first column. Distinct items will make the 
first item of each row.
+#' @param col2 name of the second column. Distinct items will make the 
column names of the output.
+#' @return a local R data.frame representing the contingency table. The 
first column of each row
+#' will be the distinct values of `col1` and the column names will 
be the distinct values
+#' of `col2`. The name of the first column will be `$col1_$col2`. 
Pairs that have no
+#' occurrences will have zero as their counts.
+#'
+#' @rdname statfunctions
+#' @name crosstab
+#' @export
+#' @examples
+#' \dontrun{
+#' df <- jsonFile(sqlCtx, "/path/to/file.json")
+#' ct <- crosstab(df, "title", "gender")
+#' }
+setMethod("crosstab",
+  signature(x = "DataFrame", col1 = "character", col2 = 
"character"),
+  function(x, col1, col2) {
+statFunctions <- callJMethod(x@sdf, "stat")
+sct <- callJMethod(statFunctions, "crosstab", col1, col2)
+collect(dataFrame(sct))
+  })
+
+#' cov
+#'
+#' Calculate the sample covariance of two numerical columns of a DataFrame.
+#'
+#' @param x A SparkSQL DataFrame
+#' @param col1 the name of the first column
+#' @param col2 the name of the second column
+#' @return the covariance of the two columns.
+#'
+#' @rdname statfunctions
+#' @name cov
+#' @export
+#' @examples
+#'\dontrun{
+#' df <- jsonFile(sqlCtx, "/path/to/file.json")
+#' cov <- cov(df, "title", "gender")
+#' }
+setMethod("cov",
+  signature(x = "DataFrame", col1 = "character", col2 = 
"character"),
--- End diff --

It will be great if we could have same signature as R API. Given the fact 
that Spark DataFrame is much different than R dataframe, this will be hard, 
maybe we could only support a small subset of what the R API can do. Instead of 
confusing users, it's more clear to use different name, if they can't be 
compatible.

Does this sound reasonable?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-09-29 Thread sun-rui

Github user sun-rui commented on a diff in the pull request:

https://github.com/apache/spark/pull/8869#discussion_r40639291
  
--- Diff: R/pkg/R/DataFrameStatFunctions.R ---
@@ -0,0 +1,102 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# DataFrameStatFunctions.R - Statistic functions for DataFrames.
+
+setOldClass("jobj")
+
+#' crosstab
+#'
+#' Computes a pair-wise frequency table of the given columns. Also known 
as a contingency
+#' table. The number of distinct values for each column should be less 
than 1e4. At most 1e6
+#' non-zero pair frequencies will be returned.
+#'
+#' @param col1 name of the first column. Distinct items will make the 
first item of each row.
+#' @param col2 name of the second column. Distinct items will make the 
column names of the output.
+#' @return a local R data.frame representing the contingency table. The 
first column of each row
+#' will be the distinct values of `col1` and the column names will 
be the distinct values
+#' of `col2`. The name of the first column will be `$col1_$col2`. 
Pairs that have no
+#' occurrences will have zero as their counts.
+#'
+#' @rdname statfunctions
+#' @name crosstab
+#' @export
+#' @examples
+#' \dontrun{
+#' df <- jsonFile(sqlCtx, "/path/to/file.json")
+#' ct <- crosstab(df, "title", "gender")
+#' }
+setMethod("crosstab",
+  signature(x = "DataFrame", col1 = "character", col2 = 
"character"),
+  function(x, col1, col2) {
+statFunctions <- callJMethod(x@sdf, "stat")
+sct <- callJMethod(statFunctions, "crosstab", col1, col2)
+collect(dataFrame(sct))
+  })
+
+#' cov
+#'
+#' Calculate the sample covariance of two numerical columns of a DataFrame.
+#'
+#' @param x A SparkSQL DataFrame
+#' @param col1 the name of the first column
+#' @param col2 the name of the second column
+#' @return the covariance of the two columns.
+#'
+#' @rdname statfunctions
+#' @name cov
+#' @export
+#' @examples
+#'\dontrun{
+#' df <- jsonFile(sqlCtx, "/path/to/file.json")
+#' cov <- cov(df, "title", "gender")
+#' }
+setMethod("cov",
+  signature(x = "DataFrame", col1 = "character", col2 = 
"character"),
--- End diff --

@NarineK, thank you for your comments. You suggestion needs extensions to 
Scala DataFrame. I prefer that you can submit a new JIRA in the community. 
@shivaram, what do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-09-28 Thread NarineK

Github user NarineK commented on a diff in the pull request:

https://github.com/apache/spark/pull/8869#discussion_r40581218
  
--- Diff: R/pkg/R/DataFrameStatFunctions.R ---
@@ -0,0 +1,102 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# DataFrameStatFunctions.R - Statistic functions for DataFrames.
+
+setOldClass("jobj")
+
+#' crosstab
+#'
+#' Computes a pair-wise frequency table of the given columns. Also known 
as a contingency
+#' table. The number of distinct values for each column should be less 
than 1e4. At most 1e6
+#' non-zero pair frequencies will be returned.
+#'
+#' @param col1 name of the first column. Distinct items will make the 
first item of each row.
+#' @param col2 name of the second column. Distinct items will make the 
column names of the output.
+#' @return a local R data.frame representing the contingency table. The 
first column of each row
+#' will be the distinct values of `col1` and the column names will 
be the distinct values
+#' of `col2`. The name of the first column will be `$col1_$col2`. 
Pairs that have no
+#' occurrences will have zero as their counts.
+#'
+#' @rdname statfunctions
+#' @name crosstab
+#' @export
+#' @examples
+#' \dontrun{
+#' df <- jsonFile(sqlCtx, "/path/to/file.json")
+#' ct <- crosstab(df, "title", "gender")
+#' }
+setMethod("crosstab",
+  signature(x = "DataFrame", col1 = "character", col2 = 
"character"),
+  function(x, col1, col2) {
+statFunctions <- callJMethod(x@sdf, "stat")
+sct <- callJMethod(statFunctions, "crosstab", col1, col2)
+collect(dataFrame(sct))
+  })
+
+#' cov
+#'
+#' Calculate the sample covariance of two numerical columns of a DataFrame.
+#'
+#' @param x A SparkSQL DataFrame
+#' @param col1 the name of the first column
+#' @param col2 the name of the second column
+#' @return the covariance of the two columns.
+#'
+#' @rdname statfunctions
+#' @name cov
+#' @export
+#' @examples
+#'\dontrun{
+#' df <- jsonFile(sqlCtx, "/path/to/file.json")
+#' cov <- cov(df, "title", "gender")
+#' }
+setMethod("cov",
+  signature(x = "DataFrame", col1 = "character", col2 = 
"character"),
--- End diff --

Hi there, 
I have some points about correlation and covariance.
1. R calls the method 'cor' and not 'corr', so if we want to have the same 
syntax as R, we might want to use the 'cor'.
2. The actual syntax for cor (cov has a similar one) is : cor(x, y = NULL, 
use = "everything",
method = c("pearson", "kendall", "spearman"))
where X is a dataframe and y can be another dataframe, a vector or matrix 
and in R I can get smth like this:
cor(longley)
 GNP.deflator   GNP   Unemployed  
GNP.deflator1.000 0.9915892
GNP 0.9915892 1.000
Unemployed  0.6206334 0.6042609
Armed.Forces0.4647442 0.4464368
Population  0.9791634 0.9910901
Year0.9911492 0.9952735
Employed0.9708985 0.9835516

I wonder if we can get this in SparkR too.
I see at least 2 options here:
1. we make K number of calls to dataframe api for each column pair or
2. we extend scala dataframe api so that it also accepts a list of columns 
... 
I can help you with this if you think that it makes sense and we want to 
add it.

Thanks,
Narine


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-09-28 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/8869#discussion_r40624158
  
--- Diff: R/pkg/R/DataFrameStatFunctions.R ---
@@ -0,0 +1,102 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# DataFrameStatFunctions.R - Statistic functions for DataFrames.
+
+setOldClass("jobj")
+
+#' crosstab
+#'
+#' Computes a pair-wise frequency table of the given columns. Also known 
as a contingency
+#' table. The number of distinct values for each column should be less 
than 1e4. At most 1e6
+#' non-zero pair frequencies will be returned.
+#'
+#' @param col1 name of the first column. Distinct items will make the 
first item of each row.
+#' @param col2 name of the second column. Distinct items will make the 
column names of the output.
+#' @return a local R data.frame representing the contingency table. The 
first column of each row
+#' will be the distinct values of `col1` and the column names will 
be the distinct values
+#' of `col2`. The name of the first column will be `$col1_$col2`. 
Pairs that have no
+#' occurrences will have zero as their counts.
+#'
+#' @rdname statfunctions
+#' @name crosstab
+#' @export
+#' @examples
+#' \dontrun{
+#' df <- jsonFile(sqlCtx, "/path/to/file.json")
+#' ct <- crosstab(df, "title", "gender")
+#' }
+setMethod("crosstab",
+  signature(x = "DataFrame", col1 = "character", col2 = 
"character"),
+  function(x, col1, col2) {
+statFunctions <- callJMethod(x@sdf, "stat")
+sct <- callJMethod(statFunctions, "crosstab", col1, col2)
+collect(dataFrame(sct))
+  })
+
+#' cov
+#'
+#' Calculate the sample covariance of two numerical columns of a DataFrame.
+#'
+#' @param x A SparkSQL DataFrame
+#' @param col1 the name of the first column
+#' @param col2 the name of the second column
+#' @return the covariance of the two columns.
+#'
+#' @rdname statfunctions
+#' @name cov
+#' @export
+#' @examples
+#'\dontrun{
+#' df <- jsonFile(sqlCtx, "/path/to/file.json")
+#' cov <- cov(df, "title", "gender")
+#' }
+setMethod("cov",
+  signature(x = "DataFrame", col1 = "character", col2 = 
"character"),
--- End diff --

Link on the function name: 
https://stat.ethz.ch/R-manual/R-devel/library/stats/html/cor.html


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-09-26 Thread shivaram

Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/8869#discussion_r40490731
  
--- Diff: R/pkg/R/DataFrameStatFunctions.R ---
@@ -0,0 +1,102 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# DataFrameStatFunctions.R - Statistic functions for DataFrames.
+
+setOldClass("jobj")
+
+#' crosstab
+#'
+#' Computes a pair-wise frequency table of the given columns. Also known 
as a contingency
+#' table. The number of distinct values for each column should be less 
than 1e4. At most 1e6
+#' non-zero pair frequencies will be returned.
+#'
+#' @param col1 name of the first column. Distinct items will make the 
first item of each row.
+#' @param col2 name of the second column. Distinct items will make the 
column names of the output.
+#' @return a local R data.frame representing the contingency table. The 
first column of each row
+#' will be the distinct values of `col1` and the column names will 
be the distinct values
+#' of `col2`. The name of the first column will be `$col1_$col2`. 
Pairs that have no
+#' occurrences will have zero as their counts.
+#'
+#' @rdname statfunctions
+#' @name crosstab
+#' @export
+#' @examples
+#' \dontrun{
+#' df <- jsonFile(sqlCtx, "/path/to/file.json")
+#' ct <- crosstab(df, "title", "gender")
+#' }
+setMethod("crosstab",
+  signature(x = "DataFrame", col1 = "character", col2 = 
"character"),
+  function(x, col1, col2) {
+statFunctions <- callJMethod(x@sdf, "stat")
+sct <- callJMethod(statFunctions, "crosstab", col1, col2)
+collect(dataFrame(sct))
+  })
+
+#' cov
+#'
+#' Calculate the sample covariance of two numerical columns of a DataFrame.
+#'
+#' @param x A SparkSQL DataFrame
+#' @param col1 the name of the first column
+#' @param col2 the name of the second column
+#' @return the covariance of the two columns.
+#'
+#' @rdname statfunctions
+#' @name cov
+#' @export
+#' @examples
+#'\dontrun{
+#' df <- jsonFile(sqlCtx, "/path/to/file.json")
+#' cov <- cov(df, "title", "gender")
+#' }
+setMethod("cov",
+  signature(x = "DataFrame", col1 = "character", col2 = 
"character"),
--- End diff --

It would cool if we also have versions which take in columns instead of 
just strings ? 
@rxin Any reason all the stat functions only take string column names in 
Scala ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-09-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8869#issuecomment-142464859
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-09-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8869#issuecomment-142464871
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-09-22 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8869#issuecomment-142466128
  
  [Test build #42875 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42875/consoleFull)
 for   PR 8869 at commit 
[`d35c3f5`](https://github.com/apache/spark/commit/d35c3f56be1785cd5e3217bf0f53f7ba42504b7c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-09-22 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8869#issuecomment-142467524
  
  [Test build #42875 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42875/console)
 for   PR 8869 at commit 
[`d35c3f5`](https://github.com/apache/spark/commit/d35c3f56be1785cd5e3217bf0f53f7ba42504b7c).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-09-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8869#issuecomment-142467569
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42875/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-09-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8869#issuecomment-142467567
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-09-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8869#issuecomment-142276456
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42832/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-09-22 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8869#issuecomment-142276448
  
  [Test build #42832 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42832/console)
 for   PR 8869 at commit 
[`038be09`](https://github.com/apache/spark/commit/038be09ee04de625bb10d1d4a13f495dcc774ac3).
 * This patch **fails R style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-09-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8869#issuecomment-142274109
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-09-22 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8869#issuecomment-142275868
  
  [Test build #42832 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42832/consoleFull)
 for   PR 8869 at commit 
[`038be09`](https://github.com/apache/spark/commit/038be09ee04de625bb10d1d4a13f495dcc774ac3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-09-22 Thread sun-rui

GitHub user sun-rui opened a pull request:

https://github.com/apache/spark/pull/8869

[SPARK-10752][SPARKR] Implement corr() and cov in DataFrameStatFunctions.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sun-rui/spark SPARK-10752

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/8869.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #8869


commit c54002c295278fb2b7c80df11c0d1305adb3aa9e
Author: Sun Rui 
Date:   2015-09-22T12:21:25Z

[SPARK-10752][SPARKR] Implement corr() and cov in DataFrameStatFunctions.

commit 038be09ee04de625bb10d1d4a13f495dcc774ac3
Author: Sun Rui 
Date:   2015-09-22T12:28:25Z

Remove crosstab() from DataFrame.R.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-09-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8869#issuecomment-142274093
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

2015-09-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8869#issuecomment-142276452
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

52 matches

Mail list logo