[GitHub] spark pull request #22455: [SPARK-24572][SPARKR] "eager execution" for R she...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22455#discussion_r219030350 --- Diff: docs/sparkr.md --- @@ -450,6 +450,42 @@ print(model.summaries) {% endhighlight %} +### Eager execution + +If the eager execution is enabled, the data will be returned to R client immediately when the `SparkDataFrame` is created. Eager execution can be enabled by setting the configuration property `spark.sql.repl.eagerEval.enabled` to `true` when the `SparkSession` is started up. + + +{% highlight r %} + +# Start up spark session with eager execution enabled +sparkR.session(master = "local[*]", sparkConfig = list(spark.sql.repl.eagerEval.enabled = "true")) + +df <- createDataFrame(faithful) + +# Instead of displaying the SparkDataFrame class, displays the data returned +df + +##+-+---+ +##|eruptions|waiting| +##+-+---+ +##| 3.6| 79.0| +##| 1.8| 54.0| +##|3.333| 74.0| +##|2.283| 62.0| +##|4.533| 85.0| +##|2.883| 55.0| +##| 4.7| 88.0| +##| 3.6| 85.0| +##| 1.95| 51.0| +##| 4.35| 85.0| +##+-+---+ +##only showing top 10 rows + +{% endhighlight %} + + +Note that the `SparkSession` created by `sparkR` shell does not have eager execution enabled. You can stop the current session and start up a new session like above to enable. --- End diff -- actually I think the suggestion should be to set that in the `sparkR` command line as spark conf? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22455: [SPARK-24572][SPARKR] "eager execution" for R she...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22455#discussion_r219030512 --- Diff: R/pkg/tests/fulltests/test_eager_execution.R --- @@ -0,0 +1,58 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +library(testthat) + +context("Show SparkDataFrame when eager execution is enabled.") + +test_that("eager execution is not enabled", { + # Start Spark session without eager execution enabled + sparkSession <- if (windows_with_hadoop()) { +sparkR.session(master = sparkRTestMaster) + } else { +sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE) + } + + df <- suppressWarnings(createDataFrame(iris)) --- End diff -- use a different dataset that does not require `suppressWarnings` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22455: [SPARK-24572][SPARKR] "eager execution" for R she...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22455#discussion_r219030211 --- Diff: docs/sparkr.md --- @@ -450,6 +450,42 @@ print(model.summaries) {% endhighlight %} +### Eager execution + +If the eager execution is enabled, the data will be returned to R client immediately when the `SparkDataFrame` is created. Eager execution can be enabled by setting the configuration property `spark.sql.repl.eagerEval.enabled` to `true` when the `SparkSession` is started up. + + +{% highlight r %} + +# Start up spark session with eager execution enabled +sparkR.session(master = "local[*]", sparkConfig = list(spark.sql.repl.eagerEval.enabled = "true")) + +df <- createDataFrame(faithful) + +# Instead of displaying the SparkDataFrame class, displays the data returned --- End diff -- we could also start here by saying "similar to R data.frame`... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22455: [SPARK-24572][SPARKR] "eager execution" for R she...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22455#discussion_r219030277 --- Diff: docs/sparkr.md --- @@ -450,6 +450,42 @@ print(model.summaries) {% endhighlight %} +### Eager execution + +If the eager execution is enabled, the data will be returned to R client immediately when the `SparkDataFrame` is created. Eager execution can be enabled by setting the configuration property `spark.sql.repl.eagerEval.enabled` to `true` when the `SparkSession` is started up. + + +{% highlight r %} + +# Start up spark session with eager execution enabled +sparkR.session(master = "local[*]", sparkConfig = list(spark.sql.repl.eagerEval.enabled = "true")) + +df <- createDataFrame(faithful) + +# Instead of displaying the SparkDataFrame class, displays the data returned +df + +##+-+---+ +##|eruptions|waiting| +##+-+---+ +##| 3.6| 79.0| +##| 1.8| 54.0| +##|3.333| 74.0| +##|2.283| 62.0| +##|4.533| 85.0| +##|2.883| 55.0| +##| 4.7| 88.0| +##| 3.6| 85.0| +##| 1.95| 51.0| +##| 4.35| 85.0| +##+-+---+ +##only showing top 10 rows + +{% endhighlight %} + + +Note that the `SparkSession` created by `sparkR` shell does not have eager execution enabled. You can stop the current session and start up a new session like above to enable. --- End diff -- change to `Note that the `SparkSession` created by `sparkR` shell by default does not ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22455: [SPARK-24572][SPARKR] "eager execution" for R she...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22455#discussion_r219029847 --- Diff: docs/sparkr.md --- @@ -450,6 +450,42 @@ print(model.summaries) {% endhighlight %} +### Eager execution --- End diff -- should be `` I think? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22455: [SPARK-24572][SPARKR] "eager execution" for R she...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22455#discussion_r219030474 --- Diff: R/pkg/tests/fulltests/test_eager_execution.R --- @@ -0,0 +1,58 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +library(testthat) + +context("Show SparkDataFrame when eager execution is enabled.") + +test_that("eager execution is not enabled", { --- End diff -- I'm neutral, should these tests be in test_sparkSQL.R? it takes longer to run with many test files --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22455: [SPARK-24572][SPARKR] "eager execution" for R she...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22455#discussion_r219030085 --- Diff: docs/sparkr.md --- @@ -450,6 +450,42 @@ print(model.summaries) {% endhighlight %} +### Eager execution + +If the eager execution is enabled, the data will be returned to R client immediately when the `SparkDataFrame` is created. Eager execution can be enabled by setting the configuration property `spark.sql.repl.eagerEval.enabled` to `true` when the `SparkSession` is started up. + + +{% highlight r %} + +# Start up spark session with eager execution enabled +sparkR.session(master = "local[*]", sparkConfig = list(spark.sql.repl.eagerEval.enabled = "true")) + +df <- createDataFrame(faithful) --- End diff -- perhaps a more complete example - like `summarize(groupBy(df, df$waiting), count = n(df$waiting))` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22455: [SPARK-24572][SPARKR] "eager execution" for R she...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22455#discussion_r219030537 --- Diff: R/pkg/tests/fulltests/test_eager_execution.R --- @@ -0,0 +1,58 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +library(testthat) + +context("Show SparkDataFrame when eager execution is enabled.") + +test_that("eager execution is not enabled", { + # Start Spark session without eager execution enabled + sparkSession <- if (windows_with_hadoop()) { +sparkR.session(master = sparkRTestMaster) + } else { +sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE) + } + + df <- suppressWarnings(createDataFrame(iris)) + expect_is(df, "SparkDataFrame") + expected <- "Sepal_Length:double, Sepal_Width:double, Petal_Length:double, Petal_Width:double, Species:string" + expect_output(show(df), expected) + + # Stop Spark session + sparkR.session.stop() +}) + +test_that("eager execution is enabled", { + # Start Spark session without eager execution enabled + sparkSession <- if (windows_with_hadoop()) { +sparkR.session(master = sparkRTestMaster, + sparkConfig = list(spark.sql.repl.eagerEval.enabled = "true")) + } else { +sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE, + sparkConfig = list(spark.sql.repl.eagerEval.enabled = "true")) + } + + df <- suppressWarnings(createDataFrame(iris)) --- End diff -- ditto --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22379: [SPARK-25393][SQL] Adding new function from_csv()
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22379 think maybe someone to review the SQL stuff more? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/7#discussion_r217953294 --- Diff: R/pkg/tests/fulltests/test_sparkSQL.R --- @@ -1803,6 +1803,18 @@ test_that("string operators", { collect(select(df4, split_string(df4$a, "")))[1, 1], list(list("a.b@c.d 1", "b")) ) + expect_equal( +collect(select(df4, split_string(df4$a, "\\.", 2)))[1, 1], +list(list("a", "b@c.d 1\\b")) + ) + expect_equal( +collect(select(df4, split_string(df4$a, "b", -2)))[1, 1], +list(list("a.", "@c.d 1\\", "")) + ) + expect_equal( +collect(select(df4, split_string(df4$a, "b", 0)))[1, 1], --- End diff -- for context, we've had some cases in the past the wrong value is passed for an parameter - so let's at least get one with and one without any optional parameter --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: [MINOR][DOCS] Axe deprecated doc refs
Repository: spark Updated Branches: refs/heads/branch-2.4 60af706b4 -> 1cb1e4301 [MINOR][DOCS] Axe deprecated doc refs Continuation of #22370. Summary of discussion there: There is some inconsistency in the R manual w.r.t. supercedent functions linking back to deprecated functions. - `createOrReplaceTempView` and `createTable` both link back to functions which are deprecated (`registerTempTable` and `createExternalTable`, respectively) - `sparkR.session` and `dropTempView` do _not_ link back to deprecated functions This PR takes the view that it is preferable _not_ to link back to deprecated functions, and removes these references from `?createOrReplaceTempView` and `?createTable`. As `registerTempTable` was included in the `SparkDataFrame functions` `family` of functions, other documentation pages which included a link to `?registerTempTable` will similarly be altered. Author: Michael Chirico Author: Michael Chirico Closes #22393 from MichaelChirico/axe_deprecated_doc_refs. (cherry picked from commit a1dd78255a3ae023820b2f245cd39f0c57a32fb1) Signed-off-by: Felix Cheung Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1cb1e430 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1cb1e430 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1cb1e430 Branch: refs/heads/branch-2.4 Commit: 1cb1e43012e57e649d77524f8ff2de231f52c66a Parents: 60af706 Author: Michael Chirico Authored: Sun Sep 16 12:57:44 2018 -0700 Committer: Felix Cheung Committed: Sun Sep 16 12:58:04 2018 -0700 -- R/pkg/R/DataFrame.R | 1 - R/pkg/R/catalog.R | 1 - 2 files changed, 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/1cb1e430/R/pkg/R/DataFrame.R -- diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R index 4f2d4c7..458deca 100644 --- a/R/pkg/R/DataFrame.R +++ b/R/pkg/R/DataFrame.R @@ -503,7 +503,6 @@ setMethod("createOrReplaceTempView", #' @param x A SparkDataFrame #' @param tableName A character vector containing the name of the table #' -#' @family SparkDataFrame functions #' @seealso \link{createOrReplaceTempView} #' @rdname registerTempTable-deprecated #' @name registerTempTable http://git-wip-us.apache.org/repos/asf/spark/blob/1cb1e430/R/pkg/R/catalog.R -- diff --git a/R/pkg/R/catalog.R b/R/pkg/R/catalog.R index baf4d86..c2d0fc3 100644 --- a/R/pkg/R/catalog.R +++ b/R/pkg/R/catalog.R @@ -69,7 +69,6 @@ createExternalTable <- function(x, ...) { #' @param ... additional named parameters as options for the data source. #' @return A SparkDataFrame. #' @rdname createTable -#' @seealso \link{createExternalTable} #' @examples #'\dontrun{ #' sparkR.session() - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] spark issue #22393: [MINOR][DOCS] Axe deprecated doc refs
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22393 thx. merged to master/2.4 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: [MINOR][DOCS] Axe deprecated doc refs
Repository: spark Updated Branches: refs/heads/master bfcf74260 -> a1dd78255 [MINOR][DOCS] Axe deprecated doc refs Continuation of #22370. Summary of discussion there: There is some inconsistency in the R manual w.r.t. supercedent functions linking back to deprecated functions. - `createOrReplaceTempView` and `createTable` both link back to functions which are deprecated (`registerTempTable` and `createExternalTable`, respectively) - `sparkR.session` and `dropTempView` do _not_ link back to deprecated functions This PR takes the view that it is preferable _not_ to link back to deprecated functions, and removes these references from `?createOrReplaceTempView` and `?createTable`. As `registerTempTable` was included in the `SparkDataFrame functions` `family` of functions, other documentation pages which included a link to `?registerTempTable` will similarly be altered. Author: Michael Chirico Author: Michael Chirico Closes #22393 from MichaelChirico/axe_deprecated_doc_refs. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a1dd7825 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a1dd7825 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a1dd7825 Branch: refs/heads/master Commit: a1dd78255a3ae023820b2f245cd39f0c57a32fb1 Parents: bfcf742 Author: Michael Chirico Authored: Sun Sep 16 12:57:44 2018 -0700 Committer: Felix Cheung Committed: Sun Sep 16 12:57:44 2018 -0700 -- R/pkg/R/DataFrame.R | 1 - R/pkg/R/catalog.R | 1 - 2 files changed, 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/a1dd7825/R/pkg/R/DataFrame.R -- diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R index 4f2d4c7..458deca 100644 --- a/R/pkg/R/DataFrame.R +++ b/R/pkg/R/DataFrame.R @@ -503,7 +503,6 @@ setMethod("createOrReplaceTempView", #' @param x A SparkDataFrame #' @param tableName A character vector containing the name of the table #' -#' @family SparkDataFrame functions #' @seealso \link{createOrReplaceTempView} #' @rdname registerTempTable-deprecated #' @name registerTempTable http://git-wip-us.apache.org/repos/asf/spark/blob/a1dd7825/R/pkg/R/catalog.R -- diff --git a/R/pkg/R/catalog.R b/R/pkg/R/catalog.R index baf4d86..c2d0fc3 100644 --- a/R/pkg/R/catalog.R +++ b/R/pkg/R/catalog.R @@ -69,7 +69,6 @@ createExternalTable <- function(x, ...) { #' @param ... additional named parameters as options for the data source. #' @return A SparkDataFrame. #' @rdname createTable -#' @seealso \link{createExternalTable} #' @examples #'\dontrun{ #' sparkR.session() - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] spark issue #22393: [MINOR][DOCS] Axe deprecated doc refs
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22393 yes please - please double the doc created looks correct - there is no automatic test for that --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21515: [SPARK-24372][build] Add scripts to help with preparing ...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/21515 UID already exists? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/7#discussion_r217901635 --- Diff: R/pkg/tests/fulltests/test_sparkSQL.R --- @@ -1803,6 +1803,10 @@ test_that("string operators", { collect(select(df4, split_string(df4$a, "")))[1, 1], list(list("a.b@c.d 1", "b")) ) + expect_equal( +collect(select(df4, split_string(df4$a, "\\.", 2)))[1, 1], +list(list("a", "b@c.d 1\\b")) --- End diff -- let's add a test for `limit = 0` or `limit = -1` too - while it's the default value, is any of the test cases changes behavior for limit = -1? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22379: [SPARK-25393][SQL] Adding new function from_csv()
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22379#discussion_r217901558 --- Diff: R/pkg/NAMESPACE --- @@ -275,6 +275,7 @@ exportMethods("%<=>%", "format_number", "format_string", "from_json", + "from_csv", --- End diff -- pleas sort this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22379: [SPARK-25393][SQL] Adding new function from_csv()
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22379#discussion_r217901588 --- Diff: R/pkg/R/functions.R --- @@ -2202,6 +2208,24 @@ setMethod("from_json", signature(x = "Column", schema = "characterOrstructType") column(jc) }) +#' @details +#' \code{from_csv}: Parses a column containing a CSV string into a Column of \code{structType} +#' with the specified \code{schema}. +#' If the string is unparseable, the Column will contain the value NA. +#' +#' @rdname column_collection_functions +#' @aliases from_csv from_csv,Column,character-method +#' --- End diff -- newline with `#'` is significant in ROxygen, please remove this line --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22393: [MINOR][DOCS] Axe deprecated doc refs
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22393 could you check the doc output manually for registerTempTable and createTable? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22379: [SPARK-25393][SQL] Adding new function from_csv()
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22379 see comment above/ --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22379: [SPARK-25393][SQL] Adding new function from_csv()
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22379#discussion_r216875875 --- Diff: R/pkg/R/functions.R --- @@ -3720,3 +3720,22 @@ setMethod("current_timestamp", jc <- callJStatic("org.apache.spark.sql.functions", "current_timestamp") column(jc) }) + +#' @details +#' \code{from_csv}: Parses a column containing a CSV string into a Column of \code{structType} +#' with the specified \code{schema}. +#' If the string is unparseable, the Column will contain the value NA. +#' +#' @rdname column_collection_functions +#' @param schema a DDL-formatted string +#' @aliases from_csv from_csv,Column,character-method +#' +#' @note from_csv since 3.0.0 +setMethod("from_csv", signature(x = "Column", schema = "character"), + function(x, schema, ...) { --- End diff -- here https://github.com/apache/spark/blob/d2bfd9430f05d006accdecb6a62ed659fbd6a2f8/R/pkg/R/functions.R#L199 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22379: [SPARK-25393][SQL] Adding new function from_csv()
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22379#discussion_r216875804 --- Diff: R/pkg/R/functions.R --- @@ -3720,3 +3720,22 @@ setMethod("current_timestamp", jc <- callJStatic("org.apache.spark.sql.functions", "current_timestamp") column(jc) }) + +#' @details +#' \code{from_csv}: Parses a column containing a CSV string into a Column of \code{structType} +#' with the specified \code{schema}. +#' If the string is unparseable, the Column will contain the value NA. +#' +#' @rdname column_collection_functions +#' @param schema a DDL-formatted string +#' @aliases from_csv from_csv,Column,character-method +#' +#' @note from_csv since 3.0.0 +setMethod("from_csv", signature(x = "Column", schema = "character"), + function(x, schema, ...) { --- End diff -- no no, this will break - I am referring to find the original doc `@rdname column_collection_functions` that has `...` already documented, and then add this in --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22376: [SPARK-25021][K8S][BACKPORT] Add spark.executor.pyspark....
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22376 Jenkins, retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21710: [SPARK-24207][R]add R API for PrefixSpan
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/21710 I think we missed the windows before the branch, I'll review in a few days --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22192: [SPARK-24918][Core] Executor Plugin API
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22192 Jenkins, retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21649: [SPARK-23648][R][SQL]Adds more types for hint in ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/21649#discussion_r216539767 --- Diff: R/pkg/R/DataFrame.R --- @@ -3939,7 +3929,15 @@ setMethod("hint", signature(x = "SparkDataFrame", name = "character"), function(x, name, ...) { parameters <- list(...) -stopifnot(all(sapply(parameters, isTypeAllowedForSqlHint))) +stopifnot(all(sapply(parameters, function(x) { --- End diff -- if recall, let's not have a inside scope with the same variable name `x` in the outer scope? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22370: don't link to deprecated function
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22370#discussion_r216539411 --- Diff: R/pkg/R/catalog.R --- @@ -69,7 +69,6 @@ createExternalTable <- function(x, ...) { #' @param ... additional named parameters as options for the data source. #' @return A SparkDataFrame. #' @rdname createTable -#' @seealso \link{createExternalTable} --- End diff -- `registerTempTable` is because of the `@family` tag, so it's a bit different. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22379: [SPARK-25393][SQL] Adding new function from_csv()
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22379#discussion_r216538924 --- Diff: R/pkg/R/functions.R --- @@ -3720,3 +3720,22 @@ setMethod("current_timestamp", jc <- callJStatic("org.apache.spark.sql.functions", "current_timestamp") column(jc) }) + +#' @details +#' \code{from_csv}: Parses a column containing a CSV string into a Column of \code{structType} +#' with the specified \code{schema}. +#' If the string is unparseable, the Column will contain the value NA. +#' +#' @rdname column_collection_functions +#' @param schema a DDL-formatted string +#' @aliases from_csv from_csv,Column,character-method +#' +#' @note from_csv since 3.0.0 +setMethod("from_csv", signature(x = "Column", schema = "character"), + function(x, schema, ...) { --- End diff -- can you add to the doc for `...` (in column_collection_functions) to indicate the use options for this function? if there is anything new? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22357: [SPARK-25363][SQL] Fix schema pruning in where clause by...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22357 if recall, parquet reader can have filter pushdown? only not so in spark parquet data source? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22376: [SPARK-25021][K8S][BACKPORT] Add spark.executor.pyspark....
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22376 Jenkins, retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22370: don't link to deprecated function
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22370 I donât feel strongly either way. I do think this is very minor since there are still many other ways to the doc page for createExternalTable (eg the index page) or via ? search within R etc. I am not sure how much difference this would make and we have already a) code spewing out warning when called b) clearly documented as Deprecated on the doc page title. Should you find other deprecation that is not documentation we should be gladly having your help to document it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21649: [SPARK-23648][R][SQL]Adds more types for hint in SparkR
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/21649 Right - I think we could inline it or simplify it further. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22372: [SPARK-25385][BUILD] Upgrade Hadoop 3.1 jackson version ...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22372 do we have jenkins tests for 3.1 profile? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22371: [SPARK-25386][CORE] Don't need to synchronize the IndexS...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22371 + @srowen @squito @JoshRosen --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22358: [SPARK-25366][SQL]Zstd and brotil CompressionCode...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22358#discussion_r216165218 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -398,10 +398,10 @@ object SQLConf { "`parquet.compression` is specified in the table-specific options/properties, the " + "precedence would be `compression`, `parquet.compression`, " + "`spark.sql.parquet.compression.codec`. Acceptable values include: none, uncompressed, " + - "snappy, gzip, lzo, brotli, lz4, zstd.") + "snappy, gzip, lzo, lz4.") .stringConf .transform(_.toLowerCase(Locale.ROOT)) -.checkValues(Set("none", "uncompressed", "snappy", "gzip", "lzo", "lz4", "brotli", "zstd")) --- End diff -- I thought if you remove it from here the user would not be able to use zstd or brotli even if it is installed/enabled/available? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22298: [SPARK-25021][K8S] Add spark.executor.pyspark.memory lim...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22298 +1 for 2.4 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21649: [SPARK-23648][R][SQL]Adds more types for hint in ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/21649#discussion_r216122842 --- Diff: R/pkg/R/DataFrame.R --- @@ -3905,6 +3905,16 @@ setMethod("rollup", groupedData(sgd) }) +isTypeAllowedForSqlHint <- function(x) { + if (is.character(x) || is.numeric(x)) { +TRUE + } else if (is.list(x)) { +all(sapply(x, (function(y) is.character(y) || is.numeric(y --- End diff -- also, if it is a `list` could we clarify if it is supposed to work with multiple hint in different types in that list (this might be "unique" to R), for example ``` > x <- list("a", 3) > all(sapply(x, function(y) { is.character(y) || is.numeric(y) } )) [1] TRUE > x <- list("a", NA) > all(sapply(x, function(y) { is.character(y) || is.numeric(y) } )) [1] FALSE ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21649: [SPARK-23648][R][SQL]Adds more types for hint in ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/21649#discussion_r216122804 --- Diff: R/pkg/R/DataFrame.R --- @@ -3905,6 +3905,16 @@ setMethod("rollup", groupedData(sgd) }) +isTypeAllowedForSqlHint <- function(x) { + if (is.character(x) || is.numeric(x)) { +TRUE + } else if (is.list(x)) { +all(sapply(x, (function(y) is.character(y) || is.numeric(y --- End diff -- I look into this more deeply, I think this style seems a bit odd, as a nit, I think this should be `all(sapply(x, function(y) { is.character(y) || is.numeric(y) } ))` think it's more readable this way. also see L2458 for an example https://github.com/apache/spark/blob/aec391c9dcb6362874736e663d435f9dd8400125/R/pkg/R/DataFrame.R#L2458 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22144 hey, this looks important, could someone review this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22362: [SPARK-25372][YARN][K8S] Deprecate and generalize...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22362#discussion_r216122659 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala --- @@ -199,8 +199,8 @@ private[deploy] class SparkSubmitArguments(args: Seq[String], env: Map[String, S numExecutors = Option(numExecutors) .getOrElse(sparkProperties.get("spark.executor.instances").orNull) queue = Option(queue).orElse(sparkProperties.get("spark.yarn.queue")).orNull -keytab = Option(keytab).orElse(sparkProperties.get("spark.yarn.keytab")).orNull -principal = Option(principal).orElse(sparkProperties.get("spark.yarn.principal")).orNull +keytab = Option(keytab).orElse(sparkProperties.get("spark.kerberos.keytab")).orNull --- End diff -- agreed, shouldn't the "old" config still work? `spark.yarn.keytab` etc --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/7#discussion_r216122621 --- Diff: R/pkg/R/functions.R --- @@ -3404,19 +3404,24 @@ setMethod("collect_set", #' Equivalent to \code{split} SQL function. #' #' @rdname column_string_functions +#' @param limit determines the size of the returned array. If `limit` is positive, +#'size of the array will be at most `limit`. If `limit` is negative, the --- End diff -- you can't use backtick in R doc --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22335: [SPARK-25091][SQL] reduce the storage memory in Executor...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22335 please fix the description for this PR - the top part contains the truncated title --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22145: [SPARK-25152][K8S] Enable SparkR Integration Tests for K...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22145 what's the latest on this, btw? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22274: [SPARK-25167][SPARKR][TEST][MINOR] Minor fixes for R sql...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22274 merged to master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: [SPARK-25167][SPARKR][TEST][MINOR] Minor fixes for R sql tests (timestamp comparison)
Repository: spark Updated Branches: refs/heads/master 64bbd134e -> 39d3d6cc9 [SPARK-25167][SPARKR][TEST][MINOR] Minor fixes for R sql tests (timestamp comparison) ## What changes were proposed in this pull request? The "date function on DataFrame" test fails consistently on my laptop. In this PR i am fixing it by changing the way we compare the two timestamp values. With this change i am able to run the tests clean. ## How was this patch tested? Fixed the failing test. Author: Dilip Biswal Closes #22274 from dilipbiswal/r-sql-test-fix2. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/39d3d6cc Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/39d3d6cc Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/39d3d6cc Branch: refs/heads/master Commit: 39d3d6cc965bd09b1719d245e672b013b8cee6f7 Parents: 64bbd13 Author: Dilip Biswal Authored: Mon Sep 3 00:38:08 2018 -0700 Committer: Felix Cheung Committed: Mon Sep 3 00:38:08 2018 -0700 -- R/pkg/tests/fulltests/test_sparkSQL.R | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/39d3d6cc/R/pkg/tests/fulltests/test_sparkSQL.R -- diff --git a/R/pkg/tests/fulltests/test_sparkSQL.R b/R/pkg/tests/fulltests/test_sparkSQL.R index 17e4a97..5c07a02 100644 --- a/R/pkg/tests/fulltests/test_sparkSQL.R +++ b/R/pkg/tests/fulltests/test_sparkSQL.R @@ -1870,9 +1870,9 @@ test_that("date functions on a DataFrame", { expect_equal(collect(select(df2, minute(df2$b)))[, 1], c(34, 24)) expect_equal(collect(select(df2, second(df2$b)))[, 1], c(0, 34)) expect_equal(collect(select(df2, from_utc_timestamp(df2$b, "JST")))[, 1], - c(as.POSIXlt("2012-12-13 21:34:00 UTC"), as.POSIXlt("2014-12-15 10:24:34 UTC"))) + c(as.POSIXct("2012-12-13 21:34:00 UTC"), as.POSIXct("2014-12-15 10:24:34 UTC"))) expect_equal(collect(select(df2, to_utc_timestamp(df2$b, "JST")))[, 1], - c(as.POSIXlt("2012-12-13 03:34:00 UTC"), as.POSIXlt("2014-12-14 16:24:34 UTC"))) + c(as.POSIXct("2012-12-13 03:34:00 UTC"), as.POSIXct("2014-12-14 16:24:34 UTC"))) expect_gt(collect(select(df2, unix_timestamp()))[1, 1], 0) expect_gt(collect(select(df2, unix_timestamp(df2$b)))[1, 1], 0) expect_gt(collect(select(df2, unix_timestamp(lit("2015-01-01"), "-MM-dd")))[1, 1], 0) @@ -3652,7 +3652,8 @@ test_that("catalog APIs, currentDatabase, setCurrentDatabase, listDatabases", { expect_equal(currentDatabase(), "default") expect_error(setCurrentDatabase("default"), NA) expect_error(setCurrentDatabase("zxwtyswklpf"), -"Error in setCurrentDatabase : analysis error - Database 'zxwtyswklpf' does not exist") + paste0("Error in setCurrentDatabase : analysis error - Database ", + "'zxwtyswklpf' does not exist")) dbs <- collect(listDatabases()) expect_equal(names(dbs), c("name", "description", "locationUri")) expect_equal(which(dbs[, 1] == "default"), 1) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] spark issue #22274: [SPARK-25167][SPARKR][TEST][MINOR] Minor fixes for R sql...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22274 possible - but since this passes for you and in jenkins/appveyor you change seem to work both ways, which is good enough for me --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22298: [SPARK-25021][K8S] Add spark.executor.pyspark.mem...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22298#discussion_r214550394 --- Diff: examples/src/main/python/worker_memory_check.py --- @@ -0,0 +1,47 @@ +# --- End diff -- I think the concern here is shipping a test as an example - this is the place where dev will be looking for example on how to use pyspark and having a memory test there is a bit strange. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22274: [SPARK-25167][SPARKR][TEST][MINOR] Minor fixes for R sql...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22274 interesting. maybe something to do with newer R release - I scanned through the rel note though but didn't find what might be related. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22291: [SPARK-25007][R]Add array_intersect/array_except/array_u...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22291 merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: [SPARK-25007][R] Add array_intersect/array_except/array_union/shuffle to SparkR
Repository: spark Updated Branches: refs/heads/master a3dccd24c -> a481794ca [SPARK-25007][R] Add array_intersect/array_except/array_union/shuffle to SparkR ## What changes were proposed in this pull request? Add the R version of array_intersect/array_except/array_union/shuffle ## How was this patch tested? Add test in test_sparkSQL.R Author: Huaxin Gao Closes #22291 from huaxingao/spark-25007. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a481794c Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a481794c Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a481794c Branch: refs/heads/master Commit: a481794ca9a5edb87982679cd0e95146f668fe78 Parents: a3dccd2 Author: Huaxin Gao Authored: Sun Sep 2 00:06:19 2018 -0700 Committer: Felix Cheung Committed: Sun Sep 2 00:06:19 2018 -0700 -- R/pkg/NAMESPACE | 4 ++ R/pkg/R/functions.R | 59 +- R/pkg/R/generics.R| 16 R/pkg/tests/fulltests/test_sparkSQL.R | 19 ++ 4 files changed, 97 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/a481794c/R/pkg/NAMESPACE -- diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE index 0fd0848..96ff389 100644 --- a/R/pkg/NAMESPACE +++ b/R/pkg/NAMESPACE @@ -204,6 +204,8 @@ exportMethods("%<=>%", "approxQuantile", "array_contains", "array_distinct", + "array_except", + "array_intersect", "array_join", "array_max", "array_min", @@ -212,6 +214,7 @@ exportMethods("%<=>%", "array_repeat", "array_sort", "arrays_overlap", + "array_union", "arrays_zip", "asc", "ascii", @@ -355,6 +358,7 @@ exportMethods("%<=>%", "shiftLeft", "shiftRight", "shiftRightUnsigned", + "shuffle", "sd", "sign", "signum", http://git-wip-us.apache.org/repos/asf/spark/blob/a481794c/R/pkg/R/functions.R -- diff --git a/R/pkg/R/functions.R b/R/pkg/R/functions.R index 2929a00..d157acc 100644 --- a/R/pkg/R/functions.R +++ b/R/pkg/R/functions.R @@ -208,7 +208,7 @@ NULL #' # Dataframe used throughout this doc #' df <- createDataFrame(cbind(model = rownames(mtcars), mtcars)) #' tmp <- mutate(df, v1 = create_array(df$mpg, df$cyl, df$hp)) -#' head(select(tmp, array_contains(tmp$v1, 21), size(tmp$v1))) +#' head(select(tmp, array_contains(tmp$v1, 21), size(tmp$v1), shuffle(tmp$v1))) #' head(select(tmp, array_max(tmp$v1), array_min(tmp$v1), array_distinct(tmp$v1))) #' head(select(tmp, array_position(tmp$v1, 21), array_repeat(df$mpg, 3), array_sort(tmp$v1))) #' head(select(tmp, flatten(tmp$v1), reverse(tmp$v1), array_remove(tmp$v1, 21))) @@ -223,6 +223,8 @@ NULL #' head(select(tmp3, element_at(tmp3$v3, "Valiant"))) #' tmp4 <- mutate(df, v4 = create_array(df$mpg, df$cyl), v5 = create_array(df$cyl, df$hp)) #' head(select(tmp4, concat(tmp4$v4, tmp4$v5), arrays_overlap(tmp4$v4, tmp4$v5))) +#' head(select(tmp4, array_except(tmp4$v4, tmp4$v5), array_intersect(tmp4$v4, tmp4$v5))) +#' head(select(tmp4, array_union(tmp4$v4, tmp4$v5))) #' head(select(tmp4, arrays_zip(tmp4$v4, tmp4$v5), map_from_arrays(tmp4$v4, tmp4$v5))) #' head(select(tmp, concat(df$mpg, df$cyl, df$hp))) #' tmp5 <- mutate(df, v6 = create_array(df$model, df$model)) @@ -3025,6 +3027,34 @@ setMethod("array_distinct", }) #' @details +#' \code{array_except}: Returns an array of the elements in the first array but not in the second +#' array, without duplicates. The order of elements in the result is not determined. +#' +#' @rdname column_collection_functions +#' @aliases array_except array_except,Column-method +#' @note array_except since 2.4.0 +setMethod("array_except", + signature(x = "Column", y = "Column"), + function(x, y) { +jc <- callJStatic("org.apache.spark.sql.functions", "array_except", x@jc, y@jc) +column(jc) + }) + +#' @details +#' \code{array_intersect}: Returns an array of the elements in the intersection of the given two +#' arrays, without duplicates. +#' +#' @rdname column_collection_functions +#' @aliases array_intersect array_intersect,Column-method +#' @note array_intersect since 2.4.0 +setMethod("array_intersect", + signature(x = "Column", y = "Column"), + function(x, y) { +jc <- callJStatic("org.apache.spark.sql.functions", "array_intersect", x@jc, y@jc) +column(jc)
zeppelin git commit: [ZEPPELIN-3753] Fix indent with TAB
Repository: zeppelin Updated Branches: refs/heads/master 26b554d64 -> 57601f819 [ZEPPELIN-3753] Fix indent with TAB ### What is this PR for? Now when you select multiline text and press TAB, text replaces with "\t" char. With this PR text just shift right if TAB have been pressed. ### What type of PR is it? Bug Fix ### What is the Jira issue? [ZEPPELIN-3753](https://issues.apache.org/jira/projects/ZEPPELIN/issues/ZEPPELIN-3753) ### Questions: * Does the licenses files need update? No * Is there breaking changes for older versions? No * Does this needs documentation? No Author: oxygen311 Closes #3168 from oxygen311/DW-18011 and squashes the following commits: 941b832 [oxygen311] Fix indent with TAB Project: http://git-wip-us.apache.org/repos/asf/zeppelin/repo Commit: http://git-wip-us.apache.org/repos/asf/zeppelin/commit/57601f81 Tree: http://git-wip-us.apache.org/repos/asf/zeppelin/tree/57601f81 Diff: http://git-wip-us.apache.org/repos/asf/zeppelin/diff/57601f81 Branch: refs/heads/master Commit: 57601f819977063d622e3acbcc2f2b8710087697 Parents: 26b554d Author: oxygen311 Authored: Wed Aug 29 17:33:51 2018 +0300 Committer: Felix Cheung Committed: Sat Sep 1 23:50:32 2018 -0700 -- zeppelin-web/src/app/notebook/paragraph/paragraph.controller.js | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/zeppelin/blob/57601f81/zeppelin-web/src/app/notebook/paragraph/paragraph.controller.js -- diff --git a/zeppelin-web/src/app/notebook/paragraph/paragraph.controller.js b/zeppelin-web/src/app/notebook/paragraph/paragraph.controller.js index 9a766de..1a1569a 100644 --- a/zeppelin-web/src/app/notebook/paragraph/paragraph.controller.js +++ b/zeppelin-web/src/app/notebook/paragraph/paragraph.controller.js @@ -930,7 +930,7 @@ function ParagraphCtrl($scope, $rootScope, $route, $window, $routeParams, $locat $scope.editor.execCommand('startAutocomplete'); } else { ace.config.loadModule('ace/ext/language_tools', function() { - $scope.editor.insertSnippet('\t'); + $scope.editor.indent(); }); } },
[GitHub] spark issue #22274: [SPARK-25167][SPARKR][TEST][MINOR] Minor fixes for R sql...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22274 maybe also your laptop's system time zone? could you also check that? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22295: [SPARK-25255][PYTHON]Add getActiveSession to Spar...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22295#discussion_r214530177 --- Diff: python/pyspark/sql/session.py --- @@ -252,6 +252,16 @@ def newSession(self): """ return self.__class__(self._sc, self._jsparkSession.newSession()) +@since(2.4) +def getActiveSession(self): +""" +Returns the active SparkSession for the current thread, returned by the builder. +>>> s = spark.getActiveSession() +>>> spark._jsparkSession.getDefaultSession().get().equals(s.get()) --- End diff -- ..and probably shouldn't access `_jsparkSession` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22298: [SPARK-25021][K8S] Add spark.executor.pyspark.mem...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22298#discussion_r214530079 --- Diff: examples/src/main/python/worker_memory_check.py --- @@ -0,0 +1,47 @@ +# --- End diff -- shouldn't this be in python tests (and get it to run only certain cluster manager) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/7#discussion_r214529571 --- Diff: python/pyspark/sql/functions.py --- @@ -1669,20 +1669,36 @@ def repeat(col, n): return Column(sc._jvm.functions.repeat(_to_java_column(col), n)) -@since(1.5) +@since(2.4) @ignore_unicode_prefix -def split(str, pattern): -""" -Splits str around pattern (pattern is a regular expression). - -.. note:: pattern is a string represent the regular expression. - ->>> df = spark.createDataFrame([('ab12cd',)], ['s',]) ->>> df.select(split(df.s, '[0-9]+').alias('s')).collect() -[Row(s=[u'ab', u'cd'])] -""" -sc = SparkContext._active_spark_context -return Column(sc._jvm.functions.split(_to_java_column(str), pattern)) +def split(str, regex, limit=-1): --- End diff -- yes, `regex` is the part breaking.. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] zeppelin issue #3168: [ZEPPELIN-3753] Fix indent with TAB
Github user felixcheung commented on the issue: https://github.com/apache/zeppelin/pull/3168 merging if no more comment ---
[GitHub] spark issue #21743: [SPARK-24767][Launcher] Propagate MDC to spark-submit th...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/21743 also, I don't recall anywhere in spark that depends/sets MDC... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18877: [SPARK-17742][core] Handle child process exit in SparkLa...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/18877 yes @danelkotev `asfgit closed this in cba826d on Aug 15, 2017` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/7#discussion_r214244981 --- Diff: python/pyspark/sql/functions.py --- @@ -1669,20 +1669,36 @@ def repeat(col, n): return Column(sc._jvm.functions.repeat(_to_java_column(col), n)) -@since(1.5) +@since(2.4) @ignore_unicode_prefix -def split(str, pattern): -""" -Splits str around pattern (pattern is a regular expression). - -.. note:: pattern is a string represent the regular expression. - ->>> df = spark.createDataFrame([('ab12cd',)], ['s',]) ->>> df.select(split(df.s, '[0-9]+').alias('s')).collect() -[Row(s=[u'ab', u'cd'])] -""" -sc = SparkContext._active_spark_context -return Column(sc._jvm.functions.split(_to_java_column(str), pattern)) +def split(str, regex, limit=-1): --- End diff -- this would be a breaking API change I believe for python --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/7#discussion_r214244918 --- Diff: R/pkg/R/functions.R --- @@ -3410,13 +3410,14 @@ setMethod("collect_set", #' \dontrun{ #' head(select(df, split_string(df$Sex, "a"))) #' head(select(df, split_string(df$Class, "\\d"))) +#' head(select(df, split_string(df$Class, "\\d", 2))) #' # This is equivalent to the following SQL expression #' head(selectExpr(df, "split(Class, 'd')"))} #' @note split_string 2.3.0 setMethod("split_string", signature(x = "Column", pattern = "character"), - function(x, pattern) { -jc <- callJStatic("org.apache.spark.sql.functions", "split", x@jc, pattern) + function(x, pattern, limit = -1) { +jc <- callJStatic("org.apache.spark.sql.functions", "split", x@jc, pattern, limit) --- End diff -- you should have `as.integer(limit)` instead could we add a test in R? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22274: [SPARK-25167][SPARKR][TEST][MINOR] Minor fixes fo...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22274#discussion_r214244580 --- Diff: R/pkg/tests/fulltests/test_sparkSQL.R --- @@ -3633,7 +3633,8 @@ test_that("catalog APIs, currentDatabase, setCurrentDatabase, listDatabases", { expect_equal(currentDatabase(), "default") expect_error(setCurrentDatabase("default"), NA) expect_error(setCurrentDatabase("zxwtyswklpf"), -"Error in setCurrentDatabase : analysis error - Database 'zxwtyswklpf' does not exist") + paste("Error in setCurrentDatabase : analysis error - Database", --- End diff -- I'd use paste0 instead to make clear about the implicit space that should be after `Database` ie. `paste0("Error in setCurrentDatabase : analysis error - Database ", "'zxwtyswklpf' does not exist")) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22291: [SPARK-25007][R]Add array_intersect/array_except/...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22291#discussion_r214244359 --- Diff: R/pkg/R/generics.R --- @@ -799,10 +807,18 @@ setGeneric("array_sort", function(x) { standardGeneric("array_sort") }) #' @name NULL setGeneric("arrays_overlap", function(x, y) { standardGeneric("arrays_overlap") }) +#' @rdname column_collection_functions +#' @name NULL +setGeneric("array_union", function(x, y) { standardGeneric("array_union") }) + #' @rdname column_collection_functions #' @name NULL setGeneric("arrays_zip", function(x, ...) { standardGeneric("arrays_zip") }) +#' @rdname column_collection_functions +#' @name NULL +setGeneric("shuffle", function(x) { standardGeneric("shuffle") }) --- End diff -- this should go below - this part of the list should be sorted alphabetically --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22226: [SPARK-25252][SQL] Support arrays of any types by...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/6#discussion_r214243115 --- Diff: R/pkg/R/functions.R --- @@ -1697,8 +1697,8 @@ setMethod("to_date", }) #' @details -#' \code{to_json}: Converts a column containing a \code{structType}, array of \code{structType}, -#' a \code{mapType} or array of \code{mapType} into a Column of JSON string. +#' \code{to_json}: Converts a column containing a \code{structType}, a \code{mapType} +#' or an array into a Column of JSON string. --- End diff -- it should could we add some tests for this in R? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20146: [SPARK-11215][ML] Add multiple columns support to String...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/20146 seems like this was a thumbs-up from @WeichenXu123 @jkbradley? @dbtsai ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] zeppelin issue #3158: [ZEPPELIN-3740] Adopt `google-java-format` and `fmt-ma...
Github user felixcheung commented on the issue: https://github.com/apache/zeppelin/pull/3158 ok ---
[GitHub] spark issue #22192: [SPARK-24918] Executor Plugin API
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22192 Jenkins, ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] zeppelin issue #3158: [ZEPPELIN-3740] Adopt `google-java-format` and `fmt-ma...
Github user felixcheung commented on the issue: https://github.com/apache/zeppelin/pull/3158 I see. might be good to get some consensus first - we seem to be doing quite a bit of style changes in the last few months, it would make maintenance or backporting harder, for example. ---
[GitHub] zeppelin issue #3158: [ZEPPELIN-3740] Adopt `google-java-format` and `fmt-ma...
Github user felixcheung commented on the issue: https://github.com/apache/zeppelin/pull/3158 what's wrong with `maven-checkstyle-plugin`? ---
[GitHub] spark issue #20838: [SPARK-23698] Resolve undefined names in Python 3
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/20838 Or that Bryan opens a PR on your branch? that usually would be easier to get *this* PR through, just my 2c. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22161: [SPARK-25167][SPARKR][TEST][MINOR] Minor fixes fo...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22161#discussion_r211487544 --- Diff: R/pkg/tests/fulltests/test_sparkSQL.R --- @@ -3613,11 +3613,11 @@ test_that("Collect on DataFrame when NAs exists at the top of a timestamp column test_that("catalog APIs, currentDatabase, setCurrentDatabase, listDatabases", { expect_equal(currentDatabase(), "default") expect_error(setCurrentDatabase("default"), NA) - expect_error(setCurrentDatabase("foo"), - "Error in setCurrentDatabase : analysis error - Database 'foo' does not exist") + expect_error(setCurrentDatabase("zxwtyswklpf"), +"Error in setCurrentDatabase : analysis error - Database 'zxwtyswklpf' does not exist") dbs <- collect(listDatabases()) expect_equal(names(dbs), c("name", "description", "locationUri")) - expect_equal(dbs[[1]], "default") + expect_equal(which(dbs[, 1] == "default"), 1) --- End diff -- I wonder if there is a better way to ensure the default database is named "default", perhaps? this checks for "exactly one database is named "default"" - I guess that's ok... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] zeppelin issue #3153: [ZEPPELIN-3738] Fix enabling JMX in ZeppelinServer
Github user felixcheung commented on the issue: https://github.com/apache/zeppelin/pull/3153 LGTM ---
[GitHub] spark issue #21584: [SPARK-24433][K8S] Initial R Bindings for SparkR on K8s
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/21584 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22107: [SPARK-25117][R] Add EXEPT ALL and INTERSECT ALL support...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22107 merged to master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: [SPARK-25117][R] Add EXEPT ALL and INTERSECT ALL support in R
Repository: spark Updated Branches: refs/heads/master c1ffb3c10 -> 162326c0e [SPARK-25117][R] Add EXEPT ALL and INTERSECT ALL support in R ## What changes were proposed in this pull request? [SPARK-21274](https://issues.apache.org/jira/browse/SPARK-21274) added support for EXCEPT ALL and INTERSECT ALL. This PR adds the support in R. ## How was this patch tested? Added test in test_sparkSQL.R Author: Dilip Biswal Closes #22107 from dilipbiswal/SPARK-25117. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/162326c0 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/162326c0 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/162326c0 Branch: refs/heads/master Commit: 162326c0ee8419083ebd1669796abd234773e9b6 Parents: c1ffb3c Author: Dilip Biswal Authored: Fri Aug 17 00:04:04 2018 -0700 Committer: Felix Cheung Committed: Fri Aug 17 00:04:04 2018 -0700 -- R/pkg/NAMESPACE | 2 + R/pkg/R/DataFrame.R | 59 +- R/pkg/R/generics.R| 6 +++ R/pkg/tests/fulltests/test_sparkSQL.R | 19 ++ 4 files changed, 85 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/162326c0/R/pkg/NAMESPACE -- diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE index adfd387..0fd0848 100644 --- a/R/pkg/NAMESPACE +++ b/R/pkg/NAMESPACE @@ -117,6 +117,7 @@ exportMethods("arrange", "dropna", "dtypes", "except", + "exceptAll", "explain", "fillna", "filter", @@ -131,6 +132,7 @@ exportMethods("arrange", "hint", "insertInto", "intersect", + "intersectAll", "isLocal", "isStreaming", "join", http://git-wip-us.apache.org/repos/asf/spark/blob/162326c0/R/pkg/R/DataFrame.R -- diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R index 471ada1..4f2d4c7 100644 --- a/R/pkg/R/DataFrame.R +++ b/R/pkg/R/DataFrame.R @@ -2848,6 +2848,35 @@ setMethod("intersect", dataFrame(intersected) }) +#' intersectAll +#' +#' Return a new SparkDataFrame containing rows in both this SparkDataFrame +#' and another SparkDataFrame while preserving the duplicates. +#' This is equivalent to \code{INTERSECT ALL} in SQL. Also as standard in +#' SQL, this function resolves columns by position (not by name). +#' +#' @param x a SparkDataFrame. +#' @param y a SparkDataFrame. +#' @return A SparkDataFrame containing the result of the intersect all operation. +#' @family SparkDataFrame functions +#' @aliases intersectAll,SparkDataFrame,SparkDataFrame-method +#' @rdname intersectAll +#' @name intersectAll +#' @examples +#'\dontrun{ +#' sparkR.session() +#' df1 <- read.json(path) +#' df2 <- read.json(path2) +#' intersectAllDF <- intersectAll(df1, df2) +#' } +#' @note intersectAll since 2.4.0 +setMethod("intersectAll", + signature(x = "SparkDataFrame", y = "SparkDataFrame"), + function(x, y) { +intersected <- callJMethod(x@sdf, "intersectAll", y@sdf) +dataFrame(intersected) + }) + #' except #' #' Return a new SparkDataFrame containing rows in this SparkDataFrame @@ -2867,7 +2896,6 @@ setMethod("intersect", #' df2 <- read.json(path2) #' exceptDF <- except(df, df2) #' } -#' @rdname except #' @note except since 1.4.0 setMethod("except", signature(x = "SparkDataFrame", y = "SparkDataFrame"), @@ -2876,6 +2904,35 @@ setMethod("except", dataFrame(excepted) }) +#' exceptAll +#' +#' Return a new SparkDataFrame containing rows in this SparkDataFrame +#' but not in another SparkDataFrame while preserving the duplicates. +#' This is equivalent to \code{EXCEPT ALL} in SQL. Also as standard in +#' SQL, this function resolves columns by position (not by name). +#' +#' @param x a SparkDataFrame. +#' @param y a SparkDataFrame. +#' @return A SparkDataFrame containing the result of the except all operation. +#' @family SparkDataFrame functions +#' @aliases exceptAll,SparkDataFrame,SparkDataFrame-method +#' @rdname exceptAll +#' @name exceptAll +#' @examples +#'\dontrun{ +#' sparkR.session() +#' df1 <- read.json(path) +#' df2 <- read.json(path2) +#' exceptAllDF <- exceptAll(df1, df2) +#' } +#' @note exceptAll since 2.4.0 +setMethod("exceptAll", + signature(x = "SparkDataFrame", y = "SparkDataFrame"), + function(x, y) { +excepted <- callJMethod(x@sdf, "exceptAll", y@sdf) +dataFrame(excepted) + }) + #' Save the contents of SparkDataFrame to a data source. #'
zeppelin git commit: [ZEPPELIN-3701].Missing first several '0' and losing digital accuracy in result table
Repository: zeppelin Updated Branches: refs/heads/master 09d44d504 -> 1267e33a0 [ZEPPELIN-3701].Missing first several '0' and losing digital accuracy in result table ### What is this PR for? Improvements: -Datas like '00058806' will be displayed correctly instead of '58806'. -Datas like '5880658806' will be displayed correctly instead of '5.880659E9'. ### What type of PR is it? [Refactoring] ### Todos * [ ] - Task ### What is the Jira issue? * https://issues.apache.org/jira/browse/ZEPPELIN-3701 ### How should this be tested? * CI pass ### Screenshots (if appropriate) ### Questions: * Does the licenses files need update? No * Is there breaking changes for older versions? No * Does this needs documentation? No Author: heguozi Author: åæå <> Closes #3132 from Deegue/master and squashes the following commits: f539a9a [åæå] add '+' validation 09fc45d [åæå] hardcoding fixed a5f9a8a [åæå] [ZEPPELIN-3701].Missing first several '0' and losing digital accuracy in result table Project: http://git-wip-us.apache.org/repos/asf/zeppelin/repo Commit: http://git-wip-us.apache.org/repos/asf/zeppelin/commit/1267e33a Tree: http://git-wip-us.apache.org/repos/asf/zeppelin/tree/1267e33a Diff: http://git-wip-us.apache.org/repos/asf/zeppelin/diff/1267e33a Branch: refs/heads/master Commit: 1267e33a0ce1bfc7b38bddaa066f89a5f98e8857 Parents: 09d44d5 Author: åæå <> Authored: Mon Aug 13 18:52:50 2018 +0800 Committer: Felix Cheung Committed: Thu Aug 16 23:49:03 2018 -0700 -- zeppelin-web/src/app/tabledata/tabledata.js | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/zeppelin/blob/1267e33a/zeppelin-web/src/app/tabledata/tabledata.js -- diff --git a/zeppelin-web/src/app/tabledata/tabledata.js b/zeppelin-web/src/app/tabledata/tabledata.js index 1f01bca..67c47be 100644 --- a/zeppelin-web/src/app/tabledata/tabledata.js +++ b/zeppelin-web/src/app/tabledata/tabledata.js @@ -36,6 +36,7 @@ export default class TableData extends Dataset { let textRows = paragraphResult.msg.split('\n'); let comment = ''; let commentRow = false; +const float64MaxDigits = 16; for (let i = 0; i < textRows.length; i++) { let textRow = textRows[i]; @@ -60,8 +61,10 @@ export default class TableData extends Dataset { columnNames.push({name: col, index: j, aggr: 'sum'}); } else { let valueOfCol; - if (!isNaN(valueOfCol = parseFloat(col)) && isFinite(col)) { -col = valueOfCol; + if (!(col[0] === '0' || col[0] === '+' || col.length > float64MaxDigits)) { +if (!isNaN(valueOfCol = parseFloat(col)) && isFinite(col)) { + col = valueOfCol; +} } cols.push(col); cols2.push({key: (columnNames[i]) ? columnNames[i].name : undefined, value: col});
[GitHub] spark issue #21221: [SPARK-23429][CORE] Add executor memory metrics to heart...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/21221 Jenkins, retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21221: [SPARK-23429][CORE] Add executor memory metrics t...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/21221#discussion_r210492311 --- Diff: core/src/main/scala/org/apache/spark/executor/Executor.scala --- @@ -216,8 +217,7 @@ private[spark] class Executor( def stop(): Unit = { env.metricsSystem.report() -heartbeater.shutdown() -heartbeater.awaitTermination(10, TimeUnit.SECONDS) +heartbeater.stop() --- End diff -- future: `try {} catch { case NonFatal(e)`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21221: [SPARK-23429][CORE] Add executor memory metrics t...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/21221#discussion_r210492513 --- Diff: core/src/main/scala/org/apache/spark/internal/config/package.scala --- @@ -69,6 +69,11 @@ package object config { .bytesConf(ByteUnit.KiB) .createWithDefaultString("100k") + private[spark] val EVENT_LOG_STAGE_EXECUTOR_METRICS = +ConfigBuilder("spark.eventLog.logStageExecutorMetrics.enabled") + .booleanConf + .createWithDefault(true) --- End diff -- should this be "false" for now until we could test this out more, just to be on the safe side? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21835: [SPARK-24779]Add sequence / map_concat / map_from...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/21835#discussion_r210489980 --- Diff: R/pkg/R/functions.R --- @@ -3320,7 +3321,7 @@ setMethod("explode", #' @aliases sequence sequence,Column-method #' @note sequence since 2.4.0 setMethod("sequence", - signature(x = "Column", y = "Column"), + signature(), --- End diff -- sorry, I didn't see the reply. yes, we should try to make sequence callable. we shouldn't have to manually call it though and it is better to rely on R internal type/call routing. it's a bit hard to explain but check out `attach` `setGeneric("attach")` or `str` `setGeneric("str")` if you see what I mean. also we should avoid `signature()` empty as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22107: [SPARK-25117][R] Add EXEPT ALL and INTERSECT ALL ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22107#discussion_r210488842 --- Diff: R/pkg/R/DataFrame.R --- @@ -2848,6 +2848,35 @@ setMethod("intersect", dataFrame(intersected) }) +#' intersectAll +#' +#' Return a new SparkDataFrame containing rows in both this SparkDataFrame +#' and another SparkDataFrame while preserving the duplicates. +#' This is equivalent to \code{INTERSECT ALL} in SQL. Also as standard in +#' SQL, this function resolves columns by position (not by name). +#' +#' @param x a SparkDataFrame. +#' @param y a SparkDataFrame. +#' @return A SparkDataFrame containing the result of the intersect all operation. +#' @family SparkDataFrame functions +#' @aliases intersectAll,SparkDataFrame,SparkDataFrame-method +#' @rdname intersectAll +#' @name intersectAll +#' @examples +#'\dontrun{ +#' sparkR.session() +#' df1 <- read.json(path) +#' df2 <- read.json(path2) +#' intersectAllDF <- intersectAll(df1, df2) +#' } +#' @rdname intersectAll +#' @note intersectAll since 2.4.0 +setMethod("intersectAll", + signature(x = "SparkDataFrame", y = "SparkDataFrame"), + function(x, y) { +intersected <- callJMethod(x@sdf, "intersectAll", y@sdf) +dataFrame(intersected) + }) --- End diff -- add extra empty line after code --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22107: [SPARK-25117][R] Add EXEPT ALL and INTERSECT ALL ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22107#discussion_r210488890 --- Diff: R/pkg/R/DataFrame.R --- @@ -2876,6 +2905,37 @@ setMethod("except", dataFrame(excepted) }) +#' exceptAll +#' +#' Return a new SparkDataFrame containing rows in this SparkDataFrame +#' but not in another SparkDataFrame while preserving the duplicates. +#' This is equivalent to \code{EXCEPT ALL} in SQL. Also as standard in +#' SQL, this function resolves columns by position (not by name). +#' +#' @param x a SparkDataFrame. +#' @param y a SparkDataFrame. +#' @return A SparkDataFrame containing the result of the except all operation. +#' @family SparkDataFrame functions +#' @aliases exceptAll,SparkDataFrame,SparkDataFrame-method +#' @rdname exceptAll +#' @name exceptAll +#' @examples +#'\dontrun{ +#' sparkR.session() +#' df1 <- read.json(path) +#' df2 <- read.json(path2) +#' exceptAllDF <- exceptAll(df1, df2) +#' } +#' @rdname exceptAll +#' @note exceptAll since 2.4.0 +setMethod("exceptAll", + signature(x = "SparkDataFrame", y = "SparkDataFrame"), + function(x, y) { +excepted <- callJMethod(x@sdf, "exceptAll", y@sdf) +dataFrame(excepted) + }) + --- End diff -- nit: remove one of the two empty lines --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22107: [SPARK-25117][R] Add EXEPT ALL and INTERSECT ALL ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22107#discussion_r210488754 --- Diff: R/pkg/R/DataFrame.R --- @@ -2848,6 +2848,35 @@ setMethod("intersect", dataFrame(intersected) }) +#' intersectAll +#' +#' Return a new SparkDataFrame containing rows in both this SparkDataFrame +#' and another SparkDataFrame while preserving the duplicates. +#' This is equivalent to \code{INTERSECT ALL} in SQL. Also as standard in +#' SQL, this function resolves columns by position (not by name). +#' +#' @param x a SparkDataFrame. +#' @param y a SparkDataFrame. +#' @return A SparkDataFrame containing the result of the intersect all operation. +#' @family SparkDataFrame functions +#' @aliases intersectAll,SparkDataFrame,SparkDataFrame-method +#' @rdname intersectAll +#' @name intersectAll +#' @examples +#'\dontrun{ +#' sparkR.session() +#' df1 <- read.json(path) +#' df2 <- read.json(path2) +#' intersectAllDF <- intersectAll(df1, df2) +#' } +#' @rdname intersectAll --- End diff -- ditto here --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22107: [SPARK-25117][R] Add EXEPT ALL and INTERSECT ALL ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22107#discussion_r210488641 --- Diff: R/pkg/R/DataFrame.R --- @@ -2876,6 +2905,37 @@ setMethod("except", dataFrame(excepted) }) +#' exceptAll +#' +#' Return a new SparkDataFrame containing rows in this SparkDataFrame +#' but not in another SparkDataFrame while preserving the duplicates. +#' This is equivalent to \code{EXCEPT ALL} in SQL. Also as standard in +#' SQL, this function resolves columns by position (not by name). +#' +#' @param x a SparkDataFrame. +#' @param y a SparkDataFrame. +#' @return A SparkDataFrame containing the result of the except all operation. +#' @family SparkDataFrame functions +#' @aliases exceptAll,SparkDataFrame,SparkDataFrame-method +#' @rdname exceptAll +#' @name exceptAll +#' @examples +#'\dontrun{ +#' sparkR.session() +#' df1 <- read.json(path) +#' df2 <- read.json(path2) +#' exceptAllDF <- exceptAll(df1, df2) +#' } +#' @rdname exceptAll --- End diff -- this is a bug in `except` there should only be one `@rdname` for each --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] zeppelin issue #3139: [ZEPPELIN-3712] Add `maxConnLifetime` parameter to JDB...
Github user felixcheung commented on the issue: https://github.com/apache/zeppelin/pull/3139 LGTM ---
[GitHub] spark issue #22095: [SPARK-23984][K8S] Changed Python Version config to be c...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22095 @mccheah btw, please add a comment (say "merged to master") after you merge a PR - just a convention in this project. FYI. thx. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22095: [SPARK-23984][K8S] Changed Python Version config to be c...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22095 @mccheah @foxish --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22071: [SPARK-25088][CORE][MESOS][DOCS] Update Rest Server docs...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22071 in this case maybe ok. perhaps just rel note this iff there's another 2.2.x or 2.1.x releases? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] zeppelin issue #3087: [ZEPPELIN-3644]: Adding SPARQL query language support ...
Github user felixcheung commented on the issue: https://github.com/apache/zeppelin/pull/3087 this is just for syntax highlighting, there is no interpreter code here. also even for syntax the ACE editor should be set with the language of choice - this PR does not have either of those. ---
[GitHub] zeppelin issue #3132: [ZEPPELIN-3701].Missing first several '0' and losing d...
Github user felixcheung commented on the issue: https://github.com/apache/zeppelin/pull/3132 merging if no more comment ---
[GitHub] zeppelin issue #3136: ZEPPELIN-3699. Remove the logic of converting single r...
Github user felixcheung commented on the issue: https://github.com/apache/zeppelin/pull/3136 Paragraph or REST API. though looks like it will break all existing notebook saved since it changes the persistent json. is there a way to make them compatible? ---
[GitHub] spark issue #22109: [SPARK-25120][CORE][HistoryServer]Fix the problem of Eve...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22109 @vanzin @squito --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22084: [SPARK-25026][BUILD] Binary releases should conta...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22084#discussion_r209507960 --- Diff: dev/make-distribution.sh --- @@ -188,6 +190,23 @@ if [ -f "$SPARK_HOME"/common/network-yarn/target/scala*/spark-*-yarn-shuffle.jar cp "$SPARK_HOME"/common/network-yarn/target/scala*/spark-*-yarn-shuffle.jar "$DISTDIR/yarn" fi +# Only copy external jars if built +if [ -f "$SPARK_HOME"/external/avro/target/spark-avro_${SCALA_VERSION}-${VERSION}.jar ]; then + cp "$SPARK_HOME"/external/avro/target/spark-avro_${SCALA_VERSION}-${VERSION}.jar "$DISTDIR/external/jars/" +fi +if [ -f "$SPARK_HOME"/external/kafka-0-10/target/spark-streaming-kafka-0-10_${SCALA_VERSION}-${VERSION}.jar ]; then + cp "$SPARK_HOME"/external/kafka-0-10/target/spark-streaming-kafka-0-10_${SCALA_VERSION}-${VERSION}.jar "$DISTDIR/external/jars/" --- End diff -- agree not kinesis or ganglia --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22081: [SPARK-23654][BUILD] remove jets3t as a dependenc...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22081#discussion_r209443568 --- Diff: pom.xml --- @@ -984,24 +987,15 @@ - + -net.java.dev.jets3t -jets3t -${jets3t.version} +javax.activation +activation +1.1.1 --- End diff -- this changes from `jets3t.version>0.9.4`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] zeppelin issue #3118: [zeppelin-3693] Option to toggle chart settings of par...
Github user felixcheung commented on the issue: https://github.com/apache/zeppelin/pull/3118 I'd agree, this seems like the intent of the report mode. maybe you can add a option to report mode instead to keep the frame for the chart? ---
[GitHub] spark issue #21027: [SPARK-23943][MESOS][DEPLOY] Improve observability of Me...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/21027 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22071: [SPARK-25088][CORE][MESOS][DOCS] Update Rest Server docs...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22071 @tnachen --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22072: [SPARK-25081][Core]Nested spill in ShuffleExternalSorter...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22072 jenkins retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22072: [SPARK-25081][Core]Nested spill in ShuffleExternalSorter...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22072 ``` * checking CRAN incoming feasibility ...Error in .check_package_CRAN_incoming(pkgdir) : dims [product 26] do not match the length of object [0] ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] zeppelin issue #3107: [ZEPPELIN-3646] Add note for updating user permissions
Github user felixcheung commented on the issue: https://github.com/apache/zeppelin/pull/3107 I think there is significant risk that some users are just running all âsampleâ notebook to check them out not fully aware that some might be modifying system state. Agreed to suggestions above. ---
[GitHub] spark pull request #21927: [SPARK-24820][SPARK-24821][Core] Fail fast when s...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/21927#discussion_r208123913 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -1946,4 +1990,11 @@ private[spark] object DAGScheduler { // Number of consecutive stage attempts allowed before a stage is aborted val DEFAULT_MAX_CONSECUTIVE_STAGE_ATTEMPTS = 4 + + // Error message when running a barrier stage that have unsupported RDD chain pattern. + val ERROR_MESSAGE_RUN_BARRIER_WITH_UNSUPPORTED_RDD_CHAIN_PATTERN = +"[SPARK-24820][SPARK-24821]: Barrier execution mode does not allow the following pattern of " + + "RDD chain within a barrier stage:\n1. Ancestor RDDs that have different number of " + + "partitions from the resulting RDD (eg. union()/coalesce()/first()/PartitionPruningRDD);\n" + --- End diff -- collect() is expensive though? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org