[GitHub] spark pull request #23252: [SPARK-26239] File-based secret key loading for S...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/23252#discussion_r240022921 --- Diff: core/src/test/scala/org/apache/spark/SecurityManagerSuite.scala --- @@ -440,12 +473,27 @@ class SecurityManagerSuite extends SparkFunSuite with ResetSystemProperties { intercept[IllegalArgumentException] { mgr.getSecretKey() } + case FILE => +val secretFile = createTempSecretFile() +conf.set(AUTH_SECRET_FILE, secretFile.getAbsolutePath) +mgr.initializeAuth() +assert(encodeFileAsBase64(secretFile) === mgr.getSecretKey()) } } } ) } } + private def encodeFileAsBase64(secretFile: File) = { +Base64.getEncoder.encodeToString(Files.readAllBytes(secretFile.toPath)) + } + + private def createTempSecretFile(contents: String = "test-secret"): File = { +val secretDir = Utils.createTempDir("temp-secrets") +val secretFile = new File(secretDir, "temp-secret.txt") +Files.write(secretFile.toPath, contents.getBytes(StandardCharsets.UTF_8)) +secretFile --- End diff -- ah it's fine --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23224: [SPARK-26277][SQL][TEST] WholeStageCodegen metrics shoul...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/23224 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23256: [SPARK-24207][R] follow-up PR for SPARK-24207 to fix cod...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/23256 ideally, but really not for this PR --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23256: [SPARK-24207][R] follow-up PR for SPARK-24207 to ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/23256#discussion_r239997109 --- Diff: R/pkg/tests/fulltests/test_mllib_fpm.R --- @@ -84,19 +84,20 @@ test_that("spark.fpGrowth", { }) test_that("spark.prefixSpan", { -df <- createDataFrame(list(list(list(list(1L, 2L), list(3L))), - list(list(list(1L), list(3L, 2L), list(1L, 2L))), - list(list(list(1L, 2L), list(5L))), - list(list(list(6L, schema = c("sequence")) -result1 <- spark.findFrequentSequentialPatterns(df, minSupport = 0.5, maxPatternLength = 5L, -maxLocalProjDBSize = 3200L) - -expected_result <- createDataFrame(list(list(list(list(1L)), 3L), -list(list(list(3L)), 2L), -list(list(list(2L)), 3L), -list(list(list(1L, 2L)), 3L), -list(list(list(1L), list(3L)), 2L)), -schema = c("sequence", "freq")) - }) + df <- createDataFrame(list(list(list(list(1L, 2L), list(3L))), + list(list(list(1L), list(3L, 2L), list(1L, 2L))), + list(list(list(1L, 2L), list(5L))), + list(list(list(6L, +schema = c("sequence")) + result <- spark.findFrequentSequentialPatterns(df, minSupport = 0.5, maxPatternLength = 5L, + maxLocalProjDBSize = 3200L) + + expected_result <- createDataFrame(list(list(list(list(1L)), 3L), list(list(list(3L)), 2L), + list(list(list(2L)), 3L), list(list(list(1L, 2L)), 3L), + list(list(list(1L), list(3L)), 2L)), + schema = c("sequence", "freq")) + + expect_equivalent(expected_result, result) --- End diff -- this is an important fix.. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23218: [SPARK-26266][BUILD] Update to Scala 2.12.8
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/23218 do we need to relnote jvm compatibility? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23252: [SPARK-26239] File-based secret key loading for S...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/23252#discussion_r239705869 --- Diff: core/src/test/scala/org/apache/spark/SecurityManagerSuite.scala --- @@ -440,12 +473,27 @@ class SecurityManagerSuite extends SparkFunSuite with ResetSystemProperties { intercept[IllegalArgumentException] { mgr.getSecretKey() } + case FILE => +val secretFile = createTempSecretFile() +conf.set(AUTH_SECRET_FILE, secretFile.getAbsolutePath) +mgr.initializeAuth() +assert(encodeFileAsBase64(secretFile) === mgr.getSecretKey()) } } } ) } } + private def encodeFileAsBase64(secretFile: File) = { +Base64.getEncoder.encodeToString(Files.readAllBytes(secretFile.toPath)) + } + + private def createTempSecretFile(contents: String = "test-secret"): File = { +val secretDir = Utils.createTempDir("temp-secrets") +val secretFile = new File(secretDir, "temp-secret.txt") +Files.write(secretFile.toPath, contents.getBytes(StandardCharsets.UTF_8)) +secretFile --- End diff -- can this secret be recovered on disk or we trust tempDir ACL is sufficient? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22305: [SPARK-24561][SQL][Python] User-defined window aggregati...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22305 I can help if this looks good to @ueshin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23184: [SPARK-26227][R] from_[csv|json] should accept sc...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/23184#discussion_r238120855 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala --- @@ -225,4 +225,10 @@ private[sql] object SQLUtils extends Logging { } sparkSession.sessionState.catalog.listTables(db).map(_.table).toArray } + + def createArrayType(elementType: DataType): ArrayType = DataTypes.createArrayType(elementType) --- End diff -- yea, it's really minor, but really no one except you is working R APIs :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23184: [SPARK-26227][R] from_[csv|json] should accept sc...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/23184#discussion_r238120812 --- Diff: R/pkg/R/functions.R --- @@ -2254,40 +2255,48 @@ setMethod("date_format", signature(y = "Column", x = "character"), column(jc) }) +setClassUnion("characterOrstructTypeOrColumn", c("character", "structType", "Column")) --- End diff -- yes --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/23072#discussion_r238087240 --- Diff: examples/src/main/scala/org/apache/spark/examples/ml/FPGrowthExample.scala --- @@ -64,4 +64,3 @@ object FPGrowthExample { spark.stop() } } -// scalastyle:on println --- End diff -- yes, println is not used --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23161: [SPARK-26189][R]Fix unionAll doc in SparkR
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/23161 merged to master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23161: [SPARK-26189][R]Fix unionAll doc in SparkR
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/23161 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23184: [SPARK-26227][R] from_[csv|json] should accept sc...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/23184#discussion_r238055143 --- Diff: R/pkg/R/functions.R --- @@ -2254,40 +2255,48 @@ setMethod("date_format", signature(y = "Column", x = "character"), column(jc) }) +setClassUnion("characterOrstructTypeOrColumn", c("character", "structType", "Column")) --- End diff -- we should probably try to pull all the setClassUnion in one place. (to avoid conflict or duplication) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23184: [SPARK-26227][R] from_[csv|json] should accept sc...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/23184#discussion_r238055087 --- Diff: R/pkg/R/functions.R --- @@ -2254,40 +2255,48 @@ setMethod("date_format", signature(y = "Column", x = "character"), column(jc) }) +setClassUnion("characterOrstructTypeOrColumn", c("character", "structType", "Column")) + #' @details #' \code{from_json}: Parses a column containing a JSON string into a Column of \code{structType} #' with the specified \code{schema} or array of \code{structType} if \code{as.json.array} is set #' to \code{TRUE}. If the string is unparseable, the Column will contain the value NA. #' #' @rdname column_collection_functions #' @param as.json.array indicating if input string is JSON array of objects or a single object. -#' @aliases from_json from_json,Column,characterOrstructType-method +#' @aliases from_json from_json,Column,characterOrstructTypeOrColumn-method #' @examples #' #' \dontrun{ #' df2 <- sql("SELECT named_struct('date', cast('2000-01-01' as date)) as d") #' df2 <- mutate(df2, d2 = to_json(df2$d, dateFormat = 'dd/MM/')) #' schema <- structType(structField("date", "string")) #' head(select(df2, from_json(df2$d2, schema, dateFormat = 'dd/MM/'))) - #' df2 <- sql("SELECT named_struct('name', 'Bob') as people") #' df2 <- mutate(df2, people_json = to_json(df2$people)) #' schema <- structType(structField("name", "string")) #' head(select(df2, from_json(df2$people_json, schema))) -#' head(select(df2, from_json(df2$people_json, "name STRING")))} +#' head(select(df2, from_json(df2$people_json, "name STRING"))) +#' head(select(df2, from_json(df2$people_json, schema_of_json(head(df2)$people_json} #' @note from_json since 2.2.0 -setMethod("from_json", signature(x = "Column", schema = "characterOrstructType"), +setMethod("from_json", signature(x = "Column", schema = "characterOrstructTypeOrColumn"), function(x, schema, as.json.array = FALSE, ...) { if (is.character(schema)) { - schema <- structType(schema) + jschema <- structType(schema)$jobj +} else if (class(schema) == "structType") { + jschema <- schema$jobj +} else { + jschema <- schema@jc } if (as.json.array) { - jschema <- callJStatic("org.apache.spark.sql.types.DataTypes", - "createArrayType", - schema$jobj) -} else { - jschema <- schema$jobj + # This case is R-specifically different. Unlike Scala and Python side, --- End diff -- what if `as.json.array` is TRUE but schema is also set? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23184: [SPARK-26227][R] from_[csv|json] should accept sc...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/23184#discussion_r238055126 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala --- @@ -225,4 +225,10 @@ private[sql] object SQLUtils extends Logging { } sparkSession.sessionState.catalog.listTables(db).map(_.table).toArray } + + def createArrayType(elementType: DataType): ArrayType = DataTypes.createArrayType(elementType) --- End diff -- as mentioned before, I kinda have to disagree with you here... I'd prefer less stuff in r/SQLUtils, and instead, call the scala/jvm method directly from R... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23184: [SPARK-26227][R] from_[csv|json] should accept sc...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/23184#discussion_r238055173 --- Diff: R/pkg/R/functions.R --- @@ -202,8 +202,9 @@ NULL #' \itemize{ #' \item \code{from_json}: a structType object to use as the schema to use #' when parsing the JSON string. Since Spark 2.3, the DDL-formatted string is -#' also supported for the schema. -#' \item \code{from_csv}: a DDL-formatted string +#' also supported for the schema. Since Spark 3.0, \code{schema_of_json} or +#' a string literal can also be accepted. --- End diff -- that's true...? they are the same? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22939: [SPARK-25446][R] Add schema_of_json() and schema_of_csv(...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22939 Error looks reasonable... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23161: [SPARK-26189][R]Fix unionAll doc in SparkR
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/23161#discussion_r237383462 --- Diff: R/pkg/R/DataFrame.R --- @@ -2732,13 +2732,24 @@ setMethod("union", dataFrame(unioned) }) -#' Return a new SparkDataFrame containing the union of rows -#' -#' This is an alias for `union`. +#' Return a new SparkDataFrame containing the union of rows. +#' This is an alias for \code{union}. --- End diff -- actually, we do need a newline - L2735 is the title as this is a new page. could you build the roxygen2 doc to double check? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23161: [SPARK-26189][R]Fix unionAll doc in SparkR
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/23161#discussion_r236972877 --- Diff: R/pkg/R/DataFrame.R --- @@ -2732,14 +2732,24 @@ setMethod("union", dataFrame(unioned) }) -#' Return a new SparkDataFrame containing the union of rows -#' -#' This is an alias for `union`. +#' Return a new SparkDataFrame containing the union of rows. +#' This is an alias for \code{union}. #' -#' @rdname union -#' @name unionAll +#' @param x a SparkDataFrame. +#' @param y a SparkDataFrame. +#' @return A SparkDataFrame containing the result of the unionAll operation. +#' @family SparkDataFrame functions #' @aliases unionAll,SparkDataFrame,SparkDataFrame-method -#' @note unionAll since 1.4.0 +#' @rdname unionAll +#' @name unionAll --- End diff -- I'd add a `@seealso` to union --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23161: [SPARK-26189][R]Fix unionAll doc in SparkR
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/23161#discussion_r236973169 --- Diff: R/pkg/R/DataFrame.R --- @@ -2732,14 +2732,24 @@ setMethod("union", dataFrame(unioned) }) -#' Return a new SparkDataFrame containing the union of rows -#' -#' This is an alias for `union`. +#' Return a new SparkDataFrame containing the union of rows. +#' This is an alias for \code{union}. #' -#' @rdname union -#' @name unionAll +#' @param x a SparkDataFrame. +#' @param y a SparkDataFrame. +#' @return A SparkDataFrame containing the result of the unionAll operation. +#' @family SparkDataFrame functions #' @aliases unionAll,SparkDataFrame,SparkDataFrame-method -#' @note unionAll since 1.4.0 +#' @rdname unionAll +#' @name unionAll +#' @examples +#'\dontrun{ +#' sparkR.session() +#' df1 <- read.json(path) +#' df2 <- read.json(path2) +#' unionAllDF <- unionAll(df1, df2) +#' } +#' @note unionAll since 3.0.0 --- End diff -- this should be `1.4.0` as above --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23025: [SPARK-26024][SQL]: Update documentation for repa...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/23025#discussion_r236970732 --- Diff: R/pkg/R/DataFrame.R --- @@ -767,6 +767,14 @@ setMethod("repartition", #' using \code{spark.sql.shuffle.partitions} as number of partitions.} #'} #' +#' At least one partition-by expression must be specified. --- End diff -- 761 is significant also, but correct. essentially: 1. first line of the blob is the title (L760) 2. second text after "empty line" is the description (L762) 3. third after another "empty line" is the "detail note" which is stashed all the way to the bottom of the doc page so generally you want "important" part of the description on top and not in the "detail" section because it is easily missed. this is the most common pattern in this code base. there's another, where multiple function is doc together as a group, eg. collection sql function (in functions.R). other finer control is possible as well but not used today in this code base. similarly L829 is good, L831 is a bit fuzzy - I'd personally prefer without L831 to keep the whole text in the description section of the doc. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/23072#discussion_r236771417 --- Diff: docs/ml-clustering.md --- @@ -265,3 +265,44 @@ Refer to the [R API docs](api/R/spark.gaussianMixture.html) for more details. + +## Power Iteration Clustering (PIC) + +Power Iteration Clustering (PIC) is a scalable graph clustering algorithm --- End diff -- could you open a separate PR with just this file (minus R) and FPGrowthExample.scala on branch-2.4? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22939: [SPARK-25446][R] Add schema_of_json() and schema_...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22939#discussion_r236770223 --- Diff: R/pkg/R/functions.R --- @@ -2230,6 +2237,32 @@ setMethod("from_json", signature(x = "Column", schema = "characterOrstructType") column(jc) }) +#' @details +#' \code{schema_of_json}: Parses a JSON string and infers its schema in DDL format. +#' +#' @rdname column_collection_functions +#' @aliases schema_of_json schema_of_json,characterOrColumn-method +#' @examples +#' +#' \dontrun{ +#' json <- '{"name":"Bob"}' +#' df <- sql("SELECT * FROM range(1)") +#' head(select(df, schema_of_json(json)))} +#' @note schema_of_json since 3.0.0 +setMethod("schema_of_json", signature(x = "characterOrColumn"), + function(x, ...) { +if (class(x) == "character") { + col <- callJStatic("org.apache.spark.sql.functions", "lit", x) +} else { + col <- x@jc --- End diff -- maybe to think about the design of API in R and Scala and else where - what does it look like when the user passes in a column that is not a literal string? probably worthwhile to follow up separately. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23145: [MINOR][Docs][WIP] Fix Typos
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/23145#discussion_r236765511 --- Diff: docs/index.md --- @@ -67,7 +67,7 @@ Example applications are also provided in Python. For example, ./bin/spark-submit examples/src/main/python/pi.py 10 Spark also provides an experimental [R API](sparkr.html) since 1.4 (only DataFrames APIs included). --- End diff -- let's remove `experimental`, WDYT? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23098: [WIP][SPARK-26132][BUILD][CORE] Remove support fo...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/23098#discussion_r236764795 --- Diff: R/pkg/R/sparkR.R --- @@ -269,7 +269,7 @@ sparkR.sparkContext <- function( #' sparkR.session("yarn-client", "SparkR", "/home/spark", #'list(spark.executor.memory="4g"), #'c("one.jar", "two.jar", "three.jar"), -#'c("com.databricks:spark-avro_2.11:2.0.1")) +#'c("com.databricks:spark-avro_2.12:2.0.1")) --- End diff -- yes, dummy name is completely fine with me. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23131: [SPARK-25908][SQL][FOLLOW-UP] Add back unionAll
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/23131#discussion_r236763355 --- Diff: R/pkg/R/DataFrame.R --- @@ -2732,6 +2732,20 @@ setMethod("union", dataFrame(unioned) }) +#' Return a new SparkDataFrame containing the union of rows +#' +#' This is an alias for `union`. --- End diff -- also backtick doesn't format with roxygen2. this should be ``` This is an alias for \code{union}. ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23025: [SPARK-26024][SQL]: Update documentation for repa...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/23025#discussion_r236762465 --- Diff: R/pkg/R/DataFrame.R --- @@ -767,6 +767,14 @@ setMethod("repartition", #' using \code{spark.sql.shuffle.partitions} as number of partitions.} #'} #' +#' At least one partition-by expression must be specified. --- End diff -- this won't be formatted correctly in R doc due to the fact that "empty line" is significant. L769 should be removed to ensure it is in description --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23131: [SPARK-25908][SQL][FOLLOW-UP] Add back unionAll
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/23131#discussion_r236760822 --- Diff: R/pkg/R/DataFrame.R --- @@ -2732,6 +2732,20 @@ setMethod("union", dataFrame(unioned) }) +#' Return a new SparkDataFrame containing the union of rows +#' +#' This is an alias for `union`. --- End diff -- If the goal is for this to be like other *All, this should go into a separate doc page, plus seealso, example etc. The way this was written, as it was a deprecated function, this doc page merge with union - as it is, none of the text above will show up and also unionAll will not be listed in method index list. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23145: [MINOR][Docs] "a R interpreter" -> "an R interpre...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/23145#discussion_r236546043 --- Diff: docs/index.md --- @@ -67,7 +67,7 @@ Example applications are also provided in Python. For example, ./bin/spark-submit examples/src/main/python/pi.py 10 Spark also provides an experimental [R API](sparkr.html) since 1.4 (only DataFrames APIs included). --- End diff -- tbh, I'm not sure this should be called "an experimental [R API]" --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23089: [SPARK-26120][TESTS][SS][SPARKR]Fix a streaming query le...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/23089 Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22939: [SPARK-25446][R] Add schema_of_json() and schema_of_csv(...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22939 Sorry for the delay, will do another pass in 1 or 2 days --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/23072#discussion_r234432181 --- Diff: R/pkg/R/mllib_clustering.R --- @@ -610,3 +616,57 @@ setMethod("write.ml", signature(object = "LDAModel", path = "character"), function(object, path, overwrite = FALSE) { write_internal(object, path, overwrite) }) + +#' PowerIterationClustering +#' +#' A scalable graph clustering algorithm. Users can call \code{spark.assignClusters} to +#' return a cluster assignment for each input vertex. +#' +# Run the PIC algorithm and returns a cluster assignment for each input vertex. +#' @param data A SparkDataFrame. +#' @param k The number of clusters to create. +#' @param initMode Param for the initialization algorithm. +#' @param maxIter Param for maximum number of iterations. +#' @param srcCol Param for the name of the input column for source vertex IDs. +#' @param dstCol Name of the input column for destination vertex IDs. +#' @param weightCol Param for weight column name. If this is not set or \code{NULL}, +#' we treat all instance weights as 1.0. +#' @param ... additional argument(s) passed to the method. +#' @return A dataset that contains columns of vertex id and the corresponding cluster for the id. +#' The schema of it will be: +#' \code{id: Long} +#' \code{cluster: Int} +#' @rdname spark.powerIterationClustering +#' @aliases assignClusters,PowerIterationClustering-method,SparkDataFrame-method +#' @examples +#' \dontrun{ +#' df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0), +#' list(1L, 2L, 1.0), list(3L, 4L, 1.0), +#' list(4L, 0L, 0.1)), schema = c("src", "dst", "weight")) +#' clusters <- spark.assignClusters(df, initMode="degree", weightCol="weight") +#' showDF(clusters) +#' } +#' @note spark.assignClusters(SparkDataFrame) since 3.0.0 +setMethod("spark.assignClusters", + signature(data = "SparkDataFrame"), + function(data, k = 2L, initMode = "random", maxIter = 20L, srcCol = "src", +dstCol = "dst", weightCol = NULL) { --- End diff -- I think we try to avoid srcCol dstCol in R (I think there are other R ml APIs like that) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/23072#discussion_r234432019 --- Diff: R/pkg/R/mllib_clustering.R --- @@ -610,3 +616,57 @@ setMethod("write.ml", signature(object = "LDAModel", path = "character"), function(object, path, overwrite = FALSE) { write_internal(object, path, overwrite) }) + +#' PowerIterationClustering +#' +#' A scalable graph clustering algorithm. Users can call \code{spark.assignClusters} to +#' return a cluster assignment for each input vertex. +#' +# Run the PIC algorithm and returns a cluster assignment for each input vertex. +#' @param data A SparkDataFrame. +#' @param k The number of clusters to create. +#' @param initMode Param for the initialization algorithm. +#' @param maxIter Param for maximum number of iterations. +#' @param srcCol Param for the name of the input column for source vertex IDs. +#' @param dstCol Name of the input column for destination vertex IDs. +#' @param weightCol Param for weight column name. If this is not set or \code{NULL}, +#' we treat all instance weights as 1.0. +#' @param ... additional argument(s) passed to the method. +#' @return A dataset that contains columns of vertex id and the corresponding cluster for the id. +#' The schema of it will be: +#' \code{id: Long} +#' \code{cluster: Int} +#' @rdname spark.powerIterationClustering +#' @aliases assignClusters,PowerIterationClustering-method,SparkDataFrame-method +#' @examples +#' \dontrun{ +#' df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0), +#' list(1L, 2L, 1.0), list(3L, 4L, 1.0), +#' list(4L, 0L, 0.1)), schema = c("src", "dst", "weight")) +#' clusters <- spark.assignClusters(df, initMode="degree", weightCol="weight") +#' showDF(clusters) +#' } +#' @note spark.assignClusters(SparkDataFrame) since 3.0.0 +setMethod("spark.assignClusters", + signature(data = "SparkDataFrame"), + function(data, k = 2L, initMode = "random", maxIter = 20L, srcCol = "src", --- End diff -- set valid values for initMode and check for it - eg. https://github.com/apache/spark/pull/23072/files#diff-d9f92e07db6424e2527a7f9d7caa9013R355 and `match.arg(initMode)` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/23072#discussion_r234432049 --- Diff: R/pkg/vignettes/sparkr-vignettes.Rmd --- @@ -968,6 +970,17 @@ predicted <- predict(model, df) head(predicted) ``` + Power Iteration Clustering + +Power Iteration Clustering (PIC) is a scalable graph clustering algorithm. `spark.assignClusters` method runs the PIC algorithm and returns a cluster assignment for each input vertex. + +```{r} +df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0), + list(1L, 2L, 1.0), list(3L, 4L, 1.0), + list(4L, 0L, 0.1)), schema = c("src", "dst", "weight")) +head(spark.assignClusters(df, initMode="degree", weightCol="weight")) --- End diff -- spacing: `initMode = "degree", weightCol = "weight"` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23073: [SPARK-26104] expose pci info to task scheduler
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/23073#discussion_r234431864 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/ExecutorData.scala --- @@ -27,12 +27,14 @@ import org.apache.spark.rpc.{RpcAddress, RpcEndpointRef} * @param executorHost The hostname that this executor is running on * @param freeCores The current number of cores available for work on the executor * @param totalCores The total number of cores available to the executor + * @param pcis The external devices avaliable to the executor --- End diff -- available --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23073: [SPARK-26104] expose pci info to task scheduler
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/23073 please put ^ comment into PR description (because comment is not included in commit message once the PR is merged) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23012: [SPARK-26014][R] Deprecate R prior to version 3.4 in Spa...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/23012 Yea there are some problem with some packages we depend on that are not installable from CRAN (eg too old) so it will be hard to a new version of R and new installation. So to clarify, depreciation as is we still test on R 3.1? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23012: [SPARK-26014][R] Deprecate R prior to version 3.4 in Spa...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/23012 Hey shane I don’t think we are saying to test multiple R version at all. In fact quite the opposite, just the new(er) version at some point in the future. (We don’t have a better solution for packages though. There’s another PR for R arrow package for example) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23012: [SPARK-26014][R] Deprecate R prior to version 3.4 in Spa...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/23012 I think it's easier to say unsupported if we are not testing it in jenkins or appveyer. I don't know if we any coverage at release for older R version anyway, so it's better to unsupported then deprecate. but agree maybe the way to do this is deprecate without updating R in jenkins --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23017: [WIP][SPARK-26015][K8S] Set a default UID for Spark on K...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/23017 noted test issue. let's kick off test though --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23017: [WIP][SPARK-26015][K8S] Set a default UID for Spark on K...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/23017 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23007: [SPARK-26010][R] fix vignette eval with Java 11
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/23007 merged to master/2.4 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22866: WIP [SPARK-12172][SPARKR] Remove internal-only RDD metho...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22866 thx, but DO NOT MERGE - there's some nasty bug I'm still investigating.. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23012: [SPARK-26014][R] Deprecate R prior to version 3.4...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/23012#discussion_r232881732 --- Diff: R/pkg/R/sparkR.R --- @@ -283,6 +283,10 @@ sparkR.session <- function( enableHiveSupport = TRUE, ...) { + if (utils::compareVersion(paste0(R.version$major, ".", R.version$minor), "3.4.0") == -1) { +warning("R prior to version 3.4 is deprecated as of Spark 3.0.") + } --- End diff -- ditto `Support for R prior to version 3.4 is deprecated since Spark 3.0.0` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23012: [SPARK-26014][R] Deprecate R prior to version 3.4...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/23012#discussion_r232882419 --- Diff: docs/index.md --- @@ -31,7 +31,8 @@ Spark runs on both Windows and UNIX-like systems (e.g. Linux, Mac OS). It's easy locally on one machine --- all you need is to have `java` installed on your system `PATH`, or the `JAVA_HOME` environment variable pointing to a Java installation. -Spark runs on Java 8+, Python 2.7+/3.4+ and R 3.1+. For the Scala API, Spark {{site.SPARK_VERSION}} +Spark runs on Java 8+, Python 2.7+/3.4+ and R 3.1+. R prior to version 3.4 is deprecated as of Spark 3.0. --- End diff -- `R prior to version 3.4 support is deprecated as of Spark 3.0.0.` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23012: [SPARK-26014][R] Deprecate R prior to version 3.4...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/23012#discussion_r232882178 --- Diff: docs/index.md --- @@ -31,7 +31,8 @@ Spark runs on both Windows and UNIX-like systems (e.g. Linux, Mac OS). It's easy locally on one machine --- all you need is to have `java` installed on your system `PATH`, or the `JAVA_HOME` environment variable pointing to a Java installation. -Spark runs on Java 8+, Python 2.7+/3.4+ and R 3.1+. For the Scala API, Spark {{site.SPARK_VERSION}} +Spark runs on Java 8+, Python 2.7+/3.4+ and R 3.1+. R prior to version 3.4 is deprecated as of Spark 3.0. --- End diff -- with all the other changes, we haven't listed all deprecation here, or have we? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23012: [SPARK-26014][R] Deprecate R prior to version 3.4...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/23012#discussion_r232881594 --- Diff: R/WINDOWS.md --- @@ -3,7 +3,7 @@ To build SparkR on Windows, the following steps are required 1. Install R (>= 3.1) and [Rtools](http://cran.r-project.org/bin/windows/Rtools/). Make sure to -include Rtools and R in `PATH`. +include Rtools and R in `PATH`. Note that R prior to version 3.4 is deprecated as of Spark 3.0. --- End diff -- I really would prefer "unsupported" but if we go with this it should say `Note that support for R prior to version 3.4 is deprecated as of Spark 3.0.0.` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23012: [SPARK-26014][R] Deprecate R prior to version 3.4 in Spa...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/23012 FYI This is unused code I’m going to remove it https://github.com/apache/spark/blob/master/R/pkg/src-native/string_hash_code.c --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23012: [SPARK-26014][R] Deprecate R prior to version 3.4 in Spa...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/23012 Also I think the warning should be in .First in general.R --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23012: [SPARK-26014][R] Deprecate R prior to version 3.4 in Spa...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/23012 I think this should say unsupported (ie could still work) instead of deprecated Also the compareVersion should check both major and minor ie 3.4.0 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22939: [SPARK-25446][R] Add schema_of_json() and schema_...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22939#discussion_r232500194 --- Diff: R/pkg/R/functions.R --- @@ -2230,6 +2237,32 @@ setMethod("from_json", signature(x = "Column", schema = "characterOrstructType") column(jc) }) +#' @details +#' \code{schema_of_json}: Parses a JSON string and infers its schema in DDL format. +#' +#' @rdname column_collection_functions +#' @aliases schema_of_json schema_of_json,characterOrColumn-method +#' @examples +#' +#' \dontrun{ +#' json <- '{"name":"Bob"}' +#' df <- sql("SELECT * FROM range(1)") +#' head(select(df, schema_of_json(json)))} +#' @note schema_of_json since 3.0.0 +setMethod("schema_of_json", signature(x = "characterOrColumn"), + function(x, ...) { +if (class(x) == "character") { + col <- callJStatic("org.apache.spark.sql.functions", "lit", x) +} else { + col <- x@jc --- End diff -- just that I thought the shortcut syntax in scala is nicer looking then `lit("string")` in R --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r232500065 --- Diff: R/pkg/R/SQLContext.R --- @@ -172,36 +257,72 @@ getDefaultSqlSource <- function() { createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0, numPartitions = NULL) { sparkSession <- getSparkSession() - + arrowEnabled <- sparkR.conf("spark.sql.execution.arrow.enabled")[[1]] == "true" + shouldUseArrow <- FALSE + firstRow <- NULL if (is.data.frame(data)) { - # Convert data into a list of rows. Each row is a list. - - # get the names of columns, they will be put into RDD - if (is.null(schema)) { -schema <- names(data) - } +# get the names of columns, they will be put into RDD +if (is.null(schema)) { + schema <- names(data) +} - # get rid of factor type - cleanCols <- function(x) { -if (is.factor(x)) { - as.character(x) -} else { - x -} +# get rid of factor type +cleanCols <- function(x) { + if (is.factor(x)) { +as.character(x) + } else { +x } +} +data[] <- lapply(data, cleanCols) + +args <- list(FUN = list, SIMPLIFY = FALSE, USE.NAMES = FALSE) +if (arrowEnabled) { + shouldUseArrow <- tryCatch({ +stopifnot(length(data) > 0) +dataHead <- head(data, 1) +checkTypeRequirementForArrow(data, schema) +fileName <- writeToTempFileInArrow(data, numPartitions) +tryCatch( + jrddInArrow <- callJStatic("org.apache.spark.sql.api.r.SQLUtils", + "readArrowStreamFromFile", + sparkSession, + fileName), +finally = { + file.remove(fileName) +}) + +firstRow <- do.call(mapply, append(args, dataHead))[[1]] +TRUE + }, + error = function(e) { +warning(paste0("createDataFrame attempted Arrow optimization because ", + "'spark.sql.execution.arrow.enabled' is set to true; however, ", + "failed, attempting non-optimization. Reason: ", + e)) +return(FALSE) --- End diff -- nit: just `FALSE` is good --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r232499902 --- Diff: R/pkg/R/SQLContext.R --- @@ -147,6 +147,91 @@ getDefaultSqlSource <- function() { l[["spark.sql.sources.default"]] } +writeToTempFileInArrow <- function(rdf, numPartitions) { + requireNamespace1 <- requireNamespace + + # For some reasons, Arrow R API requires to load 'defer_parent' which is from 'withr' package. + # This is a workaround to avoid this error. Otherwise, we should directly load 'withr' + # package, which CRAN complains about. + defer_parent <- function(x, ...) { + if (requireNamespace1("withr", quietly = TRUE)) { + defer_parent <- get("defer_parent", envir = asNamespace("withr"), inherits = FALSE) + defer_parent(x, ...) +} else { + stop("'withr' package should be installed.") +} + } + + # R API in Arrow is not yet released. CRAN requires to add the package in requireNamespace + # at DESCRIPTION. Later, CRAN checks if the package is available or not. Therefore, it works + # around by avoiding direct requireNamespace. + if (requireNamespace1("arrow", quietly = TRUE)) { +record_batch <- get("record_batch", envir = asNamespace("arrow"), inherits = FALSE) +record_batch_stream_writer <- get( + "record_batch_stream_writer", envir = asNamespace("arrow"), inherits = FALSE) +file_output_stream <- get( + "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE) +write_record_batch <- get( + "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE) + +numPartitions <- if (!is.null(numPartitions)) { + numToInt(numPartitions) +} else { + 1 --- End diff -- future: consolidate the default here and inside makeSplits --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r232499848 --- Diff: R/pkg/tests/fulltests/test_sparkSQL.R --- @@ -307,6 +307,64 @@ test_that("create DataFrame from RDD", { unsetHiveContext() }) +test_that("createDataFrame Arrow optimization", { + skip_if_not_installed("arrow") + skip_if_not_installed("withr") --- End diff -- are we going to ask shane to install arrow/withr on the Jenkins machines? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r232499794 --- Diff: R/pkg/tests/fulltests/test_sparkSQL.R --- @@ -307,6 +307,64 @@ test_that("create DataFrame from RDD", { unsetHiveContext() }) +test_that("createDataFrame Arrow optimization", { + skip_if_not_installed("arrow") + skip_if_not_installed("withr") + + conf <- callJMethod(sparkSession, "conf") + arrowEnabled <- sparkR.conf("spark.sql.execution.arrow.enabled")[[1]] + + callJMethod(conf, "set", "spark.sql.execution.arrow.enabled", "false") + tryCatch({ --- End diff -- does this fail? or just a way to inject a finally? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22993: [SPARK-24421][BUILD][CORE] Accessing sun.misc.Cle...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22993#discussion_r232499645 --- Diff: common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java --- @@ -67,6 +67,59 @@ unaligned = _unaligned; } + // Access fields and constructors once and store them, for performance: + + private static final Constructor DBB_CONSTRUCTOR; + private static final Field DBB_CLEANER_FIELD; + static { +try { + Class cls = Class.forName("java.nio.DirectByteBuffer"); + Constructor constructor = cls.getDeclaredConstructor(Long.TYPE, Integer.TYPE); + constructor.setAccessible(true); + Field cleanerField = cls.getDeclaredField("cleaner"); + cleanerField.setAccessible(true); + DBB_CONSTRUCTOR = constructor; + DBB_CLEANER_FIELD = cleanerField; +} catch (ClassNotFoundException | NoSuchMethodException | NoSuchFieldException e) { + throw new IllegalStateException(e); +} + } + + private static final Method CLEANER_CREATE_METHOD; + static { +// The implementation of Cleaner changed from JDK 8 to 9 +int majorVersion = Integer.parseInt(System.getProperty("java.version").split("\\.")[0]); --- End diff -- looks like it could be `java.version=1.8.0_192` or `java.version=11.0.1` ie. first integer or second integer (1.8 => 8?) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23007: [SPARK-26010] fix vignette eval with Java 11
GitHub user felixcheung opened a pull request: https://github.com/apache/spark/pull/23007 [SPARK-26010] fix vignette eval with Java 11 ## What changes were proposed in this pull request? changes in vignette only to disable eval ## How was this patch tested? Jenkins You can merge this pull request into a Git repository by running: $ git pull https://github.com/felixcheung/spark rjavavervig Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23007.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23007 commit dc4ba6cd813d91622032508e6696b6f56fb19bd9 Author: Felix Cheung Date: 2018-11-11T18:30:11Z fix --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22993: [SPARK-24421][BUILD][CORE] Accessing sun.misc.Cle...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22993#discussion_r232477875 --- Diff: common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java --- @@ -67,6 +67,59 @@ unaligned = _unaligned; } + // Access fields and constructors once and store them, for performance: + + private static final Constructor DBB_CONSTRUCTOR; + private static final Field DBB_CLEANER_FIELD; + static { +try { + Class cls = Class.forName("java.nio.DirectByteBuffer"); + Constructor constructor = cls.getDeclaredConstructor(Long.TYPE, Integer.TYPE); + constructor.setAccessible(true); + Field cleanerField = cls.getDeclaredField("cleaner"); + cleanerField.setAccessible(true); + DBB_CONSTRUCTOR = constructor; + DBB_CLEANER_FIELD = cleanerField; +} catch (ClassNotFoundException | NoSuchMethodException | NoSuchFieldException e) { + throw new IllegalStateException(e); +} + } + + private static final Method CLEANER_CREATE_METHOD; + static { +// The implementation of Cleaner changed from JDK 8 to 9 +int majorVersion = Integer.parseInt(System.getProperty("java.version").split("\\.")[0]); --- End diff -- is there a defined fixed format for this? we are doing some java version check and found very different format from different JDK sources (Oracle vs OpenJDK vs IBM ...) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22993: [SPARK-24421][BUILD][CORE] Accessing sun.misc.Cleaner in...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22993 what settings we need to allow `illegal reflective access` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22989: [SPARK-25986][Build] Banning throw new OutOfMemoryErrors
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22989 and catching Error or Throwable.. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22989: [SPARK-25986][Build] Banning throw new OutOfMemor...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22989#discussion_r232477783 --- Diff: scalastyle-config.xml --- @@ -240,6 +240,18 @@ This file is divided into 3 sections: ]]> + +throw new OutOfMemoryError +
[GitHub] spark issue #22977: [BUILD] Bump previousSparkVersion in MimaBuild.scala to ...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22977 right, I mean both this and that should be part of the process "post-release" --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r232477365 --- Diff: R/pkg/R/SQLContext.R --- @@ -147,6 +147,55 @@ getDefaultSqlSource <- function() { l[["spark.sql.sources.default"]] } +writeToTempFileInArrow <- function(rdf, numPartitions) { + # R API in Arrow is not yet released. CRAN requires to add the package in requireNamespace + # at DESCRIPTION. Later, CRAN checks if the package is available or not. Therefore, it works + # around by avoiding direct requireNamespace. + requireNamespace1 <- requireNamespace + if (requireNamespace1("arrow", quietly = TRUE)) { +record_batch <- get("record_batch", envir = asNamespace("arrow"), inherits = FALSE) +record_batch_stream_writer <- get( + "record_batch_stream_writer", envir = asNamespace("arrow"), inherits = FALSE) +file_output_stream <- get( + "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE) +write_record_batch <- get( + "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE) + +# Currently arrow requires withr; otherwise, write APIs don't work. +# Direct 'require' is not recommended by CRAN. Here's a workaround. +require1 <- require +if (require1("withr", quietly = TRUE)) { --- End diff -- that will be good. circumventing CRAN check for method name is... problematic.. (there are other more hacky way too, but ..) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r232477325 --- Diff: R/pkg/R/SQLContext.R --- @@ -189,19 +238,67 @@ createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0, x } } + data[] <- lapply(data, cleanCols) - # drop factors and wrap lists - data <- setNames(lapply(data, cleanCols), NULL) + args <- list(FUN = list, SIMPLIFY = FALSE, USE.NAMES = FALSE) + if (arrowEnabled) { +shouldUseArrow <- tryCatch({ + stopifnot(length(data) > 0) + dataHead <- head(data, 1) + # Currenty Arrow optimization does not support POSIXct and raw for now. + # Also, it does not support explicit float type set by users. It leads to + # incorrect conversion. We will fall back to the path without Arrow optimization. + if (any(sapply(dataHead, function(x) is(x, "POSIXct" { +stop("Arrow optimization with R DataFrame does not support POSIXct type yet.") + } + if (any(sapply(dataHead, is.raw))) { +stop("Arrow optimization with R DataFrame does not support raw type yet.") + } + if (inherits(schema, "structType")) { +if (any(sapply(schema$fields(), function(x) x$dataType.toString() == "FloatType"))) { + stop("Arrow optimization with R DataFrame does not support FloatType type yet.") --- End diff -- maybe out of bit range? 53 bit https://stat.ethz.ch/R-manual/R-patched/library/base/html/double.html --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r232477271 --- Diff: R/pkg/R/SQLContext.R --- @@ -189,19 +238,67 @@ createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0, x } } + data[] <- lapply(data, cleanCols) - # drop factors and wrap lists - data <- setNames(lapply(data, cleanCols), NULL) + args <- list(FUN = list, SIMPLIFY = FALSE, USE.NAMES = FALSE) + if (arrowEnabled) { +shouldUseArrow <- tryCatch({ + stopifnot(length(data) > 0) + dataHead <- head(data, 1) + # Currenty Arrow optimization does not support POSIXct and raw for now. + # Also, it does not support explicit float type set by users. It leads to + # incorrect conversion. We will fall back to the path without Arrow optimization. + if (any(sapply(dataHead, function(x) is(x, "POSIXct" { +stop("Arrow optimization with R DataFrame does not support POSIXct type yet.") + } + if (any(sapply(dataHead, is.raw))) { +stop("Arrow optimization with R DataFrame does not support raw type yet.") + } + if (inherits(schema, "structType")) { +if (any(sapply(schema$fields(), function(x) x$dataType.toString() == "FloatType"))) { + stop("Arrow optimization with R DataFrame does not support FloatType type yet.") +} + } + firstRow <- do.call(mapply, append(args, dataHead))[[1]] + fileName <- writeToTempFileInArrow(data, numPartitions) + tryCatch( +jrddInArrow <- callJStatic("org.apache.spark.sql.api.r.SQLUtils", + "readArrowStreamFromFile", + sparkSession, + fileName), + finally = { +file.remove(fileName) --- End diff -- yes, just more consistent. I also don't know for sure why all other instances are calling unlink --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r232477257 --- Diff: R/pkg/R/SQLContext.R --- @@ -189,19 +238,67 @@ createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0, x } } + data[] <- lapply(data, cleanCols) - # drop factors and wrap lists - data <- setNames(lapply(data, cleanCols), NULL) + args <- list(FUN = list, SIMPLIFY = FALSE, USE.NAMES = FALSE) + if (arrowEnabled) { +shouldUseArrow <- tryCatch({ + stopifnot(length(data) > 0) + dataHead <- head(data, 1) + # Currenty Arrow optimization does not support POSIXct and raw for now. + # Also, it does not support explicit float type set by users. It leads to + # incorrect conversion. We will fall back to the path without Arrow optimization. + if (any(sapply(dataHead, function(x) is(x, "POSIXct" { --- End diff -- LG, I tested a few cases too --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r232477171 --- Diff: R/pkg/R/SQLContext.R --- @@ -172,10 +221,10 @@ getDefaultSqlSource <- function() { createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0, numPartitions = NULL) { sparkSession <- getSparkSession() - + arrowEnabled <- sparkR.conf("spark.sql.execution.arrow.enabled")[[1]] == "true" --- End diff -- ok --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r232477155 --- Diff: R/pkg/R/SQLContext.R --- @@ -147,6 +147,55 @@ getDefaultSqlSource <- function() { l[["spark.sql.sources.default"]] } +writeToTempFileInArrow <- function(rdf, numPartitions) { + # R API in Arrow is not yet released. CRAN requires to add the package in requireNamespace + # at DESCRIPTION. Later, CRAN checks if the package is available or not. Therefore, it works + # around by avoiding direct requireNamespace. + requireNamespace1 <- requireNamespace + if (requireNamespace1("arrow", quietly = TRUE)) { +record_batch <- get("record_batch", envir = asNamespace("arrow"), inherits = FALSE) +record_batch_stream_writer <- get( + "record_batch_stream_writer", envir = asNamespace("arrow"), inherits = FALSE) +file_output_stream <- get( + "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE) +write_record_batch <- get( + "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE) + +# Currently arrow requires withr; otherwise, write APIs don't work. +# Direct 'require' is not recommended by CRAN. Here's a workaround. +require1 <- require +if (require1("withr", quietly = TRUE)) { + numPartitions <- if (!is.null(numPartitions)) { +numToInt(numPartitions) + } else { +1 + } + fileName <- tempfile(pattern = "spark-arrow", fileext = ".tmp") + chunk <- as.integer(ceiling(nrow(rdf) / numPartitions)) + rdf_slices <- split(rdf, rep(1:ceiling(nrow(rdf) / chunk), each = chunk)[1:nrow(rdf)]) --- End diff -- ah, any idea that was done that way in python? this seems to be different from sc.paralleize which is what https://github.com/apache/spark/blob/c3b4a94a91d66c172cf332321d3a78dba29ef8f0/R/pkg/R/context.R#L152 is done --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r232477131 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala --- @@ -225,4 +226,25 @@ private[sql] object SQLUtils extends Logging { } sparkSession.sessionState.catalog.listTables(db).map(_.table).toArray } + + /** + * R callable function to read a file in Arrow stream format and create a `RDD` + * using each serialized ArrowRecordBatch as a partition. + */ + def readArrowStreamFromFile( + sparkSession: SparkSession, + filename: String): JavaRDD[Array[Byte]] = { +ArrowConverters.readArrowStreamFromFile(sparkSession.sqlContext, filename) --- End diff -- hmm, I see your point... but there could be hundreds of these wrapper we need add if we set as a practice, I'm guessing. a few problems with these wrappers I see: 1. they are extra work to add or maintain 2. many are very simple, not much value add 3. many get abandoned over the years - they are not called and not removed but I kinda see your way, let's keep this one and review any new one in the future. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22997: SPARK-25999: make-distribution.sh failure with --r and -...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22997 btw, please see the page https://spark.apache.org/contributing.html and particularly "Pull Request" on the format. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22997: SPARK-25999: make-distribution.sh failure with --r and -...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22997 thx, but I'm not sure about this approach. this step will now cause hadoop jar to be packaged into the release tarball of hadoop-provided, which is undoing the point of hadoop-provided. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22977: [BUILD] Bump previousSparkVersion in MimaBuild.scala to ...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22977 I think also there is a hive metastore test that downloads spark release jar? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22967: [SPARK-25956] Make Scala 2.12 as default Scala ve...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22967#discussion_r232178323 --- Diff: docs/sparkr.md --- @@ -133,7 +133,7 @@ specifying `--packages` with `spark-submit` or `sparkR` commands, or if initiali {% highlight r %} -sparkR.session(sparkPackages = "com.databricks:spark-avro_2.11:3.0.0") +sparkR.session() --- End diff -- yes! (although, let's not use spark here - don't want to encourage naming packages with spark in the name) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r232170936 --- Diff: R/pkg/R/SQLContext.R --- @@ -147,6 +147,55 @@ getDefaultSqlSource <- function() { l[["spark.sql.sources.default"]] } +writeToTempFileInArrow <- function(rdf, numPartitions) { + # R API in Arrow is not yet released. CRAN requires to add the package in requireNamespace + # at DESCRIPTION. Later, CRAN checks if the package is available or not. Therefore, it works + # around by avoiding direct requireNamespace. + requireNamespace1 <- requireNamespace + if (requireNamespace1("arrow", quietly = TRUE)) { +record_batch <- get("record_batch", envir = asNamespace("arrow"), inherits = FALSE) +record_batch_stream_writer <- get( + "record_batch_stream_writer", envir = asNamespace("arrow"), inherits = FALSE) +file_output_stream <- get( + "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE) +write_record_batch <- get( + "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE) + +# Currently arrow requires withr; otherwise, write APIs don't work. +# Direct 'require' is not recommended by CRAN. Here's a workaround. +require1 <- require +if (require1("withr", quietly = TRUE)) { + numPartitions <- if (!is.null(numPartitions)) { +numToInt(numPartitions) + } else { +1 + } + fileName <- tempfile(pattern = "spark-arrow", fileext = ".tmp") + chunk <- as.integer(ceiling(nrow(rdf) / numPartitions)) + rdf_slices <- split(rdf, rep(1:ceiling(nrow(rdf) / chunk), each = chunk)[1:nrow(rdf)]) --- End diff -- `1 : ceiling`? `1 : nrow`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r232176721 --- Diff: R/pkg/R/SQLContext.R --- @@ -189,19 +238,67 @@ createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0, x } } + data[] <- lapply(data, cleanCols) - # drop factors and wrap lists - data <- setNames(lapply(data, cleanCols), NULL) + args <- list(FUN = list, SIMPLIFY = FALSE, USE.NAMES = FALSE) + if (arrowEnabled) { +shouldUseArrow <- tryCatch({ + stopifnot(length(data) > 0) + dataHead <- head(data, 1) + # Currenty Arrow optimization does not support POSIXct and raw for now. + # Also, it does not support explicit float type set by users. It leads to + # incorrect conversion. We will fall back to the path without Arrow optimization. + if (any(sapply(dataHead, function(x) is(x, "POSIXct" { +stop("Arrow optimization with R DataFrame does not support POSIXct type yet.") + } + if (any(sapply(dataHead, is.raw))) { +stop("Arrow optimization with R DataFrame does not support raw type yet.") + } + if (inherits(schema, "structType")) { +if (any(sapply(schema$fields(), function(x) x$dataType.toString() == "FloatType"))) { + stop("Arrow optimization with R DataFrame does not support FloatType type yet.") +} + } + firstRow <- do.call(mapply, append(args, dataHead))[[1]] + fileName <- writeToTempFileInArrow(data, numPartitions) + tryCatch( +jrddInArrow <- callJStatic("org.apache.spark.sql.api.r.SQLUtils", + "readArrowStreamFromFile", + sparkSession, + fileName), + finally = { +file.remove(fileName) + }) + TRUE +}, +error = function(e) { + message(paste0("WARN: createDataFrame attempted Arrow optimization because ", --- End diff -- ? https://stat.ethz.ch/R-manual/R-devel/library/base/html/warning.html --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r232172687 --- Diff: R/pkg/R/SQLContext.R --- @@ -147,6 +147,55 @@ getDefaultSqlSource <- function() { l[["spark.sql.sources.default"]] } +writeToTempFileInArrow <- function(rdf, numPartitions) { + # R API in Arrow is not yet released. CRAN requires to add the package in requireNamespace + # at DESCRIPTION. Later, CRAN checks if the package is available or not. Therefore, it works + # around by avoiding direct requireNamespace. + requireNamespace1 <- requireNamespace + if (requireNamespace1("arrow", quietly = TRUE)) { +record_batch <- get("record_batch", envir = asNamespace("arrow"), inherits = FALSE) +record_batch_stream_writer <- get( + "record_batch_stream_writer", envir = asNamespace("arrow"), inherits = FALSE) +file_output_stream <- get( + "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE) +write_record_batch <- get( + "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE) + +# Currently arrow requires withr; otherwise, write APIs don't work. +# Direct 'require' is not recommended by CRAN. Here's a workaround. +require1 <- require +if (require1("withr", quietly = TRUE)) { + numPartitions <- if (!is.null(numPartitions)) { +numToInt(numPartitions) + } else { +1 + } + fileName <- tempfile(pattern = "spark-arrow", fileext = ".tmp") + chunk <- as.integer(ceiling(nrow(rdf) / numPartitions)) + rdf_slices <- split(rdf, rep(1:ceiling(nrow(rdf) / chunk), each = chunk)[1:nrow(rdf)]) + stream_writer <- NULL + for (rdf_slice in rdf_slices) { +batch <- record_batch(rdf_slice) +if (is.null(stream_writer)) { + # We should avoid private calls like 'close_on_exit' (CRAN disallows) but looks + # there's no exposed API for it. Here's a workaround but ideally this should + # be removed. + close_on_exit <- get("close_on_exit", envir = asNamespace("arrow"), inherits = FALSE) --- End diff -- actually, I think if you use withr here it will call close_on_exit for you? (but when?) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r232169938 --- Diff: R/pkg/R/SQLContext.R --- @@ -147,6 +147,55 @@ getDefaultSqlSource <- function() { l[["spark.sql.sources.default"]] } +writeToTempFileInArrow <- function(rdf, numPartitions) { + # R API in Arrow is not yet released. CRAN requires to add the package in requireNamespace + # at DESCRIPTION. Later, CRAN checks if the package is available or not. Therefore, it works + # around by avoiding direct requireNamespace. + requireNamespace1 <- requireNamespace + if (requireNamespace1("arrow", quietly = TRUE)) { +record_batch <- get("record_batch", envir = asNamespace("arrow"), inherits = FALSE) +record_batch_stream_writer <- get( + "record_batch_stream_writer", envir = asNamespace("arrow"), inherits = FALSE) +file_output_stream <- get( + "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE) +write_record_batch <- get( + "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE) + +# Currently arrow requires withr; otherwise, write APIs don't work. +# Direct 'require' is not recommended by CRAN. Here's a workaround. +require1 <- require +if (require1("withr", quietly = TRUE)) { + numPartitions <- if (!is.null(numPartitions)) { +numToInt(numPartitions) + } else { +1 --- End diff -- should this go by default with default parallelism ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r232173367 --- Diff: R/pkg/R/SQLContext.R --- @@ -189,19 +238,67 @@ createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0, x } } + data[] <- lapply(data, cleanCols) - # drop factors and wrap lists - data <- setNames(lapply(data, cleanCols), NULL) + args <- list(FUN = list, SIMPLIFY = FALSE, USE.NAMES = FALSE) + if (arrowEnabled) { +shouldUseArrow <- tryCatch({ + stopifnot(length(data) > 0) + dataHead <- head(data, 1) + # Currenty Arrow optimization does not support POSIXct and raw for now. + # Also, it does not support explicit float type set by users. It leads to + # incorrect conversion. We will fall back to the path without Arrow optimization. + if (any(sapply(dataHead, function(x) is(x, "POSIXct" { --- End diff -- can you check - I think `is` `is.x` doesn't something do the right thing when head(df, 1) and one of the field is `NA` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r232173043 --- Diff: R/pkg/R/SQLContext.R --- @@ -172,10 +221,10 @@ getDefaultSqlSource <- function() { createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0, numPartitions = NULL) { sparkSession <- getSparkSession() - + arrowEnabled <- sparkR.conf("spark.sql.execution.arrow.enabled")[[1]] == "true" --- End diff -- is it always in the conf - I think you can also pass in a default value to sparkR.conf --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r232167634 --- Diff: R/pkg/R/SQLContext.R --- @@ -147,6 +147,55 @@ getDefaultSqlSource <- function() { l[["spark.sql.sources.default"]] } +writeToTempFileInArrow <- function(rdf, numPartitions) { + # R API in Arrow is not yet released. CRAN requires to add the package in requireNamespace + # at DESCRIPTION. Later, CRAN checks if the package is available or not. Therefore, it works + # around by avoiding direct requireNamespace. + requireNamespace1 <- requireNamespace + if (requireNamespace1("arrow", quietly = TRUE)) { +record_batch <- get("record_batch", envir = asNamespace("arrow"), inherits = FALSE) +record_batch_stream_writer <- get( + "record_batch_stream_writer", envir = asNamespace("arrow"), inherits = FALSE) +file_output_stream <- get( + "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE) +write_record_batch <- get( + "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE) + +# Currently arrow requires withr; otherwise, write APIs don't work. +# Direct 'require' is not recommended by CRAN. Here's a workaround. +require1 <- require +if (require1("withr", quietly = TRUE)) { --- End diff -- require1 sounds a bit like a hack though... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r232172546 --- Diff: R/pkg/R/SQLContext.R --- @@ -147,6 +147,55 @@ getDefaultSqlSource <- function() { l[["spark.sql.sources.default"]] } +writeToTempFileInArrow <- function(rdf, numPartitions) { + # R API in Arrow is not yet released. CRAN requires to add the package in requireNamespace + # at DESCRIPTION. Later, CRAN checks if the package is available or not. Therefore, it works + # around by avoiding direct requireNamespace. + requireNamespace1 <- requireNamespace + if (requireNamespace1("arrow", quietly = TRUE)) { +record_batch <- get("record_batch", envir = asNamespace("arrow"), inherits = FALSE) +record_batch_stream_writer <- get( + "record_batch_stream_writer", envir = asNamespace("arrow"), inherits = FALSE) +file_output_stream <- get( + "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE) +write_record_batch <- get( + "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE) + +# Currently arrow requires withr; otherwise, write APIs don't work. +# Direct 'require' is not recommended by CRAN. Here's a workaround. +require1 <- require +if (require1("withr", quietly = TRUE)) { + numPartitions <- if (!is.null(numPartitions)) { +numToInt(numPartitions) + } else { +1 + } + fileName <- tempfile(pattern = "spark-arrow", fileext = ".tmp") + chunk <- as.integer(ceiling(nrow(rdf) / numPartitions)) + rdf_slices <- split(rdf, rep(1:ceiling(nrow(rdf) / chunk), each = chunk)[1:nrow(rdf)]) + stream_writer <- NULL + for (rdf_slice in rdf_slices) { +batch <- record_batch(rdf_slice) +if (is.null(stream_writer)) { + # We should avoid private calls like 'close_on_exit' (CRAN disallows) but looks + # there's no exposed API for it. Here's a workaround but ideally this should + # be removed. + close_on_exit <- get("close_on_exit", envir = asNamespace("arrow"), inherits = FALSE) --- End diff -- so is this an API missing in Arrow? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r232177132 --- Diff: R/pkg/R/SQLContext.R --- @@ -189,19 +238,67 @@ createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0, x } } + data[] <- lapply(data, cleanCols) - # drop factors and wrap lists - data <- setNames(lapply(data, cleanCols), NULL) + args <- list(FUN = list, SIMPLIFY = FALSE, USE.NAMES = FALSE) + if (arrowEnabled) { +shouldUseArrow <- tryCatch({ --- End diff -- refactor this into a method? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r232167926 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala --- @@ -225,4 +226,25 @@ private[sql] object SQLUtils extends Logging { } sparkSession.sessionState.catalog.listTables(db).map(_.table).toArray } + + /** + * R callable function to read a file in Arrow stream format and create a `RDD` + * using each serialized ArrowRecordBatch as a partition. + */ + def readArrowStreamFromFile( + sparkSession: SparkSession, + filename: String): JavaRDD[Array[Byte]] = { +ArrowConverters.readArrowStreamFromFile(sparkSession.sqlContext, filename) --- End diff -- what's the advantage of adding this wrapper here - I've thinking to eliminate most of these if possible - and just use callJMethod on `ArrowConverters` say? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r232176774 --- Diff: R/pkg/R/SQLContext.R --- @@ -189,19 +238,67 @@ createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0, x } } + data[] <- lapply(data, cleanCols) - # drop factors and wrap lists - data <- setNames(lapply(data, cleanCols), NULL) + args <- list(FUN = list, SIMPLIFY = FALSE, USE.NAMES = FALSE) + if (arrowEnabled) { +shouldUseArrow <- tryCatch({ + stopifnot(length(data) > 0) + dataHead <- head(data, 1) + # Currenty Arrow optimization does not support POSIXct and raw for now. + # Also, it does not support explicit float type set by users. It leads to + # incorrect conversion. We will fall back to the path without Arrow optimization. + if (any(sapply(dataHead, function(x) is(x, "POSIXct" { +stop("Arrow optimization with R DataFrame does not support POSIXct type yet.") + } + if (any(sapply(dataHead, is.raw))) { +stop("Arrow optimization with R DataFrame does not support raw type yet.") + } + if (inherits(schema, "structType")) { +if (any(sapply(schema$fields(), function(x) x$dataType.toString() == "FloatType"))) { + stop("Arrow optimization with R DataFrame does not support FloatType type yet.") +} + } + firstRow <- do.call(mapply, append(args, dataHead))[[1]] + fileName <- writeToTempFileInArrow(data, numPartitions) + tryCatch( +jrddInArrow <- callJStatic("org.apache.spark.sql.api.r.SQLUtils", + "readArrowStreamFromFile", + sparkSession, + fileName), + finally = { +file.remove(fileName) --- End diff -- unlink? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r232171176 --- Diff: R/pkg/R/SQLContext.R --- @@ -147,6 +147,55 @@ getDefaultSqlSource <- function() { l[["spark.sql.sources.default"]] } +writeToTempFileInArrow <- function(rdf, numPartitions) { + # R API in Arrow is not yet released. CRAN requires to add the package in requireNamespace + # at DESCRIPTION. Later, CRAN checks if the package is available or not. Therefore, it works + # around by avoiding direct requireNamespace. + requireNamespace1 <- requireNamespace + if (requireNamespace1("arrow", quietly = TRUE)) { +record_batch <- get("record_batch", envir = asNamespace("arrow"), inherits = FALSE) +record_batch_stream_writer <- get( + "record_batch_stream_writer", envir = asNamespace("arrow"), inherits = FALSE) +file_output_stream <- get( + "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE) +write_record_batch <- get( + "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE) + +# Currently arrow requires withr; otherwise, write APIs don't work. +# Direct 'require' is not recommended by CRAN. Here's a workaround. +require1 <- require +if (require1("withr", quietly = TRUE)) { + numPartitions <- if (!is.null(numPartitions)) { +numToInt(numPartitions) + } else { +1 + } + fileName <- tempfile(pattern = "spark-arrow", fileext = ".tmp") + chunk <- as.integer(ceiling(nrow(rdf) / numPartitions)) + rdf_slices <- split(rdf, rep(1:ceiling(nrow(rdf) / chunk), each = chunk)[1:nrow(rdf)]) --- End diff -- how is this slices computed? is it similar to https://github.com/apache/spark/blob/c3b4a94a91d66c172cf332321d3a78dba29ef8f0/R/pkg/R/context.R#L152 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r232167480 --- Diff: R/pkg/R/SQLContext.R --- @@ -147,6 +147,55 @@ getDefaultSqlSource <- function() { l[["spark.sql.sources.default"]] } +writeToTempFileInArrow <- function(rdf, numPartitions) { + # R API in Arrow is not yet released. CRAN requires to add the package in requireNamespace + # at DESCRIPTION. Later, CRAN checks if the package is available or not. Therefore, it works + # around by avoiding direct requireNamespace. + requireNamespace1 <- requireNamespace + if (requireNamespace1("arrow", quietly = TRUE)) { +record_batch <- get("record_batch", envir = asNamespace("arrow"), inherits = FALSE) +record_batch_stream_writer <- get( + "record_batch_stream_writer", envir = asNamespace("arrow"), inherits = FALSE) +file_output_stream <- get( + "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE) +write_record_batch <- get( + "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE) --- End diff -- ok --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r232167110 --- Diff: R/pkg/R/SQLContext.R --- @@ -215,14 +278,16 @@ createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0, } if (is.null(schema) || (!inherits(schema, "structType") && is.null(names(schema { -row <- firstRDD(rdd) +if (is.null(firstRow)) { + firstRow <- firstRDD(rdd) --- End diff -- I <3 4 digits! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22939: [SPARK-25446][R] Add schema_of_json() and schema_...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22939#discussion_r232166370 --- Diff: R/pkg/R/functions.R --- @@ -2230,6 +2237,32 @@ setMethod("from_json", signature(x = "Column", schema = "characterOrstructType") column(jc) }) +#' @details +#' \code{schema_of_json}: Parses a JSON string and infers its schema in DDL format. +#' +#' @rdname column_collection_functions +#' @aliases schema_of_json schema_of_json,characterOrColumn-method +#' @examples +#' +#' \dontrun{ +#' json <- '{"name":"Bob"}' +#' df <- sql("SELECT * FROM range(1)") +#' head(select(df, schema_of_json(json)))} +#' @note schema_of_json since 3.0.0 +setMethod("schema_of_json", signature(x = "characterOrColumn"), + function(x, ...) { +if (class(x) == "character") { + col <- callJStatic("org.apache.spark.sql.functions", "lit", x) +} else { + col <- x@jc --- End diff -- hm.. why not just support string then? it's kinda very odd usage in R `schema_of_csv(lit("Amsterdam,2018")))` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22967: [SPARK-25956] Make Scala 2.12 as default Scala ve...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22967#discussion_r231819016 --- Diff: docs/sparkr.md --- @@ -133,7 +133,7 @@ specifying `--packages` with `spark-submit` or `sparkR` commands, or if initiali {% highlight r %} -sparkR.session(sparkPackages = "com.databricks:spark-avro_2.11:3.0.0") +sparkR.session() --- End diff -- I'd not worry about this example too much - this could be `com.databricks:spark-avro_2.12:3.0.0` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22932: [SPARK-25102][SQL] Write Spark version to ORC/Parquet fi...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22932 Does it have different values for new native ORC writer, old Hive ORC writer --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22948: [SPARK-25944][R][BUILD] AppVeyor change to latest...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22948#discussion_r231598045 --- Diff: dev/appveyor-install-dependencies.ps1 --- @@ -115,7 +115,7 @@ $env:Path += ";$env:HADOOP_HOME\bin" Pop-Location # == R -$rVer = "3.4.1" +$rVer = "3.5.1" $rToolsVer = "3.4.0" --- End diff -- is there update to rtool? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22921: [SPARK-25908][CORE][SQL] Remove old deprecated it...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22921#discussion_r231596680 --- Diff: R/pkg/R/functions.R --- @@ -1663,9 +1692,24 @@ setMethod("toDegrees", #' @aliases toRadians toRadians,Column-method #' @note toRadians since 1.4.0 setMethod("toRadians", +signature(x = "Column"), --- End diff -- fix indentation? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22921: [SPARK-25908][CORE][SQL] Remove old deprecated it...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22921#discussion_r231403827 --- Diff: R/pkg/R/functions.R --- @@ -319,6 +319,27 @@ setMethod("acos", column(jc) }) +#' @details +#' \code{approx_count_distinct}: Returns the approximate number of distinct items in a group. +#' +#' @rdname column_aggregate_functions +#' @aliases approx_count_distinct approx_count_distinct,Column-method +#' @examples +#' +#' \dontrun{ +#' head(select(df, approx_count_distinct(df$gear))) +#' head(select(df, approx_count_distinct(df$gear, 0.02))) +#' head(select(df, countDistinct(df$gear, df$cyl))) +#' head(select(df, n_distinct(df$gear))) +#' head(distinct(select(df, "gear")))} --- End diff -- we only need one set - they both are `@rdname column_aggregate_functions` so will duplicate all other examples --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22939: [SPARK-25446][R] Add schema_of_json() and schema_...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22939#discussion_r231403096 --- Diff: R/pkg/R/functions.R --- @@ -2230,6 +2237,32 @@ setMethod("from_json", signature(x = "Column", schema = "characterOrstructType") column(jc) }) +#' @details +#' \code{schema_of_json}: Parses a JSON string and infers its schema in DDL format. +#' +#' @rdname column_collection_functions +#' @aliases schema_of_json schema_of_json,characterOrColumn-method +#' @examples +#' +#' \dontrun{ +#' json <- '{"name":"Bob"}' +#' df <- sql("SELECT * FROM range(1)") +#' head(select(df, schema_of_json(json)))} +#' @note schema_of_json since 3.0.0 +setMethod("schema_of_json", signature(x = "characterOrColumn"), + function(x, ...) { +if (class(x) == "character") { + col <- callJStatic("org.apache.spark.sql.functions", "lit", x) +} else { + col <- x@jc --- End diff -- you are saying this `select(df, schema_of_csv(df$schemaCol))` is not allowed? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization fr...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r231402726 --- Diff: R/pkg/R/SQLContext.R --- @@ -147,6 +147,30 @@ getDefaultSqlSource <- function() { l[["spark.sql.sources.default"]] } +writeToTempFileInArrow <- function(rdf, numPartitions) { + stopifnot(require("arrow", quietly = TRUE)) + stopifnot(require("withr", quietly = TRUE)) + numPartitions <- if (!is.null(numPartitions)) { +numToInt(numPartitions) + } else { +1 + } + fileName <- tempfile() --- End diff -- might need to give it a dir prefix to use - the tempfile default is not CRAN compliant and possibly some ACL issue --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization fr...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r231402235 --- Diff: R/pkg/R/SQLContext.R --- @@ -147,6 +147,30 @@ getDefaultSqlSource <- function() { l[["spark.sql.sources.default"]] } +writeToTempFileInArrow <- function(rdf, numPartitions) { + stopifnot(require("arrow", quietly = TRUE)) --- End diff -- btw, is it worthwhile to check the arrow package version? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization fr...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r231402297 --- Diff: R/pkg/R/SQLContext.R --- @@ -172,15 +196,17 @@ getDefaultSqlSource <- function() { createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0, numPartitions = NULL) { sparkSession <- getSparkSession() - + conf <- callJMethod(sparkSession, "conf") + arrowEnabled <- tolower(callJMethod(conf, "get", "spark.sql.execution.arrow.enabled")) == "true" --- End diff -- I think you can use sparkR.conf --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization fr...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r231402063 --- Diff: R/pkg/R/SQLContext.R --- @@ -147,6 +147,30 @@ getDefaultSqlSource <- function() { l[["spark.sql.sources.default"]] } +writeToTempFileInArrow <- function(rdf, numPartitions) { + stopifnot(require("arrow", quietly = TRUE)) + stopifnot(require("withr", quietly = TRUE)) --- End diff -- is it possible to not depend on this withr? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization fr...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r231401994 --- Diff: R/pkg/R/SQLContext.R --- @@ -147,6 +147,30 @@ getDefaultSqlSource <- function() { l[["spark.sql.sources.default"]] } +writeToTempFileInArrow <- function(rdf, numPartitions) { + stopifnot(require("arrow", quietly = TRUE)) --- End diff -- perhaps best to add a clearer error message? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22939: [SPARK-25446][R] Add schema_of_json() and schema_...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22939#discussion_r231025592 --- Diff: R/pkg/R/functions.R --- @@ -205,11 +205,18 @@ NULL #' also supported for the schema. #' \item \code{from_csv}: a DDL-formatted string #' } -#' @param ... additional argument(s). In \code{to_json}, \code{to_csv} and \code{from_json}, -#'this contains additional named properties to control how it is converted, accepts -#'the same options as the JSON/CSV data source. Additionally \code{to_json} supports -#'the "pretty" option which enables pretty JSON generation. In \code{arrays_zip}, -#'this contains additional Columns of arrays to be merged. +#' @param ... additional argument(s). +#' \itemize{ +#' \item \code{to_json}, \code{from_json} and \code{schema_of_json}: this contains +#' additional named properties to control how it is converted and accepts the +#' same options as the JSON data source. +#' \item \code{to_json}: it supports the "pretty" option which enables pretty --- End diff -- I know it's there before but I'd like to suggest to give an example - doc or code example below. it's a bit different from python/scala I think --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org