[GitHub] spark issue #22977: [BUILD] Bump previousSparkVersion in MimaBuild.scala to ...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22977 right, I mean both this and that should be part of the process "post-release" --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r232477365 --- Diff: R/pkg/R/SQLContext.R --- @@ -147,6 +147,55 @@ getDefaultSqlSource <- function() { l[["spark.sql.sources.default"]] } +writeToTempFileInArrow <- function(rdf, numPartitions) { + # R API in Arrow is not yet released. CRAN requires to add the package in requireNamespace + # at DESCRIPTION. Later, CRAN checks if the package is available or not. Therefore, it works + # around by avoiding direct requireNamespace. + requireNamespace1 <- requireNamespace + if (requireNamespace1("arrow", quietly = TRUE)) { +record_batch <- get("record_batch", envir = asNamespace("arrow"), inherits = FALSE) +record_batch_stream_writer <- get( + "record_batch_stream_writer", envir = asNamespace("arrow"), inherits = FALSE) +file_output_stream <- get( + "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE) +write_record_batch <- get( + "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE) + +# Currently arrow requires withr; otherwise, write APIs don't work. +# Direct 'require' is not recommended by CRAN. Here's a workaround. +require1 <- require +if (require1("withr", quietly = TRUE)) { --- End diff -- that will be good. circumventing CRAN check for method name is... problematic.. (there are other more hacky way too, but ..) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r232477325 --- Diff: R/pkg/R/SQLContext.R --- @@ -189,19 +238,67 @@ createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0, x } } + data[] <- lapply(data, cleanCols) - # drop factors and wrap lists - data <- setNames(lapply(data, cleanCols), NULL) + args <- list(FUN = list, SIMPLIFY = FALSE, USE.NAMES = FALSE) + if (arrowEnabled) { +shouldUseArrow <- tryCatch({ + stopifnot(length(data) > 0) + dataHead <- head(data, 1) + # Currenty Arrow optimization does not support POSIXct and raw for now. + # Also, it does not support explicit float type set by users. It leads to + # incorrect conversion. We will fall back to the path without Arrow optimization. + if (any(sapply(dataHead, function(x) is(x, "POSIXct" { +stop("Arrow optimization with R DataFrame does not support POSIXct type yet.") + } + if (any(sapply(dataHead, is.raw))) { +stop("Arrow optimization with R DataFrame does not support raw type yet.") + } + if (inherits(schema, "structType")) { +if (any(sapply(schema$fields(), function(x) x$dataType.toString() == "FloatType"))) { + stop("Arrow optimization with R DataFrame does not support FloatType type yet.") --- End diff -- maybe out of bit range? 53 bit https://stat.ethz.ch/R-manual/R-patched/library/base/html/double.html --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20788: [SPARK-23647][PYTHON][SQL] Adds more types for hint in p...
Github user DylanGuedes commented on the issue: https://github.com/apache/spark/pull/20788 @HyukjinKwon Sure! I'll be working on this, then. Do you have any specific scenarios in mind, so that I could start with them? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r232477271 --- Diff: R/pkg/R/SQLContext.R --- @@ -189,19 +238,67 @@ createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0, x } } + data[] <- lapply(data, cleanCols) - # drop factors and wrap lists - data <- setNames(lapply(data, cleanCols), NULL) + args <- list(FUN = list, SIMPLIFY = FALSE, USE.NAMES = FALSE) + if (arrowEnabled) { +shouldUseArrow <- tryCatch({ + stopifnot(length(data) > 0) + dataHead <- head(data, 1) + # Currenty Arrow optimization does not support POSIXct and raw for now. + # Also, it does not support explicit float type set by users. It leads to + # incorrect conversion. We will fall back to the path without Arrow optimization. + if (any(sapply(dataHead, function(x) is(x, "POSIXct" { +stop("Arrow optimization with R DataFrame does not support POSIXct type yet.") + } + if (any(sapply(dataHead, is.raw))) { +stop("Arrow optimization with R DataFrame does not support raw type yet.") + } + if (inherits(schema, "structType")) { +if (any(sapply(schema$fields(), function(x) x$dataType.toString() == "FloatType"))) { + stop("Arrow optimization with R DataFrame does not support FloatType type yet.") +} + } + firstRow <- do.call(mapply, append(args, dataHead))[[1]] + fileName <- writeToTempFileInArrow(data, numPartitions) + tryCatch( +jrddInArrow <- callJStatic("org.apache.spark.sql.api.r.SQLUtils", + "readArrowStreamFromFile", + sparkSession, + fileName), + finally = { +file.remove(fileName) --- End diff -- yes, just more consistent. I also don't know for sure why all other instances are calling unlink --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r232477257 --- Diff: R/pkg/R/SQLContext.R --- @@ -189,19 +238,67 @@ createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0, x } } + data[] <- lapply(data, cleanCols) - # drop factors and wrap lists - data <- setNames(lapply(data, cleanCols), NULL) + args <- list(FUN = list, SIMPLIFY = FALSE, USE.NAMES = FALSE) + if (arrowEnabled) { +shouldUseArrow <- tryCatch({ + stopifnot(length(data) > 0) + dataHead <- head(data, 1) + # Currenty Arrow optimization does not support POSIXct and raw for now. + # Also, it does not support explicit float type set by users. It leads to + # incorrect conversion. We will fall back to the path without Arrow optimization. + if (any(sapply(dataHead, function(x) is(x, "POSIXct" { --- End diff -- LG, I tested a few cases too --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r232477171 --- Diff: R/pkg/R/SQLContext.R --- @@ -172,10 +221,10 @@ getDefaultSqlSource <- function() { createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0, numPartitions = NULL) { sparkSession <- getSparkSession() - + arrowEnabled <- sparkR.conf("spark.sql.execution.arrow.enabled")[[1]] == "true" --- End diff -- ok --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r232477155 --- Diff: R/pkg/R/SQLContext.R --- @@ -147,6 +147,55 @@ getDefaultSqlSource <- function() { l[["spark.sql.sources.default"]] } +writeToTempFileInArrow <- function(rdf, numPartitions) { + # R API in Arrow is not yet released. CRAN requires to add the package in requireNamespace + # at DESCRIPTION. Later, CRAN checks if the package is available or not. Therefore, it works + # around by avoiding direct requireNamespace. + requireNamespace1 <- requireNamespace + if (requireNamespace1("arrow", quietly = TRUE)) { +record_batch <- get("record_batch", envir = asNamespace("arrow"), inherits = FALSE) +record_batch_stream_writer <- get( + "record_batch_stream_writer", envir = asNamespace("arrow"), inherits = FALSE) +file_output_stream <- get( + "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE) +write_record_batch <- get( + "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE) + +# Currently arrow requires withr; otherwise, write APIs don't work. +# Direct 'require' is not recommended by CRAN. Here's a workaround. +require1 <- require +if (require1("withr", quietly = TRUE)) { + numPartitions <- if (!is.null(numPartitions)) { +numToInt(numPartitions) + } else { +1 + } + fileName <- tempfile(pattern = "spark-arrow", fileext = ".tmp") + chunk <- as.integer(ceiling(nrow(rdf) / numPartitions)) + rdf_slices <- split(rdf, rep(1:ceiling(nrow(rdf) / chunk), each = chunk)[1:nrow(rdf)]) --- End diff -- ah, any idea that was done that way in python? this seems to be different from sc.paralleize which is what https://github.com/apache/spark/blob/c3b4a94a91d66c172cf332321d3a78dba29ef8f0/R/pkg/R/context.R#L152 is done --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r232477131 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala --- @@ -225,4 +226,25 @@ private[sql] object SQLUtils extends Logging { } sparkSession.sessionState.catalog.listTables(db).map(_.table).toArray } + + /** + * R callable function to read a file in Arrow stream format and create a `RDD` + * using each serialized ArrowRecordBatch as a partition. + */ + def readArrowStreamFromFile( + sparkSession: SparkSession, + filename: String): JavaRDD[Array[Byte]] = { +ArrowConverters.readArrowStreamFromFile(sparkSession.sqlContext, filename) --- End diff -- hmm, I see your point... but there could be hundreds of these wrapper we need add if we set as a practice, I'm guessing. a few problems with these wrappers I see: 1. they are extra work to add or maintain 2. many are very simple, not much value add 3. many get abandoned over the years - they are not called and not removed but I kinda see your way, let's keep this one and review any new one in the future. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22997: SPARK-25999: make-distribution.sh failure with --r and -...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22997 btw, please see the page https://spark.apache.org/contributing.html and particularly "Pull Request" on the format. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22997: SPARK-25999: make-distribution.sh failure with --r and -...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22997 thx, but I'm not sure about this approach. this step will now cause hadoop jar to be packaged into the release tarball of hadoop-provided, which is undoing the point of hadoop-provided. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23004: [SPARK-26004][SQL] InMemoryTable support StartsWith pred...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23004 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98689/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23004: [SPARK-26004][SQL] InMemoryTable support StartsWith pred...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23004 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23004: [SPARK-26004][SQL] InMemoryTable support StartsWith pred...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23004 **[Test build #98689 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98689/testReport)** for PR 23004 at commit [`7bbdb07`](https://github.com/apache/spark/commit/7bbdb0713056f387e49cf3921a226554e9af5557). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r232476906 --- Diff: R/pkg/R/SQLContext.R --- @@ -147,6 +147,55 @@ getDefaultSqlSource <- function() { l[["spark.sql.sources.default"]] } +writeToTempFileInArrow <- function(rdf, numPartitions) { + # R API in Arrow is not yet released. CRAN requires to add the package in requireNamespace + # at DESCRIPTION. Later, CRAN checks if the package is available or not. Therefore, it works + # around by avoiding direct requireNamespace. + requireNamespace1 <- requireNamespace + if (requireNamespace1("arrow", quietly = TRUE)) { +record_batch <- get("record_batch", envir = asNamespace("arrow"), inherits = FALSE) +record_batch_stream_writer <- get( + "record_batch_stream_writer", envir = asNamespace("arrow"), inherits = FALSE) +file_output_stream <- get( + "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE) +write_record_batch <- get( + "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE) + +# Currently arrow requires withr; otherwise, write APIs don't work. +# Direct 'require' is not recommended by CRAN. Here's a workaround. +require1 <- require +if (require1("withr", quietly = TRUE)) { + numPartitions <- if (!is.null(numPartitions)) { +numToInt(numPartitions) + } else { +1 + } + fileName <- tempfile(pattern = "spark-arrow", fileext = ".tmp") + chunk <- as.integer(ceiling(nrow(rdf) / numPartitions)) + rdf_slices <- split(rdf, rep(1:ceiling(nrow(rdf) / chunk), each = chunk)[1:nrow(rdf)]) --- End diff -- Let me try to reuse the R side slicing logic. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r232475881 --- Diff: R/pkg/R/SQLContext.R --- @@ -189,19 +238,67 @@ createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0, x } } + data[] <- lapply(data, cleanCols) - # drop factors and wrap lists - data <- setNames(lapply(data, cleanCols), NULL) + args <- list(FUN = list, SIMPLIFY = FALSE, USE.NAMES = FALSE) + if (arrowEnabled) { +shouldUseArrow <- tryCatch({ + stopifnot(length(data) > 0) + dataHead <- head(data, 1) + # Currenty Arrow optimization does not support POSIXct and raw for now. + # Also, it does not support explicit float type set by users. It leads to + # incorrect conversion. We will fall back to the path without Arrow optimization. + if (any(sapply(dataHead, function(x) is(x, "POSIXct" { --- End diff -- Looks okay in this case specifically: ```r > any(sapply(head(data.frame(list(list(a=NA))), 1), is.raw)) [1] FALSE > any(sapply(head(data.frame(list(list(a=NA))), 1), function(x) is(x, "POSIXct"))) [1] FALSE > any(sapply(head(data.frame(list(list(a=1))), 1), is.raw)) [1] FALSE > any(sapply(head(data.frame(list(list(a="a"))), 1), function(x) is(x, "POSIXct"))) [1] FALSE > any(sapply(head(data.frame(list(list(a=raw(1, 1), is.raw)) [1] TRUE > any(sapply(head(data.frame(list(list(a=as.POSIXct("2000-01-01", 1), function(x) is(x, "POSIXct"))) [1] TRUE ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r232475777 --- Diff: R/pkg/R/SQLContext.R --- @@ -189,19 +238,67 @@ createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0, x } } + data[] <- lapply(data, cleanCols) - # drop factors and wrap lists - data <- setNames(lapply(data, cleanCols), NULL) + args <- list(FUN = list, SIMPLIFY = FALSE, USE.NAMES = FALSE) + if (arrowEnabled) { +shouldUseArrow <- tryCatch({ + stopifnot(length(data) > 0) + dataHead <- head(data, 1) + # Currenty Arrow optimization does not support POSIXct and raw for now. + # Also, it does not support explicit float type set by users. It leads to + # incorrect conversion. We will fall back to the path without Arrow optimization. + if (any(sapply(dataHead, function(x) is(x, "POSIXct" { +stop("Arrow optimization with R DataFrame does not support POSIXct type yet.") + } + if (any(sapply(dataHead, is.raw))) { +stop("Arrow optimization with R DataFrame does not support raw type yet.") + } + if (inherits(schema, "structType")) { +if (any(sapply(schema$fields(), function(x) x$dataType.toString() == "FloatType"))) { + stop("Arrow optimization with R DataFrame does not support FloatType type yet.") +} + } + firstRow <- do.call(mapply, append(args, dataHead))[[1]] + fileName <- writeToTempFileInArrow(data, numPartitions) + tryCatch( +jrddInArrow <- callJStatic("org.apache.spark.sql.api.r.SQLUtils", + "readArrowStreamFromFile", + sparkSession, + fileName), + finally = { +file.remove(fileName) --- End diff -- I believe either way is fine. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r232475752 --- Diff: R/pkg/R/SQLContext.R --- @@ -147,6 +147,55 @@ getDefaultSqlSource <- function() { l[["spark.sql.sources.default"]] } +writeToTempFileInArrow <- function(rdf, numPartitions) { + # R API in Arrow is not yet released. CRAN requires to add the package in requireNamespace + # at DESCRIPTION. Later, CRAN checks if the package is available or not. Therefore, it works + # around by avoiding direct requireNamespace. + requireNamespace1 <- requireNamespace + if (requireNamespace1("arrow", quietly = TRUE)) { +record_batch <- get("record_batch", envir = asNamespace("arrow"), inherits = FALSE) +record_batch_stream_writer <- get( + "record_batch_stream_writer", envir = asNamespace("arrow"), inherits = FALSE) +file_output_stream <- get( + "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE) +write_record_batch <- get( + "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE) + +# Currently arrow requires withr; otherwise, write APIs don't work. +# Direct 'require' is not recommended by CRAN. Here's a workaround. +require1 <- require +if (require1("withr", quietly = TRUE)) { --- End diff -- Yea, I at least managed to get rid of this hack itself. Will push soon. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21766: [SPARK-24803][SQL] add support for numeric
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21766 @wangtao605 Do you mind documenting our behavior in our Spark SQL doc? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17648: [SPARK-19851] Add support for EVERY and ANY (SOME) aggre...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/17648 @ptkool Thanks for your contribution! This feature will be available in the next release. Spark 3.0 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21679: [SPARK-24695] [SQL]: To add support to return Calendar i...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21679 For your information, @cloud-fan and I are discussing how we expose calendar interval data types. Will post the proposal later. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21766: [SPARK-24803][SQL] add support for numeric
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21766 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17169: [SPARK-19714][ML] Bucketizer.handleInvalid docs i...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/17169 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20846: [SPARK-5498][SQL][FOLLOW] add schema to table par...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20846 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20163: [SPARK-22966][PYTHON][SQL] Python UDFs with retur...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20163 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22813: [SPARK-25818][CORE] WorkDirCleanup should only re...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22813 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21514: [SPARK-22860] [SPARK-24621] [Core] [WebUI] - hide...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21514 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15899: [SPARK-18466] added withFilter method to RDD
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/15899 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17648: [SPARK-19851] Add support for EVERY and ANY (SOME...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/17648 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21679: [SPARK-24695] [SQL]: To add support to return Cal...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21679 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21994: [SPARK-24529][Build][test-maven][follow-up] Add s...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21994 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18697: [SPARK-16683][SQL] Repeated joins to same table c...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/18697 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23001: [INFRA] Close stale PRs
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/23001 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22463: remove annotation @Experimental
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22463 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21868: [SPARK-24906][SQL] Adaptively enlarge split / par...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21868 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21322: [SPARK-24225][CORE] Support closing AutoClosable ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21322 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21402: SPARK-24355 Spark external shuffle server improve...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21402 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21161: [SPARK-21645]left outer join synchronize the cond...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21161 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19434: [SPARK-21785][SQL]Support create table from a par...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19434 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18080: [Spark-20771][SQL] Make weekofyear more intuitive
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/18080 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18636: added support word2vec training with additional d...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/18636 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21257 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22005: [SPARK-16817][CORE][WIP] Use Alluxio to improve s...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22005 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22539: [SPARK-25517][SQL] Detect/Infer date type in CSV ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22539 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19691: [SPARK-14922][SPARK-17732][SQL]ALTER TABLE DROP P...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19691 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17176: [SPARK-19833][SQL]remove SQLConf.HIVE_VERIFY_PART...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/17176 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23001: [INFRA] Close stale PRs
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23001 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22994: [BUILD] refactor dev/lint-python in to something readabl...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22994 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22994: [BUILD] refactor dev/lint-python in to something readabl...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22994 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98688/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22994: [BUILD] refactor dev/lint-python in to something readabl...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22994 **[Test build #98688 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98688/testReport)** for PR 22994 at commit [`1f5c2f5`](https://github.com/apache/spark/commit/1f5c2f5815a46546d7ea878f6b732c8d5638943b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22994: [BUILD] refactor dev/lint-python in to something readabl...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22994 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98687/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22994: [BUILD] refactor dev/lint-python in to something readabl...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22994 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22994: [BUILD] refactor dev/lint-python in to something readabl...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22994 **[Test build #98687 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98687/testReport)** for PR 22994 at commit [`56329bc`](https://github.com/apache/spark/commit/56329bc9d9d28252032fe6fef8da2ffbb1ed0f9e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22976: [SPARK-25974][SQL]Optimizes Generates bytecode fo...
Github user heary-cao commented on a diff in the pull request: https://github.com/apache/spark/pull/22976#discussion_r232474090 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateOrdering.scala --- @@ -133,7 +126,6 @@ object GenerateOrdering extends CodeGenerator[Seq[SortOrder], Ordering[InternalR returnType = "int", makeSplitFunction = { body => s""" --- End diff -- thanks, I have update it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23004: [SPARK-26004][SQL] InMemoryTable support StartsWith pred...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23004 **[Test build #98689 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98689/testReport)** for PR 23004 at commit [`7bbdb07`](https://github.com/apache/spark/commit/7bbdb0713056f387e49cf3921a226554e9af5557). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23004: [SPARK-26004][SQL] InMemoryTable support StartsWith pred...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23004 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4921/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23004: [SPARK-26004][SQL] InMemoryTable support StartsWith pred...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23004 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r232473761 --- Diff: R/pkg/R/SQLContext.R --- @@ -189,19 +238,67 @@ createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0, x } } + data[] <- lapply(data, cleanCols) - # drop factors and wrap lists - data <- setNames(lapply(data, cleanCols), NULL) + args <- list(FUN = list, SIMPLIFY = FALSE, USE.NAMES = FALSE) + if (arrowEnabled) { +shouldUseArrow <- tryCatch({ + stopifnot(length(data) > 0) + dataHead <- head(data, 1) + # Currenty Arrow optimization does not support POSIXct and raw for now. + # Also, it does not support explicit float type set by users. It leads to + # incorrect conversion. We will fall back to the path without Arrow optimization. + if (any(sapply(dataHead, function(x) is(x, "POSIXct" { +stop("Arrow optimization with R DataFrame does not support POSIXct type yet.") + } + if (any(sapply(dataHead, is.raw))) { +stop("Arrow optimization with R DataFrame does not support raw type yet.") + } + if (inherits(schema, "structType")) { +if (any(sapply(schema$fields(), function(x) x$dataType.toString() == "FloatType"))) { + stop("Arrow optimization with R DataFrame does not support FloatType type yet.") --- End diff -- I suspect that it happens when `numeric` (which is like `1.0`) is casted into float type. I think it's related with casting behaviour. Let me take a look and file a JIRA there in Arrow side but if you don't mind I will focus on matching exact type cases for now ... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23004: [SPARK-26004][SQL] InMemoryTable support StartsWi...
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/23004 [SPARK-26004][SQL] InMemoryTable support StartsWith predicate push down ## What changes were proposed in this pull request? [SPARK-24638](https://issues.apache.org/jira/browse/SPARK-24638) adds support for Parquet file `StartsWith` predicate push down. `InMemoryTable` can also support this feature. ## How was this patch tested? unit tests and benchmark tests benchmark test result: ``` Pushdown benchmark for StringStartsWith Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.12.6 Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz StringStartsWith filter: (value like '10%'): Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative InMemoryTable Vectorized12068 / 14198 1.3 767.3 1.0X InMemoryTable Vectorized (Pushdown) 5457 / 8662 2.9 347.0 2.2X Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.12.6 Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz StringStartsWith filter: (value like '1000%'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative InMemoryTable Vectorized 5246 / 5355 3.0 333.5 1.0X InMemoryTable Vectorized (Pushdown) 2185 / 2346 7.2 138.9 2.4X Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.12.6 Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz StringStartsWith filter: (value like '786432%'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative InMemoryTable Vectorized 5112 / 5312 3.1 325.0 1.0X InMemoryTable Vectorized (Pushdown) 2292 / 2522 6.9 145.7 2.2X ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-26004 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23004.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23004 commit 7bbdb0713056f387e49cf3921a226554e9af5557 Author: Yuming Wang Date: 2018-11-11T03:56:36Z InMemoryTable support StartsWith predicate push down --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r232473723 --- Diff: R/pkg/R/SQLContext.R --- @@ -189,19 +238,67 @@ createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0, x } } + data[] <- lapply(data, cleanCols) - # drop factors and wrap lists - data <- setNames(lapply(data, cleanCols), NULL) + args <- list(FUN = list, SIMPLIFY = FALSE, USE.NAMES = FALSE) + if (arrowEnabled) { +shouldUseArrow <- tryCatch({ --- End diff -- Yup, let me try. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r232473716 --- Diff: R/pkg/R/SQLContext.R --- @@ -172,10 +221,10 @@ getDefaultSqlSource <- function() { createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0, numPartitions = NULL) { sparkSession <- getSparkSession() - + arrowEnabled <- sparkR.conf("spark.sql.execution.arrow.enabled")[[1]] == "true" --- End diff -- Yea,I checked that it always has the default value. I initially left the default value but took it out after double checking. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r232473705 --- Diff: R/pkg/R/SQLContext.R --- @@ -147,6 +147,55 @@ getDefaultSqlSource <- function() { l[["spark.sql.sources.default"]] } +writeToTempFileInArrow <- function(rdf, numPartitions) { + # R API in Arrow is not yet released. CRAN requires to add the package in requireNamespace + # at DESCRIPTION. Later, CRAN checks if the package is available or not. Therefore, it works + # around by avoiding direct requireNamespace. + requireNamespace1 <- requireNamespace + if (requireNamespace1("arrow", quietly = TRUE)) { +record_batch <- get("record_batch", envir = asNamespace("arrow"), inherits = FALSE) +record_batch_stream_writer <- get( + "record_batch_stream_writer", envir = asNamespace("arrow"), inherits = FALSE) +file_output_stream <- get( + "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE) +write_record_batch <- get( + "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE) + +# Currently arrow requires withr; otherwise, write APIs don't work. +# Direct 'require' is not recommended by CRAN. Here's a workaround. +require1 <- require +if (require1("withr", quietly = TRUE)) { + numPartitions <- if (!is.null(numPartitions)) { +numToInt(numPartitions) + } else { +1 + } + fileName <- tempfile(pattern = "spark-arrow", fileext = ".tmp") + chunk <- as.integer(ceiling(nrow(rdf) / numPartitions)) + rdf_slices <- split(rdf, rep(1:ceiling(nrow(rdf) / chunk), each = chunk)[1:nrow(rdf)]) + stream_writer <- NULL + for (rdf_slice in rdf_slices) { +batch <- record_batch(rdf_slice) +if (is.null(stream_writer)) { + # We should avoid private calls like 'close_on_exit' (CRAN disallows) but looks + # there's no exposed API for it. Here's a workaround but ideally this should + # be removed. + close_on_exit <- get("close_on_exit", envir = asNamespace("arrow"), inherits = FALSE) --- End diff -- Hm, possibly yea. Let me try it. In this way, we could get rid of `require`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r232473697 --- Diff: R/pkg/R/SQLContext.R --- @@ -147,6 +147,55 @@ getDefaultSqlSource <- function() { l[["spark.sql.sources.default"]] } +writeToTempFileInArrow <- function(rdf, numPartitions) { + # R API in Arrow is not yet released. CRAN requires to add the package in requireNamespace + # at DESCRIPTION. Later, CRAN checks if the package is available or not. Therefore, it works + # around by avoiding direct requireNamespace. + requireNamespace1 <- requireNamespace + if (requireNamespace1("arrow", quietly = TRUE)) { +record_batch <- get("record_batch", envir = asNamespace("arrow"), inherits = FALSE) +record_batch_stream_writer <- get( + "record_batch_stream_writer", envir = asNamespace("arrow"), inherits = FALSE) +file_output_stream <- get( + "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE) +write_record_batch <- get( + "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE) + +# Currently arrow requires withr; otherwise, write APIs don't work. +# Direct 'require' is not recommended by CRAN. Here's a workaround. +require1 <- require +if (require1("withr", quietly = TRUE)) { + numPartitions <- if (!is.null(numPartitions)) { +numToInt(numPartitions) + } else { +1 + } + fileName <- tempfile(pattern = "spark-arrow", fileext = ".tmp") + chunk <- as.integer(ceiling(nrow(rdf) / numPartitions)) + rdf_slices <- split(rdf, rep(1:ceiling(nrow(rdf) / chunk), each = chunk)[1:nrow(rdf)]) --- End diff -- This resembles PySpark side logic: https://github.com/apache/spark/blob/d367bdcf521f564d2d7066257200be26b27ea926/python/pyspark/sql/session.py#L554-L556 Let me check the difference between them --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r232473669 --- Diff: R/pkg/R/SQLContext.R --- @@ -147,6 +147,55 @@ getDefaultSqlSource <- function() { l[["spark.sql.sources.default"]] } +writeToTempFileInArrow <- function(rdf, numPartitions) { + # R API in Arrow is not yet released. CRAN requires to add the package in requireNamespace + # at DESCRIPTION. Later, CRAN checks if the package is available or not. Therefore, it works + # around by avoiding direct requireNamespace. + requireNamespace1 <- requireNamespace + if (requireNamespace1("arrow", quietly = TRUE)) { +record_batch <- get("record_batch", envir = asNamespace("arrow"), inherits = FALSE) +record_batch_stream_writer <- get( + "record_batch_stream_writer", envir = asNamespace("arrow"), inherits = FALSE) +file_output_stream <- get( + "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE) +write_record_batch <- get( + "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE) + +# Currently arrow requires withr; otherwise, write APIs don't work. +# Direct 'require' is not recommended by CRAN. Here's a workaround. +require1 <- require +if (require1("withr", quietly = TRUE)) { + numPartitions <- if (!is.null(numPartitions)) { +numToInt(numPartitions) + } else { +1 --- End diff -- We should; however, it follows the original code path's behaviour. I matched it as the same so that we can compare the performances in the same conditions. If you don't mind, I will fix both in a separate PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r232473643 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala --- @@ -225,4 +226,25 @@ private[sql] object SQLUtils extends Logging { } sparkSession.sessionState.catalog.listTables(db).map(_.table).toArray } + + /** + * R callable function to read a file in Arrow stream format and create a `RDD` + * using each serialized ArrowRecordBatch as a partition. + */ + def readArrowStreamFromFile( + sparkSession: SparkSession, + filename: String): JavaRDD[Array[Byte]] = { +ArrowConverters.readArrowStreamFromFile(sparkSession.sqlContext, filename) --- End diff -- Hmhmhm .. yea. What I was trying to do is to add SQL related codes called in R from JVM, into here when they are not official APIs in order to avoid, we change the internal APIs within Scala, and it causes R test failure. I was trying to do the similar things within PySpark side. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r232473611 --- Diff: R/pkg/R/SQLContext.R --- @@ -147,6 +147,55 @@ getDefaultSqlSource <- function() { l[["spark.sql.sources.default"]] } +writeToTempFileInArrow <- function(rdf, numPartitions) { + # R API in Arrow is not yet released. CRAN requires to add the package in requireNamespace + # at DESCRIPTION. Later, CRAN checks if the package is available or not. Therefore, it works + # around by avoiding direct requireNamespace. + requireNamespace1 <- requireNamespace + if (requireNamespace1("arrow", quietly = TRUE)) { +record_batch <- get("record_batch", envir = asNamespace("arrow"), inherits = FALSE) +record_batch_stream_writer <- get( + "record_batch_stream_writer", envir = asNamespace("arrow"), inherits = FALSE) +file_output_stream <- get( + "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE) +write_record_batch <- get( + "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE) + +# Currently arrow requires withr; otherwise, write APIs don't work. +# Direct 'require' is not recommended by CRAN. Here's a workaround. +require1 <- require +if (require1("withr", quietly = TRUE)) { --- End diff -- Yup .. it is .. looks we shouldn't have this error from a cursory look in R API of Arrow. Maybe this can be gone when I use official R Arrow release version. Let me check it later. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23001: [INFRA] Close stale PRs
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23001 I took a quick pass. Mind adding those please?: #22539 #22539 #21868 #21514 #21402 #21322 #21257 #20163 #19691 #18697 #18636 #17176 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18697: [SPARK-16683][SQL] Repeated joins to same table can leak...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18697 Let't close this then. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20163: [SPARK-22966][PYTHON][SQL] Python UDFs with returnType=S...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20163 Let's leave this closed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20168: [SPARK-22730][ML] Add ImageSchema support for all OpenCv...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20168 @tomasatdatabricks, mind updating this? Lately I happened to take a look for this few times. I will try to review. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20503: [SPARK-23299][SQL][PYSPARK] Fix __repr__ behaviour for R...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20503 ping @ashashwat to update --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22060: [DO NOT MERGE][TEST ONLY] Add once-policy rule ch...
Github user maryannxue closed the pull request at: https://github.com/apache/spark/pull/22060 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22060: [DO NOT MERGE][TEST ONLY] Add once-policy rule check
Github user maryannxue commented on the issue: https://github.com/apache/spark/pull/22060 Thank you for reminding me, @HyukjinKwon! And thanks to @mgaido91's contribution, this has been fixed already. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20788: [SPARK-23647][PYTHON][SQL] Adds more types for hint in p...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20788 @DylanGuedes let's add tests. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21257 ping @zheh12 to address comments. I am going to suggest to close this for now while I am identifying PRs to close now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21322: [SPARK-24225][CORE] Support closing AutoClosable objects...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21322 Shall we leave this PR closed and start it from a design doc? Let me suggest to close this for now while I am looking through old PRs. @JeetKunDoug, please feel free to create a clone of this PR if there's any reason to keep this open that I missed. No objection. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21402: SPARK-24355 Spark external shuffle server improvement to...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21402 @redsanket you should close it by yourself. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21514: [SPARK-22860] [SPARK-24621] [Core] [WebUI] - hide key pa...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21514 I'm going to suggest to close this. The review comments were not addressed more then few months and there's not quite a great point to keep inactive PRs. Feel free to take over this if any of you here is interested in this. Or @tooptoop4, please recreate a PR after addressing review commnets here. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21514: [SPARK-22860] [SPARK-24621] [Core] [WebUI] - hide key pa...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21514 ping @tooptoop4 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21868: [SPARK-24906][SQL] Adaptively enlarge split / partition ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21868 I think we should fix this. Basically the dynamic estimation logic is too flaky, and I think we need this for the current status. Let's don't add it for now. While I am revisiting old PRs, I am trying to suggest to close PRs that look not likely to be merged. Let me suggest to close this for now but please feel free to recreate a PR if you strongly this is needed in Spark. No objection. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22060: [DO NOT MERGE][TEST ONLY] Add once-policy rule check
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22060 hey @maryannxue, where are we here? Let's close this if it's going to be inactive a couple of weeks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22144 So, 2.4 is out. Where are we? Rereading the comments above, looks we should: 1. Find the root cause 2. Officially drop it if the workaround is not easy 3. Fix it if the workaround is simple 4. add a test If not, I would just document that we dropped this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22184: [SPARK-25132][SQL][DOC] Add migration doc for case-insen...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22184 @seancxmao, so this behaviour changes description is only valid when we upgrade spark 2.3 to 2.4? Then we can add it in `Upgrading From Spark SQL 2.3 to 2.4`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21363: [SPARK-19228][SQL] Migrate on Java 8 time from FastDateF...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21363 I am going to suggest to close this since it's being active more then few weeks. It should be good to fix. Let me leave some cc's who might be interested in this just FYI. Feel free to take over this when you guys find some time or are interested in this. Adding @xuanyuanking, @mgaido91, @viirya, @MaxGekk, @softmanu who I could think of for now. Feel free to ignore my cc if you guys are busy or having more important fixes you guys are working on. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22993: [SPARK-24421][BUILD][CORE] Accessing sun.misc.Cleaner in...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22993 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98686/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22993: [SPARK-24421][BUILD][CORE] Accessing sun.misc.Cleaner in...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22993 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22993: [SPARK-24421][BUILD][CORE] Accessing sun.misc.Cleaner in...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22993 **[Test build #98686 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98686/testReport)** for PR 22993 at commit [`2db719c`](https://github.com/apache/spark/commit/2db719c2896c96793dfc96791bbadb2bc8a3f4ab). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22994: [BUILD] refactor dev/lint-python in to something readabl...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22994 **[Test build #98688 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98688/testReport)** for PR 22994 at commit [`1f5c2f5`](https://github.com/apache/spark/commit/1f5c2f5815a46546d7ea878f6b732c8d5638943b). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22994: [BUILD] refactor dev/lint-python in to something readabl...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22994 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4920/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22994: [BUILD] refactor dev/lint-python in to something readabl...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22994 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22994: [BUILD] refactor dev/lint-python in to something readabl...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22994 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22994: [BUILD] refactor dev/lint-python in to something readabl...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22994 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4919/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22994: [BUILD] refactor dev/lint-python in to something readabl...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22994 **[Test build #98687 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98687/testReport)** for PR 22994 at commit [`56329bc`](https://github.com/apache/spark/commit/56329bc9d9d28252032fe6fef8da2ffbb1ed0f9e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22994: [BUILD] refactor dev/lint-python in to something readabl...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22994 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4918/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22994: [BUILD] refactor dev/lint-python in to something readabl...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22994 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22994: [BUILD] refactor dev/lint-python in to something readabl...
Github user shaneknapp commented on the issue: https://github.com/apache/spark/pull/22994 test this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22994: [BUILD] refactor dev/lint-python in to something readabl...
Github user shaneknapp commented on the issue: https://github.com/apache/spark/pull/22994 test this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23001: [INFRA] Close stale PRs
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/23001 Thank you for pinging me. Please add #15899 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22993: [SPARK-24421][BUILD][CORE] Accessing sun.misc.Cleaner in...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22993 **[Test build #98686 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98686/testReport)** for PR 22993 at commit [`2db719c`](https://github.com/apache/spark/commit/2db719c2896c96793dfc96791bbadb2bc8a3f4ab). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22993: [SPARK-24421][BUILD][CORE] Accessing sun.misc.Cleaner in...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22993 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4917/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org