[GitHub] spark issue #22979: [SPARK-25977][SQL] Parsing decimals from CSV using local...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22979 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22939: [SPARK-25446][R] Add schema_of_json() and schema_...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22939#discussion_r232208151 --- Diff: R/pkg/R/functions.R --- @@ -2230,6 +2237,32 @@ setMethod("from_json", signature(x = "Column", schema = "characterOrstructType") column(jc) }) +#' @details +#' \code{schema_of_json}: Parses a JSON string and infers its schema in DDL format. +#' +#' @rdname column_collection_functions +#' @aliases schema_of_json schema_of_json,characterOrColumn-method +#' @examples +#' +#' \dontrun{ +#' json <- '{"name":"Bob"}' +#' df <- sql("SELECT * FROM range(1)") +#' head(select(df, schema_of_json(json)))} +#' @note schema_of_json since 3.0.0 +setMethod("schema_of_json", signature(x = "characterOrColumn"), + function(x, ...) { +if (class(x) == "character") { + col <- callJStatic("org.apache.spark.sql.functions", "lit", x) +} else { + col <- x@jc --- End diff -- BTW, `lit` usage already works in many APIs although it looks a bit odd .. should be okay. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22939: [SPARK-25446][R] Add schema_of_json() and schema_...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22939#discussion_r232207412 --- Diff: R/pkg/R/functions.R --- @@ -2230,6 +2237,32 @@ setMethod("from_json", signature(x = "Column", schema = "characterOrstructType") column(jc) }) +#' @details +#' \code{schema_of_json}: Parses a JSON string and infers its schema in DDL format. +#' +#' @rdname column_collection_functions +#' @aliases schema_of_json schema_of_json,characterOrColumn-method +#' @examples +#' +#' \dontrun{ +#' json <- '{"name":"Bob"}' +#' df <- sql("SELECT * FROM range(1)") +#' head(select(df, schema_of_json(json)))} +#' @note schema_of_json since 3.0.0 +setMethod("schema_of_json", signature(x = "characterOrColumn"), + function(x, ...) { +if (class(x) == "character") { + col <- callJStatic("org.apache.spark.sql.functions", "lit", x) +} else { + col <- x@jc --- End diff -- It's just left as column for now .. in case we allow other more cases .. Yea, it's a bit odd that we should `schema_of_csv(lit("Amsterdam,2018")))`. Maybe later columns or other expressions might have to be supported - in that case we can just fix the documentation in R side .. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r232202773 --- Diff: R/pkg/R/SQLContext.R --- @@ -189,19 +238,67 @@ createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0, x } } + data[] <- lapply(data, cleanCols) - # drop factors and wrap lists - data <- setNames(lapply(data, cleanCols), NULL) + args <- list(FUN = list, SIMPLIFY = FALSE, USE.NAMES = FALSE) + if (arrowEnabled) { +shouldUseArrow <- tryCatch({ + stopifnot(length(data) > 0) + dataHead <- head(data, 1) + # Currenty Arrow optimization does not support POSIXct and raw for now. + # Also, it does not support explicit float type set by users. It leads to + # incorrect conversion. We will fall back to the path without Arrow optimization. + if (any(sapply(dataHead, function(x) is(x, "POSIXct" { +stop("Arrow optimization with R DataFrame does not support POSIXct type yet.") + } + if (any(sapply(dataHead, is.raw))) { +stop("Arrow optimization with R DataFrame does not support raw type yet.") + } + if (inherits(schema, "structType")) { +if (any(sapply(schema$fields(), function(x) x$dataType.toString() == "FloatType"))) { + stop("Arrow optimization with R DataFrame does not support FloatType type yet.") +} + } + firstRow <- do.call(mapply, append(args, dataHead))[[1]] + fileName <- writeToTempFileInArrow(data, numPartitions) + tryCatch( +jrddInArrow <- callJStatic("org.apache.spark.sql.api.r.SQLUtils", + "readArrowStreamFromFile", + sparkSession, + fileName), + finally = { +file.remove(fileName) + }) + TRUE +}, +error = function(e) { + message(paste0("WARN: createDataFrame attempted Arrow optimization because ", --- End diff -- ooops --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22880: [SPARK-25407][SQL] Ensure we pass a compatible pruned sc...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22880 Let me take a look on this weekends. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22985: [SPARK-25510][SQL][TEST][FOLLOW-UP] Remove BenchmarkWith...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22985 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22939: [SPARK-25446][R] Add schema_of_json() and schema_of_csv(...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22939 Hey @felixcheung, it should be ready for another look. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22979: [SPARK-25977][SQL] Parsing decimals from CSV using local...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22979 Also let me leave a cc for @srowen. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22951: [SPARK-25945][SQL] Support locale while parsing date/tim...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22951 Actually let me leave a cc for @srowen. I remember we talked about it before. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22951: [SPARK-25945][SQL] Support locale while parsing date/tim...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22951 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r232115966 --- Diff: R/pkg/R/SQLContext.R --- @@ -147,6 +147,55 @@ getDefaultSqlSource <- function() { l[["spark.sql.sources.default"]] } +writeToTempFileInArrow <- function(rdf, numPartitions) { + # R API in Arrow is not yet released. CRAN requires to add the package in requireNamespace + # at DESCRIPTION. Later, CRAN checks if the package is available or not. Therefore, it works + # around by avoiding direct requireNamespace. + requireNamespace1 <- requireNamespace + if (requireNamespace1("arrow", quietly = TRUE)) { +record_batch <- get("record_batch", envir = asNamespace("arrow"), inherits = FALSE) +record_batch_stream_writer <- get( + "record_batch_stream_writer", envir = asNamespace("arrow"), inherits = FALSE) +file_output_stream <- get( + "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE) +write_record_batch <- get( + "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE) + +# Currently arrow requires withr; otherwise, write APIs don't work. +# Direct 'require' is not recommended by CRAN. Here's a workaround. +require1 <- require +if (require1("withr", quietly = TRUE)) { --- End diff -- It's actually a bit odd that I need to manually require this package. Otherwise, it complains, for instance, here https://github.com/apache/arrow/blob/d3ec69069649013229366ebe01e22f389597dc19/r/R/on_exit.R#L20 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21363: [SPARK-19228][SQL] Migrate on Java 8 time from FastDateF...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21363 Looks difficult because the behaviours themselves are different. One possibility is a fallback and the other possibility is configuration. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22954 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] zeppelin issue #3206: [ZEPPELIN-3810] Support Spark 2.4
Github user HyukjinKwon commented on the issue: https://github.com/apache/zeppelin/pull/3206 Hey all ~ could this get in by any change maybe? ---
[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22954 I have finished most of todos except waiting for R API of Arrow 0.12.0 and fixing some changes accordingly. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [WIP] Enables Arrow optimization from R DataFrame...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r231992763 --- Diff: R/pkg/R/SQLContext.R --- @@ -147,6 +147,55 @@ getDefaultSqlSource <- function() { l[["spark.sql.sources.default"]] } +writeToTempFileInArrow <- function(rdf, numPartitions) { + # R API in Arrow is not yet released. CRAN requires to add the package in requireNamespace + # at DESCRIPTION. Later, CRAN checks if the package is available or not. Therefore, it works + # around by avoiding direct requireNamespace. + requireNamespace1 <- requireNamespace + if (requireNamespace1("arrow", quietly = TRUE)) { +record_batch <- get("record_batch", envir = asNamespace("arrow"), inherits = FALSE) +record_batch_stream_writer <- get( + "record_batch_stream_writer", envir = asNamespace("arrow"), inherits = FALSE) +file_output_stream <- get( + "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE) +write_record_batch <- get( + "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE) --- End diff -- These workarounds will be removed when Arrow 0.12.0 is released. I did it to make CARN passed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22979: [SPARK-25977][SQL] Parsing decimals from CSV using local...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22979 Looks good. Will take a closer look. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22951: [SPARK-25945][SQL] Support locale while parsing date/tim...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22951 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22977: [BUILD] Bump previousSparkVersion in MimaBuild.scala to ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22977 Actually, similar changes were being made at https://github.com/apache/spark/pull/22967 FYI. cc @dbtsai --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22973: [SPARK-25972][PYTHON] Missed JSON options in streaming.p...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22973 Yea .. actually that's documented in https://spark.apache.org/contributing.html . Strictly it should be` PYTHON` > The PR title should be of the form [SPARK-][COMPONENT] Title, where SPARK- is the relevant JIRA number, COMPONENT is one of the PR categories shown at spark-prs.appspot.com and Title may be the JIRAâs title or a more specific title describing the PR itself. ![screen shot 2018-11-08 at 7 19 07 pm](https://user-images.githubusercontent.com/6477701/48195647-2e353b80-e38b-11e8-8ac0-73a458ee00c0.png) I tried to be clear for myself so I took a look before but .. not a big deal at all :D. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization from R Da...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22954 adding @falaki and @mengxr as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization fr...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r231847272 --- Diff: R/pkg/R/SQLContext.R --- @@ -215,14 +278,16 @@ createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0, } if (is.null(schema) || (!inherits(schema, "structType") && is.null(names(schema { -row <- firstRDD(rdd) +if (is.null(firstRow)) { + firstRow <- firstRDD(rdd) --- End diff -- Note that this PR optimizes the original code path as well here - when the input is local R DataFrame, here we avoid `firstRDD` operation. In the master branch, the benchmark shows: ``` Exception in thread "dispatcher-event-loop-6" java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:3236) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877) ``` So, technically this PR improves If I try this with `10.csv` (79MB) record, it takes longer to cut it short: **Current master**: ``` Time difference of 8.502607 secs ``` **With this PR, but without Arrow** ``` Time difference of 5.143395 secs ``` **With this PR, but with Arrow** ``` Time difference of 0.6981369 secs ``` So, technically this PR improves more **1200%** --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization from R Da...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22954 @felixcheung! performance improvement was **955%** ! I described the benchmark I took in PR description. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization fr...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r231815154 --- Diff: R/pkg/R/SQLContext.R --- @@ -147,6 +147,30 @@ getDefaultSqlSource <- function() { l[["spark.sql.sources.default"]] } +writeToTempFileInArrow <- function(rdf, numPartitions) { + stopifnot(require("arrow", quietly = TRUE)) --- End diff -- hmmm .. yea I will add the version check later when R API of Arrow is officially released to CRAN. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization fr...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r231814143 --- Diff: R/pkg/R/SQLContext.R --- @@ -147,6 +147,30 @@ getDefaultSqlSource <- function() { l[["spark.sql.sources.default"]] } +writeToTempFileInArrow <- function(rdf, numPartitions) { + stopifnot(require("arrow", quietly = TRUE)) --- End diff -- Hmhmhmhm for checking arrow version itself, it will be but I think it's fine. R API for arrow will likely be from 0.12.0 and we will support 0.12.0+ from now on. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22973: [SPARK-25972][SQL] Missed JSON options in streaming.py
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22973 Oh, let's name it `[PYTHON]` BTW. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20944: [SPARK-23831][SQL] Add org.apache.derby to IsolatedClien...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20944 Let's revert this then if this only targeted to fix the test. We can bring this back later when it's needed - tho, yea . This caused a specific case failure in Livy' when restarting Hive enabled Spark session. @gatorsmile, I will revert this but I don't mind getting this back again if it actually fixes a usecase since I either don't know how exactly this fixes the case above and how it causes to make the Livy case failed - one thing clear is this specific commit is the cause. Please revert me action if I missed an actual usecase fixed by this, I am okay. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22967: [SPARK-25956] Make Scala 2.12 as default Scala ve...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22967#discussion_r231783302 --- Diff: docs/sparkr.md --- @@ -133,7 +133,7 @@ specifying `--packages` with `spark-submit` or `sparkR` commands, or if initiali {% highlight r %} -sparkR.session(sparkPackages = "com.databricks:spark-avro_2.11:3.0.0") +sparkR.session() --- End diff -- Let me try to take a look as well this weekends. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22967: [SPARK-25956] Make Scala 2.12 as default Scala ve...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22967#discussion_r231783339 --- Diff: docs/sparkr.md --- @@ -133,7 +133,7 @@ specifying `--packages` with `spark-submit` or `sparkR` commands, or if initiali {% highlight r %} -sparkR.session(sparkPackages = "com.databricks:spark-avro_2.11:3.0.0") +sparkR.session() --- End diff -- adding @JoshRosen --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22967: [SPARK-25956] Make Scala 2.12 as default Scala ve...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22967#discussion_r231783212 --- Diff: docs/sparkr.md --- @@ -133,7 +133,7 @@ specifying `--packages` with `spark-submit` or `sparkR` commands, or if initiali {% highlight r %} -sparkR.session(sparkPackages = "com.databricks:spark-avro_2.11:3.0.0") +sparkR.session() --- End diff -- I am not an expert but just know a bit. The mima change look right from a cursory look. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22967: [SPARK-25956] Make Scala 2.12 as default Scala ve...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22967#discussion_r231781880 --- Diff: docs/sparkr.md --- @@ -133,7 +133,7 @@ specifying `--packages` with `spark-submit` or `sparkR` commands, or if initiali {% highlight r %} -sparkR.session(sparkPackages = "com.databricks:spark-avro_2.11:3.0.0") +sparkR.session() --- End diff -- I mean it's related with using external package, it looks so but Avro is kind of internal source now .. so it's out of date. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22967: [SPARK-25956] Make Scala 2.12 as default Scala ve...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22967#discussion_r231781676 --- Diff: docs/sparkr.md --- @@ -133,7 +133,7 @@ specifying `--packages` with `spark-submit` or `sparkR` commands, or if initiali {% highlight r %} -sparkR.session(sparkPackages = "com.databricks:spark-avro_2.11:3.0.0") +sparkR.session() --- End diff -- oh, but the problem is other packages probably wouldn't have _2.12 distribution. hm, I think this can be left as was for now. At least I am going to release spark-xml before Spark 3.0.0 anyway. I can try to include 2.12 distribution as well and fix it here later. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22967: [SPARK-25956] Make Scala 2.12 as default Scala ve...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22967#discussion_r231780839 --- Diff: docs/sparkr.md --- @@ -133,7 +133,7 @@ specifying `--packages` with `spark-submit` or `sparkR` commands, or if initiali {% highlight r %} -sparkR.session(sparkPackages = "com.databricks:spark-avro_2.11:3.0.0") +sparkR.session() --- End diff -- Eh, @dbtsai, I think you can just switch this to other datasources like `spark-redshift` or `spark-xml`, and fix the description above `you can find data source connectors for popular file formats like Avro`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22960: [SPARK-25955][TEST] Porting JSON tests for CSV functions
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22960 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22958: [SPARK-25952][SQL] Passing actual schema to JacksonParse...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22958 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22958: [SPARK-25952][SQL] Passing actual schema to JacksonParse...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22958 @MaxGekk, BTW, can you call `verifyColumnNameOfCorruptRecord` here and datasource as well for JSON and CSV? Of course in a separate PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22958: [SPARK-25952][SQL] Passing actual schema to JacksonParse...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22958 For CSV, looks we are already doing so: https://github.com/apache/spark/blob/76813cfa1e2607ea3b669a79e59b568e96395b2e/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala#L109-L111 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22932: [SPARK-25102][SQL] Write Spark version to ORC/Par...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22932#discussion_r231777190 --- Diff: sql/core/src/test/resources/sql-tests/results/describe-part-after-analyze.sql.out --- @@ -93,7 +93,7 @@ Partition Values [ds=2017-08-01, hr=10] Location [not included in comparison]sql/core/spark-warehouse/t/ds=2017-08-01/hr=10 Created Time [not included in comparison] Last Access [not included in comparison] -Partition Statistics 1121 bytes, 3 rows +Partition Statistics 1229 bytes, 3 rows --- End diff -- Hmmm .. yea, I think we should avoid .. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22951: [SPARK-25945][SQL] Support locale while parsing d...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22951#discussion_r231776739 --- Diff: python/pyspark/sql/readwriter.py --- @@ -349,7 +353,7 @@ def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=Non negativeInf=None, dateFormat=None, timestampFormat=None, maxColumns=None, maxCharsPerColumn=None, maxMalformedLogPerPartition=None, mode=None, columnNameOfCorruptRecord=None, multiLine=None, charToEscapeQuoteEscaping=None, -samplingRatio=None, enforceSchema=None, emptyValue=None): +samplingRatio=None, enforceSchema=None, emptyValue=None, locale=None): --- End diff -- Let's add `emptyValue` in `streaming.py` in the same separate PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22951: [SPARK-25945][SQL] Support locale while parsing d...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22951#discussion_r231776568 --- Diff: python/pyspark/sql/readwriter.py --- @@ -267,7 +270,8 @@ def json(self, path, schema=None, primitivesAsString=None, prefersDecimal=None, mode=mode, columnNameOfCorruptRecord=columnNameOfCorruptRecord, dateFormat=dateFormat, timestampFormat=timestampFormat, multiLine=multiLine, allowUnquotedControlChars=allowUnquotedControlChars, lineSep=lineSep, -samplingRatio=samplingRatio, dropFieldIfAllNull=dropFieldIfAllNull, encoding=encoding) +samplingRatio=samplingRatio, dropFieldIfAllNull=dropFieldIfAllNull, encoding=encoding, --- End diff -- @MaxGekk, let's also add `dropFieldIfAllNull` and `encoding` in `sql/streaming.py` in a separate PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22951: [SPARK-25945][SQL] Support locale while parsing d...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22951#discussion_r231776396 --- Diff: python/pyspark/sql/readwriter.py --- @@ -267,7 +270,8 @@ def json(self, path, schema=None, primitivesAsString=None, prefersDecimal=None, mode=mode, columnNameOfCorruptRecord=columnNameOfCorruptRecord, dateFormat=dateFormat, timestampFormat=timestampFormat, multiLine=multiLine, allowUnquotedControlChars=allowUnquotedControlChars, lineSep=lineSep, -samplingRatio=samplingRatio, dropFieldIfAllNull=dropFieldIfAllNull, encoding=encoding) +samplingRatio=samplingRatio, dropFieldIfAllNull=dropFieldIfAllNull, encoding=encoding, +locale=locale) --- End diff -- @MaxGekk, looks `sql/streaming.py` is missed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22951: [SPARK-25945][SQL] Support locale while parsing d...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22951#discussion_r231775987 --- Diff: python/pyspark/sql/readwriter.py --- @@ -446,6 +450,9 @@ def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=Non If None is set, it uses the default value, ``1.0``. :param emptyValue: sets the string representation of an empty value. If None is set, it uses the default value, empty string. +:param locale: sets a locale as language tag in IETF BCP 47 format. If None is set, + it uses the default value, ``en-US``. For instance, ``locale`` is used while + parsing dates and timestamps. --- End diff -- I think ideally we should apply to decimal parsing too actually. But yea we can leave it separate. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22951: [SPARK-25945][SQL] Support locale while parsing date/tim...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22951 OMG, what does `Ð½Ð¾Ñ 2018` mean BTW? haha --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22932: [SPARK-25102][SQL] Write Spark version to ORC/Par...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22932#discussion_r231775020 --- Diff: sql/core/src/test/resources/sql-tests/results/describe-part-after-analyze.sql.out --- @@ -93,7 +93,7 @@ Partition Values [ds=2017-08-01, hr=10] Location [not included in comparison]sql/core/spark-warehouse/t/ds=2017-08-01/hr=10 Created Time [not included in comparison] Last Access [not included in comparison] -Partition Statistics 1121 bytes, 3 rows +Partition Statistics 1229 bytes, 3 rows --- End diff -- Hm, does it mean that basically the tests will be failed or fixed for official releases (since it doesn't have `-SNAPSHOT`)? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21679: [SPARK-24695] [SQL]: To add support to return Calendar i...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21679 I think we should close this for now then. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22948: [SPARK-25944][R][BUILD] AppVeyor change to latest...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22948#discussion_r231762531 --- Diff: dev/appveyor-install-dependencies.ps1 --- @@ -115,7 +115,7 @@ $env:Path += ";$env:HADOOP_HOME\bin" Pop-Location # == R -$rVer = "3.4.1" +$rVer = "3.5.1" $rToolsVer = "3.4.0" --- End diff -- Oh~ latest looks 3.5. Let me try to update aswell shortly. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22963: [SPARK-25962][BUILD][PYTHON] Specify minimum versions fo...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22963 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22963: [SPARK-25962][BUILD][PYTHON] Specify minimum versions fo...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22963 Thanks, @srowen and @dongjoon-hyun. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20944: [SPARK-23831][SQL] Add org.apache.derby to IsolatedClien...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20944 To me, it sounds we made a fix because it was difficult to figure out exactly what's going on internally. It's okay if it's difficult to reproduce but it can be reproduce in production; however, the problem is that the fix of this caused another problem. I am asking those to see why and how important this fix was. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20944: [SPARK-23831][SQL] Add org.apache.derby to IsolatedClien...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20944 I understood the reproducer step in JIRA but how and why it matters? Did it cause an actual problem in your production environment? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] incubator-livy issue #121: [LIVY-518][BUILD] Support Spark 2.4
Github user HyukjinKwon commented on the issue: https://github.com/apache/incubator-livy/pull/121 Thanks guys! ---
[GitHub] spark issue #20944: [SPARK-23831][SQL] Add org.apache.derby to IsolatedClien...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20944 ping @wangyum, I'm going to revert this today if there's no response today. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22938: [SPARK-25935][SQL] Prevent null rows from JSON parser
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22938 @attilapiros, mind showing rough small test codes for it please? just want to see if this is something we should fix or not. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22938: [SPARK-25935][SQL] Prevent null rows from JSON parser
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22938 Yea, looks fine in general. Will take a look within this week or weekends. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22939: [SPARK-25446][R] Add schema_of_json() and schema_...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22939#discussion_r231476884 --- Diff: R/pkg/R/functions.R --- @@ -205,11 +205,18 @@ NULL #' also supported for the schema. #' \item \code{from_csv}: a DDL-formatted string #' } -#' @param ... additional argument(s). In \code{to_json}, \code{to_csv} and \code{from_json}, -#'this contains additional named properties to control how it is converted, accepts -#'the same options as the JSON/CSV data source. Additionally \code{to_json} supports -#'the "pretty" option which enables pretty JSON generation. In \code{arrays_zip}, -#'this contains additional Columns of arrays to be merged. +#' @param ... additional argument(s). +#' \itemize{ +#' \item \code{to_json}, \code{from_json} and \code{schema_of_json}: this contains +#' additional named properties to control how it is converted and accepts the +#' same options as the JSON data source. +#' \item \code{to_json}: it supports the "pretty" option which enables pretty --- End diff -- OK. I added an example --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22963: [SPARK-25962][BUILD][PYTHON] Specify minimum versions fo...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22963 cc @srowen, @holdenk and @rekhajoshm --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22963: [SPARK-25962][BUILD][PYTHON] Specify minimum vers...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22963#discussion_r231463118 --- Diff: dev/lint-python --- @@ -87,27 +91,46 @@ else rm "$PYCODESTYLE_REPORT_PATH" fi -# stop the build if there are Python syntax errors or undefined names -flake8 . --count --select=E901,E999,F821,F822,F823 --max-line-length=100 --show-source --statistics --- End diff -- BTW, it was never a good idea to use external executable program directly without any check. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22963: [SPARK-25962][BUILD][PYTHON] Specify minimum vers...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22963#discussion_r231462182 --- Diff: dev/lint-python --- @@ -26,9 +26,13 @@ PYCODESTYLE_REPORT_PATH="$SPARK_ROOT_DIR/dev/pycodestyle-report.txt" PYDOCSTYLE_REPORT_PATH="$SPARK_ROOT_DIR/dev/pydocstyle-report.txt" PYLINT_REPORT_PATH="$SPARK_ROOT_DIR/dev/pylint-report.txt" PYLINT_INSTALL_INFO="$SPARK_ROOT_DIR/dev/pylint-info.txt" + PYDOCSTYLEBUILD="pydocstyle" -EXPECTED_PYDOCSTYLEVERSION="3.0.0" -PYDOCSTYLEVERSION=$(python -c 'import pkg_resources; print(pkg_resources.get_distribution("pydocstyle").version)' 2> /dev/null) +MINIMUM_PYDOCSTYLEVERSION="3.0.0" + +FLAKE8BUILD="flake8" +MINIMUM_FLAKE8="3.4.0" + SPHINXBUILD=${SPHINXBUILD:=sphinx-build} --- End diff -- BTW, i tested flake8 3.0.0, 3.1.0, 3.2.0, 3.3.0, 3.4.0, 3.5.0 and 3.6.0 while I am here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22963: [SPARK-25962][BUILD][PYTHON] Specify minimum vers...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22963#discussion_r231461879 --- Diff: dev/lint-python --- @@ -26,9 +26,13 @@ PYCODESTYLE_REPORT_PATH="$SPARK_ROOT_DIR/dev/pycodestyle-report.txt" PYDOCSTYLE_REPORT_PATH="$SPARK_ROOT_DIR/dev/pydocstyle-report.txt" PYLINT_REPORT_PATH="$SPARK_ROOT_DIR/dev/pylint-report.txt" PYLINT_INSTALL_INFO="$SPARK_ROOT_DIR/dev/pylint-info.txt" + PYDOCSTYLEBUILD="pydocstyle" -EXPECTED_PYDOCSTYLEVERSION="3.0.0" -PYDOCSTYLEVERSION=$(python -c 'import pkg_resources; print(pkg_resources.get_distribution("pydocstyle").version)' 2> /dev/null) +MINIMUM_PYDOCSTYLEVERSION="3.0.0" + +FLAKE8BUILD="flake8" +MINIMUM_FLAKE8="3.4.0" + SPHINXBUILD=${SPHINXBUILD:=sphinx-build} --- End diff -- @shaneknapp and @cloud-fan, it seems flake8 does work but indeed the problem was compatibility with pycodestyle within flake8 itself. Our script uses pycodestyle 2.4.0 but it's manually downloaded. So, Jenkins's pycodestyle might be different or old and this might be the actual root cause. I locally tested with latest pycodestyle and flake 3.6.0 and it worked. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22963: [SPARK-25962][BUILD][PYTHON] Specify minimum vers...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22963#discussion_r231460594 --- Diff: dev/lint-python --- @@ -26,9 +26,13 @@ PYCODESTYLE_REPORT_PATH="$SPARK_ROOT_DIR/dev/pycodestyle-report.txt" PYDOCSTYLE_REPORT_PATH="$SPARK_ROOT_DIR/dev/pydocstyle-report.txt" PYLINT_REPORT_PATH="$SPARK_ROOT_DIR/dev/pylint-report.txt" PYLINT_INSTALL_INFO="$SPARK_ROOT_DIR/dev/pylint-info.txt" + PYDOCSTYLEBUILD="pydocstyle" -EXPECTED_PYDOCSTYLEVERSION="3.0.0" -PYDOCSTYLEVERSION=$(python -c 'import pkg_resources; print(pkg_resources.get_distribution("pydocstyle").version)' 2> /dev/null) +MINIMUM_PYDOCSTYLEVERSION="3.0.0" + +FLAKE8BUILD="flake8" +MINIMUM_FLAKE8="3.5.0" --- End diff -- @shaneknapp and @cloud-fan, it seems flake8 does work but indeed the problem was compatibility with pycodestyle within flake8 itself. Our script uses pycodestyle 2.4.0 but it's manually downloaded. So, Jenkins's pycodestyle might be different or old and this might be the actual root cause. I locally tested with latest pycodestyle and flake 3.6.0 and it worked. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22963: [SPARK-25962][BUILD][PYTHON] Specify minimum vers...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22963#discussion_r231460041 --- Diff: dev/lint-python --- @@ -87,27 +91,46 @@ else rm "$PYCODESTYLE_REPORT_PATH" fi -# stop the build if there are Python syntax errors or undefined names -flake8 . --count --select=E901,E999,F821,F822,F823 --max-line-length=100 --show-source --statistics -flake8_status="${PIPESTATUS[0]}" +# Check by flake8 +if hash "$FLAKE8BUILD" 2> /dev/null; then --- End diff -- I manually tested each branch. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22963: [SPARK-25962][BUILD][PYTHON]Specify minimum versi...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/22963 [SPARK-25962][BUILD][PYTHON]Specify minimum versions for both pydocstyle and flake8 in 'lint-python' script ## What changes were proposed in this pull request? This PR explicitly specifies `flake8` and `pydocstyle` versions. ## How was this patch tested? Manually tested. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-25962 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22963.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22963 commit e59724316ef5a37b0c9b8cb3311c4867d0e5e7af Author: hyukjinkwon Date: 2018-11-07T10:47:22Z Specify minimum versions for both pydocstyle and flake8 in 'lint-python' script --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization from R Da...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22954 For encryption stuff, I will try to handle that as well (maybe as a followup(?)) so that we support it even when that's enabled. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization from R Da...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22954 Thanks, @felixcheung. I will address those comments during cleaning up. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization fr...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r231414561 --- Diff: R/pkg/R/SQLContext.R --- @@ -147,6 +147,30 @@ getDefaultSqlSource <- function() { l[["spark.sql.sources.default"]] } +writeToTempFileInArrow <- function(rdf, numPartitions) { + stopifnot(require("arrow", quietly = TRUE)) + stopifnot(require("withr", quietly = TRUE)) --- End diff -- It's actually dependent on Arrow's API calls .. Those APIs were only added in few weeks ago. I will make it clean when R API of Arrow is close to release (at 0.12.0) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22960: [SPARK-25955][TEST] Porting JSON tests for CSV fu...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22960#discussion_r231413853 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala --- @@ -86,4 +86,82 @@ class CsvFunctionsSuite extends QueryTest with SharedSQLContext { checkAnswer(df.select(to_csv($"a", options)), Row("26/08/2015 18:00") :: Nil) } + + test("from_csv uses DDL strings for defining a schema - java") { +val df = Seq("""1,"haa"""").toDS() +checkAnswer( + df.select( +from_csv($"value", lit("a INT, b STRING"), new java.util.HashMap[String, String]())), + Row(Row(1, "haa")) :: Nil) + } + + test("roundtrip to_csv -> from_csv") { +val df = Seq(Tuple1(Tuple1(1)), Tuple1(null)).toDF("struct") +val schema = df.schema(0).dataType.asInstanceOf[StructType] +val options = Map.empty[String, String] +val readback = df.select(to_csv($"struct").as("csv")) + .select(from_csv($"csv", schema, options).as("struct")) + +checkAnswer(df, readback) + } + + test("roundtrip from_csv -> to_csv") { +val df = Seq(Some("1"), None).toDF("csv") +val schema = new StructType().add("a", IntegerType) +val options = Map.empty[String, String] +val readback = df.select(from_csv($"csv", schema, options).as("struct")) + .select(to_csv($"struct").as("csv")) + +checkAnswer(df, readback) + } + + test("infers schemas of a CSV string and pass to to from_csv") { +val in = Seq("""0.123456789,987654321,"San Francisco"""").toDS() +val options = Map.empty[String, String].asJava +val out = in.select(from_csv('value, schema_of_csv("0.1,1,a"), options) as "parsed") +val expected = StructType(Seq(StructField( + "parsed", + StructType(Seq( +StructField("_c0", DoubleType, true), +StructField("_c1", IntegerType, true), +StructField("_c2", StringType, true)) + +assert(out.schema == expected) + } + + test("Support to_csv in SQL") { --- End diff -- I think we can just get rid of it. I can't imagine both functions are specifically broken alone in `selectExpr`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22951: [SPARK-25945][SQL] Support locale while parsing date/tim...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22951 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] zeppelin issue #3206: [ZEPPELIN-3810] Support Spark 2.4
Github user HyukjinKwon commented on the issue: https://github.com/apache/zeppelin/pull/3206 Thank you guys! ---
[GitHub] spark issue #20944: [SPARK-23831][SQL] Add org.apache.derby to IsolatedClien...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20944 Please describe manual tests and how it relates to actual usecase. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22939: [SPARK-25446][R] Add schema_of_json() and schema_...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22939#discussion_r231404180 --- Diff: R/pkg/R/functions.R --- @@ -2230,6 +2237,32 @@ setMethod("from_json", signature(x = "Column", schema = "characterOrstructType") column(jc) }) +#' @details +#' \code{schema_of_json}: Parses a JSON string and infers its schema in DDL format. +#' +#' @rdname column_collection_functions +#' @aliases schema_of_json schema_of_json,characterOrColumn-method +#' @examples +#' +#' \dontrun{ +#' json <- '{"name":"Bob"}' +#' df <- sql("SELECT * FROM range(1)") +#' head(select(df, schema_of_json(json)))} +#' @note schema_of_json since 3.0.0 +setMethod("schema_of_json", signature(x = "characterOrColumn"), + function(x, ...) { +if (class(x) == "character") { + col <- callJStatic("org.apache.spark.sql.functions", "lit", x) +} else { + col <- x@jc --- End diff -- Yup.. only literal works but columns don't work. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] incubator-livy issue #121: [WIP][LIVY-518][BUILD] Support Spark 2.4
Github user HyukjinKwon commented on the issue: https://github.com/apache/incubator-livy/pull/121 @vanzin, looks https://github.com/apache/spark/commit/a75571b46f813005a6d4b076ec39081ffab11844#diff-f697551d2f00bfb336406b6fe6b516fe causes the test failure. At the very least, I can see the unittests pass without that specific commit. I suspect somehow different derby is picked up and it behaves differently, which causes a test failure related - which is pretty a Spark issue or misuse of restarting Hive session. The test case fixes are actually more correct because it enables what it needs. ---
[GitHub] spark issue #20944: [SPARK-23831][SQL] Add org.apache.derby to IsolatedClien...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20944 Sorry, why was this change required? I don't see https://github.com/apache/spark/pull/20944#issuecomment-379525776 is addressed Can you elaborate please? Why do we make `org.apache.derby` as shared? Ideally, minor or maintenance versions of `derby` can be dumped up, and they shouldn't be shared unless there's a strong reason to keep it shared, for instance, making class resolution failed. How did you reproduce this and why the unit test is not added? I found an actual issue while working on Apache Livy Spark 2.4 support. I am still investigating how it relates with the test failures but at the very list I see this specific commit matters since Apache Livy unittests pass without this specific commit. Adding @vanzin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15899: [SPARK-18466] added withFilter method to RDD
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15899 +1 for the decision and closing it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] incubator-livy issue #121: [WIP][LIVY-518][BUILD] Support Spark 2.4
Github user HyukjinKwon commented on the issue: https://github.com/apache/incubator-livy/pull/121 @vanzin, weird. ``` $ ./bin/spark-shell scala> sql("CREATE TABLE tblA(a int)") scala> spark.stop() ``` ``` $ rm -fr metastore_db $ rm -fr spark-warehouse $ rm -fr derby.log ``` ``` scala> import org.apache.spark.sql.SparkSession scala> val spark = SparkSession.builder().enableHiveSupport().getOrCreate() scala> sql("CREATE TABLE tblA(a int)") ``` These steps reproduce the same error all in Spark 2.3.0, 2.3.1 and 2.4.0 - I remember we never properly supported to stop and start Hive enabled support Spark session in this way. SparkR tests were failed due to the same reason. Hmm .. I am taking a look. ---
[GitHub] spark issue #22951: [SPARK-25945][SQL] Support locale while parsing date/tim...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22951 Looks good. I or someone else should take a closer look before getting this in. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22956: [SPARK-25950][SQL] from_csv should respect to spark.sql....
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22956 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22956: [SPARK-25950][SQL] from_csv should respect to spa...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22956#discussion_r231370599 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala --- @@ -92,8 +93,14 @@ case class CsvToStructs( } } + val nameOfCorruptRecord = SQLConf.get.getConf(SQLConf.COLUMN_NAME_OF_CORRUPT_RECORD) --- End diff -- Yea, I think so. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22590: [SPARK-25574][SQL]Add an option `keepQuotes` for parsing...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22590 I wonder how important it is. I know `spark-csv` at Databricks supported different quote modes and that's gone when we ported that into Spark - the root cause was due to replacing the library from apache-common into univocity. After few years, I only saw one request about reviving the quote mode proposed here - so I suspect how important it is. Basically, @MaxGekk described my stand correctly. Can we investigate a way to set the arbitrary parse settings options? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22590: [SPARK-25574][SQL]Add an option `keepQuotes` for parsing...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22590 They should be documented in API doc like `DataFrameReader.scala`. For site, we should avoid doc duplication - It's a general issue to document options. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization from R Da...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22954 So far, the regressions tests are passed and newly added test for R optimization is verified locally. Let me fix CRAN test and some nits. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22956: [SPARK-25950][SQL] from_csv should respect to spark.sql....
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22956 Looks good. I or someone else should take a closer look before getting this in. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22960: [SPARK-25955][TEST] Porting JSON tests for CSV fu...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22960#discussion_r231344120 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala --- @@ -86,4 +86,82 @@ class CsvFunctionsSuite extends QueryTest with SharedSQLContext { checkAnswer(df.select(to_csv($"a", options)), Row("26/08/2015 18:00") :: Nil) } + + test("from_csv uses DDL strings for defining a schema - java") { +val df = Seq("""1,"haa"""").toDS() +checkAnswer( + df.select( +from_csv($"value", lit("a INT, b STRING"), new java.util.HashMap[String, String]())), + Row(Row(1, "haa")) :: Nil) + } + + test("roundtrip to_csv -> from_csv") { +val df = Seq(Tuple1(Tuple1(1)), Tuple1(null)).toDF("struct") +val schema = df.schema(0).dataType.asInstanceOf[StructType] +val options = Map.empty[String, String] +val readback = df.select(to_csv($"struct").as("csv")) + .select(from_csv($"csv", schema, options).as("struct")) + +checkAnswer(df, readback) + } + + test("roundtrip from_csv -> to_csv") { +val df = Seq(Some("1"), None).toDF("csv") +val schema = new StructType().add("a", IntegerType) +val options = Map.empty[String, String] +val readback = df.select(from_csv($"csv", schema, options).as("struct")) + .select(to_csv($"struct").as("csv")) + +checkAnswer(df, readback) + } + + test("infers schemas of a CSV string and pass to to from_csv") { +val in = Seq("""0.123456789,987654321,"San Francisco"""").toDS() +val options = Map.empty[String, String].asJava +val out = in.select(from_csv('value, schema_of_csv("0.1,1,a"), options) as "parsed") +val expected = StructType(Seq(StructField( + "parsed", + StructType(Seq( +StructField("_c0", DoubleType, true), +StructField("_c1", IntegerType, true), +StructField("_c2", StringType, true)) + +assert(out.schema == expected) + } + + test("Support to_csv in SQL") { --- End diff -- @MaxGekk, wouldn't the tests in `csv-functions.sql` be enough for SQL support test? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] incubator-livy pull request #121: [LIVY-518][BUILD] Support Spark 2.4
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/incubator-livy/pull/121#discussion_r23134 --- Diff: rsc/src/test/java/org/apache/livy/rsc/TestSparkClient.java --- @@ -271,7 +275,7 @@ public void call(LivyClient client) throws Exception { @Test public void testSparkSQLJob() throws Exception { -runTest(true, new TestFunction() { +runTest(true, false, new TestFunction() { --- End diff -- Okay, let me take a look and get back to you soon. ---
[GitHub] incubator-livy pull request #121: [LIVY-518][BUILD] Support Spark 2.4
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/incubator-livy/pull/121#discussion_r231342303 --- Diff: rsc/src/test/java/org/apache/livy/rsc/TestSparkClient.java --- @@ -271,7 +275,7 @@ public void call(LivyClient client) throws Exception { @Test public void testSparkSQLJob() throws Exception { -runTest(true, new TestFunction() { +runTest(true, false, new TestFunction() { --- End diff -- @vanzin, thanks. I will investigate it within few days ... but would you mind if I do that as a followup job? I still suspect if it's specific to tests and shouldn't affect the functionailities within Livy. ---
[GitHub] spark issue #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization from R Da...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22954 Adding @yanghaogn as well fyi --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization from R Da...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22954 Let me leave a cc @felixcheung, @BryanCutler FYI. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization fr...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/22954 [DO-NOT-MERGE][POC] Enables Arrow optimization from R DataFrame to Spark DataFrame ## What changes were proposed in this pull request? This PR is not for merging it but targets to demonstrates the feasibility (with reusing PyArrow code path at its best) and performance improvement when converting R dataframes to Spark's dataframe. This can be tested as below: ```bash $ ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true ``` ```r collect(createDataFrame(mtcars)) ``` **Requirements:** - R 3.5.x - Arrow package 0.12+ (not released yet) - CRAN released (ARROW-3204) - withr package **TODOs:** - [ ] Performance measurement - [ ] TDB ## How was this patch tested? Small test was added. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark r-arrow-createdataframe Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22954.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22954 commit 90011a5ff48f2c5fa5fae0e2573fcdaa85d44976 Author: hyukjinkwon Date: 2018-11-06T02:38:37Z [POC] Enables Arrow optimization from R DataFrame to Spark DataFrame --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22939: [SPARK-25446][R] Add schema_of_json() and schema_...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22939#discussion_r231029629 --- Diff: R/pkg/R/functions.R --- @@ -2230,6 +2237,32 @@ setMethod("from_json", signature(x = "Column", schema = "characterOrstructType") column(jc) }) +#' @details +#' \code{schema_of_json}: Parses a JSON string and infers its schema in DDL format. +#' +#' @rdname column_collection_functions +#' @aliases schema_of_json schema_of_json,characterOrColumn-method +#' @examples +#' +#' \dontrun{ +#' json <- '{"name":"Bob"}' +#' df <- sql("SELECT * FROM range(1)") +#' head(select(df, schema_of_json(json)))} +#' @note schema_of_json since 3.0.0 +setMethod("schema_of_json", signature(x = "characterOrColumn"), + function(x, ...) { +if (class(x) == "character") { + col <- callJStatic("org.apache.spark.sql.functions", "lit", x) +} else { + col <- x@jc --- End diff -- Yea .. that was discussed at https://github.com/apache/spark/pull/22775. The usecase of `schema_of_csv` or `schema_of_json` will usually be like .. copying and pasting one record from the actual data manually. That's disallowed for now conservatively. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22940: [MINOR][R] Rename SQLUtils name to RSQLUtils in R
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22940 Hm .. okay. let me close this for now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22940: [MINOR][R] Rename SQLUtils name to RSQLUtils in R
Github user HyukjinKwon closed the pull request at: https://github.com/apache/spark/pull/22940 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22940: [MINOR][R] Rename SQLUtils name to RSQLUtils in R
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22940 Hmmm yea but like .. some of classes similar with this case have been renamed time to time, for instancem `json InferSchema` -> `json JSONInferSchema` when CSV datasource was added (because there was `csv InferSchema`, which renamed `csv ... CSVInferSchema` as well). Every time I see `SQLUtils`, it's pretty confusing and somehow a bit annoying. The only reason I didn't change this so far was that it needs to change here and there .. This is somewhat what I guess we should eventually change tho. I thought it's better to do given it's 3.0.0. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22305: [SPARK-24561][SQL][Python] User-defined window aggregati...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22305 Let me try to take a look this weekends. Sorry it's been delayed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] zeppelin issue #3206: [ZEPPELIN-3810] Support Spark 2.4
Github user HyukjinKwon commented on the issue: https://github.com/apache/zeppelin/pull/3206 Ah, yeap. sure. ---
[GitHub] zeppelin issue #3206: [ZEPPELIN-3810] Support Spark 2.4
Github user HyukjinKwon commented on the issue: https://github.com/apache/zeppelin/pull/3206 This should be ready for a look as is. I already tested Spark 2.4.0. ---
[GitHub] incubator-livy issue #121: [WIP][LIVY-518][BUILD] Support Spark 2.4
Github user HyukjinKwon commented on the issue: https://github.com/apache/incubator-livy/pull/121 Yup, let me rebase and get rid of [WIP] tag. ---
[GitHub] spark issue #22953: [SPARK-25946] [BUILD] Upgrade ASM to 7.x to support JDK1...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22953 Looks good to me. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22919: [SPARK-25906][SHELL] Documents '-I' option (from Scala R...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22919 Merged to master and branch-2.4. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22940: [MINOR][R] Rename SQLUtils name to RSQLUtils in R
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22940 hmhm .. it's trivial and yea it is a logical change. I happened to take a look some codes around here lately, and the name `SQLUtils` actually annoyed me few times :(. I will leave it to @felixcheung. I don;t mind closing this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22948: [SPARK-25944][R][BUILD] AppVeyor change to latest R vers...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22948 cc @felixcheung --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22948: [SPARK-25944][R][BUILD] AppVeyor change to latest...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/22948 [SPARK-25944][R][BUILD] AppVeyor change to latest R version (3.5.1) ## What changes were proposed in this pull request? R 3.5.1 is released 2018-07-02. This PR targets to changes R version from 3.4.1 to 3.5.1. ## How was this patch tested? AppVeyor You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-25944 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22948.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22948 commit 472d339254e86d14ed420320821bec26abe3062e Author: hyukjinkwon Date: 2018-11-05T12:19:54Z AppVeyor change to latest R version (3.5.1) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org