[GitHub] spark pull request #22919: [SPARK-25906][SHELL] Documents '-I' option (from ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22919#discussion_r230701471 --- Diff: bin/spark-shell --- @@ -32,7 +32,10 @@ if [ -z "${SPARK_HOME}" ]; then source "$(dirname "$0")"/find-spark-home fi -export _SPARK_CMD_USAGE="Usage: ./bin/spark-shell [options]" +export _SPARK_CMD_USAGE="Usage: ./bin/spark-shell [options] + +Scala REPL options: + -Ipreload , enforcing line-by-line interpretation" --- End diff -- Yea .. but `-i` doesn't handle implicits like toDF or symbols which are pretty basic ones. I think we should better avoid to document it for now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22913: [SPARK-25902][SQL] Add support for dates with millisecon...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22913 We map `POSIXct` to `TimestampType` in R. I think it makes more sense to not support this. Historically Parquet also choses to better not support. For instance, `UINT_*`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22913: [SPARK-25902][SQL] Add support for dates with mil...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22913#discussion_r230698731 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowUtils.scala --- @@ -71,6 +71,7 @@ object ArrowUtils { case d: ArrowType.Decimal => DecimalType(d.getPrecision, d.getScale) case date: ArrowType.Date if date.getUnit == DateUnit.DAY => DateType case ts: ArrowType.Timestamp if ts.getUnit == TimeUnit.MICROSECOND => TimestampType +case date: ArrowType.Date if date.getUnit == DateUnit.MILLISECOND => TimestampType --- End diff -- Yea, but date is a date. Otherwise, users should use timestamp types if they don't want to truncate. I don't think we should switch the type to cover its capability. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22919: [SPARK-25906][SHELL] Documents '-I' option (from ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22919#discussion_r230689462 --- Diff: bin/spark-shell --- @@ -32,7 +32,10 @@ if [ -z "${SPARK_HOME}" ]; then source "$(dirname "$0")"/find-spark-home fi -export _SPARK_CMD_USAGE="Usage: ./bin/spark-shell [options]" +export _SPARK_CMD_USAGE="Usage: ./bin/spark-shell [options] + +Scala REPL options: + -Ipreload , enforcing line-by-line interpretation" --- End diff -- Oh haha. Sorry. That's in `SparkSubmitArguments.printUsageAndExit`.thats why I left https://github.com/apache/spark/pull/22919#discussion_r230554455 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22939: [SPARK-25446][R] Add schema_of_json() and schema_of_csv(...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22939 Let me clean up and deal with other comments. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22939: [SPARK-25446][R] Add schema_of_json() and schema_...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22939#discussion_r230688803 --- Diff: R/pkg/R/functions.R --- @@ -2230,6 +2237,32 @@ setMethod("from_json", signature(x = "Column", schema = "characterOrstructType") column(jc) }) +#' @details +#' \code{schema_of_json}: Parses a JSON string and infers its schema in DDL format. +#' +#' @rdname column_collection_functions +#' @aliases schema_of_json schema_of_json,characterOrColumn-method +#' @examples +#' +#' \dontrun{ +#' json <- '{"name":"Bob"}' +#' df <- sql("SELECT * FROM range(1)") +#' head(select(df, schema_of_json(json)))} +#' @note schema_of_json since 3.0.0 +setMethod("schema_of_json", signature(x = "characterOrColumn"), + function(x, ...) { +if (class(x) == "character") { + col <- callJStatic("org.apache.spark.sql.functions", "lit", x) +} else { + col <- x@jc --- End diff -- That's actually related with Scala API. There are too many overridden versions of functions in `function.scala` so we're trying to reduce it. Column is preferred over other specific types because Column can cover other expression cases.. in Python or R, they can be easily supported so other types and column are all supported. To cut it short, for consistency with Scala API. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22939: [SPARK-25446][R] Add schema_of_json() and schema_...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22939#discussion_r230686967 --- Diff: R/pkg/R/functions.R --- @@ -205,11 +205,18 @@ NULL #' also supported for the schema. #' \item \code{from_csv}: a DDL-formatted string #' } -#' @param ... additional argument(s). In \code{to_json}, \code{to_csv} and \code{from_json}, -#'this contains additional named properties to control how it is converted, accepts -#'the same options as the JSON/CSV data source. Additionally \code{to_json} supports -#'the "pretty" option which enables pretty JSON generation. In \code{arrays_zip}, -#'this contains additional Columns of arrays to be merged. +#' @param ... additional argument(s). +#' \itemize{ +#' \item \code{to_json}, \code{from_json} and \code{schema_of_json}: this contains +#' additional named properties to control how it is converted and accepts the +#' same options as the JSON data source. +#' \item \code{to_json}: it supports the "pretty" option which enables pretty --- End diff -- Yup. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22919: [SPARK-25906][SHELL] Documents '-I' option (from ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22919#discussion_r230675018 --- Diff: bin/spark-shell --- @@ -32,7 +32,10 @@ if [ -z "${SPARK_HOME}" ]; then source "$(dirname "$0")"/find-spark-home fi -export _SPARK_CMD_USAGE="Usage: ./bin/spark-shell [options]" +export _SPARK_CMD_USAGE="Usage: ./bin/spark-shell [options] + +Scala REPL options: + -Ipreload , enforcing line-by-line interpretation" --- End diff -- I tested other options and this one looks only the valid one. I described in PR description. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22939: [SPARK-25446][R] Add schema_of_json() and schema_of_csv(...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22939 Will make another PR after this gets merged to allow the cases below: ```r df <- sql("SELECT named_struct('name', 'Bob') as people") df <- mutate(df, people_json = to_json(df$people)) head(select(df, from_json(df$people_json, schema_of_json(head(df)$people_json ``` ``` from_json(people_json) 1Bob ``` ```r df <- sql("SELECT named_struct('name', 'Bob') as people") df <- mutate(df, people_json = to_csv(df$people)) head(select(df, from_csv(df$people_json, schema_of_csv(head(df)$people_json ``` ``` from_csv(people_json) 1 Bob ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22905: [SPARK-25894][SQL] Add a ColumnarFileFormat type ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22905#discussion_r230616072 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala --- @@ -306,7 +306,15 @@ case class FileSourceScanExec( withOptPartitionCount } -withSelectedBucketsCount +val withOptColumnCount = relation.fileFormat match { + case columnar: ColumnarFileFormat => +val sqlConf = relation.sparkSession.sessionState.conf +val columnCount = columnar.columnCountForSchema(sqlConf, requiredSchema) +withSelectedBucketsCount + ("ColumnCount" -> columnCount.toString) --- End diff -- I was wondering how important to know if the columns are pruned or not. In that way, other logs should be put in metadata. For instance, we're not even showing the actual filters (not cayalyst but I mean the actual pushed filters that are going to apply at each source implementation level such as filters from `ParquetFilters.createFilter`) in Spark side. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18581: [SPARK-21289][SQL][ML] Supports custom line separator fo...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18581 What you see is what you get. It's not yet finished. See also https://github.com/apache/spark/pull/20877#issuecomment-429182740 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22914: [SPARK-25900][WEBUI]When the page number is more than th...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22914 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22939: [SPARK-25446][R] Add schema_of_json() and schema_of_csv(...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22939 Makes sense. Let me separate it tomorrow. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22939: [SPARK-25446][R] Add schema_of_json() and schema_of_csv(...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22939 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22939: [SPARK-25446][R] Add schema_of_json() and schema_...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22939#discussion_r230586828 --- Diff: R/pkg/R/functions.R --- @@ -202,14 +202,18 @@ NULL #' \itemize{ #' \item \code{from_json}: a structType object to use as the schema to use #' when parsing the JSON string. Since Spark 2.3, the DDL-formatted string is -#' also supported for the schema. -#' \item \code{from_csv}: a DDL-formatted string +#' also supported for the schema. Since Spark 3.0, \code{schema_of_json} or +#' a DDL-formatted string literal can also be accepted. +#' \item \code{from_csv}: a structType object, DDL-formatted string, \code{schema_of_csv} +#' or DDL-formatted string literal #' } -#' @param ... additional argument(s). In \code{to_json}, \code{to_csv} and \code{from_json}, -#'this contains additional named properties to control how it is converted, accepts -#'the same options as the JSON/CSV data source. Additionally \code{to_json} supports -#'the "pretty" option which enables pretty JSON generation. In \code{arrays_zip}, -#'this contains additional Columns of arrays to be merged. +#' @param ... additional argument(s). In \code{to_json}, \code{from_json} and --- End diff -- Yea I think so. Let me try. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22938: [SPARK-25935][SQL] Prevent null rows from JSON pa...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22938#discussion_r230581932 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -552,13 +552,19 @@ case class JsonToStructs( // This converts parsed rows to the desired output by the given schema. @transient - lazy val converter = nullableSchema match { -case _: StructType => - (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next() else null -case _: ArrayType => - (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next().getArray(0) else null -case _: MapType => - (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next().getMap(0) else null + lazy val converter = (rows: Iterator[InternalRow]) => { +if (rows.hasNext) { + val result = rows.next() + // JSON's parser produces one record only. + assert(!rows.hasNext) + nullableSchema match { +case _: StructType => result --- End diff -- @MaxGekk, this looks going to type-dispatch here per record. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22938: [SPARK-25935][SQL] Prevent null rows from JSON parser
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22938 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22940: [MINOR][R] Rename SQLUtils name to RSQLUtils in R
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22940 @felixcheung, I don't feel super strongly about it but it bugged me. WDYT? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22940: [MINOR][R] Rename SQLUtils name to RSQLUtils in R
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/22940 [MINOR][R] Rename SQLUtils name to RSQLUtils in R ## What changes were proposed in this pull request? This PR proposes to rename `SQLUtils` name to `RSQLUtils` in R. This bugged me before but I didn't propose to change because of backporting worry. Since we're going to 3.0.0, the backporting concern is likely small. The name `SQLUtils` sounds quite confusing to me. This basically targets to match with Python side utils: `org.apache.spark.sql.api.python.PythonSQLUtils` `org.apache.spark.api.python.PythonUtils` **Before:** `org.apache.spark.sql.api.r.SQLUtils` `org.apache.spark.api.r.RUtils` **After:** `org.apache.spark.sql.api.r.RSQLUtils` `org.apache.spark.api.r.RUtils` ## How was this patch tested? Existing tests should cover. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark match-names Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22940.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22940 commit fa4b6f91f0f0b69d54142132be1d7f07cc3dd9b9 Author: hyukjinkwon Date: 2018-11-04T09:37:43Z Match names to SQLUtils in R --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22938: [SPARK-25935][SQL] Prevent null rows from JSON parser
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22938 add to whitelist --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22938: [SPARK-25935][SQL] Prevent null rows from JSON parser
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22938 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22939: [SPARK-25446][R] Add schema_of_json() and schema_of_csv(...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22939 cc @felixcheung and @MaxGekk --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22939: [SPARK-25446][R] Add schema_of_json() and schema_...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/22939 [SPARK-25446][R] Add schema_of_json() and schema_of_csv() to R ## What changes were proposed in this pull request? This PR proposes to expose `schema_of_json` and `schema_of_csv` at R side. **`schema_of_json`**: ```r > json <- '{"name":"Bob"}' > df <- sql("SELECT * FROM range(1)") > head(select(df, schema_of_json(json))) schema_of_json({"name":"Bob"}) 1struct ``` **`schema_of_csv`**: ```r > csv <- "Amsterdam,2018" > df <- sql("SELECT * FROM range(1)") > head(select(df, schema_of_csv(csv))) schema_of_csv(Amsterdam,2018) 1struct<_c0:string,_c1:int> ``` This is useful when it's used with [to|from]_[csv|json]: ```r > df <- sql("SELECT named_struct('name', 'Bob') as people") > df <- mutate(df, people_json = to_json(df$people)) > head(select(df, from_json(df$people_json, schema_of_json(head(df)$people_json from_json(people_json) 1Bob ``` ```r > df <- sql("SELECT named_struct('name', 'Bob') as people") > df <- mutate(df, people_json = to_csv(df$people)) > head(select(df, from_csv(df$people_json, schema_of_csv(head(df)$people_json from_csv(people_json) 1 Bob ``` ## How was this patch tested? Manually tested, unit tests added, documentation manually built and verified. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-25446 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22939.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22939 commit c4a78fc0b14876a857bdd9b2f8f094744dd76c04 Author: hyukjinkwon Date: 2018-11-04T08:46:20Z Add schema_of_json() and schema_of_csv() to R --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22934: [INFRA] Close stale PRs
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22934 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22626: [SPARK-25638][SQL] Adding new function - to_csv()
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22626 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22626: [SPARK-25638][SQL] Adding new function - to_csv()
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22626#discussion_r230577431 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala --- @@ -174,3 +176,68 @@ case class SchemaOfCsv( override def prettyName: String = "schema_of_csv" } + +/** + * Converts a [[StructType]] to a CSV output string. + */ +// scalastyle:off line.size.limit +@ExpressionDescription( + usage = "_FUNC_(expr[, options]) - Returns a CSV string with a given struct value", + examples = """ +Examples: + > SELECT _FUNC_(named_struct('a', 1, 'b', 2)); + 1,2 + > SELECT _FUNC_(named_struct('time', to_timestamp('2015-08-26', '-MM-dd')), map('timestampFormat', 'dd/MM/')); + "26/08/2015" + """, + since = "3.0.0") +// scalastyle:on line.size.limit +case class StructsToCsv( + options: Map[String, String], + child: Expression, + timeZoneId: Option[String] = None) + extends UnaryExpression with TimeZoneAwareExpression with CodegenFallback with ExpectsInputTypes { + override def nullable: Boolean = true + + def this(options: Map[String, String], child: Expression) = this(options, child, None) + + // Used in `FunctionRegistry` + def this(child: Expression) = this(Map.empty, child, None) + + def this(child: Expression, options: Expression) = +this( + options = ExprUtils.convertToMapData(options), + child = child, + timeZoneId = None) + + @transient + lazy val writer = new CharArrayWriter() + + @transient + lazy val inputSchema: StructType = child.dataType match { +case st: StructType => st +case other => + throw new IllegalArgumentException(s"Unsupported input type ${other.catalogString}") + } + + @transient + lazy val gen = new UnivocityGenerator( +inputSchema, writer, new CSVOptions(options, columnPruning = true, timeZoneId.get)) --- End diff -- nit: We wouldn't need `lazy val writer` then but just `new CharArrayWriter()` here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22919: [SPARK-25906][SHELL] Documents '-I' option (from Scala R...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22919 Let me get this in to master and branch-2.4 in few days if there are no more comments. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22914: [SPARK-25900][WEBUI]When the page number is more than th...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22914 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22919: [SPARK-25906][SHELL] Documents '-I' option (from ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22919#discussion_r230554455 --- Diff: bin/spark-shell2.cmd --- @@ -20,7 +20,13 @@ rem rem Figure out where the Spark framework is installed call "%~dp0find-spark-home.cmd" -set _SPARK_CMD_USAGE=Usage: .\bin\spark-shell.cmd [options] +set LF=^ + + +rem two empty lines are required +set _SPARK_CMD_USAGE=Usage: .\bin\spark-shell.cmd [options]^%LF%%LF%^%LF%%LF%^ +Scala REPL options:^%LF%%LF%^ --- End diff -- Script specific information looks included in `_SPARK_CMD_USAGE` - looks it's appropriate place then somewhere in `SparkSubmitArguments.printUsageAndExit`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22914: [SPARK-25900][WEBUI]When the page number is more than th...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22914 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22919: [SPARK-25906][SHELL] Documents '-I' option (from Scala R...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22919 Should be ready for a review. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22919: [SPARK-25906][SHELL] Documents '-I' option (from ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22919#discussion_r230554256 --- Diff: bin/spark-shell2.cmd --- @@ -20,7 +20,13 @@ rem rem Figure out where the Spark framework is installed call "%~dp0find-spark-home.cmd" -set _SPARK_CMD_USAGE=Usage: .\bin\spark-shell.cmd [options] +set LF=^ + + +rem two empty lines are required +set _SPARK_CMD_USAGE=Usage: .\bin\spark-shell.cmd [options]^%LF%%LF%^%LF%%LF%^ +Scala REPL options:^%LF%%LF%^ --- End diff -- There seems no claver way then this to set newlines in variables in batch files. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22919: [SPARK-25906][SHELL] Documents '-I' option (from ...
GitHub user HyukjinKwon reopened a pull request: https://github.com/apache/spark/pull/22919 [SPARK-25906][SHELL] Documents '-I' option (from Scala REPL) in spark-shell ## What changes were proposed in this pull request? Looks we mistakenly changed `-i` option behaviour at `spark-shell`. This PR targets to restore the previous option and behaviour. The root cause seems to be https://github.com/scala/scala/commit/99dad60d984d3f72338f3bad4c4fe905090edd51. They change what `-i` means in that commit. `-i` option is replaced to `-I`. The _newly replaced_ option -i at Scala 2.11.12 works like `:paste` (previously it worked like `:load`). `:paste` looks not working with implicits - at least I verified Spark 2.4.0, 2.3.2, 2.0.0, and the current master: ```bash scala> :paste test.scala Pasting file test.scala... :19: error: value toDF is not a member of org.apache.spark.rdd.RDD[Record] Error occurred in an application involving default arguments. spark.sparkContext.parallelize((1 to 2).map(i => Record(i, s"val_$i"))).toDF.show ^ ``` Note that `./bin/spark-shell --help` does not describe this option so I guess it's not explicitly documented (I guess?); however, it's best not to break. The changes are only two lines. In particular, we should backport this to branch-2.4. ## How was this patch tested? Manually tested. With the input below: ```bash $ cat test.scala spark.version case class Record(key: Int, value: String) spark.sparkContext.parallelize((1 to 2).map(i => Record(i, s"val_$i"))).toDF.show ``` **Spark 2.3.2:** ```scala $ bin/spark-shell -i test.scala ... +---+-+ |key|value| +---+-+ | 1|val_1| | 2|val_2| +---+-+ ``` **Before:** ```scala $ bin/spark-shell -i test.scala ... test.scala:17: error: value toDF is not a member of org.apache.spark.rdd.RDD[Record] Error occurred in an application involving default arguments. spark.sparkContext.parallelize((1 to 2).map(i => Record(i, s"val_$i"))).toDF.show ^ ``` **After:** ```scala $ ./bin/spark-shell -i test.scala ... +---+-+ |key|value| +---+-+ | 1|val_1| | 2|val_2| +---+-+ ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-25906 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22919.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22919 commit 5f3cb87c8798e72cc6852e71c02ffc2077c748d7 Author: hyukjinkwon Date: 2018-11-03T12:48:25Z Documents '-I' option (from Scala REPL) in spark-shell --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22919: [SPARK-25906][SHELL] Documents '-I' option (from ...
Github user HyukjinKwon closed the pull request at: https://github.com/apache/spark/pull/22919 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22921: [SPARK-25908][CORE][SQL] Remove old deprecated items in ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22921 Looks okay to me too but I'd also leave this open for few more days. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22908: [MINOR][SQL] Replace all TreeNode's node name in ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22908#discussion_r230544923 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala --- @@ -56,7 +56,7 @@ case class DataSourceV2Relation( override def pushedFilters: Seq[Expression] = Seq.empty - override def simpleString: String = "RelationV2 " + metadataString + override def simpleString: String = s"$nodeName " + metadataString --- End diff -- I'd follow this comment, actually. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22913: [SPARK-25902][SQL] Add support for dates with mil...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22913#discussion_r230544838 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowUtils.scala --- @@ -71,6 +71,7 @@ object ArrowUtils { case d: ArrowType.Decimal => DecimalType(d.getPrecision, d.getScale) case date: ArrowType.Date if date.getUnit == DateUnit.DAY => DateType case ts: ArrowType.Timestamp if ts.getUnit == TimeUnit.MICROSECOND => TimestampType +case date: ArrowType.Date if date.getUnit == DateUnit.MILLISECOND => TimestampType --- End diff -- Wait .. is it correct to map it to `TimestampType`? Looks this is why `Date` with `MILLISECOND` is not added. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22913: [SPARK-25902][SQL] Add support for dates with millisecon...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22913 Can we add a test in `ArrowUtilsSuite.scala`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22626: [SPARK-25638][SQL] Adding new function - to_csv()
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22626#discussion_r230544775 --- Diff: sql/core/src/test/resources/sql-tests/inputs/csv-functions.sql --- @@ -15,3 +15,10 @@ CREATE TEMPORARY VIEW csvTable(csvField, a) AS SELECT * FROM VALUES ('1,abc', 'a SELECT schema_of_csv(csvField) FROM csvTable; -- Clean up DROP VIEW IF EXISTS csvTable; +-- to_csv +select to_csv(named_struct('a', 1, 'b', 2)); +select to_csv(named_struct('time', to_timestamp('2015-08-26', '-MM-dd')), map('timestampFormat', 'dd/MM/')); +-- Check if errors handled +select to_csv(named_struct('a', 1, 'b', 2), named_struct('mode', 'PERMISSIVE')); +select to_csv(named_struct('a', 1, 'b', 2), map('mode', 1)); --- End diff -- This one too since the exception is from `convertToMapData`. We just only need one test - this one or the one right above. One of them can be removed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22626: [SPARK-25638][SQL] Adding new function - to_csv()
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22626#discussion_r230544760 --- Diff: sql/core/src/test/resources/sql-tests/inputs/csv-functions.sql --- @@ -15,3 +15,10 @@ CREATE TEMPORARY VIEW csvTable(csvField, a) AS SELECT * FROM VALUES ('1,abc', 'a SELECT schema_of_csv(csvField) FROM csvTable; -- Clean up DROP VIEW IF EXISTS csvTable; +-- to_csv +select to_csv(named_struct('a', 1, 'b', 2)); +select to_csv(named_struct('time', to_timestamp('2015-08-26', '-MM-dd')), map('timestampFormat', 'dd/MM/')); +-- Check if errors handled +select to_csv(named_struct('a', 1, 'b', 2), named_struct('mode', 'PERMISSIVE')); +select to_csv(named_struct('a', 1, 'b', 2), map('mode', 1)); +select to_csv(); --- End diff -- I think we don't have to test this since it's not specific to this expression. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22626: [SPARK-25638][SQL] Adding new function - to_csv()
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22626#discussion_r230544717 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala --- @@ -174,3 +176,66 @@ case class SchemaOfCsv( override def prettyName: String = "schema_of_csv" } + +/** + * Converts a [[StructType]] to a CSV output string. + */ +// scalastyle:off line.size.limit +@ExpressionDescription( + usage = "_FUNC_(expr[, options]) - Returns a CSV string with a given struct value", + examples = """ +Examples: + > SELECT _FUNC_(named_struct('a', 1, 'b', 2)); + 1,2 + > SELECT _FUNC_(named_struct('time', to_timestamp('2015-08-26', '-MM-dd')), map('timestampFormat', 'dd/MM/')); + "26/08/2015" + """, + since = "3.0.0") +// scalastyle:on line.size.limit +case class StructsToCsv( + options: Map[String, String], + child: Expression, + timeZoneId: Option[String] = None) + extends UnaryExpression with TimeZoneAwareExpression with CodegenFallback with ExpectsInputTypes { + override def nullable: Boolean = true + + def this(options: Map[String, String], child: Expression) = this(options, child, None) + + // Used in `FunctionRegistry` + def this(child: Expression) = this(Map.empty, child, None) + + def this(child: Expression, options: Expression) = +this( + options = ExprUtils.convertToMapData(options), + child = child, + timeZoneId = None) + + @transient + lazy val writer = new CharArrayWriter() + + @transient + lazy val inputSchema: StructType = child.dataType match { +case st: StructType => st +case other => + throw new IllegalArgumentException(s"Unsupported input type ${other.catalogString}") + } + + @transient + lazy val gen = new UnivocityGenerator( +inputSchema, writer, new CSVOptions(options, columnPruning = true, timeZoneId.get)) + + // This converts rows to the CSV output according to the given schema. + @transient + lazy val converter: Any => UTF8String = { +(row: Any) => UTF8String.fromString(gen.writeToString(row.asInstanceOf[InternalRow])) + } + + override def dataType: DataType = StringType + + override def withTimeZone(timeZoneId: String): TimeZoneAwareExpression = +copy(timeZoneId = Option(timeZoneId)) + + override def nullSafeEval(value: Any): Any = converter(value) + + override def inputTypes: Seq[AbstractDataType] = TypeCollection(StructType) :: Nil --- End diff -- I think we can `StructType :: Nil` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22626: [SPARK-25638][SQL] Adding new function - to_csv()
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22626#discussion_r230544667 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala --- @@ -174,3 +176,66 @@ case class SchemaOfCsv( override def prettyName: String = "schema_of_csv" } + +/** + * Converts a [[StructType]] to a CSV output string. + */ +// scalastyle:off line.size.limit +@ExpressionDescription( + usage = "_FUNC_(expr[, options]) - Returns a CSV string with a given struct value", + examples = """ +Examples: + > SELECT _FUNC_(named_struct('a', 1, 'b', 2)); + 1,2 + > SELECT _FUNC_(named_struct('time', to_timestamp('2015-08-26', '-MM-dd')), map('timestampFormat', 'dd/MM/')); + "26/08/2015" + """, + since = "3.0.0") +// scalastyle:on line.size.limit +case class StructsToCsv( + options: Map[String, String], + child: Expression, + timeZoneId: Option[String] = None) + extends UnaryExpression with TimeZoneAwareExpression with CodegenFallback with ExpectsInputTypes { + override def nullable: Boolean = true + + def this(options: Map[String, String], child: Expression) = this(options, child, None) + + // Used in `FunctionRegistry` + def this(child: Expression) = this(Map.empty, child, None) + + def this(child: Expression, options: Expression) = +this( + options = ExprUtils.convertToMapData(options), + child = child, + timeZoneId = None) + + @transient + lazy val writer = new CharArrayWriter() + + @transient + lazy val inputSchema: StructType = child.dataType match { +case st: StructType => st +case other => + throw new IllegalArgumentException(s"Unsupported input type ${other.catalogString}") + } + + @transient + lazy val gen = new UnivocityGenerator( +inputSchema, writer, new CSVOptions(options, columnPruning = true, timeZoneId.get)) + + // This converts rows to the CSV output according to the given schema. + @transient + lazy val converter: Any => UTF8String = { +(row: Any) => UTF8String.fromString(gen.writeToString(row.asInstanceOf[InternalRow])) --- End diff -- @MaxGekk, can we use the data from `writer` like `writer.toString` and `writer.reset()` like `to_json`? Looks we are going to avoid header (which is fine). If we explicitly set `header` to `false` in this expression, looks we don't need to add `writeToString` in `UnivocityGenerator`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22913: [SPARK-25902][SQL] Add support for dates with millisecon...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22913 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22914: [SPARK-25900][WEBUI]When the page number is more than th...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22914 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22913: [SPARK-25902][SQL] Add support for dates with millisecon...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22913 cc @BryanCutler --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22626: [SPARK-25638][SQL] Adding new function - to_csv()
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22626#discussion_r230544556 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala --- @@ -45,7 +45,6 @@ class CsvFunctionsSuite extends QueryTest with SharedSQLContext { Row(Row(java.sql.Timestamp.valueOf("2015-08-26 18:00:00.0" } - --- End diff -- Ah, I prefer to don't include unrelated changes but it's okay --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22626: [SPARK-25638][SQL] Adding new function - to_csv()
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22626#discussion_r230544492 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityGenerator.scala --- @@ -15,18 +15,17 @@ * limitations under the License. */ -package org.apache.spark.sql.execution.datasources.csv +package org.apache.spark.sql.catalyst.csv import java.io.Writer import com.univocity.parsers.csv.CsvWriter import org.apache.spark.sql.catalyst.InternalRow -import org.apache.spark.sql.catalyst.csv.CSVOptions import org.apache.spark.sql.catalyst.util.DateTimeUtils import org.apache.spark.sql.types._ -private[csv] class UnivocityGenerator( +private[sql] class UnivocityGenerator( --- End diff -- Let's remove `private[sql]`. We are already in an internal package `catalyst`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22923: [SPARK-25910][CORE] accumulator updates from previous st...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22923 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22919: [SPARK-25906][SHELL] Restores '-i' option's behaviour in...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22919 Let me go ahead with documenting one then tomorrow. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22905: [SPARK-25894][SQL] Add a ColumnarFileFormat type ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22905#discussion_r230250684 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ColumnarFileFormat.scala --- @@ -0,0 +1,32 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.types.StructType + +/** + * An optional mix-in for columnar [[FileFormat]]s. This trait provides some helpful metadata when + * debugging a physical query plan. + */ +private[sql] trait ColumnarFileFormat { --- End diff -- Thanks for explanation. Mind changing it to `private[datasources]`? I am still not clear: 1. If this information is worth enough for metadata and/or why it should be there 2. If this information can be generalised 3. The purpose of this info - is this purpose to check if the columns are actually being pruned or not? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22905: [SPARK-25894][SQL] Add a ColumnarFileFormat type ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22905#discussion_r230249905 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala --- @@ -306,7 +306,15 @@ case class FileSourceScanExec( withOptPartitionCount } -withSelectedBucketsCount +val withOptColumnCount = relation.fileFormat match { + case columnar: ColumnarFileFormat => +val sqlConf = relation.sparkSession.sessionState.conf +val columnCount = columnar.columnCountForSchema(sqlConf, requiredSchema) +withSelectedBucketsCount + ("ColumnCount" -> columnCount.toString) --- End diff -- The purpose of this info is to check the number of columns actually selected, and that information can be shown via logging, no? Why should it be exposed in metadata then? Maybe debug logging that shows the number of columns that actually being selected via the underlying source. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22919: [SPARK-25906][SHELL] Restores '-i' option's behaviour in...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22919 True. but it's a bit difficult to say it's a downside because `-i` has a quite major issue as described above - user's code suddenly does not work with implicit method (like symbol `'id` or `.toDF` which are pretty common) in minor release bumpup. Also, looks the only reason why changed `:load` to `:paste` in `-i` option is a bug (SI-7898). For documentation, I can mention we can use Scala shell flags. Looks @cloud-fan is going to leave this JIRA as a known issue so I guess we are good. Maybe I can leave an additional note in migration guide as well if we're not going ahead to fix it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22919: [SPARK-25906][SHELL] Restores '-i' option's behaviour in...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22919 Thabks, @dbtsai. Also adding @vanzin and @srowen wdyt about -i options at 2.4.x and 3.0.0? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22919: [SPARK-25906][SHELL] Restores '-i' option's behaviour in...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22919 Also, the difference introduced between `-i` and `-I` sounds rather a bug fix ([SI-7898](https://issues.scala-lang.org/browse/SI-7898)), which already exists in previous shell. I ended up with thinking that it might be more important that end users upgrade Spark and face such issues than syncing it to Scala 2.11.12 shell at Spark 2.4.x. I don't superduper strongly feel about it but it was my conclusion. I will defer to guys here about going ahead in this way or not. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22919: [SPARK-25906][SHELL] Restores '-i' option's behaviour in...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22919 Note that `python` shell also takes file but `pyspark` shell doesn't: ``` $ cat tmp.py print "a" ``` ``` $ python2 tmp.py a ``` ``` $ ./bin/pyspark tmp.py Running python applications through 'pyspark' is not supported as of Spark 2.0. Use ./bin/spark-submit ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22919: [SPARK-25906][SHELL] Restores '-i' option's behaviour in...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22919 I was actually thinking that hard about that. My conclusion was that basically we should but `spark-shell` is a Spark entry point where we make a release - so I thought it's better to keep it in 2.4.0. In 3.0.0, actually I think we should disallow both `-i` and `-I` like `pyspark` or `sparkr` shell (we allowed file before but not anymore). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22919: [SPARK-25906][SHELL] Restores '-i' option's behaviour in...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22919 Yup. That's what I thought. and treat them as same --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22919: [SPARK-25906][SHELL] Revives '-i' option in spark-shell
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22919 That's exactly what I initially thought. At least we could in 3.0.0 but I think we should keep it at 2.4.x. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22626: [SPARK-25638][SQL] Adding new function - to_csv()
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22626#discussion_r229981376 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala --- @@ -3905,6 +3905,47 @@ object functions { withExpr(SchemaOfCsv(csv.expr, options.asScala.toMap)) } + /** + * (Scala-specific) Converts a column containing a `StructType` into a CSV string + * with the specified schema. Throws an exception, in the case of an unsupported type. + * + * @param e a column containing a struct. + * @param options options to control how the struct column is converted into a CSV string. + *It accepts the same options and the CSV data source. + * + * @group collection_funcs + * @since 3.0.0 + */ + def to_csv(e: Column, options: Map[String, String]): Column = withExpr { --- End diff -- Let's get rid of this Scala version for now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22626: [SPARK-25638][SQL] Adding new function - to_csv()
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22626#discussion_r229981416 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala --- @@ -45,7 +45,6 @@ class CsvFunctionsSuite extends QueryTest with SharedSQLContext { Row(Row(java.sql.Timestamp.valueOf("2015-08-26 18:00:00.0" } - --- End diff -- seems mistake. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22626: [SPARK-25638][SQL] Adding new function - to_csv()
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22626#discussion_r229981271 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala --- @@ -174,3 +176,66 @@ case class SchemaOfCsv( override def prettyName: String = "schema_of_csv" } + +/** + * Converts a [[StructType]] to a CSV output string. + */ +// scalastyle:off line.size.limit +@ExpressionDescription( + usage = "_FUNC_(expr[, options]) - Returns a CSV string with a given struct value", + examples = """ +Examples: + > SELECT _FUNC_(named_struct('a', 1, 'b', 2)); + 1,2 + > SELECT _FUNC_(named_struct('time', to_timestamp('2015-08-26', '-MM-dd')), map('timestampFormat', 'dd/MM/')); + "26/08/2015" + """, + since = "3.0.0") +// scalastyle:on line.size.limit +case class StructsToCsv( + options: Map[String, String], + child: Expression, + timeZoneId: Option[String] = None) --- End diff -- seems indentation mistake --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22626: [SPARK-25638][SQL] Adding new function - to_csv()
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22626 @MaxGekk, BTW, I added `schema_of_csv` at R side. You can add `schema_of_json` at R side in a similar way. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22919: [SPARK-25906][SHELL] Revives '-i' option in spark-shell
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22919 cc @felixcheung as well. This is similar code path with https://github.com/apache/zeppelin/pull/3206 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22919: [SPARK-25906][SHELL] Revives '-i' option in spark-shell
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22919 Note that some commands like `:replay` wouldn't work as well if `-i` option is given in the shell since `:paste` itself looks not working. This PR also fixes it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22919: [SPARK-25906][SHELL] Revive -i option in spark-shell
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22919 cc @dbtsai, @dongjoon-hyun, @gatorsmile and @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22919: [SPARK-25906][SHELL] Revive -i option in spark-sh...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/22919 [SPARK-25906][SHELL] Revive -i option in spark-shell ## What changes were proposed in this pull request? Looks we mistakenly removed `-i` option at `spark-shell`. This PR targets to restore the previous option and behaviour. The root cause seems to be https://github.com/scala/scala/commit/99dad60d984d3f72338f3bad4c4fe905090edd51. They change what `-i` means in that commit. `-i` option is replaced to `-I`. The _newly replaced_ option -i at Scala 2.11.12 works like `:paste` (previously it worked like `:load`). `:paste` looks not working so far - at least I verified Spark 2.4.0, 2.3.2, 2.0.0, and the current master: ``` scala> :paste test.scala Pasting file test.scala... :19: error: value toDF is not a member of org.apache.spark.rdd.RDD[Record] Error occurred in an application involving default arguments. spark.sparkContext.parallelize((1 to 2).map(i => Record(i, s"val_$i"))).toDF.show ^ ``` Note that `./bin/spark-shell --help` does not describe this option so I guess it's not explicitly documented (I guess?); however, it's best not to break. The changes are only two lines. ## How was this patch tested? Manually tested. With the input below: ``` $ cat test.scala spark.version case class Record(key: Int, value: String) spark.sparkContext.parallelize((1 to 2).map(i => Record(i, s"val_$i"))).toDF.show ``` **Spark 2.3.2:** ```scala $ bin/spark-shell -i test.scala ... +---+-+ |key|value| +---+-+ | 1|val_1| | 2|val_2| +---+-+ ``` **Before:** ```scala $ bin/spark-shell -i test.scala ... test.scala:17: error: value toDF is not a member of org.apache.spark.rdd.RDD[Record] Error occurred in an application involving default arguments. spark.sparkContext.parallelize((1 to 2).map(i => Record(i, s"val_$i"))).toDF.show ``` **After:** ```scala ./bin/spark-shell -i test.scala ... +---+-+ |key|value| +---+-+ | 1|val_1| | 2|val_2| +---+-+ ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-25906 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22919.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22919 commit 17b3c6c5ad4a39b1ecfd33111c392de47c6df967 Author: hyukjinkwon Date: 2018-11-01T08:53:39Z Revive -i option in spark-shell --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22912: [SPARK-25901][CORE] Use only one thread in BarrierTaskCo...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22912 cc @jiangxb1987 and @mengxr --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22905: [SPARK-25894][SQL] Add a ColumnarFileFormat type ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22905#discussion_r229933689 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ColumnarFileFormat.scala --- @@ -0,0 +1,32 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.types.StructType + +/** + * An optional mix-in for columnar [[FileFormat]]s. This trait provides some helpful metadata when + * debugging a physical query plan. + */ +private[sql] trait ColumnarFileFormat { --- End diff -- If it's supposed to be exposed as an interface to external datasources, then I wouldn't even add this one. It looks a rough guess that it can be generalised. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22905: [SPARK-25894][SQL] Add a ColumnarFileFormat type ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22905#discussion_r229933544 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala --- @@ -306,7 +306,15 @@ case class FileSourceScanExec( withOptPartitionCount } -withSelectedBucketsCount +val withOptColumnCount = relation.fileFormat match { + case columnar: ColumnarFileFormat => +val sqlConf = relation.sparkSession.sessionState.conf +val columnCount = columnar.columnCountForSchema(sqlConf, requiredSchema) +withSelectedBucketsCount + ("ColumnCount" -> columnCount.toString) --- End diff -- Is this something we really should include in the metadata? If the purpose of this is to check if the column pruning works or not, logging should be good enough. Adding a trait for it sounds an overkill for the current status. Let's not add an abstraction just for rough guess that it can be generalised. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22905: [SPARK-25894][SQL] Add a ColumnarFileFormat type ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22905#discussion_r229932338 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ColumnarFileFormat.scala --- @@ -0,0 +1,32 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.types.StructType + +/** + * An optional mix-in for columnar [[FileFormat]]s. This trait provides some helpful metadata when + * debugging a physical query plan. + */ +private[sql] trait ColumnarFileFormat { --- End diff -- and I would actually make it `pricate[datasources]` since that's only currently used in ParquetFileFormat. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22905: [SPARK-25894][SQL] Add a ColumnarFileFormat type ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22905#discussion_r229932234 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ColumnarFileFormat.scala --- @@ -0,0 +1,32 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.types.StructType + +/** + * An optional mix-in for columnar [[FileFormat]]s. This trait provides some helpful metadata when + * debugging a physical query plan. + */ +private[sql] trait ColumnarFileFormat { --- End diff -- It's already in a private package. `private[sql]` can be removed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22666: [SPARK-25672][SQL] schema_of_csv() - schema inference fr...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22666 Argh, sorry, it was my mistake. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22666: [SPARK-25672][SQL] schema_of_csv() - schema inference fr...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22666 Ah no I am sorry @MaxGekk. I made the primary author as me mistakenly. I showed my email first. ``` === Pull Request #22666 === title [SPARK-25672][SQL] schema_of_csv() - schema inference from an example source MaxGekk/schema_of_csv-function target master url https://api.github.com/repos/apache/spark/pulls/22666 Proceed with merging pull request #22666? (y/n): y git fetch apache-github pull/22666/head:PR_TOOL_MERGE_PR_22666 From https://github.com/apache/spark * [new ref] refs/pull/22666/head -> PR_TOOL_MERGE_PR_22666 git fetch apache master:PR_TOOL_MERGE_PR_22666_MASTER remote: Counting objects: 303, done. remote: Compressing objects: 100% (153/153), done. remote: Total 209 (delta 91), reused 0 (delta 0) Receiving objects: 100% (209/209), 91.89 KiB | 445.00 KiB/s, done. Resolving deltas: 100% (91/91), completed with 65 local objects. From https://git-wip-us.apache.org/repos/asf/spark * [new branch] master -> PR_TOOL_MERGE_PR_22666_MASTER 57eddc7182e..c5ef477d2f6 master -> apache/master git checkout PR_TOOL_MERGE_PR_22666_MASTER Switched to branch 'PR_TOOL_MERGE_PR_22666_MASTER' ['git', 'merge', 'PR_TOOL_MERGE_PR_22666', '--squash'] Automatic merge went well; stopped before committing as requested ['git', 'log', 'HEAD..PR_TOOL_MERGE_PR_22666', '--pretty=format:%an <%ae>'] Enter primary author in the format of "name " [hyukjinkwon ]: hyukjinkwon ['git', 'log', 'HEAD..PR_TOOL_MERGE_PR_22666', '--pretty=format:%h [%an] %s'] ``` Looks the commit order affects the name appearing for `Enter primary author in the format of "name "`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22666: [SPARK-25672][SQL] schema_of_csv() - schema inference fr...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22666 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22901: [SPARK-25891][PYTHON] Upgrade to Py4J 0.10.8.1
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22901 Late LGTM! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22895: [SPARK-25886][SQL][Minor] Improve error message of `Fail...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22895 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22902: [SPARK-25893][SQL] Show a directional error message for ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22902 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22666: [SPARK-25672][SQL] schema_of_csv() - schema inference fr...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22666 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] zeppelin issue #3206: [WIP][ZEPPELIN-3810] Support Spark 2.4
Github user HyukjinKwon commented on the issue: https://github.com/apache/zeppelin/pull/3206 Thanks for confirmation. ---
[GitHub] spark issue #22666: [SPARK-25672][SQL] schema_of_csv() - schema inference fr...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22666 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22895: [SPARK-25886][SQL][Minor] Improve error message of `Fail...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22895 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] zeppelin issue #3206: [WIP][ZEPPELIN-3810] Support Spark 2.4
Github user HyukjinKwon commented on the issue: https://github.com/apache/zeppelin/pull/3206 I also tested this patch against Spark RC5. ---
[GitHub] zeppelin issue #3206: [WIP][ZEPPELIN-3810] Support Spark 2.4
Github user HyukjinKwon commented on the issue: https://github.com/apache/zeppelin/pull/3206 BTW, I tested this via my Travis CI - https://travis-ci.org/HyukjinKwon/zeppelin/builds/448215776. Tests seems got passed. ---
[GitHub] zeppelin pull request #3206: [WIP][ZEPPELIN-3810] Support Spark 2.4
GitHub user HyukjinKwon reopened a pull request: https://github.com/apache/zeppelin/pull/3206 [WIP][ZEPPELIN-3810] Support Spark 2.4 ### What is this PR for? Spark 2.4 changed it's Scala version from 2.11.8 to 2.11.12 (see SPARK-24418). There are two problems for this upgrade at Zeppelin side: 1.. Some methods that are used in private by reflection, for instance, `loopPostInit` became inaccessible. See: - https://github.com/scala/scala/blob/v2.11.8/src/repl/scala/tools/nsc/interpreter/ILoop.scala - https://github.com/scala/scala/blob/v2.11.12/src/repl/scala/tools/nsc/interpreter/ILoop.scala To work around this, I manually ported `loopPostInit` at 2.11.8 to retain the behaviour. Some functions that are commonly existing at both Scala 2.11.8 and Scala 2.11.12 are used inside of the new `loopPostInit` by reflection. 2.. Upgrade from 2.11.8 to 2.11.12 requires `jline.version` upgrade. Otherwise, we will hit: ``` Caused by: java.lang.NoSuchMethodError: jline.console.completer.CandidateListCompletionHandler.setPrintSpaceAfterFullCompletion(Z)V at scala.tools.nsc.interpreter.jline.JLineConsoleReader.initCompletion(JLineReader.scala:139) ``` To work around this, I tweaked this by upgrading jline from `2.12.1` to `2.14.3`. ### What type of PR is it? [Improvement] ### Todos * [ ] - Wait until Spark 2.4.0 is officially released. ### What is the Jira issue? * https://issues.apache.org/jira/browse/ZEPPELIN-3810 ### How should this be tested? Verified manually against Spark 2.4.0 RC3 ### Questions: * Does the licenses files need update? Yes * Is there breaking changes for older versions? No * Does this needs documentation? No You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/zeppelin ZEPPELIN-3810 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/zeppelin/pull/3206.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3206 commit 9ac1797a7e175a73350d61a0e62b3e4f89b29b8f Author: hyukjinkwon Date: 2018-10-17T14:41:29Z Support Spark 2.4 ---
[GitHub] zeppelin pull request #3206: [WIP][ZEPPELIN-3810] Support Spark 2.4
Github user HyukjinKwon closed the pull request at: https://github.com/apache/zeppelin/pull/3206 ---
[GitHub] spark pull request #22891: [SPARK-25881][pyspark] df.toPandas() convert deci...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22891#discussion_r229564044 --- Diff: python/pyspark/sql/dataframe.py --- @@ -2064,6 +2064,7 @@ def toDF(self, *cols): @since(1.3) def toPandas(self): """ +:param coerce_float: default False, if Ture, will handle decimal type to np.float64 instand of type object. --- End diff -- There's no such param. Also, we don't convert it to string. Also, Arrow optimization code path should also be fixed BTW. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22895: [SPARK-25886][SQL][Minor] Improve error message o...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22895#discussion_r229563551 --- Diff: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala --- @@ -100,9 +100,14 @@ case class AvroDataToCatalyst( case NonFatal(e) => parseMode match { case PermissiveMode => nullResultRow case FailFastMode => - throw new SparkException("Malformed records are detected in record parsing. " + + val msg = if (e.getMessage != null) { +e.getMessage + "\n" + } else { +"" + } + throw new SparkException(msg + "Malformed records are detected in record parsing. " + s"Current parse Mode: ${FailFastMode.name}. To process malformed records as null " + -"result, try setting the option 'mode' as 'PERMISSIVE'.", e.getCause) +"result, try setting the option 'mode' as 'PERMISSIVE'.", e) --- End diff -- How about we just remove `msg` change? If I am not mistaken, usually it's just added at the cause alone. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22891: [SPARK-25881][pyspark] df.toPandas() convert decimal to ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22891 I don't think we should map decimals to float. It will loses precisions and it's a breaking change. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22895: [SPARK-25886][SQL][Minor] Improve error message o...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22895#discussion_r229562284 --- Diff: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala --- @@ -100,9 +100,14 @@ case class AvroDataToCatalyst( case NonFatal(e) => parseMode match { case PermissiveMode => nullResultRow case FailFastMode => - throw new SparkException("Malformed records are detected in record parsing. " + + val msg = if (e.getMessage != null) { +e.getMessage + "\n" + } else { +"" + } + throw new SparkException(msg + "Malformed records are detected in record parsing. " + s"Current parse Mode: ${FailFastMode.name}. To process malformed records as null " + -"result, try setting the option 'mode' as 'PERMISSIVE'.", e.getCause) +"result, try setting the option 'mode' as 'PERMISSIVE'.", e) --- End diff -- Adding causes looks fine. I was actually wondering adding `msg` is necessary. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22896: [SPARKR]found some extra whitespace in the R tests
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22896 @shaneknapp, I already merged this into master branch (https://github.com/apache/spark/commit/243ce319a06f20365d5b08d479642d75748645d9) :-). AppVeyor tests were passed (https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/builds/19932574). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22896: [SPARKR]found some extra whitespace in the R tests
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22896 @shaneknapp, I just got this in since that's orthogonal to this PR :-) . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22896: [SPARKR]found some extra whitespace in the R tests
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22896 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22844: [SPARK-25847][SQL][TEST] Refactor JSONBenchmarks to use ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22844 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22895: [SPARK-25886][SQL][Minor] Improve error message o...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22895#discussion_r229539289 --- Diff: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala --- @@ -100,9 +100,14 @@ case class AvroDataToCatalyst( case NonFatal(e) => parseMode match { case PermissiveMode => nullResultRow case FailFastMode => - throw new SparkException("Malformed records are detected in record parsing. " + + val msg = if (e.getMessage != null) { +e.getMessage + "\n" + } else { +"" + } --- End diff -- I think it's okay to make this if-else inlined --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22895: [SPARK-25886][SQL][Minor] Improve error message o...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22895#discussion_r229539055 --- Diff: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala --- @@ -100,9 +100,14 @@ case class AvroDataToCatalyst( case NonFatal(e) => parseMode match { case PermissiveMode => nullResultRow case FailFastMode => - throw new SparkException("Malformed records are detected in record parsing. " + + val msg = if (e.getMessage != null) { +e.getMessage + "\n" + } else { +"" + } + throw new SparkException(msg + "Malformed records are detected in record parsing. " + s"Current parse Mode: ${FailFastMode.name}. To process malformed records as null " + -"result, try setting the option 'mode' as 'PERMISSIVE'.", e.getCause) +"result, try setting the option 'mode' as 'PERMISSIVE'.", e) --- End diff -- @gengliangwang, looks okay. How does the error message look like before/after btw? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22844: [SPARK-25847][SQL][TEST] Refactor JSONBenchmarks to use ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22844 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22844: [SPARK-25847][SQL][TEST] Refactor JSONBenchmarks to use ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22844 Lgtm --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16429: [SPARK-19019][PYTHON] Fix hijacked `collections.namedtup...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16429 Install Spark I mentioned above. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22893: One part of Spark MLlib Kmean Logic Performance problem
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22893 How much performance does it gain in end-to-end test, and how does it provide better performance? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22893: One part of Spark MLlib Kmean Logic Performance problem
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22893 Also please fill the PR description. How much performance does it gain? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org