[GitHub] spark pull request #22942: [SPARK-25884][SQL][FOLLOW-UP] Add sample.json bac...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22942 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22429: [SPARK-25440][SQL] Dumping query execution info to a fil...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22429 **[Test build #98461 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98461/testReport)** for PR 22429 at commit [`76f4248`](https://github.com/apache/spark/commit/76f424830418129c12a2a08d81f19377490c95eb). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22942: [SPARK-25884][SQL][FOLLOW-UP] Add sample.json back.
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22942 thanks, merging to master! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22429: [SPARK-25440][SQL] Dumping query execution info to a fil...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/22429 jenkins, retest this, please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22919: [SPARK-25906][SHELL] Documents '-I' option (from ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22919#discussion_r230655513 --- Diff: bin/spark-shell --- @@ -32,7 +32,10 @@ if [ -z "${SPARK_HOME}" ]; then source "$(dirname "$0")"/find-spark-home fi -export _SPARK_CMD_USAGE="Usage: ./bin/spark-shell [options]" +export _SPARK_CMD_USAGE="Usage: ./bin/spark-shell [options] + +Scala REPL options: + -Ipreload , enforcing line-by-line interpretation" --- End diff -- where do we define other options? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22942: [SPARK-25884][SQL][FOLLOW-UP] Add sample.json back.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22942 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98458/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22942: [SPARK-25884][SQL][FOLLOW-UP] Add sample.json back.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22942 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22942: [SPARK-25884][SQL][FOLLOW-UP] Add sample.json back.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22942 **[Test build #98458 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98458/testReport)** for PR 22942 at commit [`18ccff1`](https://github.com/apache/spark/commit/18ccff15a771d3e0221b49114ff300b0ef41a25b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22939: [SPARK-25446][R] Add schema_of_json() and schema_...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22939#discussion_r230649513 --- Diff: R/pkg/R/functions.R --- @@ -205,11 +205,18 @@ NULL #' also supported for the schema. #' \item \code{from_csv}: a DDL-formatted string #' } -#' @param ... additional argument(s). In \code{to_json}, \code{to_csv} and \code{from_json}, -#'this contains additional named properties to control how it is converted, accepts -#'the same options as the JSON/CSV data source. Additionally \code{to_json} supports -#'the "pretty" option which enables pretty JSON generation. In \code{arrays_zip}, -#'this contains additional Columns of arrays to be merged. +#' @param ... additional argument(s). +#' \itemize{ +#' \item \code{to_json}, \code{from_json} and \code{schema_of_json}: this contains +#' additional named properties to control how it is converted and accepts the +#' same options as the JSON data source. +#' \item \code{to_json}: it supports the "pretty" option which enables pretty --- End diff -- actually, how does `pretty` work? is it `pretty = TRUE`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22939: [SPARK-25446][R] Add schema_of_json() and schema_...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22939#discussion_r230650176 --- Diff: R/pkg/R/functions.R --- @@ -2230,6 +2237,32 @@ setMethod("from_json", signature(x = "Column", schema = "characterOrstructType") column(jc) }) +#' @details +#' \code{schema_of_json}: Parses a JSON string and infers its schema in DDL format. +#' +#' @rdname column_collection_functions +#' @aliases schema_of_json schema_of_json,characterOrColumn-method +#' @examples +#' +#' \dontrun{ +#' json <- '{"name":"Bob"}' +#' df <- sql("SELECT * FROM range(1)") +#' head(select(df, schema_of_json(json)))} +#' @note schema_of_json since 3.0.0 +setMethod("schema_of_json", signature(x = "characterOrColumn"), + function(x, ...) { +if (class(x) == "character") { + col <- callJStatic("org.apache.spark.sql.functions", "lit", x) +} else { + col <- x@jc --- End diff -- what's the use when x is a Column? `schema_of_csv(lit("Amsterdam,2018")))` seems a bit odd to me... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22939: [SPARK-25446][R] Add schema_of_json() and schema_...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22939#discussion_r230649120 --- Diff: R/pkg/R/functions.R --- @@ -2230,6 +2237,32 @@ setMethod("from_json", signature(x = "Column", schema = "characterOrstructType") column(jc) }) +#' @details +#' \code{schema_of_json}: Parses a JSON string and infers its schema in DDL format. +#' +#' @rdname column_collection_functions +#' @aliases schema_of_json schema_of_json,characterOrColumn-method +#' @examples +#' +#' \dontrun{ +#' json <- '{"name":"Bob"}' --- End diff -- I think we should avoid mixing `'` and `"` in doc --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22939: [SPARK-25446][R] Add schema_of_json() and schema_...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22939#discussion_r230649693 --- Diff: R/pkg/R/functions.R --- @@ -2260,6 +2293,32 @@ setMethod("from_csv", signature(x = "Column", schema = "characterOrColumn"), column(jc) }) +#' @details +#' \code{schema_of_csv}: Parses a CSV string and infers its schema in DDL format. +#' +#' @rdname column_collection_functions +#' @aliases schema_of_csv schema_of_csv,characterOrColumn-method +#' @examples +#' +#' \dontrun{ +#' csv <- "'Amsterdam,2018'" --- End diff -- I"m a bit confused `"'Amsterdam,2018'"` vs `"Amsterdam,2018"` does the latter work? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22693: [SPARK-25701][SQL] Supports calculation of table ...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/22693#discussion_r230639634 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala --- @@ -115,26 +116,45 @@ class ResolveHiveSerdeTable(session: SparkSession) extends Rule[LogicalPlan] { class DetermineTableStats(session: SparkSession) extends Rule[LogicalPlan] { override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { +case filterPlan @ Filter(_, SubqueryAlias(_, relation: HiveTableRelation)) => + val predicates = PhysicalOperation.unapply(filterPlan).map(_._2).getOrElse(Nil) + computeTableStats(relation, predicates) case relation: HiveTableRelation if DDLUtils.isHiveTable(relation.tableMeta) && relation.tableMeta.stats.isEmpty => - val table = relation.tableMeta - val sizeInBytes = if (session.sessionState.conf.fallBackToHdfsForStatsEnabled) { -try { - val hadoopConf = session.sessionState.newHadoopConf() - val tablePath = new Path(table.location) - val fs: FileSystem = tablePath.getFileSystem(hadoopConf) - fs.getContentSummary(tablePath).getLength -} catch { - case e: IOException => -logWarning("Failed to get table size from hdfs.", e) -session.sessionState.conf.defaultSizeInBytes -} - } else { -session.sessionState.conf.defaultSizeInBytes + computeTableStats(relation) + } + + private def computeTableStats( + relation: HiveTableRelation, + predicates: Seq[Expression] = Nil): LogicalPlan = { +val table = relation.tableMeta +val sizeInBytes = if (session.sessionState.conf.fallBackToHdfsForStatsEnabled) { + try { +val hadoopConf = session.sessionState.newHadoopConf() +val tablePath = new Path(table.location) +val fs: FileSystem = tablePath.getFileSystem(hadoopConf) +BigInt(fs.getContentSummary(tablePath).getLength) + } catch { +case e: IOException => + logWarning("Failed to get table size from hdfs.", e) + getSizeInBytesFromTablePartitions(table.identifier, predicates) } +} else { + getSizeInBytesFromTablePartitions(table.identifier, predicates) +} +val withStats = table.copy(stats = Some(CatalogStatistics(sizeInBytes = sizeInBytes))) +relation.copy(tableMeta = withStats) + } - val withStats = table.copy(stats = Some(CatalogStatistics(sizeInBytes = BigInt(sizeInBytes - relation.copy(tableMeta = withStats) + private def getSizeInBytesFromTablePartitions( + tableIdentifier: TableIdentifier, + predicates: Seq[Expression] = Nil): BigInt = { +session.sessionState.catalog.listPartitionsByFilter(tableIdentifier, predicates) match { --- End diff -- After [this refactor](https://github.com/apache/spark/pull/22743). We can avoid compute stats if `LogicalRelation` already cached. because the computed stats will not take effect. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22943: [SPARK-25098][SQL] Trim the string when cast stringToTim...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22943 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22943: [SPARK-25098][SQL] Trim the string when cast stringToTim...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22943 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4760/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22943: [SPARK-25098][SQL] Trim the string when cast stringToTim...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22943 **[Test build #98460 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98460/testReport)** for PR 22943 at commit [`d297817`](https://github.com/apache/spark/commit/d297817b7457fef40eb78b803542aed213afb7fc). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22943: [SPARK-25098][SQL] Trim the string when cast stri...
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/22943 [SPARK-25098][SQL] Trim the string when cast stringToTimestamp and stringToDate ## What changes were proposed in this pull request? **Hive** and **Oracle** trim the string when cast `stringToTimestamp` and `stringToDate`. this PR support this feature: ![image](https://user-images.githubusercontent.com/5399861/47979721-793b1e80-e0ff-11e8-97c8-24b10950ee9e.png) ![image](https://user-images.githubusercontent.com/5399861/47979725-7dffd280-e0ff-11e8-87d4-5767a00ed46e.png) ## How was this patch tested? unit tests Closes https://github.com/apache/spark/pull/22089 You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-25098 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22943.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22943 commit d297817b7457fef40eb78b803542aed213afb7fc Author: Yuming Wang Date: 2018-11-05T05:31:22Z trim() the string when cast stringToTimestamp and stringToDate --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22928: [SPARK-25926][CORE] Move config entries in core module t...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22928 **[Test build #98459 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98459/testReport)** for PR 22928 at commit [`6144e01`](https://github.com/apache/spark/commit/6144e01fc6eb612e07a532cc10e3fafb8ccd71ee). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22928: [SPARK-25926][CORE] Move config entries in core module t...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22928 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22928: [SPARK-25926][CORE] Move config entries in core module t...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22928 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4759/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22913: [SPARK-25902][SQL] Add support for dates with mil...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/22913#discussion_r230635196 --- Diff: sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java --- @@ -414,6 +416,21 @@ final int getInt(int rowId) { } } + private static class DateMilliAccessor extends ArrowVectorAccessor { + +private final DateMilliVector accessor; + +DateMilliAccessor(DateMilliVector vector) { + super(vector); + this.accessor = vector; +} + +@Override +final long getLong(int rowId) { --- End diff -- We should use `getInt()` to represent the number of days from 1970-01-01 if we map the type to date type. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22942: [SPARK-25884][SQL][FOLLOW-UP] Add sample.json back.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22942 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22942: [SPARK-25884][SQL][FOLLOW-UP] Add sample.json back.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22942 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4758/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22942: [SPARK-25884][SQL][FOLLOW-UP] Add sample.json back.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22942 **[Test build #98458 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98458/testReport)** for PR 22942 at commit [`18ccff1`](https://github.com/apache/spark/commit/18ccff15a771d3e0221b49114ff300b0ef41a25b). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22942: [SPARK-25884][SQL][FOLLOW-UP] Add sample.json bac...
GitHub user ueshin opened a pull request: https://github.com/apache/spark/pull/22942 [SPARK-25884][SQL][FOLLOW-UP] Add sample.json back. ## What changes were proposed in this pull request? This is a follow-up pr of #22892 which moved `sample.json` from hive module to sql module, but we still need the file in hive module. ## How was this patch tested? Existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ueshin/apache-spark issues/SPARK-25884/sample.json Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22942.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22942 commit 18ccff15a771d3e0221b49114ff300b0ef41a25b Author: Takuya UESHIN Date: 2018-11-05T04:54:42Z Add sample.json back. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22942: [SPARK-25884][SQL][FOLLOW-UP] Add sample.json back.
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/22942 cc @srowen @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22892: [SPARK-25884][SQL] Add TBLPROPERTIES and COMMENT, and us...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/22892 Seems like we still need `sample.json` in hive module. I'll submit a follow-up pr. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22939: [SPARK-25446][R] Add schema_of_json() and schema_of_csv(...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22939 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98457/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22939: [SPARK-25446][R] Add schema_of_json() and schema_of_csv(...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22939 **[Test build #98457 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98457/testReport)** for PR 22939 at commit [`c0a9384`](https://github.com/apache/spark/commit/c0a9384d292cdeff3a8799b20e166522f64ff50d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22939: [SPARK-25446][R] Add schema_of_json() and schema_of_csv(...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22939 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22913: [SPARK-25902][SQL] Add support for dates with mil...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/22913#discussion_r230628333 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowUtils.scala --- @@ -71,6 +71,7 @@ object ArrowUtils { case d: ArrowType.Decimal => DecimalType(d.getPrecision, d.getScale) case date: ArrowType.Date if date.getUnit == DateUnit.DAY => DateType case ts: ArrowType.Timestamp if ts.getUnit == TimeUnit.MICROSECOND => TimestampType +case date: ArrowType.Date if date.getUnit == DateUnit.MILLISECOND => TimestampType --- End diff -- Notice that Spark doesn't have date type with milliseconds, so if we want to map to date type, the hours, minutes, ... will be lost. Otherwise we have to map to timestamp type. Which is the proper behavior? cc @BryanCutler --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20675: [SPARK-23033][SS][Follow Up] Task level retry for...
Github user xuanyuanking closed the pull request at: https://github.com/apache/spark/pull/20675 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20675: [SPARK-23033][SS][Follow Up] Task level retry for contin...
Github user xuanyuanking commented on the issue: https://github.com/apache/spark/pull/20675 @HeartSaVioR Thanks for your reply, sorry for just seen your comment. Yep, will keep tracking this feature after we supports shuffled stateful operators. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22932: [SPARK-25102][SQL] Write Spark version to ORC/Parquet fi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22932 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22932: [SPARK-25102][SQL] Write Spark version to ORC/Parquet fi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22932 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98456/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22932: [SPARK-25102][SQL] Write Spark version to ORC/Parquet fi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22932 **[Test build #98456 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98456/testReport)** for PR 22932 at commit [`ef49a27`](https://github.com/apache/spark/commit/ef49a277d3fd39c6fd91b3fcda65f660b833ec95). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22939: [SPARK-25446][R] Add schema_of_json() and schema_of_csv(...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22939 Will make another PR after this gets merged to allow the cases below: ```r df <- sql("SELECT named_struct('name', 'Bob') as people") df <- mutate(df, people_json = to_json(df$people)) head(select(df, from_json(df$people_json, schema_of_json(head(df)$people_json ``` ``` from_json(people_json) 1Bob ``` ```r df <- sql("SELECT named_struct('name', 'Bob') as people") df <- mutate(df, people_json = to_csv(df$people)) head(select(df, from_csv(df$people_json, schema_of_csv(head(df)$people_json ``` ``` from_csv(people_json) 1 Bob ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22939: [SPARK-25446][R] Add schema_of_json() and schema_of_csv(...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22939 **[Test build #98457 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98457/testReport)** for PR 22939 at commit [`c0a9384`](https://github.com/apache/spark/commit/c0a9384d292cdeff3a8799b20e166522f64ff50d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22939: [SPARK-25446][R] Add schema_of_json() and schema_of_csv(...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22939 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4757/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22939: [SPARK-25446][R] Add schema_of_json() and schema_of_csv(...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22939 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22941: [SPARK-25936][SQL] Fix InsertIntoDataSourceComman...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/22941#discussion_r230622708 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala --- @@ -589,4 +590,33 @@ class InsertSuite extends DataSourceTest with SharedSQLContext { sql("INSERT INTO TABLE test_table SELECT 2, null") } } + + test("SPARK-25936 InsertIntoDataSourceCommand does not use Cached Data") { --- End diff -- It works. Do we need to fix this plan issue? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22918: [SPARK-25892][SQL]Change AttributeReference.withMetadata...
Github user xuanyuanking commented on the issue: https://github.com/apache/spark/pull/22918 The `as the spark-25902 mentioned.` in pr description maybe a typo? SPARK-25892? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22903: [SPARK-24196][SQL] Implement Spark's own GetSchemasOpera...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22903 cc @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22925: [SPARK-25913][SQL] Extend UnaryExecNode by unary ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22925 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22928: [SPARK-25926][CORE] Move config entries in core module t...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22928 Keeping them in separate source files is also fine to me. I think we should put them in the same package. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22913: [SPARK-25902][SQL] Add support for dates with millisecon...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22913 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22913: [SPARK-25902][SQL] Add support for dates with millisecon...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22913 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98455/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22913: [SPARK-25902][SQL] Add support for dates with millisecon...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22913 **[Test build #98455 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98455/testReport)** for PR 22913 at commit [`3afb870`](https://github.com/apache/spark/commit/3afb8708c0394368a9435a7911201de31143f41e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22905: [SPARK-25894][SQL] Add a ColumnarFileFormat type ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22905#discussion_r230616072 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala --- @@ -306,7 +306,15 @@ case class FileSourceScanExec( withOptPartitionCount } -withSelectedBucketsCount +val withOptColumnCount = relation.fileFormat match { + case columnar: ColumnarFileFormat => +val sqlConf = relation.sparkSession.sessionState.conf +val columnCount = columnar.columnCountForSchema(sqlConf, requiredSchema) +withSelectedBucketsCount + ("ColumnCount" -> columnCount.toString) --- End diff -- I was wondering how important to know if the columns are pruned or not. In that way, other logs should be put in metadata. For instance, we're not even showing the actual filters (not cayalyst but I mean the actual pushed filters that are going to apply at each source implementation level such as filters from `ParquetFilters.createFilter`) in Spark side. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22913: [SPARK-25902][SQL] Add support for dates with millisecon...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22913 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98454/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22913: [SPARK-25902][SQL] Add support for dates with millisecon...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22913 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22913: [SPARK-25902][SQL] Add support for dates with millisecon...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22913 **[Test build #98454 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98454/testReport)** for PR 22913 at commit [`2c14694`](https://github.com/apache/spark/commit/2c146941adb294ec9c5acc93cf55108e88075ad2). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22089: [SPARK-25098][SQL]‘Cast’ will return NULL when input...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22089 Sure, @gatorsmile . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18581: [SPARK-21289][SQL][ML] Supports custom line separator fo...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18581 What you see is what you get. It's not yet finished. See also https://github.com/apache/spark/pull/20877#issuecomment-429182740 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22889: [SPARK-25882][SQL] Added a function to join two d...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22889#discussion_r230614726 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -883,6 +883,31 @@ class Dataset[T] private[sql]( join(right, Seq(usingColumn)) } + /** +* Equi-join with another `DataFrame` using the given column. +* +* Different from other join functions, the join column will only appear once in the output, +* i.e. similar to SQL's `JOIN USING` syntax. +* +* {{{ +* // Left join of df1 and df2 using the column "user_id" +* df1.join(df2, "user_id", "left") +* }}} +* +* @param right Right side of the join operation. +* @param usingColumn Name of the column to join on. This column must exist on both sides. +* @param joinType Type of join to perform. Default `inner`. Must be one of: +* `inner`, `cross`, `outer`, `full`, `full_outer`, `left`, `left_outer`, +* `right`, `right_outer`, `left_semi`, `left_anti`. +* @note If you perform a self-join using this function without aliasing the input +* `DataFrame`s, you will NOT be able to reference any columns after the join, since +* there is no way to disambiguate which side of the join you would like to reference. +* @group untypedrel +*/ + def join(right: Dataset[_], usingColumn: String, joinType: String): DataFrame = { --- End diff -- Normally, we do not add such an API. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21856: [SPARK-24738] [HistoryServer] FsHistoryProvider c...
Github user LiShuMing closed the pull request at: https://github.com/apache/spark/pull/21856 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22923: [SPARK-25910][CORE] accumulator updates from previous st...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/22923 We need to always update user accumulators. Right now such task metrics just cause some annoying error logs, seems not worth to fix. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22932: [SPARK-25102][SQL] Write Spark version to ORC/Parquet fi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22932 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4756/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22932: [SPARK-25102][SQL] Write Spark version to ORC/Parquet fi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22932 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22932: [SPARK-25102][SQL] Write Spark version to ORC/Parquet fi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22932 **[Test build #98456 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98456/testReport)** for PR 22932 at commit [`ef49a27`](https://github.com/apache/spark/commit/ef49a277d3fd39c6fd91b3fcda65f660b833ec95). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22932: [SPARK-25102][SQL] Write Spark version to ORC/Par...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22932#discussion_r230610261 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/package.scala --- @@ -44,4 +44,13 @@ package object sql { type Strategy = SparkStrategy type DataFrame = Dataset[Row] + + /** + * Metadata key which is used to write Spark version in the followings: + * - Parquet file metadata + * - ORC file metadata + * + * Note that Hive table property `spark.sql.create.version` also has Spark version. + */ + private[sql] val CREATE_VERSION = "org.apache.spark.sql.create.version" --- End diff -- Thank you for review, @hvanhovell . Yes, we can use that `org.apache.spark.version` since this is a new key. Although Hive table property `spark.sql.create.version` has `.create.` part, it seems that we don't need to follow that convention here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22913: [SPARK-25902][SQL] Add support for dates with millisecon...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22913 Also cc @ueshin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22693: [SPARK-25701][SQL] Supports calculation of table ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22693#discussion_r230609824 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala --- @@ -115,26 +116,45 @@ class ResolveHiveSerdeTable(session: SparkSession) extends Rule[LogicalPlan] { class DetermineTableStats(session: SparkSession) extends Rule[LogicalPlan] { override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { +case filterPlan @ Filter(_, SubqueryAlias(_, relation: HiveTableRelation)) => + val predicates = PhysicalOperation.unapply(filterPlan).map(_._2).getOrElse(Nil) + computeTableStats(relation, predicates) case relation: HiveTableRelation if DDLUtils.isHiveTable(relation.tableMeta) && relation.tableMeta.stats.isEmpty => - val table = relation.tableMeta - val sizeInBytes = if (session.sessionState.conf.fallBackToHdfsForStatsEnabled) { -try { - val hadoopConf = session.sessionState.newHadoopConf() - val tablePath = new Path(table.location) - val fs: FileSystem = tablePath.getFileSystem(hadoopConf) - fs.getContentSummary(tablePath).getLength -} catch { - case e: IOException => -logWarning("Failed to get table size from hdfs.", e) -session.sessionState.conf.defaultSizeInBytes -} - } else { -session.sessionState.conf.defaultSizeInBytes + computeTableStats(relation) + } + + private def computeTableStats( + relation: HiveTableRelation, + predicates: Seq[Expression] = Nil): LogicalPlan = { +val table = relation.tableMeta +val sizeInBytes = if (session.sessionState.conf.fallBackToHdfsForStatsEnabled) { + try { +val hadoopConf = session.sessionState.newHadoopConf() +val tablePath = new Path(table.location) +val fs: FileSystem = tablePath.getFileSystem(hadoopConf) +BigInt(fs.getContentSummary(tablePath).getLength) + } catch { +case e: IOException => + logWarning("Failed to get table size from hdfs.", e) + getSizeInBytesFromTablePartitions(table.identifier, predicates) } +} else { + getSizeInBytesFromTablePartitions(table.identifier, predicates) +} +val withStats = table.copy(stats = Some(CatalogStatistics(sizeInBytes = sizeInBytes))) +relation.copy(tableMeta = withStats) + } - val withStats = table.copy(stats = Some(CatalogStatistics(sizeInBytes = BigInt(sizeInBytes - relation.copy(tableMeta = withStats) + private def getSizeInBytesFromTablePartitions( + tableIdentifier: TableIdentifier, + predicates: Seq[Expression] = Nil): BigInt = { +session.sessionState.catalog.listPartitionsByFilter(tableIdentifier, predicates) match { --- End diff -- The perf will be pretty bad when the number of partitions is huge. Thus, I think we can close this PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19796: [SPARK-22581][SQL] Catalog api does not allow to ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19796#discussion_r230609716 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala --- @@ -411,7 +410,29 @@ abstract class Catalog { tableName: String, source: String, schema: StructType, - options: Map[String, String]): DataFrame + options: Map[String, String]): DataFrame = { +createTable(tableName, source, schema, options, Nil) + } + + /** +* :: Experimental :: +* (Scala-specific) +* Create a table based on the dataset in a data source, a schema, a set of options and a set of partition columns. +* Then, returns the corresponding DataFrame. +* +* @param tableName is either a qualified or unqualified name that designates a table. +* If no database identifier is provided, it refers to a table in +* the current database. +* @since ??? +*/ + @Experimental + @InterfaceStability.Evolving + def createTable( +tableName: String, +source: String, +schema: StructType, +options: Map[String, String], +partitionColumnNames : Seq[String]): DataFrame --- End diff -- I think we will not introduce a new API for partitioning columns in the current stage. Let us use SQL DDL for creating the table. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22089: [SPARK-25098][SQL]‘Cast’ will return NULL when input...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22089 @wangyum Could you please take it over? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22089: [SPARK-25098][SQL]‘Cast’ will return NULL whe...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22089#discussion_r230609486 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala --- @@ -98,6 +98,7 @@ class CastSuite extends SparkFunSuite with ExpressionEvalHelper { c.set(Calendar.MILLISECOND, 0) checkEvaluation(Cast(Literal("2015-03-18"), DateType), new Date(c.getTimeInMillis)) checkEvaluation(Cast(Literal("2015-03-18 "), DateType), new Date(c.getTimeInMillis)) +checkEvaluation(Cast(Literal(" 2015-03-18"), DateType), new Date(c.getTimeInMillis)) --- End diff -- > SELECT CAST(' 22-OCT-1997' AS TIMESTAMP) FROM dual; Oracle also trims the leading space. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22941: [SPARK-25936][SQL] Fix InsertIntoDataSourceCommand does ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22941 I think this is not a bug, although the plan is confusing. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22930: [SPARK-24869][SQL] Fix SaveIntoDataSourceCommand'...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22930#discussion_r230609078 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SaveIntoDataSourceCommand.scala --- @@ -37,13 +37,12 @@ case class SaveIntoDataSourceCommand( query: LogicalPlan, dataSource: CreatableRelationProvider, options: Map[String, String], -mode: SaveMode) extends RunnableCommand { +mode: SaveMode, +outputColumnNames: Seq[String]) extends DataWritingCommand { - override protected def innerChildren: Seq[QueryPlan[_]] = Seq(query) - - override def run(sparkSession: SparkSession): Seq[Row] = { -dataSource.createRelation( - sparkSession.sqlContext, mode, options, Dataset.ofRows(sparkSession, query)) --- End diff -- See what I commented in https://github.com/apache/spark/pull/22941 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22941: [SPARK-25936][SQL] Fix InsertIntoDataSourceComman...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22941#discussion_r230609046 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala --- @@ -589,4 +590,33 @@ class InsertSuite extends DataSourceTest with SharedSQLContext { sql("INSERT INTO TABLE test_table SELECT 2, null") } } + + test("SPARK-25936 InsertIntoDataSourceCommand does not use Cached Data") { --- End diff -- You can move this test suite to CachedTableSuite.scala and use the helper functions to verify whether the cache is used. See the example. ``` spark.range(2).createTempView("test_view") spark.catalog.cacheTable("test_view") val rddId = rddIdOf("test_view") assert(!isMaterialized(rddId)) sql("INSERT INTO TABLE test_table SELECT * FROM test_view") assert(isMaterialized(rddId)) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18581: [SPARK-21289][SQL][ML] Supports custom line separator fo...
Github user don4of4 commented on the issue: https://github.com/apache/spark/pull/18581 Was this finished and merged in? I see https://issues.apache.org/jira/browse/SPARK-21289 is still open. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22941: [SPARK-25936][SQL] Fix InsertIntoDataSourceComman...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22941#discussion_r230608937 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoDataSourceCommand.scala --- @@ -30,14 +30,13 @@ import org.apache.spark.sql.sources.InsertableRelation case class InsertIntoDataSourceCommand( logicalRelation: LogicalRelation, query: LogicalPlan, -overwrite: Boolean) - extends RunnableCommand { +overwrite: Boolean, +outputColumnNames: Seq[String]) + extends DataWritingCommand { - override protected def innerChildren: Seq[QueryPlan[_]] = Seq(query) - - override def run(sparkSession: SparkSession): Seq[Row] = { + override def run(sparkSession: SparkSession, child: SparkPlan): Seq[Row] = { val relation = logicalRelation.relation.asInstanceOf[InsertableRelation] -val data = Dataset.ofRows(sparkSession, query) --- End diff -- This will use the cached data, although the plan does not show the cached data is used. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22913: [SPARK-25902][SQL] Add support for dates with millisecon...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22913 **[Test build #98455 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98455/testReport)** for PR 22913 at commit [`3afb870`](https://github.com/apache/spark/commit/3afb8708c0394368a9435a7911201de31143f41e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22913: [SPARK-25902][SQL] Add support for dates with mil...
Github user javierluraschi commented on a diff in the pull request: https://github.com/apache/spark/pull/22913#discussion_r230607581 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowUtils.scala --- @@ -71,6 +71,7 @@ object ArrowUtils { case d: ArrowType.Decimal => DecimalType(d.getPrecision, d.getScale) case date: ArrowType.Date if date.getUnit == DateUnit.DAY => DateType case ts: ArrowType.Timestamp if ts.getUnit == TimeUnit.MICROSECOND => TimestampType +case date: ArrowType.Date if date.getUnit == DateUnit.MILLISECOND => TimestampType --- End diff -- Good catch, thanks. Yes, this should be mapped to `Date` in the Arrow schema, not `TimeStamp`. To give more background, Arrow Dates can have a unit of `DateUnit.DAY` or `DateUnit.MILLISECOND` (see [arrow/vector/types/DateUnit.java#L21-L22](https://github.com/apache/arrow/blob/73d379f4631cd3013371f60876a52615171e6c3b/java/vector/src/main/java/org/apache/arrow/vector/types/DateUnit.java#L21-L22)), currently, if a date with milliseconds is passed, this simply fails; therefore, this change does not affect other type conversions and is fine to map all Arrow dates to Spark dates since now all cases are properly handled. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22913: [SPARK-25902][SQL] Add support for dates with millisecon...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22913 **[Test build #98454 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98454/testReport)** for PR 22913 at commit [`2c14694`](https://github.com/apache/spark/commit/2c146941adb294ec9c5acc93cf55108e88075ad2). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22936: [SPARK-19799] Support WITH clause (CTE) in subqueries
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22936 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22936: [SPARK-19799] Support WITH clause (CTE) in subqueries
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22936 **[Test build #98453 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98453/testReport)** for PR 22936 at commit [`66cd537`](https://github.com/apache/spark/commit/66cd5379a17e05707ae162bb20e9c64812737d78). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22936: [SPARK-19799] Support WITH clause (CTE) in subqueries
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22936 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98453/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22936: [SPARK-19799] Support WITH clause (CTE) in subqueries
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22936 **[Test build #98453 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98453/testReport)** for PR 22936 at commit [`66cd537`](https://github.com/apache/spark/commit/66cd5379a17e05707ae162bb20e9c64812737d78). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22936: [SPARK-19799] Support WITH clause (CTE) in subqueries
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22936 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22931: [SPARK-25930][K8s] Fix scala string detection in k8s tes...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22931 **[Test build #4412 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4412/testReport)** for PR 22931 at commit [`bf85974`](https://github.com/apache/spark/commit/bf85974e769b86056a83be6f051cb15ff3279022). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22932: [SPARK-25102][SQL] Write Spark version to ORC/Par...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/22932#discussion_r230604337 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/package.scala --- @@ -44,4 +44,13 @@ package object sql { type Strategy = SparkStrategy type DataFrame = Dataset[Row] + + /** + * Metadata key which is used to write Spark version in the followings: + * - Parquet file metadata + * - ORC file metadata + * + * Note that Hive table property `spark.sql.create.version` also has Spark version. + */ + private[sql] val CREATE_VERSION = "org.apache.spark.sql.create.version" --- End diff -- Is this a pre-existing key? Seems that `org.apache.spark.version` should be enough. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22938: [SPARK-25935][SQL] Prevent null rows from JSON parser
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22938 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98452/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22938: [SPARK-25935][SQL] Prevent null rows from JSON parser
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22938 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22938: [SPARK-25935][SQL] Prevent null rows from JSON parser
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22938 **[Test build #98452 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98452/testReport)** for PR 22938 at commit [`c4d6a80`](https://github.com/apache/spark/commit/c4d6a8066031c4f1b0f9323f9998f0f0b10b74c7). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22818: [SPARK-25904][CORE] Allocate arrays smaller than Int.Max...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22818 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22818: [SPARK-25904][CORE] Allocate arrays smaller than Int.Max...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22818 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98451/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22818: [SPARK-25904][CORE] Allocate arrays smaller than Int.Max...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22818 **[Test build #98451 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98451/testReport)** for PR 22818 at commit [`ca3efd8`](https://github.com/apache/spark/commit/ca3efd8f636706abf8c994cb75c14432f4e4939a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22914: [SPARK-25900][WEBUI]When the page number is more than th...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22914 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22914: [SPARK-25900][WEBUI]When the page number is more than th...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22914 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98450/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22914: [SPARK-25900][WEBUI]When the page number is more than th...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22914 **[Test build #98450 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98450/testReport)** for PR 22914 at commit [`fc1e542`](https://github.com/apache/spark/commit/fc1e5423547fb86156e2b76bd3857c5a75139300). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22932: [SPARK-25102][SQL] Write Spark version to ORC/Parquet fi...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22932 I see. Thanks, @gatorsmile . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22932: [SPARK-25102][SQL] Write Spark version to ORC/Parquet fi...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22932 Will take a look this week. Thanks for your work! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22932: [SPARK-25102][SQL] Write Spark version to ORC/Parquet fi...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22932 Could you review this please, @gatorsmile ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22754: [SPARK-25776][CORE]The disk write buffer size must be gr...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22754 Thanks! merging to master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22754: [SPARK-25776][CORE]The disk write buffer size mus...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22754 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22931: [SPARK-25930][K8s] Fix scala string detection in k8s tes...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22931 **[Test build #4412 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4412/testReport)** for PR 22931 at commit [`bf85974`](https://github.com/apache/spark/commit/bf85974e769b86056a83be6f051cb15ff3279022). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22938: [SPARK-25935][SQL] Prevent null rows from JSON parser
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22938 **[Test build #98452 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98452/testReport)** for PR 22938 at commit [`c4d6a80`](https://github.com/apache/spark/commit/c4d6a8066031c4f1b0f9323f9998f0f0b10b74c7). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22939: [SPARK-25446][R] Add schema_of_json() and schema_of_csv(...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22939 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98449/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22939: [SPARK-25446][R] Add schema_of_json() and schema_of_csv(...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22939 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22939: [SPARK-25446][R] Add schema_of_json() and schema_of_csv(...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22939 **[Test build #98449 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98449/testReport)** for PR 22939 at commit [`5f0a3b6`](https://github.com/apache/spark/commit/5f0a3b658b1512cceccb6a2e90bc39942851d815). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org