[GitHub] spark issue #22213: [SPARK-25221][DEPLOY] Consistent trailing whitespace tre...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22213 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95820/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22213: [SPARK-25221][DEPLOY] Consistent trailing whitespace tre...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22213 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22213: [SPARK-25221][DEPLOY] Consistent trailing whitespace tre...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22213 **[Test build #95820 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95820/testReport)** for PR 22213 at commit [`603fdc9`](https://github.com/apache/spark/commit/603fdc9c148e2a36393140054c559f5a0a380e5d). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22311: [WIP][SPARK-25305][SQL] Respect attribute name in Collap...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/22311 @maropu Let's keep the change of #22320 for a while. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21649: [SPARK-23648][R][SQL]Adds more types for hint in ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/21649#discussion_r216122842 --- Diff: R/pkg/R/DataFrame.R --- @@ -3905,6 +3905,16 @@ setMethod("rollup", groupedData(sgd) }) +isTypeAllowedForSqlHint <- function(x) { + if (is.character(x) || is.numeric(x)) { +TRUE + } else if (is.list(x)) { +all(sapply(x, (function(y) is.character(y) || is.numeric(y --- End diff -- also, if it is a `list` could we clarify if it is supposed to work with multiple hint in different types in that list (this might be "unique" to R), for example ``` > x <- list("a", 3) > all(sapply(x, function(y) { is.character(y) || is.numeric(y) } )) [1] TRUE > x <- list("a", NA) > all(sapply(x, function(y) { is.character(y) || is.numeric(y) } )) [1] FALSE ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22359: [SPARK-25313][SQL][FOLLOW-UP] Fix InsertIntoHiveDirComma...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22359 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95818/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22359: [SPARK-25313][SQL][FOLLOW-UP] Fix InsertIntoHiveDirComma...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22359 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22359: [SPARK-25313][SQL][FOLLOW-UP] Fix InsertIntoHiveDirComma...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22359 **[Test build #95818 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95818/testReport)** for PR 22359 at commit [`71f382b`](https://github.com/apache/spark/commit/71f382bd7c9fdfda38b1bc8063b2a55dd56c00b4). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21649: [SPARK-23648][R][SQL]Adds more types for hint in ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/21649#discussion_r216122804 --- Diff: R/pkg/R/DataFrame.R --- @@ -3905,6 +3905,16 @@ setMethod("rollup", groupedData(sgd) }) +isTypeAllowedForSqlHint <- function(x) { + if (is.character(x) || is.numeric(x)) { +TRUE + } else if (is.list(x)) { +all(sapply(x, (function(y) is.character(y) || is.numeric(y --- End diff -- I look into this more deeply, I think this style seems a bit odd, as a nit, I think this should be `all(sapply(x, function(y) { is.character(y) || is.numeric(y) } ))` think it's more readable this way. also see L2458 for an example https://github.com/apache/spark/blob/aec391c9dcb6362874736e663d435f9dd8400125/R/pkg/R/DataFrame.R#L2458 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22144 hey, this looks important, could someone review this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22362: [SPARK-25372][YARN][K8S] Deprecate and generalize...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22362#discussion_r216122659 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala --- @@ -199,8 +199,8 @@ private[deploy] class SparkSubmitArguments(args: Seq[String], env: Map[String, S numExecutors = Option(numExecutors) .getOrElse(sparkProperties.get("spark.executor.instances").orNull) queue = Option(queue).orElse(sparkProperties.get("spark.yarn.queue")).orNull -keytab = Option(keytab).orElse(sparkProperties.get("spark.yarn.keytab")).orNull -principal = Option(principal).orElse(sparkProperties.get("spark.yarn.principal")).orNull +keytab = Option(keytab).orElse(sparkProperties.get("spark.kerberos.keytab")).orNull --- End diff -- agreed, shouldn't the "old" config still work? `spark.yarn.keytab` etc --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/7#discussion_r216122621 --- Diff: R/pkg/R/functions.R --- @@ -3404,19 +3404,24 @@ setMethod("collect_set", #' Equivalent to \code{split} SQL function. #' #' @rdname column_string_functions +#' @param limit determines the size of the returned array. If `limit` is positive, +#'size of the array will be at most `limit`. If `limit` is negative, the --- End diff -- you can't use backtick in R doc --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19773: [SPARK-22546][SQL] Supporting for changing column...
Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/19773#discussion_r216122599 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala --- @@ -350,16 +366,11 @@ case class AlterTableChangeColumnCommand( s"${schema.fieldNames.mkString("[`", "`, `", "`]")}")) } - // Add the comment to a column, if comment is empty, return the original column. - private def addComment(column: StructField, comment: Option[String]): StructField = { -comment.map(column.withComment(_)).getOrElse(column) - } - --- End diff -- Thanks for your question, actually that's also what I'm consider during do the compatible check. Hive do this column type change work in [HiveAlterHandler](https://github.com/apache/hive/blob/3287a097e31063cc805ca55c2ca7defffe761b6f/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveAlterHandler.java#L175 ) and the detailed compatible check is in [ColumnType](https://github.com/apache/hive/blob/3287a097e31063cc805ca55c2ca7defffe761b6f/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/ColumnType.java#L206). You can see in the ColumnType checking work, it actually use the `canCast` semantic to judge compatible. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22262: [SPARK-25175][SQL] Field resolution should fail if there...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22262 The following is the sequence. ```scala scala> sql("insert overwrite local directory '/tmp/parquet' stored as parquet select 1 id, 2 id") ``` ``` $ parquet-tools schema /tmp/parquet message hive_schema { optional int32 id; optional int32 id; } ``` ```scala scala> sql("create table parquet(id int) USING parquet LOCATION '/tmp/parquet'") res3: org.apache.spark.sql.DataFrame = [] scala> sql("select * from parquet") res4: org.apache.spark.sql.DataFrame = [id: int] scala> sql("select * from parquet").show 18/09/07 23:31:03 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2) java.lang.RuntimeException: [id] INT32 was added twice ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22349: [SPARK-25345][ML] Deprecate public APIs from ImageSchema
Github user mengxr commented on the issue: https://github.com/apache/spark/pull/22349 @WeichenXu123 Could you address the comments? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22359: [SPARK-25313][SQL][FOLLOW-UP] Fix InsertIntoHiveDirComma...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22359 Oh, I it seems to be the Parquet behavior from the beginning of this command at 2.3.0. I was confused because it's different from ORC. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22353: [SPARK-25357][SQL] Abbreviated simpleString in Da...
Github user LantaoJin commented on a diff in the pull request: https://github.com/apache/spark/pull/22353#discussion_r216122293 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala --- @@ -54,7 +54,7 @@ trait DataSourceScanExec extends LeafExecNode with CodegenSupport { override def simpleString: String = { val metadataEntries = metadata.toSeq.sorted.map { case (key, value) => -key + ": " + StringUtils.abbreviate(redact(value), 100) --- End diff -- Itâs not for debugging, itâs for offline analysis based on event log. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22353: [SPARK-25357][SQL] Abbreviated simpleString in Da...
Github user LantaoJin commented on a diff in the pull request: https://github.com/apache/spark/pull/22353#discussion_r216122273 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala --- @@ -54,7 +54,7 @@ trait DataSourceScanExec extends LeafExecNode with CodegenSupport { override def simpleString: String = { val metadataEntries = metadata.toSeq.sorted.map { case (key, value) => -key + ": " + StringUtils.abbreviate(redact(value), 100) --- End diff -- Spark users donât care about it but platform team care. The purpose is to add the missing information back in event log since PR#18600 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21791: [SPARK-24925][SQL] input bytesRead metrics fluctuate fro...
Github user yucai commented on the issue: https://github.com/apache/spark/pull/21791 @kiszk I see #22324 has been solved, which is one of my PR's dependency actually (see https://issues.apache.org/jira/browse/SPARK-24925?focusedCommentId=16556818&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16556818) , I am stuck for that for a long time, I will update this one recently. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21791: [SPARK-24925][SQL] input bytesRead metrics fluctuate fro...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/21791 gentle ping @yucai --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22353: [SPARK-25357][SQL] Abbreviated simpleString in Da...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/22353#discussion_r216122062 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala --- @@ -54,7 +54,7 @@ trait DataSourceScanExec extends LeafExecNode with CodegenSupport { override def simpleString: String = { val metadataEntries = metadata.toSeq.sorted.map { case (key, value) => -key + ": " + StringUtils.abbreviate(redact(value), 100) --- End diff -- We already have `spark.debug.maxToStringFields` for debugging, so is it bad to add `spark.debug.`? Anyway, if most users don't care about this, I think changing the default number goes too far. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22262: [SPARK-25175][SQL] Field resolution should fail if there...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22262 > ... we need this duplication check in case-sensitive mode ... Do you mean we may define ORC/Parquet schema with identical field names (even in the same letter case)? Would you please explain a bit more on this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22358: [SPARK-25366][SQL]Zstd and brotil CompressionCodec are n...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/22358 just fyi about related talks: https://github.com/apache/spark/pull/21070#issuecomment-382086510 cc: @rdblue --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22343: [SPARK-25132][SQL][FOLLOW-UP] The behavior must be consi...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22343 @HyukjinKwon @cloud-fan @gatorsmile Could you please kindly help review this if you have time? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22262: [SPARK-25175][SQL] Field resolution should fail if there...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22262 BTW, I think we need this duplication check in case-sensitive mode, too. I'll ping on previous Parquet PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22358: [SPARK-25366][SQL]Zstd and brotil CompressionCode...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/22358#discussion_r216121746 --- Diff: docs/sql-programming-guide.md --- @@ -964,7 +964,7 @@ Configuration of Parquet can be done using the `setConf` method on `SparkSession Sets the compression codec used when writing Parquet files. If either `compression` or `parquet.compression` is specified in the table-specific options/properties, the precedence would be `compression`, `parquet.compression`, `spark.sql.parquet.compression.codec`. Acceptable values include: -none, uncompressed, snappy, gzip, lzo, brotli, lz4, zstd. --- End diff -- ah, ok. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22298: [SPARK-25021][K8S] Add spark.executor.pyspark.memory lim...
Github user ifilonenko commented on the issue: https://github.com/apache/spark/pull/22298 @holdenk or @mccheah for merge --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22262: [SPARK-25175][SQL] Field resolution should fail i...
Github user seancxmao commented on a diff in the pull request: https://github.com/apache/spark/pull/22262#discussion_r216121510 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala --- @@ -116,6 +116,14 @@ object OrcUtils extends Logging { }) } else { val resolver = if (isCaseSensitive) caseSensitiveResolution else caseInsensitiveResolution +// Need to fail if there is ambiguity, i.e. more than one field is matched. +requiredSchema.fieldNames.foreach { requiredFieldName => + val matchedOrcFields = orcFieldNames.filter(resolver(_, requiredFieldName)) --- End diff -- That's all right :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22262: [SPARK-25175][SQL] Field resolution should fail if there...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22262 **[Test build #95825 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95825/testReport)** for PR 22262 at commit [`26b4710`](https://github.com/apache/spark/commit/26b4710bb47e700d3532da5652fd5ea3f128be20). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21654: [SPARK-24671][PySpark] DataFrame length using a d...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21654#discussion_r216121454 --- Diff: python/pyspark/sql/dataframe.py --- @@ -375,6 +375,9 @@ def _truncate(self): return int(self.sql_ctx.getConf( "spark.sql.repl.eagerEval.truncate", "20")) +def __len__(self): --- End diff -- Can we better just not define this? RDD doesn't have this one too. IMHO, such allowing bit by bit wouldn't be so ideal .. For example, `columns.py` ended up with a weird limit: ```python >>> iter(spark.range(1).id) Traceback (most recent call last): File "", line 1, in File "/.../spark/python/pyspark/sql/column.py", line 344, in __iter__ raise TypeError("Column is not iterable") TypeError: Column is not iterable >>> isinstance(spark.range(1).id, collections.Iterable) True ``` It makes a general sense though. This `__iter__` can't be removed BTW because we implement `__getitem__` and `__getattr__` to access columns in dataframes IIRC. `__repr__` was added because it's commonly used and it had a strong usecase for notebook, etc. However, for `len()` I wouldn't add it for now. Think about `if len(df) ...` and it is eagerly evaluated .. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22262: [SPARK-25175][SQL] Field resolution should fail if there...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22262 @dongjoon-hyun That's all right :). I have reverted to the first commit and adjusted the indentation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22262: [SPARK-25175][SQL] Field resolution should fail if there...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22262 **[Test build #95824 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95824/testReport)** for PR 22262 at commit [`366bb35`](https://github.com/apache/spark/commit/366bb35ad62edb8e6707e65c681a5b9001cc868e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22337: [SPARK-25338][Test] Ensure to call super.beforeAll() and...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22337 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2943/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22337: [SPARK-25338][Test] Ensure to call super.beforeAll() and...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22337 **[Test build #95822 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95822/testReport)** for PR 22337 at commit [`a314776`](https://github.com/apache/spark/commit/a3147760b025e6592dd80d858ae4757bd907a72c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22360: [MINOR][ML] Remove `BisectingKMeansModel.setDistanceMeas...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22360 Oh right, the model, not the implementation. CC @mgaido91 too. OK we can take it out because it was only introduced for 2.4.0. Hm, so the model still has a distanceMeasure, but it's just not something that should be settable. The bisecting k-means model just wraps the older .mllib model. But I don't see that its getDistanceMeasure would actually return the value inside that wrapped model? I am missing where the wrapper gets updated to return the actual distance measure used. Is that also an issue or did I miss something? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21654: [SPARK-24671][PySpark] DataFrame length using a dunder/m...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21654 **[Test build #95823 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95823/testReport)** for PR 21654 at commit [`4d0afaf`](https://github.com/apache/spark/commit/4d0afaf3cd046b11e8bae43dc00ddf4b1eb97732). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22337: [SPARK-25338][Test] Ensure to call super.beforeAll() and...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22337 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21654: [SPARK-24671][PySpark] DataFrame length using a dunder/m...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21654 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22353: [SPARK-25357][SQL] Abbreviated simpleString in Da...
Github user LantaoJin commented on a diff in the pull request: https://github.com/apache/spark/pull/22353#discussion_r216121128 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala --- @@ -54,7 +54,7 @@ trait DataSourceScanExec extends LeafExecNode with CodegenSupport { override def simpleString: String = { val metadataEntries = metadata.toSeq.sorted.map { case (key, value) => -key + ": " + StringUtils.abbreviate(redact(value), 100) --- End diff -- I think itâs overkill to parameterizing it. And Spark user doesnât care about it, no one will reset it before submitting app. Besides, simply raise up to 1000 also can resolve the problem on most cases, but longer than 1000 chars is still meanlessness. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21180: [SPARK-22674][PYTHON] Disabled _hack_namedtuple for pick...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21180 @superbobry, sorry for back and forth. branch-2.4 is cut out. Can we open another PR that removes the hack? Let's make sure to leave the potential impact, workaround and strong justification (your blog) in the PR description. I am still hesitant but good to get rid of it 3.0.0. From a cursory look so far, I also think we should eventually get rid of it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22298: [SPARK-25021][K8S] Add spark.executor.pyspark.memory lim...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22298 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22298: [SPARK-25021][K8S] Add spark.executor.pyspark.memory lim...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22298 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95813/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22351: [MINOR][SQL] Add a debug log when a SQL text is u...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22351 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22192: [SPARK-24918][Core] Executor Plugin API
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22192 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22192: [SPARK-24918][Core] Executor Plugin API
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22192 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95814/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22298: [SPARK-25021][K8S] Add spark.executor.pyspark.memory lim...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22298 **[Test build #95813 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95813/testReport)** for PR 22298 at commit [`ea25b8b`](https://github.com/apache/spark/commit/ea25b8b14013d17fbafe5c338cca0eb28a490482). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22192: [SPARK-24918][Core] Executor Plugin API
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22192 **[Test build #95814 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95814/testReport)** for PR 22192 at commit [`cf30ed5`](https://github.com/apache/spark/commit/cf30ed52c0aefc1eda8efffc11e73cafc93a032d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22359: [SPARK-25313][SQL][FOLLOW-UP] Fix InsertIntoHiveDirComma...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22359 @wangyum @gengliangwang Does this PR aim a correct behavior for `INSERT OVERWRITE LOCAL DIRECTORY`? In SPARK-25313, it may make sense because the generated file are under the table and respects the table definition. However, in this PR, this accidentally introduces **case-sensitivity** in SQL SELECT statement. ``` The schema should be StructType(StructField(ID,LongType,true)) as we SELECT ID FROM view1. ``` This case seems to require a new SPARK JIRA issue and more discussion on the goal. Also, this should be allowed only when `spark.sql.caseSensitivity=true`. cc @gatorsmile @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22351: [MINOR][SQL] Add a debug log when a SQL text is used for...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22351 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22349: [SPARK-25345][ML] Deprecate public APIs from Imag...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22349#discussion_r216120943 --- Diff: python/pyspark/ml/image.py --- @@ -30,6 +30,7 @@ from pyspark import SparkContext from pyspark.sql.types import Row, _create_row, _parse_datatype_json_string from pyspark.sql import DataFrame, SparkSession +import warnings --- End diff -- Technically builtin package should be ordered above per PEP 8. I wonder why this is not caught. ``` import sys import warnings import numpy as np from pyspark import SparkContext from pyspark.sql.types import Row, _create_row, _parse_datatype_json_string from pyspark.sql import DataFrame, SparkSession ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22349: [SPARK-25345][ML] Deprecate public APIs from Imag...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22349#discussion_r216120953 --- Diff: python/pyspark/ml/image.py --- @@ -222,7 +226,8 @@ def readImages(self, path, recursive=False, numPartitions=-1, .. versionadded:: 2.3.0 """ - +warnings.warn("`ImageSchema.readImage` is deprecated. " + + "Use `spark.read.format(\"image\").load(path)` instead.") --- End diff -- `warnings.warn(..., DeprecationWarning)` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22337: [SPARK-25338][Test] Ensure to call super.beforeAl...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/22337#discussion_r216120896 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/StateStoreRDDSuite.scala --- @@ -49,8 +49,11 @@ class StateStoreRDDSuite extends SparkFunSuite with BeforeAndAfter with BeforeAn } override def afterAll(): Unit = { -super.afterAll() -Utils.deleteRecursively(new File(tempDir)) +try { + Utils.deleteRecursively(new File(tempDir)) +} finally { + super.afterAll() --- End diff -- Hmm, I see. Let me keep the original order as ``` try { super.afterAll() } finally { Utils.deleteRecursively(...) } ``` Thank you for listing them out. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22358: [SPARK-25366][SQL]Zstd and brotil CompressionCodec are n...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22358 If the codecs are found, then we support it. One thing we should do might be to document to explicitly provide the codec but I am not sure how many users are confused about it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22357: [SPARK-25363][SQL] Fix schema pruning in where clause by...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/22357 I have reconstructed my original patch for this issue, but I've discovered it will require more work to complete. However, as part of that reconstruction I've discovered a couple of cases where our patches create different physical plans. The query results are the same, but I'm not sure whichâif eitherâplan is correct. I want to go into detail on that, but it's complicated and I have to call it quits tonight. I have a flight in the morning, and I'll be on break next week. In the meantime, I'll just copy and paste two queriesâbased on the data in `ParquetSchemaPruningSuite.scala`âwith two query plans each. First query: select employer.id from contacts where employer is not null This PR (as of d68f808) produces: ``` == Physical Plan == *(1) Project [employer#4442.id AS id#4452] +- *(1) Filter isnotnull(employer#4442) +- *(1) FileScan parquet [employer#4442,p#4443] Batched: false, Format: Parquet, PartitionCount: 2, PartitionFilters: [], PushedFilters: [IsNotNull(employer)], ReadSchema: struct> ``` My WIP patch produces: ``` == Physical Plan == *(1) Project [employer#4442.id AS id#4452] +- *(1) Filter isnotnull(employer#4442) +- *(1) FileScan parquet [employer#4442,p#4443] Batched: false, Format: Parquet, PartitionCount: 2, PartitionFilters: [], PushedFilters: [IsNotNull(employer)], ReadSchema: struct>> ``` Second query: select employer.id from contacts where employer.id = 0 This PR produces: ``` == Physical Plan == *(1) Project [employer#4297.id AS id#4308] +- *(1) Filter (isnotnull(employer#4297) && (employer#4297.id = 0)) +- *(1) FileScan parquet [employer#4297,p#4298] Batched: false, Format: Parquet, PartitionCount: 2, PartitionFilters: [], PushedFilters: [IsNotNull(employer)], ReadSchema: struct> ``` My WIP patch produces: ``` == Physical Plan == *(1) Project [employer#4445.id AS id#4456] +- *(1) Filter (isnotnull(employer#4445.id) && (employer#4445.id = 0)) +- *(1) FileScan parquet [employer#4445,p#4446] Batched: false, Format: Parquet, PartitionCount: 2, PartitionFilters: [], PushedFilters: [], ReadSchema: struct> ``` I wanted to give my thoughts on the differences of these in detail, but I have to wrap up my work for the night. I'll be visiting family next week. I don't know how responsive I'll be in that time, but I'll at least try to check back. Cheers. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22358: [SPARK-25366][SQL]Zstd and brotil CompressionCode...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22358#discussion_r216120854 --- Diff: docs/sql-programming-guide.md --- @@ -964,7 +964,7 @@ Configuration of Parquet can be done using the `setConf` method on `SparkSession Sets the compression codec used when writing Parquet files. If either `compression` or `parquet.compression` is specified in the table-specific options/properties, the precedence would be `compression`, `parquet.compression`, `spark.sql.parquet.compression.codec`. Acceptable values include: -none, uncompressed, snappy, gzip, lzo, brotli, lz4, zstd. --- End diff -- I think so given the download page. ![screen shot 2018-09-08 at 12 41 41 pm](https://user-images.githubusercontent.com/6477701/45250388-94d6b280-b364-11e8-85ee-1a67daa3a123.png) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22337: [SPARK-25338][Test] Ensure to call super.beforeAl...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22337#discussion_r216120459 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/StateStoreRDDSuite.scala --- @@ -49,8 +49,11 @@ class StateStoreRDDSuite extends SparkFunSuite with BeforeAndAfter with BeforeAn } override def afterAll(): Unit = { -super.afterAll() -Utils.deleteRecursively(new File(tempDir)) +try { + Utils.deleteRecursively(new File(tempDir)) +} finally { + super.afterAll() --- End diff -- Ur, are you sure? `StreamingAggregationSuite`, `StreamTest`, `StreamingAggregationSuite` and `StreamingDeduplicationSuite` also have reverse order for `StateStore.stop()`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22337: [SPARK-25338][Test] Ensure to call super.beforeAll() and...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22337 **[Test build #95821 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95821/testReport)** for PR 22337 at commit [`6a4cbcf`](https://github.com/apache/spark/commit/6a4cbcf7e7c78c9cd932910f9431b5ea1dd223be). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22337: [SPARK-25338][Test] Ensure to call super.beforeAll() and...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22337 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2942/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22337: [SPARK-25338][Test] Ensure to call super.beforeAll() and...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22337 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22337: [SPARK-25338][Test] Ensure to call super.beforeAl...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/22337#discussion_r216120240 --- Diff: external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/KafkaRDDSuite.scala --- @@ -44,20 +44,25 @@ class KafkaRDDSuite extends SparkFunSuite with BeforeAndAfterAll { private var sc: SparkContext = _ override def beforeAll { +super.beforeAll() sc = new SparkContext(sparkConf) kafkaTestUtils = new KafkaTestUtils kafkaTestUtils.setup() } override def afterAll { -if (sc != null) { - sc.stop - sc = null -} - -if (kafkaTestUtils != null) { - kafkaTestUtils.teardown() - kafkaTestUtils = null +try { + if (sc != null) { +sc.stop +sc = null + } + + if (kafkaTestUtils != null) { +kafkaTestUtils.teardown() +kafkaTestUtils = null + } +} finally { + super.afterAll() --- End diff -- `sc.stop` can throw an exception. Thus, I updated this using nested `try-finally`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22337: [SPARK-25338][Test] Ensure to call super.beforeAl...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/22337#discussion_r216120228 --- Diff: external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaRelationSuite.scala --- @@ -51,8 +51,8 @@ class KafkaRelationSuite extends QueryTest with SharedSQLContext with KafkaTest if (testUtils != null) { testUtils.teardown() testUtils = null - super.afterAll() } +super.afterAll() --- End diff -- Good catch, added `try-finally` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22262: [SPARK-25175][SQL] Field resolution should fail if there...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22262 @seancxmao . I mistook yesterday. Could you restore to your first commit? In the first commit, please adjust the indentation at [line 140](https://github.com/apache/spark/pull/22262/commits/366bb35ad62edb8e6707e65c681a5b9001cc868e#diff-3fb8426b690ab771c4f67f9cad336498R140). Sorry for the back and forth! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22337: [SPARK-25338][Test] Ensure to call super.beforeAl...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/22337#discussion_r216119902 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/StateStoreRDDSuite.scala --- @@ -49,8 +49,11 @@ class StateStoreRDDSuite extends SparkFunSuite with BeforeAndAfter with BeforeAn } override def afterAll(): Unit = { -super.afterAll() -Utils.deleteRecursively(new File(tempDir)) +try { + Utils.deleteRecursively(new File(tempDir)) +} finally { + super.afterAll() --- End diff -- Sure, let me change the order in `FlatMapGroupsWithStateSuite `, too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22363: [SPARK-25375][SQL][TEST] Reenable qualified perm. functi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22363 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22363: [SPARK-25375][SQL][TEST] Reenable qualified perm. functi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22363 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95816/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22363: [SPARK-25375][SQL][TEST] Reenable qualified perm. functi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22363 **[Test build #95816 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95816/testReport)** for PR 22363 at commit [`e71920e`](https://github.com/apache/spark/commit/e71920ec6e9c2560e39d01943dda0f49716ea1b7). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22162: [spark-24442][SQL] Added parameters to control the defau...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/22162 @AndrewKL Thanks for your work! I think we can add enough tests for this patch without using a spy library. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22162: [spark-24442][SQL] Added parameters to control th...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/22162#discussion_r216119774 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala --- @@ -969,6 +969,20 @@ class DatasetSuite extends QueryTest with SharedSQLContext { checkShowString(ds, expected) } + test("SPARK-24442 Show should follow spark.show.default.number.of.rows") { +withSQLConf("spark.sql.show.defaultNumRows" -> "100") { + (1 to 1000).toDS().as[Int].show +} + } + + test("SPARK-24442 Show should follow spark.show.default.truncate.characters.per.column") { +withSQLConf("spark.sql.show.truncateMaxCharsPerColumn" -> "30") { + (1 to 1000).map(x => "123456789_123456789_123456789_") +.toDS().as[String] +.show --- End diff -- Plz compare the show result? https://github.com/apache/spark/blob/9241e1e7e66574cfafa68791771959dfc39c9684/sql/core/src/test/scala/org/apache/spark/sql/UDFSuite.scala#L326 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22162: [spark-24442][SQL] Added parameters to control the defau...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/22162 Can you fix the style errors? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22162: [spark-24442][SQL] Added parameters to control th...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/22162#discussion_r216119706 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala --- @@ -969,6 +969,20 @@ class DatasetSuite extends QueryTest with SharedSQLContext { checkShowString(ds, expected) } + test("SPARK-24442 Show should follow spark.show.default.number.of.rows") { +withSQLConf("spark.sql.show.defaultNumRows" -> "100") { + (1 to 1000).toDS().as[Int].show +} + } + + test("SPARK-24442 Show should follow spark.show.default.truncate.characters.per.column") { +withSQLConf("spark.sql.show.truncateMaxCharsPerColumn" -> "30") { --- End diff -- ditto --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22162: [spark-24442][SQL] Added parameters to control th...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/22162#discussion_r216119703 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala --- @@ -969,6 +969,20 @@ class DatasetSuite extends QueryTest with SharedSQLContext { checkShowString(ds, expected) } + test("SPARK-24442 Show should follow spark.show.default.number.of.rows") { +withSQLConf("spark.sql.show.defaultNumRows" -> "100") { --- End diff -- `SQLConf.SQL_SHOW_DEFAULT_MAX_ROWS.key -> "100"` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22162: [spark-24442][SQL] Added parameters to control th...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/22162#discussion_r216119684 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -1950,6 +1962,11 @@ class SQLConf extends Serializable with Logging { def parallelFileListingInStatsComputation: Boolean = getConf(SQLConf.PARALLEL_FILE_LISTING_IN_STATS_COMPUTATION) + def sqlShowDefaultMaxRecordsToShow: Int = getConf(SQLConf.SQL_SHOW_DEFAULT_MAX_ROWS) + + def sqlShowDefaultMaxCharsPerTruncatedRow: Int = --- End diff -- `sqlShowTruncateMaxCharsPerColumn`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22162: [spark-24442][SQL] Added parameters to control th...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/22162#discussion_r216119676 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -1950,6 +1962,11 @@ class SQLConf extends Serializable with Logging { def parallelFileListingInStatsComputation: Boolean = getConf(SQLConf.PARALLEL_FILE_LISTING_IN_STATS_COMPUTATION) + def sqlShowDefaultMaxRecordsToShow: Int = getConf(SQLConf.SQL_SHOW_DEFAULT_MAX_ROWS) --- End diff -- `sqlShowDefaultMaxRows`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22311: [WIP][SPARK-25305][SQL] Respect attribute name in Collap...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/22311 Can this pr include the revert code for the #22320 change? Or, we should keep the change for safeguard for a while? (I thinks we'd be better to choose the second approach, so this is just a question). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22362: [SPARK-25372][YARN][K8S] Deprecate and generalize keytab...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22362 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22362: [SPARK-25372][YARN][K8S] Deprecate and generalize keytab...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22362 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95812/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22362: [SPARK-25372][YARN][K8S] Deprecate and generalize keytab...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22362 **[Test build #95812 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95812/testReport)** for PR 22362 at commit [`c5a7fec`](https://github.com/apache/spark/commit/c5a7fec7d8f6784e50b14fafb9575eb4d23208bc). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22358: [SPARK-25366][SQL]Zstd and brotil CompressionCode...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/22358#discussion_r216119384 --- Diff: docs/sql-programming-guide.md --- @@ -964,7 +964,7 @@ Configuration of Parquet can be done using the `setConf` method on `SparkSession Sets the compression codec used when writing Parquet files. If either `compression` or `parquet.compression` is specified in the table-specific options/properties, the precedence would be `compression`, `parquet.compression`, `spark.sql.parquet.compression.codec`. Acceptable values include: -none, uncompressed, snappy, gzip, lzo, brotli, lz4, zstd. --- End diff -- hadoop-2.9.x is officially supported in Spark? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22362: [SPARK-25372][YARN][K8S] Deprecate and generalize keytab...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22362 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22362: [SPARK-25372][YARN][K8S] Deprecate and generalize keytab...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22362 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95811/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22362: [SPARK-25372][YARN][K8S] Deprecate and generalize keytab...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22362 **[Test build #95811 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95811/testReport)** for PR 22362 at commit [`30286f8`](https://github.com/apache/spark/commit/30286f85ac315cad2578fe2e11c8718b8883630a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22262: [SPARK-25175][SQL] Field resolution should fail i...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22262#discussion_r216119226 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala --- @@ -116,6 +116,14 @@ object OrcUtils extends Logging { }) } else { val resolver = if (isCaseSensitive) caseSensitiveResolution else caseInsensitiveResolution +// Need to fail if there is ambiguity, i.e. more than one field is matched. +requiredSchema.fieldNames.foreach { requiredFieldName => + val matchedOrcFields = orcFieldNames.filter(resolver(_, requiredFieldName)) --- End diff -- Sorry, @seancxmao . We need to update once more. I'll make a PR to you. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22353: [SPARK-25357][SQL] Abbreviated simpleString in Da...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/22353#discussion_r216119211 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala --- @@ -54,7 +54,7 @@ trait DataSourceScanExec extends LeafExecNode with CodegenSupport { override def simpleString: String = { val metadataEntries = metadata.toSeq.sorted.map { case (key, value) => -key + ": " + StringUtils.abbreviate(redact(value), 100) --- End diff -- How about parameterizing the second value as a new `SQLConf`? Or, defining `verboseString`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22349: [SPARK-25345][ML] Deprecate public APIs from ImageSchema
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22349 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95817/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22349: [SPARK-25345][ML] Deprecate public APIs from ImageSchema
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22349 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22349: [SPARK-25345][ML] Deprecate public APIs from ImageSchema
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22349 **[Test build #95817 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95817/testReport)** for PR 22349 at commit [`5cefd23`](https://github.com/apache/spark/commit/5cefd23b3842736632c29c40ece1a2251d56efaf). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22347: [SPARK-25353][SQL] executeTake in SparkPlan is modified ...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/22347 btw, do we have any actual performance benefit (wall time) from this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21618: [SPARK-20408][SQL] Get the glob path in parallel ...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21618#discussion_r216118997 --- Diff: core/src/main/java/org/apache/hadoop/fs/SparkGlobber.java --- @@ -0,0 +1,293 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.fs; + +import java.io.FileNotFoundException; +import java.io.IOException; +import java.util.ArrayList; +import java.util.List; + +import org.apache.commons.lang3.StringUtils; +import org.apache.commons.logging.LogFactory; +import org.apache.commons.logging.Log; + +/** + * This is based on hadoop-common-2.7.2 + * {@link org.apache.hadoop.fs.Globber}. + * This class exposes globWithThreshold which can be used glob path in parallel. + */ +public class SparkGlobber { + public static final Log LOG = LogFactory.getLog(SparkGlobber.class.getName()); + + private final FileSystem fs; + private final FileContext fc; + private final Path pathPattern; + + public SparkGlobber(FileSystem fs, Path pathPattern) { +this.fs = fs; +this.fc = null; +this.pathPattern = pathPattern; + } + + public SparkGlobber(FileContext fc, Path pathPattern) { +this.fs = null; +this.fc = fc; +this.pathPattern = pathPattern; + } + + private FileStatus getFileStatus(Path path) throws IOException { +try { + if (fs != null) { +return fs.getFileStatus(path); + } else { +return fc.getFileStatus(path); + } +} catch (FileNotFoundException e) { + return null; +} + } + + private FileStatus[] listStatus(Path path) throws IOException { +try { + if (fs != null) { +return fs.listStatus(path); + } else { +return fc.util().listStatus(path); + } +} catch (FileNotFoundException e) { + return new FileStatus[0]; +} + } + + private Path fixRelativePart(Path path) { +if (fs != null) { + return fs.fixRelativePart(path); +} else { + return fc.fixRelativePart(path); +} + } + + /** + * Convert a path component that contains backslash ecape sequences to a + * literal string. This is necessary when you want to explicitly refer to a + * path that contains globber metacharacters. + */ + private static String unescapePathComponent(String name) { +return name.replaceAll("(.)", "$1"); + } + + /** + * Translate an absolute path into a list of path components. + * We merge double slashes into a single slash here. + * POSIX root path, i.e. '/', does not get an entry in the list. + */ + private static List getPathComponents(String path) + throws IOException { +ArrayList ret = new ArrayList(); +for (String component : path.split(Path.SEPARATOR)) { + if (!component.isEmpty()) { +ret.add(component); + } +} +return ret; + } + + private String schemeFromPath(Path path) throws IOException { +String scheme = path.toUri().getScheme(); +if (scheme == null) { + if (fs != null) { +scheme = fs.getUri().getScheme(); + } else { +scheme = fc.getFSofPath(fc.fixRelativePart(path)).getUri().getScheme(); + } +} +return scheme; + } + + private String authorityFromPath(Path path) throws IOException { +String authority = path.toUri().getAuthority(); +if (authority == null) { + if (fs != null) { +authority = fs.getUri().getAuthority(); + } else { +authority = fc.getFSofPath(fc.fixRelativePart(path)).getUri().getAuthority(); + } +} +return authority ; + } + + public FileStatus[] globWithThreshold(int threshold) throws IOException { +// First we get the scheme and authority of the patter
[GitHub] spark issue #22213: [SPARK-25221][DEPLOY] Consistent trailing whitespace tre...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22213 **[Test build #95820 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95820/testReport)** for PR 22213 at commit [`603fdc9`](https://github.com/apache/spark/commit/603fdc9c148e2a36393140054c559f5a0a380e5d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22213: [SPARK-25221][DEPLOY] Consistent trailing whitespace tre...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22213 **[Test build #95819 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95819/testReport)** for PR 22213 at commit [`d341dae`](https://github.com/apache/spark/commit/d341dae0511fd9456ab3e0bd686e11749d43aeb7). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21618: [SPARK-20408][SQL] Get the glob path in parallel ...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21618#discussion_r216118822 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -724,4 +726,37 @@ object DataSource extends Logging { """.stripMargin) } } + + /** + * Return all paths represented by the wildcard string. + * This will be done in main thread by default while the value of config + * `spark.sql.sources.parallelGetGlobbedPath.numThreads` > 0, a local thread + * pool will expand the globbed paths. --- End diff -- Can you describe a note about what's the difference from `InMemoryFileIndex.listLeafFiles`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21618: [SPARK-20408][SQL] Get the glob path in parallel ...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21618#discussion_r216118793 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -724,4 +726,37 @@ object DataSource extends Logging { """.stripMargin) } } + + /** + * Return all paths represented by the wildcard string. + * This will be done in main thread by default while the value of config + * `spark.sql.sources.parallelGetGlobbedPath.numThreads` > 0, a local thread + * pool will expand the globbed paths. + */ + private def getGlobbedPaths( --- End diff -- move this function into `SparkHadoopUtil? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21618: [SPARK-20408][SQL] Get the glob path in parallel ...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21618#discussion_r216118763 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -656,6 +656,25 @@ object SQLConf { .intConf .createWithDefault(1) + val PARALLEL_GET_GLOBBED_PATH_THRESHOLD = +buildConf("spark.sql.sources.parallelGetGlobbedPath.threshold") + .doc("The maximum number of subfiles or directories allowed after a globbed path " + +"expansion.") + .intConf + .checkValue(threshold => threshold >= 0, "The maximum number of subfiles or directories " + --- End diff -- `internal`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22359: [SPARK-25313][SQL][FOLLOW-UP] Fix InsertIntoHiveDirComma...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22359 **[Test build #95818 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95818/testReport)** for PR 22359 at commit [`71f382b`](https://github.com/apache/spark/commit/71f382bd7c9fdfda38b1bc8063b2a55dd56c00b4). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22359: [SPARK-25313][SQL][FOLLOW-UP] Fix InsertIntoHiveDirComma...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22359 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22359: [SPARK-25313][SQL][FOLLOW-UP] Fix InsertIntoHiveDirComma...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22359 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2941/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21618: [SPARK-20408][SQL] Get the glob path in parallel ...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21618#discussion_r216118629 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -724,4 +726,37 @@ object DataSource extends Logging { """.stripMargin) } } + + /** + * Return all paths represented by the wildcard string. + * This will be done in main thread by default while the value of config + * `spark.sql.sources.parallelGetGlobbedPath.numThreads` > 0, a local thread + * pool will expand the globbed paths. + */ + private def getGlobbedPaths( + sparkSession: SparkSession, --- End diff -- Can you drop `sparkSession`, then add `parallelGetGlobbedPathThreshold`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21618: [SPARK-20408][SQL] Get the glob path in parallel ...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21618#discussion_r216118583 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -1557,6 +1576,15 @@ class SQLConf extends Serializable with Logging { def parallelPartitionDiscoveryParallelism: Int = getConf(SQLConf.PARALLEL_PARTITION_DISCOVERY_PARALLELISM) + def parallelGetGlobbedPathThreshold: Int = +getConf(SQLConf.PARALLEL_GET_GLOBBED_PATH_THRESHOLD) + + def parallelGetGlobbedPathNumThreads: Int = +getConf(SQLConf.PARALLEL_GET_GLOBBED_PATH_NUM_THREADS) + + def parallelGetGlobbedPathEnabled: Boolean = --- End diff -- Plz move this into `DataSource` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22361: Revert [SPARK-10399] [SPARK-23879] [SPARK-23762] [SPARK-...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22361 @gatorsmile Overall, I agree with revert since performance degradation is confirmed. When I run the TPC-DS in #19222, I have not seen such a performance regression as [here](https://github.com/apache/spark/pull/19222#issuecomment-378154854). For future analysis, I am very interested in * execution time of each query * environment (machine, os, jvm, cluster or local, ...) * scaling factor, * # of repeats per query * others @npoggi and @winglungngai Thanks for your investigations. Would it be possible to share the avobe information? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22359: [SPARK-25313][SQL][FOLLOW-UP] Fix InsertIntoHiveD...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22359#discussion_r216118310 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala --- @@ -803,6 +803,25 @@ class HiveDDLSuite } } + test("Insert overwrite directory should output correct schema") { --- End diff -- In this PR, let's handle this test case only. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org