[GitHub] [spark] zhengruifeng commented on a diff in pull request #39602: [SPARK-42021][CONNECT][PYTHON] Make `createDataFrame` support `array.array`
zhengruifeng commented on code in PR #39602: URL: https://github.com/apache/spark/pull/39602#discussion_r1070927723 ## python/pyspark/sql/connect/session.py: ## @@ -299,9 +296,9 @@ def createDataFrame( # we need to convert it to [[1], [2], [3]] to be able to infer schema. _data = [[d] for d in _data] -try: -_inferred_schema = self._inferSchemaFromList(_data, _cols) -except Exception: +_inferred_schema = self._inferSchemaFromList(_data, _cols) Review Comment: changes in this file is to make the error thrown in `test_array_types` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng opened a new pull request, #39602: [SPARK-42021][CONNECT][PYTHON] Make `createDataFrame` support `array.array`
zhengruifeng opened a new pull request, #39602: URL: https://github.com/apache/spark/pull/39602 ### What changes were proposed in this pull request? Make `createDataFrame` support `array.array` ### Why are the changes needed? for parity ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? enabled UT and added UT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on a diff in pull request #39598: [SPARK-41708][SQL][FOLLOWUP] Override `equals` and `hashCode` of `TableIdentifier` to make `CatalogFileIndex` print the same re
ulysses-you commented on code in PR #39598: URL: https://github.com/apache/spark/pull/39598#discussion_r1070918433 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/identifiers.scala: ## @@ -104,6 +104,13 @@ case class TableIdentifier(table: String, database: Option[String], catalog: Opt def this(table: String) = this(table, None, None) def this(table: String, database: Option[String]) = this(table, database, None) + + override def equals(obj: Any): Boolean = obj match { Review Comment: it had already used -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] peter-toth commented on a diff in pull request #38034: [SPARK-40599][SQL] Add multiTransform methods to TreeNode to generate alternatives
peter-toth commented on code in PR #38034: URL: https://github.com/apache/spark/pull/38034#discussion_r1070234296 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala: ## @@ -618,6 +618,165 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product with Tre } } + /** + * Returns alternative copies of this node where `rule` has been recursively applied to the tree. + * + * Users should not expect a specific directionality. If a specific directionality is needed, + * multiTransformDown or multiTransformUp should be used. + * + * @param rule a function used to generate transformed alternatives for a node + * @return the stream of alternatives + */ + def multiTransform(rule: PartialFunction[BaseType, Seq[BaseType]]): Stream[BaseType] = { +multiTransformDown(rule) + } + + /** + * Returns alternative copies of this node where `rule` has been recursively applied to the tree. + * + * Users should not expect a specific directionality. If a specific directionality is needed, + * multiTransformDownWithPruning or multiTransformUpWithPruning should be used. + * + * @param rule a function used to generate transformed alternatives for a node + * @param cond a Lambda expression to prune tree traversals. If `cond.apply` returns false + * on a TreeNode T, skips processing T and its subtree; otherwise, processes + * T and its subtree recursively. + * @param ruleId is a unique Id for `rule` to prune unnecessary tree traversals. When it is + * UnknownRuleId, no pruning happens. Otherwise, if `rule` (with id `ruleId`) + * has been marked as in effective on a TreeNode T, skips processing T and its + * subtree. Do not pass it if the rule is not purely functional and reads a + * varying initial state for different invocations. + * @return the stream of alternatives + */ + def multiTransformWithPruning( + cond: TreePatternBits => Boolean, + ruleId: RuleId = UnknownRuleId +)(rule: PartialFunction[BaseType, Seq[BaseType]]): Stream[BaseType] = { +multiTransformDownWithPruning(cond, ruleId)(rule).map(_._1) + } + + /** + * Returns alternative copies of this node where `rule` has been recursively applied to it and all + * of its children (pre-order). + * + * @param rule the function used to generate transformed alternatives for a node + * @return the stream of alternatives + */ + def multiTransformDown(rule: PartialFunction[BaseType, Seq[BaseType]]): Stream[BaseType] = { +multiTransformDownWithPruning(AlwaysProcess.fn, UnknownRuleId)(rule).map(_._1) + } + + /** + * Returns alternative copies of this node where `rule` has been recursively applied to it and all + * of its children (pre-order). + * + * As it is very easy to generate enormous number of alternatives when the input tree is huge or + * when the rule returns large number of alternatives, this function returns the alternatives as a + * lazy `Stream` to be able to limit the number of alternatives generated at the caller side as + * needed. + * + * To indicate that the original node without any transformation is a valid alternative the rule + * can either: + * - not apply or + * - return an empty `Seq` or + * - a `Seq` that contains a node that is equal to the original node. + * + * Please note that this function always consider the original node as a valid alternative (even + * if the original node is not included in the returned `Seq`) if the rule can transform any of + * the descendants of the node. E.g. consider a simple expression: + * `Add(a, b)` + * and a rule that returns: + * `Seq(1, 2)` for `a` and + * `Seq(10, 20)` for `b` and + * `Seq(11, 12, 21, 22)` for `Add(a, b)` (note that the original `Add(a, b)` is not returned) + * then the result of `multiTransform` is: + * `Seq(11, 12, 21, 22, Add(1, 10), Add(2, 10), Add(1, 20), Add(2, 20))`. + * This feature makes the usage of `multiTransform` easier as a non-leaf transforming rule doesn't + * need to take into account that it can transform a descendant node of the non-leaf node as well + * and so it doesn't need return the non-leaf node itself in the list of alternatives to not stop + * generating alternatives. + * + * @param rule a function used to generate transformed alternatives for a node + * @param cond a Lambda expression to prune tree traversals. If `cond.apply` returns false + * on a TreeNode T, skips processing T and its subtree; otherwise, processes + * T and its subtree recursively. + * @param ruleId is a unique Id for `rule` to prune unnecessary tree traversals. When it is + * UnknownRuleId, no pruning happens. Otherwise, if `rule` (with id `ruleId`) + * has been marked
[GitHub] [spark] dongjoon-hyun opened a new pull request, #39601: [SPARK-42087][SQL][TESTS] Use `--no-same-owner` when `HiveExternalCatalogVersionsSuite` untars
dongjoon-hyun opened a new pull request, #39601: URL: https://github.com/apache/spark/pull/39601 ### What changes were proposed in this pull request? This PR aims to use a new parameter `--no-same-owner` when `HiveExternalCatalogVersionsSuite` untars ### Why are the changes needed? **BEFORE** ``` root@edeccfee6d53:/spark# tar xfz spark-3.3.1-bin-hadoop3.tgz root@edeccfee6d53:/spark# ls -al spark-3.3.1-bin-hadoop3 total 96 drwxr-xr-x 17 110302528 1000 544 Oct 15 10:32 . drwxr-xr-x 4 root root 128 Jan 16 07:30 .. -rw-r--r-- 1 root root 22940 Oct 15 10:32 LICENSE -rw-r--r-- 1 root root 57842 Oct 15 10:32 NOTICE drwxr-xr-x 3 110302528 100096 Oct 15 10:32 R -rw-r--r-- 1 root root 4461 Oct 15 10:32 README.md -rw-r--r-- 1 root root 165 Oct 15 10:32 RELEASE drwxr-xr-x 29 110302528 1000 928 Oct 15 10:32 bin drwxr-xr-x 8 110302528 1000 256 Oct 15 10:32 conf drwxr-xr-x 5 110302528 1000 160 Oct 15 10:32 data drwxr-xr-x 4 110302528 1000 128 Oct 15 10:32 examples drwxr-xr-x 250 110302528 1000 8000 Oct 15 10:32 jars drwxr-xr-x 4 110302528 1000 128 Oct 15 10:32 kubernetes drwxr-xr-x 60 110302528 1000 1920 Oct 15 10:32 licenses drwxr-xr-x 19 110302528 1000 608 Oct 15 10:32 python drwxr-xr-x 29 110302528 1000 928 Oct 15 10:32 sbin drwxr-xr-x 3 110302528 100096 Oct 15 10:32 yarn ``` **AFTER** ``` root@edeccfee6d53:/spark# tar xfz spark-3.3.1-bin-hadoop3.tgz --no-same-owner root@edeccfee6d53:/spark# ls -al spark-3.3.1-bin-hadoop3 total 96 drwxr-xr-x 17 root root 544 Oct 15 10:32 . drwxr-xr-x 4 root root 128 Jan 16 07:34 .. -rw-r--r-- 1 root root 22940 Oct 15 10:32 LICENSE -rw-r--r-- 1 root root 57842 Oct 15 10:32 NOTICE drwxr-xr-x 3 root root96 Oct 15 10:32 R -rw-r--r-- 1 root root 4461 Oct 15 10:32 README.md -rw-r--r-- 1 root root 165 Oct 15 10:32 RELEASE drwxr-xr-x 29 root root 928 Oct 15 10:32 bin drwxr-xr-x 8 root root 256 Oct 15 10:32 conf drwxr-xr-x 5 root root 160 Oct 15 10:32 data drwxr-xr-x 4 root root 128 Oct 15 10:32 examples drwxr-xr-x 250 root root 8000 Oct 15 10:32 jars drwxr-xr-x 4 root root 128 Oct 15 10:32 kubernetes drwxr-xr-x 60 root root 1920 Oct 15 10:32 licenses drwxr-xr-x 19 root root 608 Oct 15 10:32 python drwxr-xr-x 29 root root 928 Oct 15 10:32 sbin drwxr-xr-x 3 root root96 Oct 15 10:32 yarn ``` ### Does this PR introduce _any_ user-facing change? No. This is a test-only improvement. ### How was this patch tested? Pass the CIs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #39598: [SPARK-41708][SQL][FOLLOWUP] Override `equals` and `hashCode` of `TableIdentifier` to make `CatalogFileIndex` print the same re
LuciferYang commented on code in PR #39598: URL: https://github.com/apache/spark/pull/39598#discussion_r1070909295 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/identifiers.scala: ## @@ -104,6 +104,13 @@ case class TableIdentifier(table: String, database: Option[String], catalog: Opt def this(table: String) = this(table, None, None) def this(table: String, database: Option[String]) = this(table, database, None) + + override def equals(obj: Any): Boolean = obj match { Review Comment: @ulysses-you do you mean https://github.com/apache/spark/blob/2062c743df40c81ee2347ead8e9ae8d078f10c57/sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala#L173 change to use `SQLQueryTestHelper.replaceNotIncludedMsg(output)` ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on a diff in pull request #39598: [SPARK-41708][SQL][FOLLOWUP] Override `equals` and `hashCode` of `TableIdentifier` to make `CatalogFileIndex` print the same re
ulysses-you commented on code in PR #39598: URL: https://github.com/apache/spark/pull/39598#discussion_r1070905951 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/identifiers.scala: ## @@ -104,6 +104,13 @@ case class TableIdentifier(table: String, database: Option[String], catalog: Opt def this(table: String) = this(table, None, None) def this(table: String, database: Option[String]) = this(table, database, None) + + override def equals(obj: Any): Boolean = obj match { Review Comment: We can replace `CatalogFileIndex@xxx` to `CatalogFileIndex` in `SQLQueryTestHelper#replaceNotIncludedMsg` to recover test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] EnricoMi commented on a diff in pull request #39556: [SPARK-42049][SQL] Improve AliasAwareOutputExpression
EnricoMi commented on code in PR #39556: URL: https://github.com/apache/spark/pull/39556#discussion_r1070900791 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -435,6 +435,15 @@ object SQLConf { .booleanConf .createWithDefault(true) + val EXPRESSION_PROJECTION_CANDIDATE_LIMIT = +buildConf("spark.sql.optimizer.expressionProjectionCandidateLimit") + .doc("The maximum number of the candidate of out put expressions whose alias are replaced." + +" It can preserve the output partitioning and ordering." + +" Negative value means disable this optimization.") Review Comment: ```suggestion .doc("The maximum number of candidates for output expressions whose aliases are replaced." + " This can preserve the output partitioning and ordering." + " A negative value means to disable this optimization.") ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #39598: [SPARK-41708][SQL][FOLLOWUP] Override `equals` and `hashCode` of `TableIdentifier` to make `CatalogFileIndex` print the same results when
LuciferYang commented on PR #39598: URL: https://github.com/apache/spark/pull/39598#issuecomment-1383586233 > Could you make a separate test PR to scrub the hashCode instead of this follow-up, @LuciferYang ? > > https://github.com/apache/spark/blob/2062c743df40c81ee2347ead8e9ae8d078f10c57/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L843-L845 > > I believe you can do that here in SQLQueryTestSuite. > > https://github.com/apache/spark/blob/2062c743df40c81ee2347ead8e9ae8d078f10c57/sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala#L173 ok, let me try this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] EnricoMi commented on a diff in pull request #37525: [WIP][SPARK-40086][SPARK-42049][SQL] Improve AliasAwareOutputPartitioning and AliasAwareQueryOutputOrdering to take all aliases in
EnricoMi commented on code in PR #37525: URL: https://github.com/apache/spark/pull/37525#discussion_r1070898729 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -435,6 +435,15 @@ object SQLConf { .booleanConf .createWithDefault(true) + val EXPRESSION_PROJECTION_CANDIDATE_LIMIT = +buildConf("spark.sql.optimizer.expressionProjectionCandidateLimit") + .doc("The maximum number of the candidate of out put expressions whose alias are replaced." + +" It can preserve the output partitioning and ordering." + +" Negative value means disable this optimization.") Review Comment: ```suggestion .doc("The maximum number of candidates for output expressions whose aliases are replaced." + " This can preserve the output partitioning and ordering." + " Negative value means to disable this optimization.") ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #39600: [SPARK-42032][SPARK-41988][CONNECT][PYTHON][TESTS] `MapType` related doctests sort the dicts
zhengruifeng commented on PR #39600: URL: https://github.com/apache/spark/pull/39600#issuecomment-1383583994 @HyukjinKwon @beliefer -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng opened a new pull request, #39600: [SPARK-42032][SPARK-41988][CONNECT][PYTHON][TESTS] `MapType` related doctests sort the dicts
zhengruifeng opened a new pull request, #39600: URL: https://github.com/apache/spark/pull/39600 ### What changes were proposed in this pull request? `MapType` related doctests sort the dicts ### Why are the changes needed? `MapType` in Spark doesn't guarantee the ordering of entries ### Does this PR introduce _any_ user-facing change? no, test-only ### How was this patch tested? updated UT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #39598: [SPARK-41708][SQL][FOLLOWUP] Override `equals` and `hashCode` of `TableIdentifier` to make `CatalogFileIndex` print the same re
LuciferYang commented on code in PR #39598: URL: https://github.com/apache/spark/pull/39598#discussion_r1070897426 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/identifiers.scala: ## @@ -104,6 +104,13 @@ case class TableIdentifier(table: String, database: Option[String], catalog: Opt def this(table: String) = this(table, None, None) def this(table: String, database: Option[String]) = this(table, database, None) + + override def equals(obj: Any): Boolean = obj match { Review Comment: just print `org.apache.spark.sql.execution.datasources.CatalogFileIndex`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #39598: [SPARK-41708][SQL][FOLLOWUP] Override `equals` and `hashCode` of `TableIdentifier` to make `CatalogFileIndex` print the same
dongjoon-hyun commented on code in PR #39598: URL: https://github.com/apache/spark/pull/39598#discussion_r1070896775 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/identifiers.scala: ## @@ -104,6 +104,13 @@ case class TableIdentifier(table: String, database: Option[String], catalog: Opt def this(table: String) = this(table, None, None) def this(table: String, database: Option[String]) = this(table, database, None) + + override def equals(obj: Any): Boolean = obj match { Review Comment: The better solution would be to normalize the output by removing `CatalogFileIndex@xxx` part. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #39598: [SPARK-41708][SQL][FOLLOWUP] Override `equals` and `hashCode` of `TableIdentifier` to make `CatalogFileIndex` print the same re
LuciferYang commented on code in PR #39598: URL: https://github.com/apache/spark/pull/39598#discussion_r1070893736 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/identifiers.scala: ## @@ -104,6 +104,13 @@ case class TableIdentifier(table: String, database: Option[String], catalog: Opt def this(table: String) = this(table, None, None) def this(table: String, database: Option[String]) = this(table, database, None) + + override def equals(obj: Any): Boolean = obj match { Review Comment: Currently printing is `org.apache.spark.sql.execution.datasources.CatalogFileIndex@hashCode` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #39598: [SPARK-41708][SQL][FOLLOWUP] Override `equals` and `hashCode` of `TableIdentifier` to make `CatalogFileIndex` print the same re
LuciferYang commented on code in PR #39598: URL: https://github.com/apache/spark/pull/39598#discussion_r1070893144 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/identifiers.scala: ## @@ -104,6 +104,13 @@ case class TableIdentifier(table: String, database: Option[String], catalog: Opt def this(table: String) = this(table, None, None) def this(table: String, database: Option[String]) = this(table, database, None) + + override def equals(obj: Any): Boolean = obj match { Review Comment: cc @cloud-fan @HyukjinKwon @dongjoon-hyun @ulysses-you The daily build GA `Master Scala 2.13 + Hadoop 3 + JDK 8` failed after https://github.com/apache/spark/pull/39468 merged. - https://github.com/apache/spark/actions/runs/3924877161 - https://github.com/apache/spark/actions/runs/3919886270 - https://github.com/apache/spark/actions/runs/3914162890 - https://github.com/apache/spark/actions/runs/3905252989 - https://github.com/apache/spark/actions/runs/3895875210 - https://github.com/apache/spark/actions/runs/3895875210 Are there any suggestions for better solutions? For example, override `toString` of `CatalogFileIndex`? What is better to print? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #39599: [SPARK-42086][SQL][TESTS] Sort test cases in SQLQueryTestSuite
dongjoon-hyun commented on PR #39599: URL: https://github.com/apache/spark/pull/39599#issuecomment-1383569534 Thank you, @HyukjinKwon ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #39593: [SPARK-42083][K8S] Make `(Executor|StatefulSet)PodsAllocator` extendable
dongjoon-hyun closed pull request #39593: [SPARK-42083][K8S] Make `(Executor|StatefulSet)PodsAllocator` extendable URL: https://github.com/apache/spark/pull/39593 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #39593: [SPARK-42083][K8S] Make `(Executor|StatefulSet)PodsAllocator` extendable
dongjoon-hyun commented on PR #39593: URL: https://github.com/apache/spark/pull/39593#issuecomment-1383568837 All K8s tests passed. Merged to master for Apache Spark 3.4.0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun opened a new pull request, #39599: [SPARK-42086][SQL][TESTS] Sort test cases in SQLQueryTestSuite
dongjoon-hyun opened a new pull request, #39599: URL: https://github.com/apache/spark/pull/39599 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #39492: [SPARK-41876][CONNECT][PYTHON] `test_dataframe` should catch both `AnalysisException` and `SparkConnectAnalysisException`
HyukjinKwon commented on code in PR #39492: URL: https://github.com/apache/spark/pull/39492#discussion_r1070885000 ## python/pyspark/sql/tests/test_dataframe.py: ## @@ -42,6 +42,7 @@ DayTimeIntervalType, ) from pyspark.sql.utils import AnalysisException, IllegalArgumentException +from pyspark.sql.connect.client import SparkConnectAnalysisException Review Comment: https://github.com/apache/spark/pull/39387 is merged. Could we move the Spark Connect errors into there, see fix the test cases here? See also https://github.com/apache/spark/pull/39591. cc @itholic -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #39450: [SPARK-41897][CONNECT][TESTS] Enable tests with error mismatch in connect/test_parity_functions.py
HyukjinKwon commented on code in PR #39450: URL: https://github.com/apache/spark/pull/39450#discussion_r1070884942 ## python/pyspark/sql/tests/test_functions.py: ## @@ -24,6 +24,7 @@ from py4j.protocol import Py4JJavaError from pyspark.sql import Row, Window, types +from pyspark.sql.connect.client import SparkConnectException Review Comment: https://github.com/apache/spark/pull/39387 is merged. Could we move the Spark Connect errors into there, see fix the test cases here? See also https://github.com/apache/spark/pull/39591. cc @itholic -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #39591: [SPARK-42078][PYTHON] Migrate errors thrown by JVM into `PySparkException`.
HyukjinKwon commented on code in PR #39591: URL: https://github.com/apache/spark/pull/39591#discussion_r1070884205 ## python/pyspark/errors/exceptions.py: ## @@ -74,3 +79,187 @@ def getMessageParameters(self) -> Optional[Dict[str, str]]: def __str__(self) -> str: return f"[{self.getErrorClass()}] {self.message}" + + +class CapturedException(PySparkException): +def __init__( +self, +desc: Optional[str] = None, +stackTrace: Optional[str] = None, +cause: Optional[Py4JJavaError] = None, +origin: Optional[Py4JJavaError] = None, +): +# desc & stackTrace vs origin are mutually exclusive. +# cause is optional. +assert (origin is not None and desc is None and stackTrace is None) or ( +origin is None and desc is not None and stackTrace is not None +) + +self.desc = desc if desc is not None else cast(Py4JJavaError, origin).getMessage() +assert SparkContext._jvm is not None +self.stackTrace = ( +stackTrace +if stackTrace is not None +else (SparkContext._jvm.org.apache.spark.util.Utils.exceptionString(origin)) +) +self.cause = convert_exception(cause) if cause is not None else None +if self.cause is None and origin is not None and origin.getCause() is not None: +self.cause = convert_exception(origin.getCause()) +self._origin = origin + +def __str__(self) -> str: +assert SparkContext._jvm is not None + +jvm = SparkContext._jvm +sql_conf = jvm.org.apache.spark.sql.internal.SQLConf.get() +debug_enabled = sql_conf.pysparkJVMStacktraceEnabled() +desc = self.desc +if debug_enabled: +desc = desc + "\n\nJVM stacktrace:\n%s" % self.stackTrace +return str(desc) + +def getErrorClass(self) -> Optional[str]: +assert SparkContext._gateway is not None + +gw = SparkContext._gateway +if self._origin is not None and is_instance_of( +gw, self._origin, "org.apache.spark.SparkThrowable" +): +return self._origin.getErrorClass() +else: +return None + +def getSqlState(self) -> Optional[str]: +assert SparkContext._gateway is not None + +gw = SparkContext._gateway +if self._origin is not None and is_instance_of( +gw, self._origin, "org.apache.spark.SparkThrowable" +): +return self._origin.getSqlState() +else: +return None + + +def convert_exception(e: Py4JJavaError) -> CapturedException: Review Comment: If you're exposing these classes as API, should list up the classes with `__all__`, and make it documented. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang opened a new pull request, #39598: [SPARK-41708][SQL][FOLLOWUP] Override `equals` and `hashCode` to make `CatalogFileIndex` print the same results when using Scala 2.12 an
LuciferYang opened a new pull request, #39598: URL: https://github.com/apache/spark/pull/39598 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #39593: [SPARK-42083][K8S] Make `(Executor|StatefulSet)PodsAllocator` extendable
dongjoon-hyun commented on PR #39593: URL: https://github.com/apache/spark/pull/39593#issuecomment-1383542749 Thank you, @viirya and @huaxingao ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #39597: [SPARK-41990][SQL][FOLLOWUP] Add comments to explain why `FieldReference.column` is used when `ParseException` happens
dongjoon-hyun commented on PR #39597: URL: https://github.com/apache/spark/pull/39597#issuecomment-1383542350 +1 from my side. I'll leave this PR to @cloud-fan . Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on pull request #39597: [SPARK-41990][SQL][FOLLOWUP] Add comments to explain why `FieldReference.column` is used when `ParseException` happens
huaxingao commented on PR #39597: URL: https://github.com/apache/spark/pull/39597#issuecomment-1383529292 cc @cloud-fan @dongjoon-hyun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao opened a new pull request, #39597: [SPARK-41990][SQL][FOLLOWUP] Add comments to explain why `FieldReference.column` is used when `ParseException` happens
huaxingao opened a new pull request, #39597: URL: https://github.com/apache/spark/pull/39597 ### What changes were proposed in this pull request? Add comments to explain why `FieldReference.column` is used when `ParseException` happens ### Why are the changes needed? To address this [comment](https://github.com/apache/spark/pull/39564#discussion_r1070788648) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Only comments are added. No need to test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #39596: [SPARK-42084][SQL] Avoid leaking the qualified-access-only restriction
cloud-fan commented on PR #39596: URL: https://github.com/apache/spark/pull/39596#issuecomment-1383528602 cc @viirya @gengliangwang -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan opened a new pull request, #39596: [SPARK-42084][SQL] Avoid leaking the qualified-access-only restriction
cloud-fan opened a new pull request, #39596: URL: https://github.com/apache/spark/pull/39596 ### What changes were proposed in this pull request? This is a better fix than https://github.com/apache/spark/pull/39077 and https://github.com/apache/spark/pull/38862 The special attribute metadata `__qualified_access_only` is very risky, as it breaks normal column resolution. The aforementioned 2 PRs remove the restriction in `SubqueryAlias` and `Alias`, but it's not good enough as we may forget to do the same thing for new logical plans/expressions in the future. It's also problematic if advanced users manipulate logical plans and expressions directly, when there is no `SubqueryAlias` and `Alias` to remove the restriction. To be safe, we should only apply this restriction when resolving join hidden columns, which means the plan node right above `Project(Join(using or natural join))`. This PR simply removes the restriction when a column is resolved from a sequence of `Attributes`, and also when adding the `Project` hidden columns to its output. This makes sure that the qualified-access-only restriction will not be leaked to normal column resolution, but only metadata column resolution. ### Why are the changes needed? To make the join hidden column feature more robust ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xinrong-meng commented on a diff in pull request #39585: [WIP] Unregistered Python UDF in Spark Connect
xinrong-meng commented on code in PR #39585: URL: https://github.com/apache/spark/pull/39585#discussion_r1070861631 ## connector/connect/common/src/main/protobuf/spark/connect/expressions.proto: ## @@ -43,6 +43,8 @@ message Expression { Window window = 11; UnresolvedExtractValue unresolved_extract_value = 12; UpdateFields update_fields = 13; +PythonFunction python_function = 14; Review Comment: Sounds good. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xinrong-meng commented on pull request #39585: [WIP] Unregistered Python UDF in Spark Connect
xinrong-meng commented on PR #39585: URL: https://github.com/apache/spark/pull/39585#issuecomment-1383524848 Thanks @grundprinzip! I didn't notice your reviews before the force push. I was going to make commits history clear by categorizing them into client, server (via commit messages) etc before the PR is ready for review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a diff in pull request #39564: [SPARK-41990][SQL] Use `FieldReference.column` instead of `apply` in V1 to V2 filter conversion
huaxingao commented on code in PR #39564: URL: https://github.com/apache/spark/pull/39564#discussion_r1070860837 ## sql/catalyst/src/main/scala/org/apache/spark/sql/sources/filters.scala: ## @@ -74,6 +75,15 @@ sealed abstract class Filter { * Converts V1 filter to V2 filter */ private[sql] def toV2: Predicate + + protected def toV2Column(attribute: String): NamedReference = { +try { + FieldReference(attribute) Review Comment: I will have a follow up to add some comments. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a diff in pull request #39564: [SPARK-41990][SQL] Use `FieldReference.column` instead of `apply` in V1 to V2 filter conversion
huaxingao commented on code in PR #39564: URL: https://github.com/apache/spark/pull/39564#discussion_r1070860368 ## sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala: ## @@ -1963,4 +1977,9 @@ class JDBCSuite extends QueryTest with SharedSparkSession { } } } + + test("SPARK-41990: Filter with composite name") { +val df = sql("SELECT * FROM composite_name WHERE `last name` = 'smith'") Review Comment: Right. Support for nested predicate pushdown is calculated by this [method](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala#L108), which returns false for JDBC V1. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xinrong-meng commented on a diff in pull request #39585: [WIP] Unregistered Python UDF in Spark Connect
xinrong-meng commented on code in PR #39585: URL: https://github.com/apache/spark/pull/39585#discussion_r1070860268 ## python/pyspark/sql/connect/udf.py: ## @@ -0,0 +1,184 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +""" +User-defined function related classes and functions +""" +import functools +from typing import Callable, Any, TYPE_CHECKING, Optional + +from pyspark import cloudpickle +from pyspark.sql.connect.expressions import ColumnReference, PythonFunction, PythonUDF +from pyspark.sql.connect.column import Column +from pyspark.sql.types import DataType + + +if TYPE_CHECKING: +from pyspark.sql.connect._typing import ( +ColumnOrName, +DataTypeOrString, +UserDefinedFunctionLike, +) +from pyspark.sql.types import StringType +from pyspark.sql.session import SparkSession + + +def _wrap_function( Review Comment: Removed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #39593: [SPARK-42083][K8S] Make `ExecutorPodsAllocator` extendable
dongjoon-hyun commented on PR #39593: URL: https://github.com/apache/spark/pull/39593#issuecomment-1383521322 Thank you. Sure, let me open it too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] czxm opened a new pull request, #39595: [SPARK-38230][SQL] InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions in most cases
czxm opened a new pull request, #39595: URL: https://github.com/apache/spark/pull/39595 ### What changes were proposed in this pull request? This is based on @coalchan 's previous work of https://github.com/apache/spark/pull/35549 . I enhanced it as per @jackylee-ch 's comment below so that no new parameter is needed. https://github.com/apache/spark/pull/35549#issuecomment-1062790013 ### Why are the changes needed? 1) method listPartitions is order to get locations of partitions and compute custom partition locations(variable customPartitionLocations), but customPartitionLocations is only used while overwriting static hive partitions 2) method listPartitions is costly and only used in one case. So we can use listPartitionNames to reduce requests on hive metastore db. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? 1) created a nested partitioned table 2) run the insert/overwrite static/dynamic partition cases such as: insert overwrite table p1 partition(p1="1",p2="1.1",p3="1.1.1") select 1 insert into table p1 partition(p1="1",p2="1.1",p3="1.1.1") select 1 insert overwrite table p1 partition(p1="1",p2,p3) select 1,"1.1","1.1.1" insert into table p1 partition(p1="1",p2,p3) select 1,"1.1","1.1.1" -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng opened a new pull request, #39594: [SPARK-42085][CONNECT][PYTHON] Make `from_arrow_schema` support nested types
zhengruifeng opened a new pull request, #39594: URL: https://github.com/apache/spark/pull/39594 ### What changes were proposed in this pull request? Make `from_arrow_schema` support nested types ### Why are the changes needed? do not need to get the schema by `self.schema` due to `from_arrow_schema` not supporting some nested types ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing UT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #39593: [SPARK-42083][K8S] Make `ExecutorPodsAllocator` extendable
dongjoon-hyun commented on PR #39593: URL: https://github.com/apache/spark/pull/39593#issuecomment-1383513453 Could you review this please, @viirya ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun opened a new pull request, #39593: [SPARK-42083][K8S] Make `ExecutorPodsAllocator` extendable
dongjoon-hyun opened a new pull request, #39593: URL: https://github.com/apache/spark/pull/39593 ### What changes were proposed in this pull request? This PR aims to make `ExecutorPodsAllocator` by switching `private` to `protected`. ### Why are the changes needed? This will help the users to use `ExternalClusterManager` interface in K8s environment by reusing more. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #39592: [SPARK-42081][SQL] Improve the plan change validation
cloud-fan commented on PR #39592: URL: https://github.com/apache/spark/pull/39592#issuecomment-1383506642 cc @dongjinleekr @viirya @gengliangwang @dtenedor -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan opened a new pull request, #39592: [SPARK-42081][SQL] Improve the plan change validation
cloud-fan opened a new pull request, #39592: URL: https://github.com/apache/spark/pull/39592 ### What changes were proposed in this pull request? Today, we validate the plan change after each rule execution, which is very useful to find issues in the rule. However, it has 2 problems: 1. it can only be used in the tests 2. it just returns a boolean, making it hard to known what's wrong with the rule This PR addresses these 2 problems: 1. adds a new config to allow people to turn on it to debug production workloads 2. throws an exception immediately if a violation is found. ### Why are the changes needed? To make it easier to debug queries using plan change validation. ### Does this PR introduce _any_ user-facing change? No, the new config is internal. ### How was this patch tested? updated tests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #39588: [SPARK-42077][CONNECT][PYTHON] Literal should throw TypeError for unsupported DataType
zhengruifeng commented on PR #39588: URL: https://github.com/apache/spark/pull/39588#issuecomment-1383504836 merged into master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng closed pull request #39588: [SPARK-42077][CONNECT][PYTHON] Literal should throw TypeError for unsupported DataType
zhengruifeng closed pull request #39588: [SPARK-42077][CONNECT][PYTHON] Literal should throw TypeError for unsupported DataType URL: https://github.com/apache/spark/pull/39588 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #39589: [MINOR][CORE][TESTS] Suppress unchecked warnings in OneForOneStreamManagerSuite
dongjoon-hyun closed pull request #39589: [MINOR][CORE][TESTS] Suppress unchecked warnings in OneForOneStreamManagerSuite URL: https://github.com/apache/spark/pull/39589 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #39589: [MINOR][CORE][TESTS] Suppress unchecked warnings in OneForOneStreamManagerSuite
dongjoon-hyun commented on PR #39589: URL: https://github.com/apache/spark/pull/39589#issuecomment-1383503641 Thank you. Building tests passed. Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #39587: [SPARK-42076][CONNECT][PYTHON] Factor data conversion `arrow -> rows` out to `conversion.py`
zhengruifeng commented on PR #39587: URL: https://github.com/apache/spark/pull/39587#issuecomment-1383502337 merged into master, thanks @HyukjinKwon for the reviews! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng closed pull request #39587: [SPARK-42076][CONNECT][PYTHON] Factor data conversion `arrow -> rows` out to `conversion.py`
zhengruifeng closed pull request #39587: [SPARK-42076][CONNECT][PYTHON] Factor data conversion `arrow -> rows` out to `conversion.py` URL: https://github.com/apache/spark/pull/39587 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic opened a new pull request, #39591: [SPARK-42078][PYTHON] Migrate errors thrown by JVM into `PySparkException`.
itholic opened a new pull request, #39591: URL: https://github.com/apache/spark/pull/39591 ### What changes were proposed in this pull request? This PR proposes to migrate errors thrown by JVM into `PySparkException`. ### Why are the changes needed? Because the errors thrown by JVM side (e.g. `AnalysisException`, `ParseException `, `IllegalArgumentException ` ... ) are managed by `pyspark/sql/utils.py` which is not very correct place to handle the PySpark errors so far. Now we have separate error package `pyspark.errors` which is introduced from https://github.com/apache/spark/pull/39387, we should centralize whole errors generated by PySpark side into single management point. So, here I suggest that all errors occurring in the JVM inherit PySparkException so that errors can be managed consistently. ### Does this PR introduce _any_ user-facing change? No, this PR suggests to improve an internal error handling structure without any user-facing change. ### How was this patch tested? The existing CI should pass. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng opened a new pull request, #39590: [SPARK-42079][CONNECT][PYTHON] Rename proto messages for `toDF` and `withColumnsRenamed`
zhengruifeng opened a new pull request, #39590: URL: https://github.com/apache/spark/pull/39590 ### What changes were proposed in this pull request? Rename proto messages for `toDF` and `withColumnsRenamed` ### Why are the changes needed? to be consistent with other messages: we usually name the proto messages after the corresponding logical plan or dataframe API. ### Does this PR introduce _any_ user-facing change? no, dev-only ### How was this patch tested? existing UT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #39558: [SPARK-41982][SQL] Partitions of type string should not be treated as numeric types
cloud-fan commented on code in PR #39558: URL: https://github.com/apache/spark/pull/39558#discussion_r1070835545 ## sql/core/src/test/scala/org/apache/spark/sql/SQLInsertTestSuite.scala: ## @@ -351,6 +351,36 @@ trait SQLInsertTestSuite extends QueryTest with SQLTestUtils { } } } + + test("SPARK-41982: When the partition is not enclosed with quotation marks, " + +"the partition type is string and keepPartitionSpecAsStringLiteral is enabled, " + +"we need to treat it as a string type partition") { Review Comment: can we shorten the test name? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #39558: [SPARK-41982][SQL] Partitions of type string should not be treated as numeric types
cloud-fan commented on code in PR #39558: URL: https://github.com/apache/spark/pull/39558#discussion_r1070835291 ## sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala: ## @@ -363,7 +363,8 @@ class SparkSqlAstBuilder extends AstBuilder { * Convert a constants list into a String sequence. */ override def visitConstantList(ctx: ConstantListContext): Seq[String] = withOrigin(ctx) { -ctx.constant.asScala.map(v => visitStringConstant(v, legacyNullAsString = false)).toSeq +ctx.constant.asScala.map(v => visitStringConstant(v, Review Comment: can we add default parameter value for `visitStringConstant`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #39558: [SPARK-41982][SQL] Partitions of type string should not be treated as numeric types
cloud-fan commented on code in PR #39558: URL: https://github.com/apache/spark/pull/39558#discussion_r1070835079 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -4059,6 +4059,15 @@ object SQLConf { .booleanConf .createWithDefault(true) + val LEGACY_KEEP_PARTITION_SPEC_AS_STRING_LITERAL = Review Comment: Can we put it near LEGACY_PARSE_NULL_PARTITION_SPEC_AS_STRING_LITERAL and follow its cofig doc as well? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun opened a new pull request, #39589: [MINOR][CORE][TESTS] Suppress unchecked warnings in OneForOneStreamManagerSuite
dongjoon-hyun opened a new pull request, #39589: URL: https://github.com/apache/spark/pull/39589 ### What changes were proposed in this pull request? This PR aims to suppress the two test code compilation warnings in `network-common` module. ``` [warn] /Users/dongjoon/APACHE/spark-merge/common/network-common/src/test/java/org/apache/spark/network/server/OneForOneStreamManagerSuite.java:105:1: [unchecked] unchecked conversion [warn] Iterator buffers = Mockito.mock(Iterator.class); [warn] ^ required: Iterator [warn] found:Iterator [warn] /Users/dongjoon/APACHE/spark-merge/common/network-common/src/test/java/org/apache/spark/network/server/OneForOneStreamManagerSuite.java:111:1: [unchecked] unchecked conversion [warn] Iterator buffers2 = Mockito.mock(Iterator.class); [warn]^ required: Iterator [warn] found:Iterator ``` ### Why are the changes needed? To suppress warnings in `common` directory. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually check the output with the following. ``` $ build/sbt "network-common/test" ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #39558: [SPARK-41982][SQL] Partitions of type string should not be treated as numeric types
cloud-fan commented on code in PR #39558: URL: https://github.com/apache/spark/pull/39558#discussion_r1070834762 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala: ## @@ -507,14 +507,19 @@ class AstBuilder extends SqlBaseParserBaseVisitor[AnyRef] with SQLConfHelper wit ctx: PartitionSpecContext): Map[String, Option[String]] = withOrigin(ctx) { val legacyNullAsString = conf.getConf(SQLConf.LEGACY_PARSE_NULL_PARTITION_SPEC_AS_STRING_LITERAL) +val keepPartitionTypeAsString = Review Comment: ```suggestion val keepPartitionSpecAsString = ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] grundprinzip commented on a diff in pull request #39585: [WIP] Unregistered Python UDF in Spark Connect
grundprinzip commented on code in PR #39585: URL: https://github.com/apache/spark/pull/39585#discussion_r1070822033 ## python/pyspark/sql/connect/udf.py: ## @@ -0,0 +1,184 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +""" +User-defined function related classes and functions +""" +import functools +from typing import Callable, Any, TYPE_CHECKING, Optional + +from pyspark import cloudpickle +from pyspark.sql.connect.expressions import ColumnReference, PythonFunction, PythonUDF +from pyspark.sql.connect.column import Column +from pyspark.sql.types import DataType + + +if TYPE_CHECKING: +from pyspark.sql.connect._typing import ( +ColumnOrName, +DataTypeOrString, +UserDefinedFunctionLike, +) +from pyspark.sql.types import StringType +from pyspark.sql.session import SparkSession + + +def _wrap_function( Review Comment: is this still called? Sincere the client does not have a JVM, it should not. ## connector/connect/common/src/main/protobuf/spark/connect/expressions.proto: ## @@ -216,6 +218,19 @@ message Expression { bool is_user_defined_function = 4; } + message PythonUDF { Review Comment: I'm wondering if it makes sense to generalize this message to something like "session UDF" in which there is a oneof for the different languages. This will avoid logical duplication for the Scala UDF work (and other languages). @hvanhovell what's your take? ## connector/connect/common/src/main/protobuf/spark/connect/expressions.proto: ## @@ -43,6 +43,8 @@ message Expression { Window window = 11; UnresolvedExtractValue unresolved_extract_value = 12; UpdateFields update_fields = 13; +PythonFunction python_function = 14; Review Comment: It looks like this message is never used outside the Python UDF so it should not be listed here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] tedyu closed pull request #39586: [MINOR] Drop saslTimeoutSeen
tedyu closed pull request #39586: [MINOR] Drop saslTimeoutSeen URL: https://github.com/apache/spark/pull/39586 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] databricks-david-lewis commented on pull request #39488: [SPARK-41970] Introduce SparkPath for typesafety
databricks-david-lewis commented on PR #39488: URL: https://github.com/apache/spark/pull/39488#issuecomment-1383440531 Thanks for the comment @JoshRosen! You put it very well. One of the other places I wanted to use `SparkPath` was in the streaming interfaces (`FileStreamSink`, `FileStreamSource`, `SinkFileStatus`, and maybe more). I also wanted to standardize on the currently expected behavior of users providing hadoop-path-strings in SQL statements (table locations, SQLOnFiles, DataFrame-Reader/-Writer, etc). Right now they all (I think) end up in a new Path(_), but could benefit from type-safety. @mridulm , this PR is ready for review I think. Please take a look. I think the sooner we can get it into a Spark minor version the better. Even if that means we are incrementally changing the internal interfaces in subsequent minor versions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37525: [WIP][SPARK-40086][SPARK-42049][SQL] Improve AliasAwareOutputPartitioning and AliasAwareQueryOutputOrdering to take all aliases i
cloud-fan commented on code in PR #37525: URL: https://github.com/apache/spark/pull/37525#discussion_r1070800149 ## sql/core/src/main/scala/org/apache/spark/sql/execution/AliasAwareOutputExpression.scala: ## @@ -16,24 +16,53 @@ */ package org.apache.spark.sql.execution -import org.apache.spark.sql.catalyst.expressions.{Alias, Expression, NamedExpression, SortOrder} -import org.apache.spark.sql.catalyst.plans.physical.{HashPartitioning, Partitioning, PartitioningCollection, UnknownPartitioning} +import scala.collection.mutable + +import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, Expression, NamedExpression, SortOrder} +import org.apache.spark.sql.catalyst.plans.physical.{Partitioning, PartitioningCollection, UnknownPartitioning} +import org.apache.spark.sql.internal.SQLConf /** * A trait that provides functionality to handle aliases in the `outputExpressions`. */ trait AliasAwareOutputExpression extends UnaryExecNode { protected def outputExpressions: Seq[NamedExpression] - private lazy val aliasMap = outputExpressions.collect { -case a @ Alias(child, _) => child.canonicalized -> a.toAttribute - }.toMap + private lazy val aliasMap = { +val aliases = mutable.Map[Expression, mutable.ListBuffer[Attribute]]() +// Add aliases to the map. If multiple alias is defined for a source attribute then add all. +outputExpressions.foreach { + case a @ Alias(child, _) => +// This prepend is needed to make the first element of the `ListBuffer` point to the last +// occurrence of an aliased child. This is to keep the previous behavior: +// - when we return `Partitioning`s in `outputPartitioning()`, the first should be same as +// it was previously +// - when we return a `SortOrder` in `outputOrdering()`, it should be should be same as +// previously +a.toAttribute +=: aliases.getOrElseUpdate(child.canonicalized, mutable.ListBuffer.empty) + case _ => +} +// Append identity mapping of an attribute to the map if both the attribute and its aliased +// version can be found in `outputExpressions`. +outputExpressions.foreach { + case a: Attribute if aliases.contains(a.canonicalized) => aliases(a.canonicalized) += a + case _ => +} +aliases + } protected def hasAlias: Boolean = aliasMap.nonEmpty - protected def normalizeExpression(exp: Expression): Expression = { + // This function returns a limited number of alternatives to mapped `exp`. + protected def normalizeExpression(exp: Expression): Seq[Expression] = { +exp.multiTransform { + case e: Expression if aliasMap.contains(e.canonicalized) => aliasMap(e.canonicalized).toSeq Review Comment: Yea the key is that `multiTransform` allows you to return the original expr so that it can be expanded again with other mappings. ``` input.multiTransform { case e if exprAliasMap.contains(e) => e +: exprAliasMap(e) case e if attrAliasMap(e) => attrAliasMap(e) } ``` @peter-toth do you agree that it's better to let the caller decide if the original expr should be included or not? It seems currently the framework always include the original expr. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] databricks-david-lewis commented on a diff in pull request #39488: [SPARK-41970] Introduce SparkPath for typesafety
databricks-david-lewis commented on code in PR #39488: URL: https://github.com/apache/spark/pull/39488#discussion_r1070796985 ## core/src/main/scala/org/apache/spark/paths/SparkPath.scala: ## @@ -0,0 +1,40 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.paths + +import java.net.URI + +import org.apache.hadoop.fs.{FileStatus, Path} + +/** + * A SparkPath is the result of a URI.toString call. + */ +case class SparkPath private (private val underlying: String) { + def uriEncoded: String = underlying + def toUri: URI = new URI(underlying) + def toPath: Path = new Path(toUri) + override def toString: String = underlying +} + +object SparkPath { + def fromPathString(str: String): SparkPath = fromPath(new Path(str)) Review Comment: I have revamped the comments in this class, please let me know what you think. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on pull request #39571: [SPARK-42064][SQL] Implement bloom filter join hint
ulysses-you commented on PR #39571: URL: https://github.com/apache/spark/pull/39571#issuecomment-1383430507 I mean something like https://github.com/apache/spark/pull/32355, e.g. we can not prune left for left outer join so it should show logs if the bloom_filter_join point to left. There are a lots of limits to prevent injecting runtime filter even we have bloom_filter_hint. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #35969: [SPARK-38651][SQL] Add `spark.sql.legacy.allowEmptySchemaWrite`
cloud-fan commented on PR #35969: URL: https://github.com/apache/spark/pull/35969#issuecomment-1383428417 sorry for the late review. If this is a valid use case, shall we just allow it in certain file formats like ORC? We can pass the entire query schema to `FileFormat.supportsDataType` and only fail empty `StructType` if we are sure the format doesn't support (parquet is one of them). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] databricks-david-lewis commented on a diff in pull request #39488: [SPARK-41970] Introduce SparkPath for typesafety
databricks-david-lewis commented on code in PR #39488: URL: https://github.com/apache/spark/pull/39488#discussion_r1070791606 ## common/network-common/src/main/java/org/apache/spark/network/util/JavaUtils.java: ## @@ -383,6 +383,7 @@ public static File createTempDir() throws IOException { public static File createTempDir(String root, String namePrefix) throws IOException { if (root == null) root = System.getProperty("java.io.tmpdir"); if (namePrefix == null) namePrefix = "spark"; +namePrefix = namePrefix + " with a space"; Review Comment: In my brief search I found that they all came to this method in the end (withTempDir, etc). I'm actually undoing this change for now, because it causes too many failures and I don't have the bandwidth to fix the tests right now. Here is a PR in my branch where I made only this change: https://github.com/databricks-david-lewis/spark-opensource/pull/2. I am hoping to fix the problems with spark's tests, and then propose that as a PR. ## common/network-common/src/main/java/org/apache/spark/network/util/JavaUtils.java: ## @@ -383,6 +383,7 @@ public static File createTempDir() throws IOException { public static File createTempDir(String root, String namePrefix) throws IOException { if (root == null) root = System.getProperty("java.io.tmpdir"); if (namePrefix == null) namePrefix = "spark"; +namePrefix = namePrefix + " with a space"; Review Comment: In my brief search I found that they all came to this method in the end (withTempDir, etc). I'm actually undoing this change for now, because it causes too many failures and I don't have the bandwidth to fix the tests right now. Here is a PR in my branch where I made only this change: https://github.com/databricks-david-lewis/spark-opensource/pull/2. I am hoping to fix the problems with spark's tests, and then propose that as a PR, but I don't know when I will be able to. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng opened a new pull request, #39588: [SPARK-42077][CONNECT][PYTHON] Literal should throw TypeError for unsupported DataType
zhengruifeng opened a new pull request, #39588: URL: https://github.com/apache/spark/pull/39588 ### What changes were proposed in this pull request? Make `Literal` throw TypeError for unsupported DataType ### Why are the changes needed? Literal should throw TypeError for unsupported DataType ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? updated UT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] databricks-david-lewis commented on a diff in pull request #39488: [SPARK-41970] Introduce SparkPath for typesafety
databricks-david-lewis commented on code in PR #39488: URL: https://github.com/apache/spark/pull/39488#discussion_r1070790755 ## core/src/main/scala/org/apache/spark/paths/SparkPath.scala: ## @@ -0,0 +1,40 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.paths + +import java.net.URI + +import org.apache.hadoop.fs.{FileStatus, Path} + +/** + * A SparkPath is the result of a URI.toString call. + */ +case class SparkPath private (private val underlying: String) { + def uriEncoded: String = underlying Review Comment: I don't feel strongly. I went with "uri" because it was from a `java.net.URI`. I will change it to `urlEncoded`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #39131: [SPARK-41162][SQL] Fix anti- and semi-join for self-join with aggregations
cloud-fan commented on code in PR #39131: URL: https://github.com/apache/spark/pull/39131#discussion_r1070790235 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PushDownLeftSemiAntiJoin.scala: ## @@ -61,9 +61,10 @@ object PushDownLeftSemiAntiJoin extends Rule[LogicalPlan] } // LeftSemi/LeftAnti over Aggregate, only push down if join can be planned as broadcast join. -case join @ Join(agg: Aggregate, rightOp, LeftSemiOrAnti(_), _, _) +case join @ Join(agg: Aggregate, rightOp, LeftSemiOrAnti(_), joinCond, _) if agg.aggregateExpressions.forall(_.deterministic) && agg.groupingExpressions.nonEmpty && !agg.aggregateExpressions.exists(ScalarSubquery.hasCorrelatedScalarSubquery) && + canPushThroughCondition(agg.children, joinCond, rightOp) && Review Comment: This is probably an existing issue. If the `Aggregate` does `a as x, b as y`, then the join condition won't reference `a` and `b` and this check is useless. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng opened a new pull request, #39587: [SPARK-42076][CONNECT][PYTHON] Factor data conversion `arrow -> rows` out to `conversion.py`
zhengruifeng opened a new pull request, #39587: URL: https://github.com/apache/spark/pull/39587 ### What changes were proposed in this pull request? Factor data conversion `arrow -> rows` out to `conversion.py` ### Why are the changes needed? to make it more clearer, and also more consistent with PySpark ### Does this PR introduce _any_ user-facing change? no, refactor-only ### How was this patch tested? existing UT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #39564: [SPARK-41990][SQL] Use `FieldReference.column` instead of `apply` in V1 to V2 filter conversion
cloud-fan commented on code in PR #39564: URL: https://github.com/apache/spark/pull/39564#discussion_r107072 ## sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala: ## @@ -1963,4 +1977,9 @@ class JDBCSuite extends QueryTest with SharedSparkSession { } } } + + test("SPARK-41990: Filter with composite name") { +val df = sql("SELECT * FROM composite_name WHERE `last name` = 'smith'") Review Comment: is it because we always disable nested predicate pushdown for JDBC v1? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #39564: [SPARK-41990][SQL] Use `FieldReference.column` instead of `apply` in V1 to V2 filter conversion
cloud-fan commented on code in PR #39564: URL: https://github.com/apache/spark/pull/39564#discussion_r1070788648 ## sql/catalyst/src/main/scala/org/apache/spark/sql/sources/filters.scala: ## @@ -74,6 +75,15 @@ sealed abstract class Filter { * Converts V1 filter to V2 filter */ private[sql] def toV2: Predicate + + protected def toV2Column(attribute: String): NamedReference = { +try { + FieldReference(attribute) Review Comment: We should add some comments to explain it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #39571: [SPARK-42064][SQL] Implement bloom filter join hint
wangyum commented on PR #39571: URL: https://github.com/apache/spark/pull/39571#issuecomment-1383418899 > Make sense to me. Shall we show some warn logs if the hint can not work ? We do the similar things for other join hints. Please see `ResolveJoinStrategyHints`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn commented on a diff in pull request #39555: [SPARK-42051][SQL] Codegen Support for HiveGenericUDF
yaooqinn commented on code in PR #39555: URL: https://github.com/apache/spark/pull/39555#discussion_r1070776840 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUDFs.scala: ## @@ -128,11 +129,10 @@ private[hive] class DeferredObjectAdapter(oi: ObjectInspector, dataType: DataTyp override def get(): AnyRef = wrapper(func()).asInstanceOf[AnyRef] } -private[hive] case class HiveGenericUDF( +case class HiveGenericUDF( Review Comment: Seems quite complicated to handle the hive value mapping and function wrapping -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #39584: [SPARK-42074][SQL] Enable `KryoSerializer` in `TPCDSQueryBenchmark` to enforce SQL class registration
LuciferYang commented on PR #39584: URL: https://github.com/apache/spark/pull/39584#issuecomment-1383392745 Late LGTM -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #38376: [SPARK-40817][K8S] `spark.files` should preserve remote files
HyukjinKwon commented on PR #38376: URL: https://github.com/apache/spark/pull/38376#issuecomment-1383390931 > @antonipp Is this related to https://github.com/apache/spark/pull/39488 ? I don't think this is related. Maybe `isLocalDependency` itself is related because it assumes that `null` in scheme is a local file (which is wrong I guess? because it assumes the default URL scheme is a file) but this PR itself isn't related to that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38376: [SPARK-40817][K8S] `spark.files` should preserve remote files
HyukjinKwon commented on code in PR #38376: URL: https://github.com/apache/spark/pull/38376#discussion_r1070769064 ## resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala: ## @@ -168,27 +168,27 @@ private[spark] class BasicDriverFeatureStep(conf: KubernetesDriverConf) MEMORY_OVERHEAD_FACTOR.key -> defaultOverheadFactor.toString) // try upload local, resolvable files to a hadoop compatible file system Seq(JARS, FILES, ARCHIVES, SUBMIT_PYTHON_FILES).foreach { key => - val uris = conf.get(key).filter(uri => KubernetesUtils.isLocalAndResolvable(uri)) + val (localUris, remoteUris) = +conf.get(key).partition(uri => KubernetesUtils.isLocalAndResolvable(uri)) val value = { if (key == ARCHIVES) { Review Comment: In case of archives, it should also resolve the URL properly for `remoteUris` too in order to remove fragments (e.g., `file:///tmp/tmp4542734800151332666.txt.tar.gz#test_tar_gz`). Otherwise, it would fail to upload the file. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on pull request #39571: [SPARK-42064][SQL] Implement bloom filter join hint
ulysses-you commented on PR #39571: URL: https://github.com/apache/spark/pull/39571#issuecomment-1383386714 Make sense to me. Shall we show some warn logs if the hint can not work ? We do the similar things for other join hints. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] tedyu opened a new pull request, #39586: [MINOR] Drop saslTimeoutSeen
tedyu opened a new pull request, #39586: URL: https://github.com/apache/spark/pull/39586 ### What changes were proposed in this pull request? This PR proposes to drop `saslTimeoutSeen` field from RetryingBlockTransferor. ### Why are the changes needed? This field doesn't affect the return value for `shouldRetry`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test suite. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on a diff in pull request #37525: [WIP][SPARK-40086][SPARK-42049][SQL] Improve AliasAwareOutputPartitioning and AliasAwareQueryOutputOrdering to take all aliases
ulysses-you commented on code in PR #37525: URL: https://github.com/apache/spark/pull/37525#discussion_r1070753260 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/AliasAwareOutputExpression.scala: ## @@ -70,53 +66,16 @@ trait AliasAwareOutputExpression extends SQLConfHelper { protected def normalizeExpression( expr: Expression, pruneFunc: (Expression, AttributeSet) => Option[Expression]): Seq[Expression] = { -val normalizedCandidates = new mutable.HashSet[Expression]() -normalizedCandidates.add(expr) val outputSet = AttributeSet(outputExpressions.map(_.toAttribute)) - -def pruneCandidate(candidate: Expression): Option[Expression] = { +expr.multiTransform { Review Comment: According to my usage. The `multiTransform` should at least support 4 cases of pruning: - the max limit size of returned result - prune the result whose references is not subset of output - prune intermediate result if the alias map does not contain any other sub-expression - prune sub-expression, e.g. `PartitionCollection(a, b)` -> `PartitionCollection(a)` if b is not subset of output If all this requirements can be matched, I think it's good to switch to multi-transofrm. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on a diff in pull request #37525: [WIP][SPARK-40086][SPARK-42049][SQL] Improve AliasAwareOutputPartitioning and AliasAwareQueryOutputOrdering to take all aliases
ulysses-you commented on code in PR #37525: URL: https://github.com/apache/spark/pull/37525#discussion_r1070753260 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/AliasAwareOutputExpression.scala: ## @@ -70,53 +66,16 @@ trait AliasAwareOutputExpression extends SQLConfHelper { protected def normalizeExpression( expr: Expression, pruneFunc: (Expression, AttributeSet) => Option[Expression]): Seq[Expression] = { -val normalizedCandidates = new mutable.HashSet[Expression]() -normalizedCandidates.add(expr) val outputSet = AttributeSet(outputExpressions.map(_.toAttribute)) - -def pruneCandidate(candidate: Expression): Option[Expression] = { +expr.multiTransform { Review Comment: According to my usage. The `multiTransform` should at least support 3 cases of pruning: 1. the max limit size of returned result 2. eagerly pruning func - prune the result whose references is not subset of output - prune intermediate result if the alias map does not contain any other sub-expression - prune sub-expression, e.g. `PartitionCollection(a, b)` -> `PartitionCollection(a)` if b is not subset of output If all this requirements can be matched, I think it's good to switch to multi-transofrm. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xinrong-meng opened a new pull request, #39585: [WIP] Unregistered Python UDF in Spark Connect
xinrong-meng opened a new pull request, #39585: URL: https://github.com/apache/spark/pull/39585 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #39387: [SPARK-41586][PYTHON] Introduce `pyspark.errors` and error classes for PySpark.
HyukjinKwon closed pull request #39387: [SPARK-41586][PYTHON] Introduce `pyspark.errors` and error classes for PySpark. URL: https://github.com/apache/spark/pull/39387 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #39387: [SPARK-41586][PYTHON] Introduce `pyspark.errors` and error classes for PySpark.
HyukjinKwon commented on PR #39387: URL: https://github.com/apache/spark/pull/39387#issuecomment-1383328922 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on pull request #39091: [SPARK-41527][CONNECT][PYTHON] Implement `DataFrame.observe`
beliefer commented on PR #39091: URL: https://github.com/apache/spark/pull/39091#issuecomment-1383316479 ping @hvanhovell -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #39572: [SPARK-39979][SQL] Add option to use large variable width vectors for arrow UDF operations
HyukjinKwon commented on PR #39572: URL: https://github.com/apache/spark/pull/39572#issuecomment-1383312668 cc @BryanCutler @viirya @ueshin FYI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #39570: [SPARK-41903][CONNECT][PYTHON] `Literal` should support 1-dim ndarray
HyukjinKwon closed pull request #39570: [SPARK-41903][CONNECT][PYTHON] `Literal` should support 1-dim ndarray URL: https://github.com/apache/spark/pull/39570 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #39570: [SPARK-41903][CONNECT][PYTHON] `Literal` should support 1-dim ndarray
HyukjinKwon commented on PR #39570: URL: https://github.com/apache/spark/pull/39570#issuecomment-1383310216 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #39583: [SPARK-42073][CONNECT][PYTHON][TESTS] Enable tests in common/test_parity_serde, common/test_parity_types
HyukjinKwon closed pull request #39583: [SPARK-42073][CONNECT][PYTHON][TESTS] Enable tests in common/test_parity_serde, common/test_parity_types URL: https://github.com/apache/spark/pull/39583 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #39571: [SPARK-42064][SQL] Implement bloom filter join hint
wangyum commented on PR #39571: URL: https://github.com/apache/spark/pull/39571#issuecomment-1383309269 @sigmod @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #39583: [SPARK-42073][CONNECT][PYTHON][TESTS] Enable tests in common/test_parity_serde, common/test_parity_types
HyukjinKwon commented on PR #39583: URL: https://github.com/apache/spark/pull/39583#issuecomment-1383309267 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #39580: [MINOR] [FOLLOW-UP] Throw Timeout exception when SASL Request Retry is enabled
mridulm commented on PR #39580: URL: https://github.com/apache/spark/pull/39580#issuecomment-1383309053 Thanks for the interest @tedyu ! Please do feel free to ping me in case there are other improvements or bug fixes you contribute -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #39582: [SPARK-41921][CONNECT][TESTS][FOLLOW-UP] Enable doctests in connect/column and connect/readwriter
HyukjinKwon closed pull request #39582: [SPARK-41921][CONNECT][TESTS][FOLLOW-UP] Enable doctests in connect/column and connect/readwriter URL: https://github.com/apache/spark/pull/39582 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #39582: [SPARK-41921][CONNECT][TESTS][FOLLOW-UP] Enable doctests in connect/column and connect/readwriter
HyukjinKwon commented on PR #39582: URL: https://github.com/apache/spark/pull/39582#issuecomment-1383303216 Actually https://github.com/apache/spark/pull/39581 had the same change. I merged that instead. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #39582: [SPARK-41921][CONNECT][TESTS][FOLLOW-UP] Enable doctests in connect/column and connect/readwriter
HyukjinKwon commented on PR #39582: URL: https://github.com/apache/spark/pull/39582#issuecomment-1383301651 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #39581: [SPARK-42011][SPARK-42012][CONNECT][PYTHON][TESTS][FOLLOW-UP] Enable csv, orc tests in connect/test_parity_datasources.py
HyukjinKwon closed pull request #39581: [SPARK-42011][SPARK-42012][CONNECT][PYTHON][TESTS][FOLLOW-UP] Enable csv, orc tests in connect/test_parity_datasources.py URL: https://github.com/apache/spark/pull/39581 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #39581: [SPARK-42011][SPARK-42012][CONNECT][PYTHON][TESTS][FOLLOW-UP] Enable csv, orc tests in connect/test_parity_datasources.py
HyukjinKwon commented on PR #39581: URL: https://github.com/apache/spark/pull/39581#issuecomment-1383301343 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #37822: [SPARK-40381][DEPLOY] Support standalone worker recommission
github-actions[bot] closed pull request #37822: [SPARK-40381][DEPLOY] Support standalone worker recommission URL: https://github.com/apache/spark/pull/37822 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #38073: [SPARK-40632][SQL] Do not inject runtime filter if join condition reference non simple expression
github-actions[bot] closed pull request #38073: [SPARK-40632][SQL] Do not inject runtime filter if join condition reference non simple expression URL: https://github.com/apache/spark/pull/38073 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #37809: [SPARK-40355][SQL] Improve pushdown for orc & parquet when cast scenario
github-actions[bot] closed pull request #37809: [SPARK-40355][SQL] Improve pushdown for orc & parquet when cast scenario URL: https://github.com/apache/spark/pull/37809 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #38092: [SPARK-40647][CORE] DAGScheduler should fail job until all related running tasks have been killed
github-actions[bot] commented on PR #38092: URL: https://github.com/apache/spark/pull/38092#issuecomment-1383298058 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #37996: [SPARK-40558][SQL] Add Reusable Exchange in Bloom creation side plan
github-actions[bot] closed pull request #37996: [SPARK-40558][SQL] Add Reusable Exchange in Bloom creation side plan URL: https://github.com/apache/spark/pull/37996 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #39584: [SPARK-42074][SQL] Enable `KryoSerializer` in `TPCDSQueryBenchmark` to enforce SQL class registration
dongjoon-hyun commented on PR #39584: URL: https://github.com/apache/spark/pull/39584#issuecomment-1383296513 Merged to master for Apache Spark 3.4.0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org