[PR] [SPARK-48421][SQL] SPJ: Add documentation [spark]
szehon-ho opened a new pull request, #46745: URL: https://github.com/apache/spark/pull/46745 ### What changes were proposed in this pull request? Add docs for SPJ ### Why are the changes needed? There are no docs describing SPJ, even though it is mentioned in migration notes: https://github.com/apache/spark/pull/46673 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Checked the new text ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48329][SQL] Enable `spark.sql.sources.v2.bucketing.pushPartValues.enabled` by default [spark]
szehon-ho commented on PR #46673: URL: https://github.com/apache/spark/pull/46673#issuecomment-2130960125 @dongjoon-hyun do you have any guidance how we may have a new Spark 4.0+ tracking JIRA to link all the new SPJ items? I think it will be nice to have all of them in one list. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48320][CORE][DOCS] Add external third-party ecosystem access guide to the `scala/java` doc [spark]
panbingkun commented on code in PR #46634: URL: https://github.com/apache/spark/pull/46634#discussion_r1614383361 ## common/utils/src/main/scala/org/apache/spark/internal/README.md: ## Review Comment: Okay, I have adopted the plan of deleting `README.md` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48394][CORE] Cleanup mapIdToMapIndex on mapoutput unregister [spark]
Ngone51 commented on PR #46706: URL: https://github.com/apache/spark/pull/46706#issuecomment-2130842872 @dongjoon-hyun @mridulm Sorry, can we make it a bug and backport to maintenance release branches? This actually causes us an issue internally. I was pushing a quick fix before realizing it is the root cause. The issue leads to shuffle fetch failure and the job failure in the end. It happens this way: 1) Stage A computes the partition P0 by task t1 (TID, a.k.a mapId) on executor e1 2) Executor Y starts deommission 3) Executor Y reports false-positve lost to driver during its decommission 4) Stage B reuse the shuffle dependency with Stage A, and computes the partition P0 again by task t2 on executor e2 5) When task t2 finishes, we see two items ((t1 -> P0), (t2 -> P0)) for the same paritition in `mapIdToMapIndex` but only one item (mapStatuses(P0)=MapStatus(t2, e2)) in `mapStatuses`. 6) Executor Y starts to migrate task t1's mapstatus (to executor e3 for example) and call `updateMapOutput` on driver. Regarding to 5), we'd use mapId (i.e., t1) to get mapIndex (i.e., P0) and use P0 to get task t2's mapstatus. ```scala // updateMapOutput val mapIndex = mapIdToMapIndex.get(mapId) val mapStatusOpt = mapIndex.map(mapStatuses(_)).flatMap(Option(_)) ``` 7) Task t2's mapstatus's location then would be updated to executor e3 but it's indeed still located on executor e2. This finally leads to the fetch failure in the end. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48320][CORE][DOCS] Add external third-party ecosystem access guide to the `scala/java` doc [spark]
mridulm commented on code in PR #46634: URL: https://github.com/apache/spark/pull/46634#discussion_r1614353283 ## common/utils/src/main/scala/org/apache/spark/internal/README.md: ## Review Comment: IIRC @panbingkun included the contents into the existing code; which is also a nice addition to the code documentation. IMO that is a better overall approach given this is internal to spark. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-48420][BUILD] Upgrade netty to `4.1.110.Final` [spark]
panbingkun opened a new pull request, #46744: URL: https://github.com/apache/spark/pull/46744 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48239][INFRA][FOLLOWUP] install the missing `jq` library [spark]
cloud-fan closed pull request #46743: [SPARK-48239][INFRA][FOLLOWUP] install the missing `jq` library URL: https://github.com/apache/spark/pull/46743 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48239][INFRA][FOLLOWUP] install the missing `jq` library [spark]
cloud-fan commented on PR #46743: URL: https://github.com/apache/spark/pull/46743#issuecomment-2130735901 thanks for review, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47579][CORE][PART3] Migrate logInfo with variables to structured logging framework [spark]
gengliangwang commented on PR #46724: URL: https://github.com/apache/spark/pull/46724#issuecomment-2130729394 @zeotuan I was aware of this PR when I worked on https://github.com/apache/spark/pull/46739, since you were not responding to the jira comment. Please resolve the merge conflicts and I will review and merge this one. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47579][CORE][PART3] Spark core: Migrate logInfo with variables to structured logging framework [spark]
gengliangwang closed pull request #46739: [SPARK-47579][CORE][PART3] Spark core: Migrate logInfo with variables to structured logging framework URL: https://github.com/apache/spark/pull/46739 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47579][CORE][PART3] Spark core: Migrate logInfo with variables to structured logging framework [spark]
gengliangwang commented on PR #46739: URL: https://github.com/apache/spark/pull/46739#issuecomment-2130728636 @cloud-fan @dongjoon-hyun thanks for the review. Merging to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48239][INFRA][FOLLOWUP] install the missing `jq` library [spark]
cloud-fan commented on PR #46743: URL: https://github.com/apache/spark/pull/46743#issuecomment-2130727208 cc @HyukjinKwon @dongjoon-hyun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48239][INFRA][FOLLOWUP] install the missing `jq` library [spark]
cloud-fan commented on code in PR #46743: URL: https://github.com/apache/spark/pull/46743#discussion_r1614307758 ## dev/create-release/release-util.sh: ## @@ -128,6 +128,9 @@ function get_release_info { RC_COUNT=1 fi + if [ "$GIT_BRANCH" = "master" ]; then +RELEASE_VERSION="$RELEASE_VERSION-preview1" Review Comment: This is unrelated change, but gives better prompt for preview releases -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-48239][INFRA][FOLLOWUP] install the missing `jq` library [spark]
cloud-fan opened a new pull request, #46743: URL: https://github.com/apache/spark/pull/46743 ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/46534 . We missed the `jq` library which is needed to create git tags. ### Why are the changes needed? fix bug ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? manual ### Was this patch authored or co-authored using generative AI tooling? no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48419][SQL] Foldable propagation replace foldable column shoul… [spark]
KnightChess closed pull request #46742: [SPARK-48419][SQL] Foldable propagation replace foldable column shoul… URL: https://github.com/apache/spark/pull/46742 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48415][PYTHON] Refactor `TypeName` to support parameterized datatypes [spark]
HyukjinKwon commented on PR #46738: URL: https://github.com/apache/spark/pull/46738#issuecomment-2130648241 I think we should list all behaviour changes. `typeName` is sort of an API too -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-48419][SQL] Foldable propagation replace foldable column shoul… [spark]
KnightChess opened a new pull request, #46742: URL: https://github.com/apache/spark/pull/46742 …d use origin column name ### What changes were proposed in this pull request? fix optimizer rule `FoldablePropagation` will change column name, use origin name. ### Why are the changes needed? fix bug ### Does this PR introduce _any_ user-facing change? Noe ### How was this patch tested? Added UT ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] Collations proof of concept [spark]
github-actions[bot] commented on PR #44537: URL: https://github.com/apache/spark/pull/44537#issuecomment-2130548277 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47041] PushDownUtils should support not only FileScanBuilder but any SupportsPushDownCatalystFilters [spark]
github-actions[bot] commented on PR #45099: URL: https://github.com/apache/spark/pull/45099#issuecomment-2130548267 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47579][CORE][PART3] Migrate logInfo with variables to structured logging framework [spark]
zeotuan commented on PR #46724: URL: https://github.com/apache/spark/pull/46724#issuecomment-2130535309 @gengliangwang Please help review this. I will merge this after #46739 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-48416][SQL] Support related nested WITH expression [spark]
zml1206 opened a new pull request, #46741: URL: https://github.com/apache/spark/pull/46741 ### What changes were proposed in this pull request? Refactor `RewriteWithExpression` logic to support related nested `WITH` expression. Generate `Project` order: 1. internally nested with of main expression definitions 2. main expression 3. internally nested with of main expression ### Why are the changes needed? Optimize `WITH` nested to apply to more scenarios, such as push down predicate. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48292[CORE] Revert [SPARK-39195][SQL] Spark OutputCommitCoordinator should abort stage when committed file not consistent with task status [spark]
viirya commented on PR #46696: URL: https://github.com/apache/spark/pull/46696#issuecomment-2130496440 Looks good to me. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using a scala TreeMap (RB Tree) [spark]
GideonPotok closed pull request #46404: [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using a scala TreeMap (RB Tree) URL: https://github.com/apache/spark/pull/46404 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48394][CORE] Cleanup mapIdToMapIndex on mapoutput unregister [spark]
dongjoon-hyun commented on PR #46706: URL: https://github.com/apache/spark/pull/46706#issuecomment-2130495121 Merged to master only because this was defined as `Improvement`. https://github.com/apache/spark/assets/9700541/e6901f63-cfb8-491b-9368-a840493befdc";> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP] Don't review: E2e [spark]
GideonPotok closed pull request #46670: [WIP] Don't review: E2e URL: https://github.com/apache/spark/pull/46670 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48394][CORE] Cleanup mapIdToMapIndex on mapoutput unregister [spark]
dongjoon-hyun closed pull request #46706: [SPARK-48394][CORE] Cleanup mapIdToMapIndex on mapoutput unregister URL: https://github.com/apache/spark/pull/46706 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47579][CORE][PART3] Spark core: Migrate logInfo with variables to structured logging framework [spark]
dongjoon-hyun commented on PR #46739: URL: https://github.com/apache/spark/pull/46739#issuecomment-2130487265 Pending CIs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48407][SQL][DOCS] Teradata: Document Type Conversion rules between Spark SQL and teradata [spark]
dongjoon-hyun closed pull request #46728: [SPARK-48407][SQL][DOCS] Teradata: Document Type Conversion rules between Spark SQL and teradata URL: https://github.com/apache/spark/pull/46728 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48325][CORE] Always specify messages in ExecutorRunner.killProcess [spark]
dongjoon-hyun commented on PR #46641: URL: https://github.com/apache/spark/pull/46641#issuecomment-2130485187 Merged to master for Apache Spark 4.0.0. Thank you, @bozhang2820 and all. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48325][CORE] Always specify messages in ExecutorRunner.killProcess [spark]
dongjoon-hyun closed pull request #46641: [SPARK-48325][CORE] Always specify messages in ExecutorRunner.killProcess URL: https://github.com/apache/spark/pull/46641 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48292[CORE] Revert [SPARK-39195][SQL] Spark OutputCommitCoordinator should abort stage when committed file not consistent with task status [spark]
cloud-fan commented on PR #46696: URL: https://github.com/apache/spark/pull/46696#issuecomment-2130363916 can we also revert https://github.com/apache/spark/pull/46562 in this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47008][CORE] Added Hadoops fileSystems hasPathCapability check to avoid FileNotFoundException(s) when using S3 Express One Zone Storage. [spark]
leovegas commented on code in PR #46678: URL: https://github.com/apache/spark/pull/46678#discussion_r1613994258 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala: ## @@ -62,10 +62,10 @@ object OrcUtils extends Logging { val CATALYST_TYPE_ATTRIBUTE_NAME = "spark.sql.catalyst.type" def listOrcFiles(pathStr: String, conf: Configuration): Seq[Path] = { +// TODO: Check if the paths coming in are already qualified and simplify. val origPath = new Path(pathStr) val fs = origPath.getFileSystem(conf) val paths = SparkHadoopUtil.get.listLeafStatuses(fs, origPath) - .filterNot(_.isDirectory) Review Comment: listFiles in listLeafStatuses provides FileStatus of the files only. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47008][CORE] Added Hadoops fileSystems hasPathCapability check to avoid FileNotFoundException(s) when using S3 Express One Zone Storage. [spark]
leovegas commented on code in PR #46678: URL: https://github.com/apache/spark/pull/46678#discussion_r1613992518 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileOperator.scala: ## @@ -125,16 +125,4 @@ private[hive] object OrcFileOperator extends Logging { path: String, conf: Option[Configuration]): Option[StructObjectInspector] = { getFileReader(path, conf).map(_.getObjectInspector.asInstanceOf[StructObjectInspector]) } - - def listOrcFiles(pathStr: String, conf: Configuration): Seq[Path] = { Review Comment: Removed this method because duplicate of https://github.com/apache/spark/blob/99213d7123a88e1fc79bc73b898c4971c78f37ff/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L64 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48273][SQL] Fix late rewrite of PlanWithUnresolvedIdentifier [spark]
cloud-fan commented on code in PR #46580: URL: https://github.com/apache/spark/pull/46580#discussion_r1613945984 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveIdentifierClause.scala: ## @@ -20,19 +20,23 @@ package org.apache.spark.sql.catalyst.analysis import org.apache.spark.sql.catalyst.expressions.{AliasHelper, EvalHelper, Expression} import org.apache.spark.sql.catalyst.parser.CatalystSqlParser import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan -import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.catalyst.rules.{Rule, RuleExecutor} import org.apache.spark.sql.catalyst.trees.TreePattern.UNRESOLVED_IDENTIFIER import org.apache.spark.sql.types.StringType /** * Resolves the identifier expressions and builds the original plans/expressions. */ -object ResolveIdentifierClause extends Rule[LogicalPlan] with AliasHelper with EvalHelper { +class ResolveIdentifierClause(earlyBatches: Seq[RuleExecutor[LogicalPlan]#Batch]) + extends Rule[LogicalPlan] with AliasHelper with EvalHelper { override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUpWithPruning( _.containsAnyPattern(UNRESOLVED_IDENTIFIER)) { case p: PlanWithUnresolvedIdentifier if p.identifierExpr.resolved => - p.planBuilder.apply(evalIdentifierExpr(p.identifierExpr)) + val executor = new RuleExecutor[LogicalPlan] { Review Comment: We only need to create one `RuleExecutor` instance once in this rule. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48380][CORE][WIP]: SerDeUtil.javaToPython to support batchSize parameter [spark]
JoshRosen commented on code in PR #46697: URL: https://github.com/apache/spark/pull/46697#discussion_r1613915291 ## core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala: ## @@ -104,12 +104,40 @@ private[spark] object SerDeUtil extends Logging { } } + /** + * Use a fixed batch size + */ + private[spark] class FixedBatchedPickler(iter: Iterator[Any], batch: Int) extends Iterator[Array[Byte]] { Review Comment: Looks like Scalastyle is complaining about line length, so you might need to wrap this like ```suggestion private[spark] class FixedBatchedPickler( iter: Iterator[Any], batch: Int) extends Iterator[Array[Byte]] { ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48391][CORE]Using addAll instead of add function in fromAccumulatorInfos method of TaskMetrics Class [spark]
JoshRosen commented on code in PR #46705: URL: https://github.com/apache/spark/pull/46705#discussion_r1613904865 ## core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala: ## @@ -328,16 +328,15 @@ private[spark] object TaskMetrics extends Logging { */ def fromAccumulators(accums: Seq[AccumulatorV2[_, _]]): TaskMetrics = { val tm = new TaskMetrics -for (acc <- accums) { +val (innerAccums, externalAccums) = accums. + partition(t => t.name.isDefined && tm.nameToAccums.contains(t.name.get)) Review Comment: Just thinking aloud here: Do we do a lot of intermediate collections allocations in this `.partition` operation itself? e.g. under the hood, would we be resizing an array builder or doing a bunch of linked list construction operations? I am wondering whether instead it might be faster (or at least more predicable / easier to reason about) to construct a Java ArrayBuilder / ArrayBuffer, imperatively append the external accumulators to it, then do a single addAll call at the end, e.g. something like ```scala def fromAccumulators(accums: Seq[AccumulatorV2[_, _]]): TaskMetrics = { val tm = new TaskMetrics val externalAccums = new java.util.ArrayList[AccumulatorV2[Any, Any]]() for (acc <- accums) { val name = acc.name if (name.isDefined && tm.nameToAccums.contains(name.get)) { val tmAcc = tm.nameToAccums(name.get).asInstanceOf[AccumulatorV2[Any, Any]] tmAcc.metadata = acc.metadata tmAcc.merge(acc.asInstanceOf[AccumulatorV2[Any, Any]]) } else { externalAccums.add(acc) } } tm._externalAccums.addAll(externalAccums) tm } ``` This is less net code churn and is probably either comparable in performance or faster, and avoids any performance unpredictability as a function of the Seq type. Given that this code runs in a hot path, I think the use of imperative mutation vs. a more pure functional approach is fine. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48391][CORE]Using addAll instead of add function in fromAccumulatorInfos method of TaskMetrics Class [spark]
JoshRosen commented on code in PR #46705: URL: https://github.com/apache/spark/pull/46705#discussion_r1613904865 ## core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala: ## @@ -328,16 +328,15 @@ private[spark] object TaskMetrics extends Logging { */ def fromAccumulators(accums: Seq[AccumulatorV2[_, _]]): TaskMetrics = { val tm = new TaskMetrics -for (acc <- accums) { +val (innerAccums, externalAccums) = accums. + partition(t => t.name.isDefined && tm.nameToAccums.contains(t.name.get)) Review Comment: Just thinking aloud here: Do we do a lot of intermediate collections allocations in this `.partition` operation itself? e.g. under the hood, would we be resizing an array builder or doing a bunch of linked list construction operations? I am wondering whether instead it might be faster (or at least more predicable / easier to reason about) to construct a Java ArrayBuilder / ArrayBuffer, imperatively append the external accumulators to it, then do a single addAll call at the end, e.g. something like ```scala def fromAccumulators(accums: Seq[AccumulatorV2[_, _]]): TaskMetrics = { val tm = new TaskMetrics val externalAccums = new java.util.ArrayList[AccumulatorV2[Any, Any]]() for (acc <- accums) { val name = acc.name if (name.isDefined && tm.nameToAccums.contains(name.get)) { val tmAcc = tm.nameToAccums(name.get).asInstanceOf[AccumulatorV2[Any, Any]] tmAcc.metadata = acc.metadata tmAcc.merge(acc.asInstanceOf[AccumulatorV2[Any, Any]]) } else { externalAccums.add(acc) } } tm._externalAccums.addAll(externalAccums) tm } ``` This is less net code churn and is probably either comparable in performance or faster, and avoids any performance unpredictability as a function of the Seq type. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48391][CORE]Using addAll instead of add function in fromAccumulatorInfos method of TaskMetrics Class [spark]
JoshRosen commented on code in PR #46705: URL: https://github.com/apache/spark/pull/46705#discussion_r1613896716 ## core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala: ## @@ -328,16 +328,15 @@ private[spark] object TaskMetrics extends Logging { */ def fromAccumulators(accums: Seq[AccumulatorV2[_, _]]): TaskMetrics = { val tm = new TaskMetrics -for (acc <- accums) { +val (innerAccums, externalAccums) = accums. Review Comment: > tm.nameToAccums is a fixed size LinkedHashMap, why do we need create another local variable? I think the concern was that each call to `.nameToAccums` under the hood goes through a `lazy val` which is slightly more expensive to access than an ordinary field. On the other hand, I think the old and new code are performing the same total number of `nameToAccums` accesses: - Before, we'd call `tm.nameToAccums` once per named accumulator to check whether it was a task metric, then a second time for each task metric to retrieve the actual value. - In the new code we perform the same calls but in a different sequence: we now do all of the initial calls in a batch then do the second "retrieve the actual accum" in a second pass. Given that we're performing the same total number of calls, I don't think that the new code represents a perf. regression w.r.t. those accesses, so the suggestion of storing `val nameToAccums = tm.nameToAccums` would just be a small performance optimization rather than a perf. regression fix. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48391][CORE]Using addAll instead of add function in fromAccumulatorInfos method of TaskMetrics Class [spark]
JoshRosen commented on code in PR #46705: URL: https://github.com/apache/spark/pull/46705#discussion_r1613891588 ## core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala: ## @@ -328,16 +328,15 @@ private[spark] object TaskMetrics extends Logging { */ def fromAccumulators(accums: Seq[AccumulatorV2[_, _]]): TaskMetrics = { val tm = new TaskMetrics -for (acc <- accums) { +val (innerAccums, externalAccums) = accums. + partition(t => t.name.isDefined && tm.nameToAccums.contains(t.name.get)) +for (acc <- innerAccums) { val name = acc.name - if (name.isDefined && tm.nameToAccums.contains(name.get)) { -val tmAcc = tm.nameToAccums(name.get).asInstanceOf[AccumulatorV2[Any, Any]] -tmAcc.metadata = acc.metadata -tmAcc.merge(acc.asInstanceOf[AccumulatorV2[Any, Any]]) - } else { -tm._externalAccums.add(acc) - } + val tmAcc = tm.nameToAccums(name.get).asInstanceOf[AccumulatorV2[Any, Any]] + tmAcc.metadata = acc.metadata + tmAcc.merge(acc.asInstanceOf[AccumulatorV2[Any, Any]]) } +tm._externalAccums.addAll(externalAccums.asJava) Review Comment: Yeah, it looks like `CopyOnWriteArrayList.addAll` performs at most one array copy: https://github.com/AdoptOpenJDK/openjdk-jdk12u/blame/master/src/java.base/share/classes/java/util/concurrent/CopyOnWriteArrayList.java#L724-L751 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR][SQL] Remove outdated `TODO`s from `UnsafeHashedRelation` [spark]
JoshRosen commented on code in PR #46736: URL: https://github.com/apache/spark/pull/46736#discussion_r1613884389 ## sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala: ## @@ -409,9 +407,6 @@ private[joins] class UnsafeHashedRelation( val pageSizeBytes = Option(SparkEnv.get).map(_.memoryManager.pageSizeBytes) .getOrElse(new SparkConf().get(BUFFER_PAGESIZE).getOrElse(16L * 1024 * 1024)) -// TODO(josh): We won't need this dummy memory manager after future refactorings; revisit Review Comment: These are ancient TOODs, but my recollection is that they were related to some limitations in storage memory management related to transferring memory between execution and storage: at the time (and even now) I don't think we had a reliable way to transfer task-allocated memory pages into a shared context where they can be used by multiple tasks and counted towards storage memory, hence some of the hacks here. I'm not firmly opposed to removing TODOs like this, but I also don't particularly understand the motivation for doing so: sure, it's a bit of clutter in the code but if we're not replacing them with a newer comment or making a code change then it just seems like cleanup for cleanup's sake, which I am generally opposed to for code churn / review bandwidth reasons. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47257][SQL] Assign names to error classes _LEGACY_ERROR_TEMP_105[3-4] and _LEGACY_ERROR_TEMP_1113 [spark]
wayneguow commented on code in PR #46731: URL: https://github.com/apache/spark/pull/46731#discussion_r1613880314 ## sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala: ## @@ -90,8 +91,8 @@ class ResolveSessionCatalog(val catalogManager: CatalogManager) table.schema.findNestedField(Seq(colName), resolver = conf.resolver) .map(_._2.dataType) .getOrElse { -throw QueryCompilationErrors.alterColumnCannotFindColumnInV1TableError( - quoteIfNeeded(colName), table) +throw QueryCompilationErrors.unresolvedColumnError( Review Comment: Would it be better if we could provide users with table and context information? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48286] Fix analysis and creation of column with exists default expression [spark]
cloud-fan commented on PR #46594: URL: https://github.com/apache/spark/pull/46594#issuecomment-2130135726 can we add a test? Besically any default column value that is unfoldable, like `current_date()`, can trigger this bug -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48286] Fix analysis and creation of column with exists default expression [spark]
cloud-fan commented on PR #46594: URL: https://github.com/apache/spark/pull/46594#issuecomment-2130132833 Can you retry the github action job? Seems flaky -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48320][CORE][DOCS] Add external third-party ecosystem access guide to the `scala/java` doc [spark]
gengliangwang commented on code in PR #46634: URL: https://github.com/apache/spark/pull/46634#discussion_r1613860819 ## common/utils/src/main/scala/org/apache/spark/internal/README.md: ## Review Comment: I will wait for @mridulm's response until this weekend. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47257][SQL] Assign names to error classes _LEGACY_ERROR_TEMP_105[3-4] and _LEGACY_ERROR_TEMP_1113 [spark]
wayneguow commented on code in PR #46731: URL: https://github.com/apache/spark/pull/46731#discussion_r1613124606 ## sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala: ## @@ -90,8 +91,8 @@ class ResolveSessionCatalog(val catalogManager: CatalogManager) table.schema.findNestedField(Seq(colName), resolver = conf.resolver) .map(_._2.dataType) .getOrElse { -throw QueryCompilationErrors.alterColumnCannotFindColumnInV1TableError( - quoteIfNeeded(colName), table) +throw QueryCompilationErrors.unresolvedColumnError( Review Comment: > Can we reuse `UNRESOLVED_COLUMN.WITH_SUGGESTION`? For the scenario of `ALTER COLUMN`, since there is one table, I think it is OK; for `resolveFieldNames` in `Analyzer`, I think it would be better if the table can be explicitly specified. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-48411][SS][PYTHON] Add E2E test for DropDuplicateWithinWatermark [spark]
eason-yuchen-liu opened a new pull request, #46740: URL: https://github.com/apache/spark/pull/46740 ### What changes were proposed in this pull request? This PR adds a test for API DropDuplicateWithinWatermark, which was previously missing. ### Why are the changes needed? Check the correctness of API DropDuplicateWithinWatermark. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Passed: ``` python/run-tests --testnames pyspark.sql.tests.streaming.test_streaming python/run-tests --testnames pyspark.sql.tests.connect.streaming.test_parity_streaming ``` ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47579][CORE][PART3] Spark core: Migrate logInfo with variables to structured logging framework [spark]
gengliangwang commented on PR #46739: URL: https://github.com/apache/spark/pull/46739#issuecomment-2130009429 cc @panbingkun @zeotuan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-47579][CORE][PART3] Spark core: Migrate logInfo with variables to structured logging framework [spark]
gengliangwang opened a new pull request, #46739: URL: https://github.com/apache/spark/pull/46739 ### What changes were proposed in this pull request? The PR aims to migrate logInfo in Core module with variables to structured logging framework. ### Why are the changes needed? To enhance Apache Spark's logging system by implementing structured logging. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? GA tests ### Was this patch authored or co-authored using generative AI tooling? Yes, Generated-by: GPT-4 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47579][SQL][FOLLOWUP] Restore the `--help` print format of spark sql shell [spark]
gengliangwang closed pull request #46735: [SPARK-47579][SQL][FOLLOWUP] Restore the `--help` print format of spark sql shell URL: https://github.com/apache/spark/pull/46735 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47579][SQL][FOLLOWUP] Restore the `--help` print format of spark sql shell [spark]
gengliangwang commented on PR #46735: URL: https://github.com/apache/spark/pull/46735#issuecomment-2129973290 Merging to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]
GideonPotok commented on PR #46597: URL: https://github.com/apache/spark/pull/46597#issuecomment-2129923419 @uros-db I forgot but should I add collation support to `org.apache.spark.sql.catalyst.expressions.aggregate.PandasMode`? The only difference will be 1. Support for null keys (thus StringType won't necessarily mean all values in buffer are UTF8String, some might just be null, right?) 2. PandasMode returns a list of all values that are tied for mode. In that case, should all the values be present? Eg if you have the pandas_mode of ['a', 'a', 'a', 'b', 'b', 'B'], with utf_binary_lcase collation, what do you think pandas_mode should return? If we want to support PandasMode, I can do a little research on what other databases seem to favor for this type of question. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48168][SQL] Add bitwise shifting operators support [spark]
cloud-fan commented on code in PR #46440: URL: https://github.com/apache/spark/pull/46440#discussion_r1613651414 ## connector/connect/common/src/test/resources/query-tests/explain-results/function_shiftleft.explain: ## @@ -1,2 +1,2 @@ -Project [shiftleft(cast(b#0 as int), 2) AS shiftleft(b, 2)#0] +Project [(cast(b#0 as int) << 2) AS (b << 2)#0] Review Comment: can we keep the pretty string unchanged? if people explicitly use `shiftleft` function, then the generated alias name should still be `shiftleft`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48273][SQL] Fix late rewrite of PlanWithUnresolvedIdentifier [spark]
nikolamand-db commented on PR #46580: URL: https://github.com/apache/spark/pull/46580#issuecomment-2129615940 > I think this fixes https://issues.apache.org/jira/browse/SPARK-46625 as well. Can we add a test to verify? Checked locally, seems like these changes don't resolve the issue: ``` spark-sql (default)> explain extended WITH S(c1, c2) AS (VALUES(1, 2), (2, 3)), T(c1, c2) AS (VALUES ('a', 'b'), ('c', 'd')) SELECT IDENTIFIER('max')(IDENTIFIER('c1')) FROM IDENTIFIER('T'); == Parsed Logical Plan == CTE [S, T] : :- 'SubqueryAlias S : : +- 'UnresolvedSubqueryColumnAliases [c1, c2] : : +- 'UnresolvedInlineTable [col1, col2], [[1, 2], [2, 3]] : +- 'SubqueryAlias T : +- 'UnresolvedSubqueryColumnAliases [c1, c2] :+- 'UnresolvedInlineTable [col1, col2], [[a, b], [c, d]] +- 'Project [unresolvedalias(expressionwithunresolvedidentifier(max, expressionwithunresolvedidentifier(c1, ), org.apache.spark.sql.catalyst.parser.AstBuilder$$Lambda$1453/0x000401c95138@312ddc51))] +- 'PlanWithUnresolvedIdentifier T, org.apache.spark.sql.catalyst.parser.AstBuilder$$Lambda$1420/0x000401c5e688@330f3046 == Analyzed Logical Plan == org.apache.spark.sql.AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `c1` cannot be resolved. Did you mean one of the following? [`Z`, `s`]. SQLSTATE: 42703; line 1 pos 129; 'WithCTE :- CTERelationDef 0, false : +- SubqueryAlias S : +- Project [col1#5 AS c1#9, col2#6 AS c2#10] :+- LocalRelation [col1#5, col2#6] :- CTERelationDef 1, false : +- SubqueryAlias T : +- Project [col1#7 AS c1#11, col2#8 AS c2#12] :+- LocalRelation [col1#7, col2#8] +- 'Project [unresolvedalias('max('c1))] +- SubqueryAlias spark_catalog.default.t +- Relation spark_catalog.default.t[Z#13,s#14] parquet ``` `PlanWithUnresolvedIdentifier` gets resolved but it tries to use table `t` from catalog instead of CTE definition. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48342][SQL] Introduction of SQL Scripting Parser [spark]
davidm-db commented on code in PR #46665: URL: https://github.com/apache/spark/pull/46665#discussion_r1613161561 ## sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4: ## @@ -42,6 +42,29 @@ options { tokenVocab = SqlBaseLexer; } public boolean double_quoted_identifiers = false; } +compoundOrSingleStatement +: singleStatement +| singleCompound +; + +singleCompound +: compound SEMICOLON* EOF +; + +compound +: BEGIN compoundBody END +; + +// Last semicolon in body is optional. +compoundBody Review Comment: As Milan said, we will reuse `compoundBody` in future in the blocks that don't require `BEGIN` and `END` (if else, while, etc.). One more reason is that in such blocks (that don't begin with `BEGIN`) variable declaration is not allowed, so it is easier to have these cases separated for the sake of additional checks in visitor functions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48282][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (StringReplace, FindInSet) [spark]
uros-db commented on code in PR #46682: URL: https://github.com/apache/spark/pull/46682#discussion_r1613448325 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java: ## @@ -34,6 +34,155 @@ * Utility class for collation-aware UTF8String operations. */ public class CollationAwareUTF8String { + + /** + * The constant value to indicate that the match is not found when searching for a pattern + * string in a target string. + */ + private static final int MATCH_NOT_FOUND = -1; + + /** + * Returns whether the target string starts with the specified prefix, starting from the + * specified position (0-based index referring to character position in UTF8String), with respect + * to the UTF8_BINARY_LCASE collation. The method assumes that the prefix is already lowercased + * prior to method call to avoid the overhead of calling .toLowerCase() multiple times on the + * same prefix string. + * + * @param target the string to be searched in + * @param lowercasePattern the string to be searched for + * @param startPos the start position for searching (in the target string) + * @return whether the target string starts with the specified prefix in UTF8_BINARY_LCASE + */ + public static boolean lowercaseMatchFrom( + final UTF8String target, + final UTF8String lowercasePattern, + int startPos) { +return lowercaseMatchLengthFrom(target, lowercasePattern, startPos) != MATCH_NOT_FOUND; + } + + /** + * Returns the length of the substring of the target string that starts with the specified + * prefix, starting from the specified position (0-based index referring to character position + * in UTF8String), with respect to the UTF8_BINARY_LCASE collation. The method assumes that the + * prefix is already lowercased. The method only considers the part of target string that + * starts from the specified (inclusive) position (that is, the method does not look at UTF8 + * characters of the target string at or after position `endPos`). If the prefix is not found, + * MATCH_NOT_FOUND is returned. + * + * @param target the string to be searched in + * @param lowercasePattern the string to be searched for + * @param startPos the start position for searching (in the target string) + * @return length of the target substring that ends with the specified prefix in lowercase + */ + public static int lowercaseMatchLengthFrom( + final UTF8String target, + final UTF8String lowercasePattern, + int startPos) { +assert startPos >= 0; +for (int len = 0; len <= target.numChars() - startPos; ++len) { + if (target.substring(startPos, startPos + len).toLowerCase().equals(lowercasePattern)) { +return len; + } +} +return MATCH_NOT_FOUND; + } + + /** + * Returns the position of the first occurrence of the pattern string in the target string, + * starting from the specified position (0-based index referring to character position in + * UTF8String), with respect to the UTF8_BINARY_LCASE collation. The method assumes that the + * pattern string is already lowercased prior to call. If the pattern is not found, + * MATCH_NOT_FOUND is returned. + * + * @param target the string to be searched in + * @param lowercasePattern the string to be searched for + * @param startPos the start position for searching (in the target string) + * @return the position of the first occurrence of pattern in target + */ + public static int lowercaseFind( + final UTF8String target, + final UTF8String lowercasePattern, + int startPos) { +assert startPos >= 0; +for (int i = startPos; i <= target.numChars(); ++i) { + if (lowercaseMatchFrom(target, lowercasePattern, i)) { +return i; + } +} +return MATCH_NOT_FOUND; + } + + /** + * Returns whether the target string ends with the specified suffix, ending at the specified + * position (0-based index referring to character position in UTF8String), with respect to the + * UTF8_BINARY_LCASE collation. The method assumes that the suffix is already lowercased prior + * to method call to avoid the overhead of calling .toLowerCase() multiple times on the same + * suffix string. + * + * @param target the string to be searched in + * @param lowercasePattern the string to be searched for + * @param endPos the end position for searching (in the target string) + * @return whether the target string ends with the specified suffix in lowercase + */ + public static boolean lowercaseMatchUntil( + final UTF8String target, + final UTF8String lowercasePattern, + int endPos) { +return lowercaseMatchLengthUntil(target, lowercasePattern, endPos) != MATCH_NOT_FOUND; + } + + /** + * Returns the length of the substring of the target string that ends with the specified + * suffix, ending at the specified position (0-based index refer
Re: [PR] [SPARK-48282][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (StringReplace, FindInSet) [spark]
uros-db commented on code in PR #46682: URL: https://github.com/apache/spark/pull/46682#discussion_r1613447350 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java: ## @@ -34,6 +34,155 @@ * Utility class for collation-aware UTF8String operations. */ public class CollationAwareUTF8String { + + /** + * The constant value to indicate that the match is not found when searching for a pattern + * string in a target string. + */ + private static final int MATCH_NOT_FOUND = -1; + + /** + * Returns whether the target string starts with the specified prefix, starting from the + * specified position (0-based index referring to character position in UTF8String), with respect + * to the UTF8_BINARY_LCASE collation. The method assumes that the prefix is already lowercased + * prior to method call to avoid the overhead of calling .toLowerCase() multiple times on the + * same prefix string. + * + * @param target the string to be searched in + * @param lowercasePattern the string to be searched for + * @param startPos the start position for searching (in the target string) + * @return whether the target string starts with the specified prefix in UTF8_BINARY_LCASE + */ + public static boolean lowercaseMatchFrom( + final UTF8String target, + final UTF8String lowercasePattern, + int startPos) { +return lowercaseMatchLengthFrom(target, lowercasePattern, startPos) != MATCH_NOT_FOUND; + } + + /** + * Returns the length of the substring of the target string that starts with the specified + * prefix, starting from the specified position (0-based index referring to character position + * in UTF8String), with respect to the UTF8_BINARY_LCASE collation. The method assumes that the + * prefix is already lowercased. The method only considers the part of target string that + * starts from the specified (inclusive) position (that is, the method does not look at UTF8 + * characters of the target string at or after position `endPos`). If the prefix is not found, + * MATCH_NOT_FOUND is returned. + * + * @param target the string to be searched in + * @param lowercasePattern the string to be searched for + * @param startPos the start position for searching (in the target string) + * @return length of the target substring that ends with the specified prefix in lowercase + */ + public static int lowercaseMatchLengthFrom( + final UTF8String target, + final UTF8String lowercasePattern, + int startPos) { +assert startPos >= 0; +for (int len = 0; len <= target.numChars() - startPos; ++len) { + if (target.substring(startPos, startPos + len).toLowerCase().equals(lowercasePattern)) { +return len; + } +} +return MATCH_NOT_FOUND; + } + + /** + * Returns the position of the first occurrence of the pattern string in the target string, + * starting from the specified position (0-based index referring to character position in + * UTF8String), with respect to the UTF8_BINARY_LCASE collation. The method assumes that the + * pattern string is already lowercased prior to call. If the pattern is not found, + * MATCH_NOT_FOUND is returned. + * + * @param target the string to be searched in + * @param lowercasePattern the string to be searched for + * @param startPos the start position for searching (in the target string) + * @return the position of the first occurrence of pattern in target + */ + public static int lowercaseFind( + final UTF8String target, + final UTF8String lowercasePattern, + int startPos) { +assert startPos >= 0; +for (int i = startPos; i <= target.numChars(); ++i) { + if (lowercaseMatchFrom(target, lowercasePattern, i)) { +return i; + } +} +return MATCH_NOT_FOUND; + } + + /** + * Returns whether the target string ends with the specified suffix, ending at the specified + * position (0-based index referring to character position in UTF8String), with respect to the + * UTF8_BINARY_LCASE collation. The method assumes that the suffix is already lowercased prior + * to method call to avoid the overhead of calling .toLowerCase() multiple times on the same + * suffix string. + * + * @param target the string to be searched in + * @param lowercasePattern the string to be searched for + * @param endPos the end position for searching (in the target string) + * @return whether the target string ends with the specified suffix in lowercase + */ + public static boolean lowercaseMatchUntil( + final UTF8String target, + final UTF8String lowercasePattern, + int endPos) { +return lowercaseMatchLengthUntil(target, lowercasePattern, endPos) != MATCH_NOT_FOUND; + } + + /** + * Returns the length of the substring of the target string that ends with the specified + * suffix, ending at the specified position (0-based index refer
Re: [PR] [SPARK-48413][SQL] ALTER COLUMN with collation [spark]
johanl-db commented on code in PR #46734: URL: https://github.com/apache/spark/pull/46734#discussion_r1613401422 ## sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala: ## @@ -396,8 +396,9 @@ case class AlterTableChangeColumnCommand( val newDataSchema = table.dataSchema.fields.map { field => if (field.name == originColumn.name) { // Create a new column from the origin column with the new comment. +val newField = field.copy(dataType = newColumn.dataType) Review Comment: Can you make it more explicit that we effectively allow the data type to change now but we only allow type changes that differ by collation? E.g. by splitting `columnEqual` into checking that names match on one side and that the type change is supported on the other + add a ` withNewType` method. ## sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala: ## @@ -396,8 +396,9 @@ case class AlterTableChangeColumnCommand( val newDataSchema = table.dataSchema.fields.map { field => if (field.name == originColumn.name) { // Create a new column from the origin column with the new comment. +val newField = field.copy(dataType = newColumn.dataType) Review Comment: Lines 359-367: the comment needs to be updated to be clear we now allow changing collation. Side note: it gives the hive-style syntax for ALTER COLUMN but that's only one of the two syntaxes, see: https://github.com/apache/spark/blob/master/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4#L130 The other one is the one (partially) documented and the more capable/preferred one: https://spark.apache.org/docs/3.5.1/sql-ref-syntax-ddl-alter-table.html#parameters-4 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48282][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (StringReplace, FindInSet) [spark]
uros-db commented on code in PR #46682: URL: https://github.com/apache/spark/pull/46682#discussion_r1613408156 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java: ## @@ -34,6 +34,155 @@ * Utility class for collation-aware UTF8String operations. */ public class CollationAwareUTF8String { + + /** + * The constant value to indicate that the match is not found when searching for a pattern + * string in a target string. + */ + private static final int MATCH_NOT_FOUND = -1; + + /** + * Returns whether the target string starts with the specified prefix, starting from the + * specified position (0-based index referring to character position in UTF8String), with respect + * to the UTF8_BINARY_LCASE collation. The method assumes that the prefix is already lowercased + * prior to method call to avoid the overhead of calling .toLowerCase() multiple times on the + * same prefix string. + * + * @param target the string to be searched in + * @param lowercasePattern the string to be searched for + * @param startPos the start position for searching (in the target string) + * @return whether the target string starts with the specified prefix in UTF8_BINARY_LCASE + */ + public static boolean lowercaseMatchFrom( + final UTF8String target, + final UTF8String lowercasePattern, + int startPos) { +return lowercaseMatchLengthFrom(target, lowercasePattern, startPos) != MATCH_NOT_FOUND; + } + + /** + * Returns the length of the substring of the target string that starts with the specified + * prefix, starting from the specified position (0-based index referring to character position + * in UTF8String), with respect to the UTF8_BINARY_LCASE collation. The method assumes that the + * prefix is already lowercased. The method only considers the part of target string that + * starts from the specified (inclusive) position (that is, the method does not look at UTF8 + * characters of the target string at or after position `endPos`). If the prefix is not found, + * MATCH_NOT_FOUND is returned. + * + * @param target the string to be searched in + * @param lowercasePattern the string to be searched for + * @param startPos the start position for searching (in the target string) + * @return length of the target substring that ends with the specified prefix in lowercase + */ + public static int lowercaseMatchLengthFrom( + final UTF8String target, + final UTF8String lowercasePattern, + int startPos) { +assert startPos >= 0; +for (int len = 0; len <= target.numChars() - startPos; ++len) { + if (target.substring(startPos, startPos + len).toLowerCase().equals(lowercasePattern)) { +return len; + } +} +return MATCH_NOT_FOUND; + } + + /** + * Returns the position of the first occurrence of the pattern string in the target string, + * starting from the specified position (0-based index referring to character position in + * UTF8String), with respect to the UTF8_BINARY_LCASE collation. The method assumes that the + * pattern string is already lowercased prior to call. If the pattern is not found, + * MATCH_NOT_FOUND is returned. + * + * @param target the string to be searched in + * @param lowercasePattern the string to be searched for + * @param startPos the start position for searching (in the target string) + * @return the position of the first occurrence of pattern in target + */ + public static int lowercaseFind( + final UTF8String target, + final UTF8String lowercasePattern, + int startPos) { +assert startPos >= 0; +for (int i = startPos; i <= target.numChars(); ++i) { + if (lowercaseMatchFrom(target, lowercasePattern, i)) { +return i; + } +} +return MATCH_NOT_FOUND; + } + + /** + * Returns whether the target string ends with the specified suffix, ending at the specified + * position (0-based index referring to character position in UTF8String), with respect to the + * UTF8_BINARY_LCASE collation. The method assumes that the suffix is already lowercased prior + * to method call to avoid the overhead of calling .toLowerCase() multiple times on the same + * suffix string. + * + * @param target the string to be searched in + * @param lowercasePattern the string to be searched for + * @param endPos the end position for searching (in the target string) + * @return whether the target string ends with the specified suffix in lowercase + */ + public static boolean lowercaseMatchUntil( Review Comment: yes, this will need to be a public method (changes in https://github.com/apache/spark/pull/46511) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries
Re: [PR] [SPARK-48282][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (StringReplace, FindInSet) [spark]
uros-db commented on code in PR #46682: URL: https://github.com/apache/spark/pull/46682#discussion_r1613407352 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java: ## @@ -34,6 +34,155 @@ * Utility class for collation-aware UTF8String operations. */ public class CollationAwareUTF8String { + + /** + * The constant value to indicate that the match is not found when searching for a pattern + * string in a target string. + */ + private static final int MATCH_NOT_FOUND = -1; + + /** + * Returns whether the target string starts with the specified prefix, starting from the + * specified position (0-based index referring to character position in UTF8String), with respect + * to the UTF8_BINARY_LCASE collation. The method assumes that the prefix is already lowercased + * prior to method call to avoid the overhead of calling .toLowerCase() multiple times on the + * same prefix string. + * + * @param target the string to be searched in + * @param lowercasePattern the string to be searched for + * @param startPos the start position for searching (in the target string) + * @return whether the target string starts with the specified prefix in UTF8_BINARY_LCASE + */ + public static boolean lowercaseMatchFrom( + final UTF8String target, + final UTF8String lowercasePattern, + int startPos) { +return lowercaseMatchLengthFrom(target, lowercasePattern, startPos) != MATCH_NOT_FOUND; + } + + /** + * Returns the length of the substring of the target string that starts with the specified + * prefix, starting from the specified position (0-based index referring to character position + * in UTF8String), with respect to the UTF8_BINARY_LCASE collation. The method assumes that the + * prefix is already lowercased. The method only considers the part of target string that + * starts from the specified (inclusive) position (that is, the method does not look at UTF8 + * characters of the target string at or after position `endPos`). If the prefix is not found, + * MATCH_NOT_FOUND is returned. + * + * @param target the string to be searched in + * @param lowercasePattern the string to be searched for + * @param startPos the start position for searching (in the target string) + * @return length of the target substring that ends with the specified prefix in lowercase + */ + public static int lowercaseMatchLengthFrom( + final UTF8String target, + final UTF8String lowercasePattern, + int startPos) { +assert startPos >= 0; +for (int len = 0; len <= target.numChars() - startPos; ++len) { + if (target.substring(startPos, startPos + len).toLowerCase().equals(lowercasePattern)) { +return len; + } +} +return MATCH_NOT_FOUND; + } + + /** + * Returns the position of the first occurrence of the pattern string in the target string, + * starting from the specified position (0-based index referring to character position in + * UTF8String), with respect to the UTF8_BINARY_LCASE collation. The method assumes that the + * pattern string is already lowercased prior to call. If the pattern is not found, + * MATCH_NOT_FOUND is returned. + * + * @param target the string to be searched in + * @param lowercasePattern the string to be searched for + * @param startPos the start position for searching (in the target string) + * @return the position of the first occurrence of pattern in target + */ + public static int lowercaseFind( Review Comment: this one can go private -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48282][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (StringReplace, FindInSet) [spark]
uros-db commented on code in PR #46682: URL: https://github.com/apache/spark/pull/46682#discussion_r1613404598 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java: ## @@ -34,6 +34,155 @@ * Utility class for collation-aware UTF8String operations. */ public class CollationAwareUTF8String { + + /** + * The constant value to indicate that the match is not found when searching for a pattern + * string in a target string. + */ + private static final int MATCH_NOT_FOUND = -1; + + /** + * Returns whether the target string starts with the specified prefix, starting from the + * specified position (0-based index referring to character position in UTF8String), with respect + * to the UTF8_BINARY_LCASE collation. The method assumes that the prefix is already lowercased + * prior to method call to avoid the overhead of calling .toLowerCase() multiple times on the + * same prefix string. + * + * @param target the string to be searched in + * @param lowercasePattern the string to be searched for + * @param startPos the start position for searching (in the target string) + * @return whether the target string starts with the specified prefix in UTF8_BINARY_LCASE + */ + public static boolean lowercaseMatchFrom( + final UTF8String target, + final UTF8String lowercasePattern, + int startPos) { +return lowercaseMatchLengthFrom(target, lowercasePattern, startPos) != MATCH_NOT_FOUND; + } + + /** + * Returns the length of the substring of the target string that starts with the specified + * prefix, starting from the specified position (0-based index referring to character position + * in UTF8String), with respect to the UTF8_BINARY_LCASE collation. The method assumes that the + * prefix is already lowercased. The method only considers the part of target string that + * starts from the specified (inclusive) position (that is, the method does not look at UTF8 + * characters of the target string at or after position `endPos`). If the prefix is not found, + * MATCH_NOT_FOUND is returned. + * + * @param target the string to be searched in + * @param lowercasePattern the string to be searched for + * @param startPos the start position for searching (in the target string) + * @return length of the target substring that ends with the specified prefix in lowercase + */ + public static int lowercaseMatchLengthFrom( Review Comment: this one can go private -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48282][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (StringReplace, FindInSet) [spark]
uros-db commented on code in PR #46682: URL: https://github.com/apache/spark/pull/46682#discussion_r1613401443 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java: ## @@ -34,6 +34,155 @@ * Utility class for collation-aware UTF8String operations. */ public class CollationAwareUTF8String { + + /** + * The constant value to indicate that the match is not found when searching for a pattern + * string in a target string. + */ + private static final int MATCH_NOT_FOUND = -1; + + /** + * Returns whether the target string starts with the specified prefix, starting from the + * specified position (0-based index referring to character position in UTF8String), with respect + * to the UTF8_BINARY_LCASE collation. The method assumes that the prefix is already lowercased + * prior to method call to avoid the overhead of calling .toLowerCase() multiple times on the + * same prefix string. + * + * @param target the string to be searched in + * @param lowercasePattern the string to be searched for + * @param startPos the start position for searching (in the target string) + * @return whether the target string starts with the specified prefix in UTF8_BINARY_LCASE + */ + public static boolean lowercaseMatchFrom( Review Comment: yes, this will need to be a public method (changes in https://github.com/apache/spark/pull/46511) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48342][SQL] Introduction of SQL Scripting Parser [spark]
davidm-db commented on code in PR #46665: URL: https://github.com/apache/spark/pull/46665#discussion_r1613389337 ## sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4: ## @@ -42,6 +42,29 @@ options { tokenVocab = SqlBaseLexer; } public boolean double_quoted_identifiers = false; } +compoundOrSingleStatement +: singleStatement +| singleCompound +; + +singleCompound +: compound SEMICOLON* EOF +; + +compound +: BEGIN compoundBody END +; + +// Last semicolon in body is optional. +compoundBody +: compoundStatement (SEMICOLON+ compoundStatement)* SEMICOLON* Review Comment: Talked with Serge offline. His ref spec says that `;` is single and mandatory and needs to be present even in the latest statement in the block. This is more restrictive, but the idea is that we can easily make it less restrictive if we need it in the future. So now, we have only `(compoundStatement SEMICOLON)*` which allows empty `BEGIN END` blocks, but that's also per ref spec. We can discuss further if we want this changed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48283][SQL] Modify string comparison for UTF8_BINARY_LCASE [spark]
uros-db commented on code in PR #46700: URL: https://github.com/apache/spark/pull/46700#discussion_r1613391451 ## common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java: ## @@ -17,15 +17,136 @@ package org.apache.spark.unsafe.types; import org.apache.spark.SparkException; +import org.apache.spark.sql.catalyst.util.CollationAwareUTF8String; import org.apache.spark.sql.catalyst.util.CollationFactory; import org.apache.spark.sql.catalyst.util.CollationSupport; import org.junit.jupiter.api.Test; import static org.junit.jupiter.api.Assertions.*; - +// checkstyle.off: AvoidEscapedUnicodeCharacters public class CollationSupportSuite { + /** + * Collation-aware UTF8String comparison. + */ + + private void assertStringCompare(String s1, String s2, String collationName, int expected) + throws SparkException { +UTF8String l = UTF8String.fromString(s1); +UTF8String r = UTF8String.fromString(s2); +int compare = CollationFactory.fetchCollation(collationName).comparator.compare(l, r); +assertEquals(Integer.signum(expected), Integer.signum(compare)); + } + + @Test + public void testCompare() throws SparkException { +// Edge cases +assertStringCompare("", "", "UTF8_BINARY", 0); +assertStringCompare("a", "", "UTF8_BINARY", 1); +assertStringCompare("", "a", "UTF8_BINARY", -1); +assertStringCompare("", "", "UTF8_BINARY_LCASE", 0); +assertStringCompare("a", "", "UTF8_BINARY_LCASE", 1); +assertStringCompare("", "a", "UTF8_BINARY_LCASE", -1); +assertStringCompare("", "", "UNICODE", 0); +assertStringCompare("a", "", "UNICODE", 1); +assertStringCompare("", "a", "UNICODE", -1); +assertStringCompare("", "", "UNICODE_CI", 0); +assertStringCompare("a", "", "UNICODE_CI", 1); +assertStringCompare("", "a", "UNICODE_CI", -1); +// Basic tests +assertStringCompare("AbCd", "aBcD", "UTF8_BINARY", -1); +assertStringCompare("ABCD", "abcd", "UTF8_BINARY_LCASE", 0); +assertStringCompare("AbcD", "aBCd", "UNICODE", 1); +assertStringCompare("abcd", "ABCD", "UNICODE_CI", 0); +// Accent variation +assertStringCompare("aBćD", "ABĆD", "UTF8_BINARY", 1); +assertStringCompare("AbCδ", "ABCΔ", "UTF8_BINARY_LCASE", 0); +assertStringCompare("äBCd", "ÄBCD", "UNICODE", -1); +assertStringCompare("Ab́cD", "AB́CD", "UNICODE_CI", 0); +// Case-variable character length +assertStringCompare("i\u0307", "İ", "UTF8_BINARY", -1); +assertStringCompare("İ", "i\u0307", "UTF8_BINARY", 1); +assertStringCompare("i\u0307", "İ", "UTF8_BINARY_LCASE", 0); +assertStringCompare("İ", "i\u0307", "UTF8_BINARY_LCASE", 0); +assertStringCompare("i\u0307", "İ", "UNICODE", -1); +assertStringCompare("İ", "i\u0307", "UNICODE", 1); +assertStringCompare("i\u0307", "İ", "UNICODE_CI", 0); +assertStringCompare("İ", "i\u0307", "UNICODE_CI", 0); +assertStringCompare("i\u0307İ", "i\u0307İ", "UTF8_BINARY_LCASE", 0); +assertStringCompare("i\u0307İ", "İi\u0307", "UTF8_BINARY_LCASE", 0); +assertStringCompare("İi\u0307", "i\u0307İ", "UTF8_BINARY_LCASE", 0); +assertStringCompare("İi\u0307", "İi\u0307", "UTF8_BINARY_LCASE", 0); +assertStringCompare("i\u0307İ", "i\u0307İ", "UNICODE_CI", 0); +assertStringCompare("i\u0307İ", "İi\u0307", "UNICODE_CI", 0); +assertStringCompare("İi\u0307", "i\u0307İ", "UNICODE_CI", 0); +assertStringCompare("İi\u0307", "İi\u0307", "UNICODE_CI", 0); +// Conditional case mapping +assertStringCompare("ς", "σ", "UTF8_BINARY", -1); +assertStringCompare("ς", "Σ", "UTF8_BINARY", 1); +assertStringCompare("σ", "Σ", "UTF8_BINARY", 1); +assertStringCompare("ς", "σ", "UTF8_BINARY_LCASE", 0); +assertStringCompare("ς", "Σ", "UTF8_BINARY_LCASE", 0); +assertStringCompare("σ", "Σ", "UTF8_BINARY_LCASE", 0); +assertStringCompare("ς", "σ", "UNICODE", 1); +assertStringCompare("ς", "Σ", "UNICODE", 1); +assertStringCompare("σ", "Σ", "UNICODE", -1); +assertStringCompare("ς", "σ", "UNICODE_CI", 0); +assertStringCompare("ς", "Σ", "UNICODE_CI", 0); +assertStringCompare("σ", "Σ", "UNICODE_CI", 0); + } + + private void assertLcaseCompare(String target, String expected, String collationName) + throws SparkException { +if (collationName.equals("UTF8_BINARY")) { + UTF8String targetUTF8 = UTF8String.fromString(target); + UTF8String expectedUTF8 = UTF8String.fromString(expected); + assertEquals(expectedUTF8, targetUTF8.toLowerCase()); +} else if (collationName.equals("UTF8_BINARY_LCASE")) { + assertEquals(expected, CollationAwareUTF8String.lowerCaseCodePoints(target)); +} else { + int collationId = CollationFactory.collationNameToId(collationName); + assertEquals(expected, CollationAwareUTF8String.toLowerCase(target, collationId)); +} + } + + @Test + public void testLcaseCompare() throws SparkException { +// Edge cases +assertLcaseCompare(""
Re: [PR] [SPARK-48342][SQL] Introduction of SQL Scripting Parser [spark]
davidm-db commented on code in PR #46665: URL: https://github.com/apache/spark/pull/46665#discussion_r1613389337 ## sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4: ## @@ -42,6 +42,29 @@ options { tokenVocab = SqlBaseLexer; } public boolean double_quoted_identifiers = false; } +compoundOrSingleStatement +: singleStatement +| singleCompound +; + +singleCompound +: compound SEMICOLON* EOF +; + +compound +: BEGIN compoundBody END +; + +// Last semicolon in body is optional. +compoundBody +: compoundStatement (SEMICOLON+ compoundStatement)* SEMICOLON* Review Comment: Talked with Serge offline. His ref spec says that `;` is mandatory and needs to be present even in the latest statement in the block. This is more restrictive, but the idea is that we can easily make it less restrictive if we need it in the future. So now, we have only `(compoundStatement SEMICOLON)*` which allows empty `BEGIN END` blocks, but that's also per ref spec. We can discuss further if we want this changed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48342][SQL] Introduction of SQL Scripting Parser [spark]
davidm-db commented on code in PR #46665: URL: https://github.com/apache/spark/pull/46665#discussion_r1613386426 ## sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4: ## @@ -42,6 +42,29 @@ options { tokenVocab = SqlBaseLexer; } public boolean double_quoted_identifiers = false; } +compoundOrSingleStatement +: singleStatement +| singleCompound +; + +singleCompound +: compound SEMICOLON* EOF +; + +compound Review Comment: Renamed to `beginEndCompoundBlock`. Added the `Compound` part so it's clear that it's related to compound statements. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48342][SQL] Introduction of SQL Scripting Parser [spark]
davidm-db commented on code in PR #46665: URL: https://github.com/apache/spark/pull/46665#discussion_r1613384389 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/SqlScriptingLogicalOperators.scala: ## @@ -0,0 +1,35 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.parser + +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan + +sealed trait CompoundPlanStatement Review Comment: Done. ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/SqlScriptingLogicalOperators.scala: ## @@ -0,0 +1,35 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.parser + +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan + +sealed trait CompoundPlanStatement + +// Statement that is supposed to be executed against Spark. +// This can also be a Spark expression that is wrapped in a statement. +case class SparkStatementWithPlan( +parsedPlan: LogicalPlan, +sourceStart: Int, Review Comment: Done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48415][PYTHON] Refactor `TypeName` to support parameterized datatypes [spark]
zhengruifeng commented on code in PR #46738: URL: https://github.com/apache/spark/pull/46738#discussion_r1613379691 ## python/pyspark/sql/types.py: ## @@ -120,9 +120,8 @@ def __eq__(self, other: Any) -> bool: def __ne__(self, other: Any) -> bool: return not self.__eq__(other) -@classmethod -def typeName(cls) -> str: -return cls.__name__[:-4].lower() +def typeName(self) -> str: +return self.__class__.__name__.removesuffix("Type").removesuffix("UDT").lower() Review Comment: follow Scala side https://github.com/apache/spark/blob/6f6b4860268dc250d8e31a251d740733798aa512/sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala#L58-L62 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-48415][PYTHON] Refactor `TypeName` to support parameterized datatypes [spark]
zhengruifeng opened a new pull request, #46738: URL: https://github.com/apache/spark/pull/46738 ### What changes were proposed in this pull request? 1, refactor `TypeName` to support parameterized datatypes 2, remove redundant simpleString/jsonValue methods, since they are type name by default. ### Why are the changes needed? to be consistent with scala side ### Does this PR introduce _any_ user-facing change? no, the `typeName` was only internally used in tree string now ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48283][SQL] Modify string comparison for UTF8_BINARY_LCASE [spark]
uros-db commented on code in PR #46700: URL: https://github.com/apache/spark/pull/46700#discussion_r1613207707 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java: ## @@ -147,6 +162,45 @@ public static String toLowerCase(final String target, final int collationId) { return UCharacter.toLowerCase(locale, target); } + /** + * Converts a single code point to lowercase using ICU rules, with special handling for + * conditional case mappings (i.e. characters that map to multiple characters in lowercase). Review Comment: ah, so "conditional case mapping" == "context-sensitivity" instead, what I wanted to say is just "one-to-many case mapping" -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]
uros-db commented on code in PR #46597: URL: https://github.com/apache/spark/pull/46597#discussion_r1613363048 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala: ## @@ -74,16 +90,29 @@ case class Mode( if (buffer.isEmpty) { return null } - +val collationAwareBuffer = child.dataType match { + case c: StringType if +!CollationFactory.fetchCollation(c.collationId).supportsBinaryEquality => +val collationId = c.collationId +val modeMap = buffer.toSeq.groupMapReduce { + case (key: String, _) => +CollationFactory.getCollationKey(UTF8String.fromString(key), collationId) + case (key: UTF8String, _) => +CollationFactory.getCollationKey(key, collationId) + case (key, _) => key Review Comment: also, what case exactly does this catch? if `case c: StringType` is already matched, then we should be able to do `CollationFactory.getCollationKey(key.asInstanceOf[UTF8String], collationId)` right away? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]
uros-db commented on code in PR #46597: URL: https://github.com/apache/spark/pull/46597#discussion_r1613361676 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala: ## @@ -74,16 +90,29 @@ case class Mode( if (buffer.isEmpty) { return null } - +val collationAwareBuffer = child.dataType match { + case c: StringType if +!CollationFactory.fetchCollation(c.collationId).supportsBinaryEquality => +val collationId = c.collationId +val modeMap = buffer.toSeq.groupMapReduce { + case (key: String, _) => +CollationFactory.getCollationKey(UTF8String.fromString(key), collationId) + case (key: UTF8String, _) => +CollationFactory.getCollationKey(key, collationId) Review Comment: could we combine these with something like: `CollationFactory.getCollationKey(key.asInstanceOf[UTF8String], collationId)` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]
uros-db commented on code in PR #46597: URL: https://github.com/apache/spark/pull/46597#discussion_r1613358004 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala: ## @@ -74,16 +90,29 @@ case class Mode( if (buffer.isEmpty) { return null } - +val collationAwareBuffer = child.dataType match { + case c: StringType if +!CollationFactory.fetchCollation(c.collationId).supportsBinaryEquality => +val collationId = c.collationId +val modeMap = buffer.toSeq.groupMapReduce { + case (key: String, _) => +CollationFactory.getCollationKey(UTF8String.fromString(key), collationId) + case (key: UTF8String, _) => +CollationFactory.getCollationKey(key, collationId) Review Comment: ```suggestion CollationFactory.getCollationKey(key.asInstanceOf[UTF8String], collationId) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]
uros-db commented on code in PR #46597: URL: https://github.com/apache/spark/pull/46597#discussion_r1613358004 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala: ## @@ -74,16 +90,29 @@ case class Mode( if (buffer.isEmpty) { return null } - +val collationAwareBuffer = child.dataType match { + case c: StringType if +!CollationFactory.fetchCollation(c.collationId).supportsBinaryEquality => +val collationId = c.collationId +val modeMap = buffer.toSeq.groupMapReduce { + case (key: String, _) => +CollationFactory.getCollationKey(UTF8String.fromString(key), collationId) + case (key: UTF8String, _) => +CollationFactory.getCollationKey(key, collationId) Review Comment: ```suggestion CollationFactory.getCollationKey(key.asInstanceOf[UTF8String], collationId) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]
uros-db commented on code in PR #46597: URL: https://github.com/apache/spark/pull/46597#discussion_r1613351378 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala: ## @@ -48,6 +49,21 @@ case class Mode( override def inputTypes: Seq[AbstractDataType] = Seq(AnyDataType) + override def checkInputDataTypes(): TypeCheckResult = { +val defaultCheck = super.checkInputDataTypes() +if (defaultCheck.isFailure) { + defaultCheck +} else if (UnsafeRowUtils.isBinaryStable(child.dataType) || + child.dataType.isInstanceOf[StringType]) { + defaultCheck Review Comment: ```suggestion TypeCheckResult.TypeCheckSuccess ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-44050][K8S]add retry config when creating Kubernetes resources. [spark]
imtzer commented on PR #45911: URL: https://github.com/apache/spark/pull/45911#issuecomment-2129285181 > > same problem when using spark operator, it's weird why the code does not throw anything when configmap is not created > > When using spark-submit, there is no error output in the console, and the client will show that the driver pod is always in the ContainerCreating state and will never end. Maybe I met another issuse, I use spark-submit in spark operator pod with k8s mode, but sometimes driver pod keeps getting stuck in ContainerCreating state due to missing ConfigMap and the console output shows 'Killed'. I added some log in KubernetesClientApplication.scala like this: ``` logInfo("before pod create, " + driverPodName) var watch: Watch = null var createdDriverPod: Pod = null try { createdDriverPod = kubernetesClient.pods().inNamespace(conf.namespace).resource(resolvedDriverPod).create() } catch {...} logInfo("before pre resource refresh, " + driverPodName) // Refresh all pre-resources' owner references try { addOwnerReference(createdDriverPod, preKubernetesResources) kubernetesClient.resourceList(preKubernetesResources: _*).forceConflicts().serverSideApply() } catch {...} logInfo("before other resource, " + driverPodName) // setup resources after pod creation, and refresh all resources' owner references try { val otherKubernetesResources = resolvedDriverSpec.driverKubernetesResources ++ Seq(configMap) addOwnerReference(createdDriverPod, otherKubernetesResources) kubernetesClient.resourceList(otherKubernetesResources: _*).createOrReplace() logInfo("after other resource, " + driverPodName) } catch {...} ``` and the log did not show `after other resource` when spark-submit was 'Killed', the configmap that driver pod need had not been successfully created! Also, I checked oom using `dmesg`, and some process in spark operator pod was killed because of oom -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48309][YARN]Stop am retry, in situations where some errors and retries may not be successful [spark]
guixiaowen commented on PR #46620: URL: https://github.com/apache/spark/pull/46620#issuecomment-2129261156 > Hi, @guixiaowen I need some time to think about it as it might break some existing workloads. > > Meantime, you can > > * Update the PR desc for better readability > * Update the PR desc according to the latest change > * Revise the doc of `spark.yarn.maxAppAttempts` @yaooqinn Ok, Thank you. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-48414][PYTHON] Fix breaking change in python's `fromJson` [spark]
stefankandic opened a new pull request, #46737: URL: https://github.com/apache/spark/pull/46737 ### What changes were proposed in this pull request? Fix breaking change in `fromJson` method by having default param values. ### Why are the changes needed? In order to not break clients that don't care about collations. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing UTs. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48309][YARN]Stop am retry, in situations where some errors and retries may not be successful [spark]
yaooqinn commented on PR #46620: URL: https://github.com/apache/spark/pull/46620#issuecomment-2129183148 Hi, @guixiaowen I need some time to think about it as it might break some existing workloads. Meantime, you can - Update the PR desc for better readability - Update the PR desc according to the latest change - Revise the doc of `spark.yarn.maxAppAttempts` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48168][SQL] Add bitwise shifting operators support [spark]
ulysses-you closed pull request #46440: [SPARK-48168][SQL] Add bitwise shifting operators support URL: https://github.com/apache/spark/pull/46440 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48168][SQL] Add bitwise shifting operators support [spark]
ulysses-you commented on PR #46440: URL: https://github.com/apache/spark/pull/46440#issuecomment-2129180382 thanks, merged to master(4.0.0) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48282][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (StringReplace, FindInSet) [spark]
uros-db commented on code in PR #46682: URL: https://github.com/apache/spark/pull/46682#discussion_r1613221638 ## common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java: ## @@ -610,8 +610,42 @@ public void testFindInSet() throws SparkException { assertFindInSet("界x", "test,大千,世,界X,大,千,世界", "UNICODE_CI", 4); assertFindInSet("界x", "test,大千,界Xx,世,界X,大,千,世界", "UNICODE_CI", 5); assertFindInSet("大", "test,大千,世,界X,大,千,世界", "UNICODE_CI", 5); +assertFindInSet("i̇", "İ", "UNICODE_CI", 1); +assertFindInSet("i", "İ", "UNICODE_CI", 0); +assertFindInSet("i̇", "i̇", "UNICODE_CI", 1); +assertFindInSet("i", "i̇", "UNICODE_CI", 0); +assertFindInSet("i̇", "İ,", "UNICODE_CI", 1); +assertFindInSet("i", "İ,", "UNICODE_CI", 0); +assertFindInSet("i̇", "i̇,", "UNICODE_CI", 1); +assertFindInSet("i", "i̇,", "UNICODE_CI", 0); +assertFindInSet("i̇", "ab,İ", "UNICODE_CI", 2); +assertFindInSet("i", "ab,İ", "UNICODE_CI", 0); +assertFindInSet("i̇", "ab,i̇", "UNICODE_CI", 2); +assertFindInSet("i", "ab,i̇", "UNICODE_CI", 0); +assertFindInSet("i̇", "ab,İ,12", "UNICODE_CI", 2); +assertFindInSet("i", "ab,İ,12", "UNICODE_CI", 0); +assertFindInSet("i̇", "ab,i̇,12", "UNICODE_CI", 2); +assertFindInSet("i", "ab,i̇,12", "UNICODE_CI", 0); assertFindInSet("i̇o", "ab,İo,12", "UNICODE_CI", 2); assertFindInSet("İo", "ab,i̇o,12", "UNICODE_CI", 2); +assertFindInSet("i̇", "İ", "UTF8_BINARY_LCASE", 1); +assertFindInSet("i", "İ", "UTF8_BINARY_LCASE", 0); +assertFindInSet("i̇", "i̇", "UTF8_BINARY_LCASE", 1); +assertFindInSet("i", "i̇", "UTF8_BINARY_LCASE", 0); +assertFindInSet("i̇", "İ,", "UTF8_BINARY_LCASE", 1); +assertFindInSet("i", "İ,", "UTF8_BINARY_LCASE", 0); +assertFindInSet("i̇", "i̇,", "UTF8_BINARY_LCASE", 1); +assertFindInSet("i", "i̇,", "UTF8_BINARY_LCASE", 0); +assertFindInSet("i̇", "ab,İ", "UTF8_BINARY_LCASE", 2); +assertFindInSet("i", "ab,İ", "UTF8_BINARY_LCASE", 0); +assertFindInSet("i̇", "ab,i̇", "UTF8_BINARY_LCASE", 2); +assertFindInSet("i", "ab,i̇", "UTF8_BINARY_LCASE", 0); +assertFindInSet("i̇", "ab,İ,12", "UTF8_BINARY_LCASE", 2); +assertFindInSet("i", "ab,İ,12", "UTF8_BINARY_LCASE", 0); +assertFindInSet("i̇", "ab,i̇,12", "UTF8_BINARY_LCASE", 2); +assertFindInSet("i", "ab,i̇,12", "UTF8_BINARY_LCASE", 0); +assertFindInSet("i̇o", "ab,İo,12", "UTF8_BINARY_LCASE", 2); +assertFindInSet("İo", "ab,i̇o,12", "UTF8_BINARY_LCASE", 2); Review Comment: returns 3, as expected -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48342][SQL] Introduction of SQL Scripting Parser [spark]
dbatomic commented on code in PR #46665: URL: https://github.com/apache/spark/pull/46665#discussion_r1613213985 ## sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4: ## @@ -42,6 +42,29 @@ options { tokenVocab = SqlBaseLexer; } public boolean double_quoted_identifiers = false; } +compoundOrSingleStatement +: singleStatement +| singleCompound +; + +singleCompound +: compound SEMICOLON* EOF +; + +compound +: BEGIN compoundBody END +; + +// Last semicolon in body is optional. +compoundBody +: compoundStatement (SEMICOLON+ compoundStatement)* SEMICOLON* Review Comment: Yeah, in POC we did what other languages do (e.g. in C++ you are allowed to say `int x = 12`. Also, for single statements we do allow trailing semicolons - `SELECT 1;` is valid. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48283][SQL] Modify string comparison for UTF8_BINARY_LCASE [spark]
uros-db commented on code in PR #46700: URL: https://github.com/apache/spark/pull/46700#discussion_r1613207707 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java: ## @@ -147,6 +162,45 @@ public static String toLowerCase(final String target, final int collationId) { return UCharacter.toLowerCase(locale, target); } + /** + * Converts a single code point to lowercase using ICU rules, with special handling for + * conditional case mappings (i.e. characters that map to multiple characters in lowercase). Review Comment: ah, so "conditional case mapping" == "context-sensitivity" instead, what I wanted to say is just "case mapping" -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48000][SQL] Enable hash join support for all collations (StringType) [spark]
uros-db commented on code in PR #46599: URL: https://github.com/apache/spark/pull/46599#discussion_r1612779931 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala: ## @@ -279,6 +280,7 @@ abstract class Optimizer(catalogManager: CatalogManager) RewritePredicateSubquery.ruleName :: NormalizeFloatingNumbers.ruleName :: ReplaceUpdateFieldsExpression.ruleName :: + RewriteCollationJoin.ruleName :: Review Comment: update: I think there is no impact on correctness, so I went ahead and removed `RewriteCollationJoin` from `nonExcludableRules` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48412][PYTHON] Refactor data type json parse [spark]
zhengruifeng commented on PR #46733: URL: https://github.com/apache/spark/pull/46733#issuecomment-2129082503 merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48412][PYTHON] Refactor data type json parse [spark]
zhengruifeng closed pull request #46733: [SPARK-48412][PYTHON] Refactor data type json parse URL: https://github.com/apache/spark/pull/46733 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR][SQL] Remove outdated `TODO`s from `UnsafeHashedRelation` [spark]
LuciferYang commented on code in PR #46736: URL: https://github.com/apache/spark/pull/46736#discussion_r1613184450 ## sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala: ## @@ -409,9 +407,6 @@ private[joins] class UnsafeHashedRelation( val pageSizeBytes = Option(SparkEnv.get).map(_.memoryManager.pageSizeBytes) .getOrElse(new SparkConf().get(BUFFER_PAGESIZE).getOrElse(16L * 1024 * 1024)) -// TODO(josh): We won't need this dummy memory manager after future refactorings; revisit Review Comment: friendly ping @JoshRosen , can we remove these 2 TODO? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR][SQL] Remove outdated `TODO`s from `UnsafeHashedRelation` [spark]
LuciferYang commented on code in PR #46736: URL: https://github.com/apache/spark/pull/46736#discussion_r1613184450 ## sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala: ## @@ -409,9 +407,6 @@ private[joins] class UnsafeHashedRelation( val pageSizeBytes = Option(SparkEnv.get).map(_.memoryManager.pageSizeBytes) .getOrElse(new SparkConf().get(BUFFER_PAGESIZE).getOrElse(16L * 1024 * 1024)) -// TODO(josh): We won't need this dummy memory manager after future refactorings; revisit Review Comment: friendly ping @JoshRosen , can we remove these 2 TOOD? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48413][SQL] ALTER COLUMN with collation [spark]
olaky commented on code in PR #46734: URL: https://github.com/apache/spark/pull/46734#discussion_r1613167082 ## sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala: ## @@ -408,6 +408,37 @@ object DataType { } } + /** + * Check if `from` is equal to `to` type except for collations and nullability, which are + * both checked to be compatible so that data of type `from` can be interpreted as of type `to`. + */ + private[sql] def equalsIgnoreCompatibleCollationAndNullability( Review Comment: let's already give this a name that reflects that is returns true of the it is allowed to evolve the type. This will be true for more cases than collations in the future ## sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala: ## @@ -445,7 +446,8 @@ case class AlterTableChangeColumnCommand( // name(by resolver) and dataType. private def columnEqual( field: StructField, other: StructField, resolver: Resolver): Boolean = { -resolver(field.name, other.name) && field.dataType == other.dataType +resolver(field.name, other.name) && + DataType.equalsIgnoreCompatibleCollationAndNullability(field.dataType, other.dataType) Review Comment: The comment of the command in line 359 is not outdated ## sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala: ## @@ -408,6 +408,37 @@ object DataType { } } + /** + * Check if `from` is equal to `to` type except for collations and nullability, which are + * both checked to be compatible so that data of type `from` can be interpreted as of type `to`. + */ + private[sql] def equalsIgnoreCompatibleCollationAndNullability( + from: DataType, + to: DataType): Boolean = { +(from, to) match { + case (_: StringType, _: StringType) => true + + case (ArrayType(fromElement, fn), ArrayType(toElement, tn)) => +(tn || !fn) && equalsIgnoreCompatibleCollationAndNullability(fromElement, toElement) + + case (MapType(fromKey, fromValue, fn), MapType(toKey, toValue, tn)) => +(tn || !fn) && + // Map keys cannot change collation. + equalsIgnoreCompatibleNullability(fromKey, toKey) && + equalsIgnoreCompatibleCollationAndNullability(fromValue, toValue) + + case (StructType(fromFields), StructType(toFields)) => +fromFields.length == toFields.length && + fromFields.zip(toFields).forall { case (fromField, toField) => Review Comment: I would rather do a forAll on fromFields and look up each field by name in toFields. That way we do not depend on the order ## sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala: ## @@ -408,6 +408,37 @@ object DataType { } } + /** + * Check if `from` is equal to `to` type except for collations and nullability, which are + * both checked to be compatible so that data of type `from` can be interpreted as of type `to`. + */ + private[sql] def equalsIgnoreCompatibleCollationAndNullability( + from: DataType, + to: DataType): Boolean = { +(from, to) match { + case (_: StringType, _: StringType) => true + + case (ArrayType(fromElement, fn), ArrayType(toElement, tn)) => +(tn || !fn) && equalsIgnoreCompatibleCollationAndNullability(fromElement, toElement) Review Comment: Why do we want to allow setting the type nullable? ## sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala: ## @@ -408,6 +408,37 @@ object DataType { } } + /** + * Check if `from` is equal to `to` type except for collations and nullability, which are + * both checked to be compatible so that data of type `from` can be interpreted as of type `to`. + */ + private[sql] def equalsIgnoreCompatibleCollationAndNullability( + from: DataType, + to: DataType): Boolean = { +(from, to) match { + case (_: StringType, _: StringType) => true + + case (ArrayType(fromElement, fn), ArrayType(toElement, tn)) => +(tn || !fn) && equalsIgnoreCompatibleCollationAndNullability(fromElement, toElement) + + case (MapType(fromKey, fromValue, fn), MapType(toKey, toValue, tn)) => Review Comment: let's also not abbreviate here ## sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala: ## @@ -408,6 +408,37 @@ object DataType { } } + /** + * Check if `from` is equal to `to` type except for collations and nullability, which are + * both checked to be compatible so that data of type `from` can be interpreted as of type `to`. + */ + private[sql] def equalsIgnoreCompatibleCollationAndNullability( + from: DataType, + to: DataType): Boolean = { +(from, to) match { + case (_: StringType, _: StringType) => true + + case (ArrayType(fromElement, fn), ArrayType(toElement, tn)) => +(tn || !fn) && equalsIgnoreCompatibleCollationAndNullability(fromEl
Re: [PR] [MINOR][SQL] Remove an outdated `TODO` from UnsafeHashedRelation [spark]
LuciferYang commented on PR #46736: URL: https://github.com/apache/spark/pull/46736#issuecomment-2129074503 > How about removing the `TODO(josh)` at L412 [done](https://github.com/apache/spark/pull/9127/files#diff-127291a0287f790755be5473765ea03eb65f8b58b9ec0760955f124e21e3452f) ![image](https://github.com/apache/spark/assets/1475305/51904dc0-e8d9-470f-b6ed-9904246d9e47) seems ok -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48213][SQL] Do not push down predicate if non-cheap expression exceed reused limit [spark]
zml1206 commented on PR #46499: URL: https://github.com/apache/spark/pull/46499#issuecomment-2129064493 > Can we make `With` nested as well? I have thought about it for a long time, and I can change the logic of withRewrite. The `alias` generated by the lowest layer with is in the uppermost project, which is currently reversed. This can solve the complex nesting above. `with(with(((commonexpressionref(CommonExpressionId(1,false), IntegerType, true) + commonexpressionref(CommonExpressionId(1,false), IntegerType, true)) > commonexpressionref(CommonExpressionId(0,false), IntegerType, true)), commonexpressiondef((commonexpressionref(CommonExpressionId(0,false), IntegerType, true) * commonexpressionref(CommonExpressionId(0,false), IntegerType, true)), CommonExpressionId(1,false))), commonexpressiondef((a#0 + a#0), CommonExpressionId(0,false)))` Do you think it’s feasible? If so, I’ll do it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48342][SQL] Introduction of SQL Scripting Parser [spark]
davidm-db commented on code in PR #46665: URL: https://github.com/apache/spark/pull/46665#discussion_r1613161561 ## sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4: ## @@ -42,6 +42,29 @@ options { tokenVocab = SqlBaseLexer; } public boolean double_quoted_identifiers = false; } +compoundOrSingleStatement +: singleStatement +| singleCompound +; + +singleCompound +: compound SEMICOLON* EOF +; + +compound +: BEGIN compoundBody END +; + +// Last semicolon in body is optional. +compoundBody Review Comment: As Milan said, we will reuse `compoundBody` in future in the blocks that don't require `BEGIN` and `END` (if else, while, etc.). One more reason is that in such blocks (that don't begin with `BEGIN`) variable declaration is not allowed, so we need to have these cases separated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR][SQL] Remove an outdated `TODO` from UnsafeHashedRelation [spark]
yaooqinn commented on PR #46736: URL: https://github.com/apache/spark/pull/46736#issuecomment-2129046316 How about removing the `TODO(josh)` at L412 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48396] Support configuring max cores can be used for SQL [spark]
yabola commented on PR #46713: URL: https://github.com/apache/spark/pull/46713#issuecomment-2129043469 @cloud-fan could you take a look, thank you~ This is useful in a shared sql cluster. This will make it easier to control sql. The picture below shows that cores used is consistent. https://github.com/apache/spark/assets/31469905/f6812b84-d5fb-4436-b1a2-3171547c94e0";> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48352][SQL]set max file counter through spark conf [spark]
guixiaowen commented on PR #46668: URL: https://github.com/apache/spark/pull/46668#issuecomment-2129040807 @LuciferYang Can you review this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48342][SQL] Introduction of SQL Scripting Parser [spark]
davidm-db commented on code in PR #46665: URL: https://github.com/apache/spark/pull/46665#discussion_r1613136311 ## sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4: ## @@ -42,6 +42,29 @@ options { tokenVocab = SqlBaseLexer; } public boolean double_quoted_identifiers = false; } +compoundOrSingleStatement +: singleStatement +| singleCompound +; + +singleCompound +: compound SEMICOLON* EOF +; + +compound +: BEGIN compoundBody END +; + +// Last semicolon in body is optional. +compoundBody +: compoundStatement (SEMICOLON+ compoundStatement)* SEMICOLON* Review Comment: Not sure why we did it like this in POC, I've tried in MySql Workbench now and it seems that MySql doesn't allow multiple "empty" semicolons... I'll need to check what is the desired behavior and will update accordingly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48409][BUILD][TESTS] Upgrade MySQL & Postgres & Mariadb docker image version [spark]
yaooqinn commented on PR #46704: URL: https://github.com/apache/spark/pull/46704#issuecomment-2129013813 Merged to master. Thank you @panbingkun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48409][BUILD][TESTS] Upgrade MySQL & Postgres & Mariadb docker image version [spark]
yaooqinn closed pull request #46704: [SPARK-48409][BUILD][TESTS] Upgrade MySQL & Postgres & Mariadb docker image version URL: https://github.com/apache/spark/pull/46704 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR][CORE] Remove an outdated TODO from UnsafeHashedRelation [spark]
LuciferYang commented on code in PR #46736: URL: https://github.com/apache/spark/pull/46736#discussion_r1613125740 ## sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala: ## @@ -396,8 +396,6 @@ private[joins] class UnsafeHashedRelation( val nKeys = readLong() val nValues = readLong() // This is used in Broadcast, shared by multiple tasks, so we use on-heap memory -// TODO(josh): This needs to be revisited before we merge this patch; making this change now -// so that tests compile: Review Comment: It seems that this was a reminder for the author to revisit this part of the code before merging, but it was forgotten to be deleted when the code was merged? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47257][SQL] Assign names to error classes _LEGACY_ERROR_TEMP_105[3-4] and _LEGACY_ERROR_TEMP_1113 [spark]
wayneguow commented on code in PR #46731: URL: https://github.com/apache/spark/pull/46731#discussion_r1613124606 ## sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala: ## @@ -90,8 +91,8 @@ class ResolveSessionCatalog(val catalogManager: CatalogManager) table.schema.findNestedField(Seq(colName), resolver = conf.resolver) .map(_._2.dataType) .getOrElse { -throw QueryCompilationErrors.alterColumnCannotFindColumnInV1TableError( - quoteIfNeeded(colName), table) +throw QueryCompilationErrors.unresolvedColumnError( Review Comment: > Can we reuse `UNRESOLVED_COLUMN.WITH_SUGGESTION`? For the scenario of `ALTER COLUMN`, since there is one table, I think it is OK; for `resolveFieldNames` in `Analyzer`, I think it would be better if the table can be explicitly specified. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org