[GitHub] [spark] sadikovi commented on pull request #36726: [SPARK-39339][SQL] Support TimestampNTZ type in JDBC data source
sadikovi commented on PR #36726: URL: https://github.com/apache/spark/pull/36726#issuecomment-1145652717 @gengliangwang Can you review? Thanks to the comments from the reviewers, I noticed that there could be inconsistencies when writing timestamp with local time zone and reading as timestamp_ntz. For example, writing 2020-01-01 00:00:00 in local time zone (UTC+1), timestamp_ntz will be read as 2019-12-31 23:00:00. This is because we always store timestamp in UTC and we don't have information on what time zone was used when writing timestamp. Could you advise on how to proceed? I am not sure we can do much about it because we don't store time zone information. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] olaky commented on pull request #36752: [SPARK-39259][SQL][3.3] Evaluate timestamps consistently in subqueries
olaky commented on PR #36752: URL: https://github.com/apache/spark/pull/36752#issuecomment-1145665119 Thanks for picking this up so quickly, I wanted to first see if the build passes and then update the comment -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #36726: [SPARK-39339][SQL] Support TimestampNTZ type in JDBC data source
gengliangwang commented on PR #36726: URL: https://github.com/apache/spark/pull/36726#issuecomment-1145676931 @sadikovi yes will do. I just moved home. Sorry for the late reply. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] olaky commented on pull request #36753: [SPARK-39259] Evaluate timestamps consistently in subqueries
olaky commented on PR #36753: URL: https://github.com/apache/spark/pull/36753#issuecomment-1145711345 There are some Python errors in the build, which seem to me like there is a problem with pandas or python versions in the build pipeline. Anyway, I really doubt that I broke those tests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #36756: [SPARK-39369][INFRA] Use JAVA_OPTS for AppVeyer build to increase the memory properly
HyukjinKwon commented on PR #36756: URL: https://github.com/apache/spark/pull/36756#issuecomment-1145722312 Let me merge this in to master to recover the build. Merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #36756: [SPARK-39369][INFRA] Use JAVA_OPTS for AppVeyer build to increase the memory properly
HyukjinKwon closed pull request #36756: [SPARK-39369][INFRA] Use JAVA_OPTS for AppVeyer build to increase the memory properly URL: https://github.com/apache/spark/pull/36756 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36701: [SPARK-39179][PYTHON][TESTS] Improve the test coverage for pyspark/shuffle.py
HyukjinKwon commented on code in PR #36701: URL: https://github.com/apache/spark/pull/36701#discussion_r888738083 ## python/pyspark/tests/test_shuffle.py: ## @@ -54,6 +63,59 @@ def test_medium_dataset(self): self.assertTrue(m.spills >= 1) self.assertEqual(sum(sum(v) for k, v in m.items()), sum(range(self.N)) * 3) +def test_shuffle_data_with_multiple_locations(self): +# SPARK-39179: Test shuffle of data with multiple location also check +# shuffle locations get randomized + +with tempfile.TemporaryDirectory() as tempdir1, tempfile.TemporaryDirectory() as tempdir2: +original = None +if "SPARK_LOCAL_DIRS" in os.environ: +original = os.environ["SPARK_LOCAL_DIRS"] Review Comment: you can use `os.environ.get("SPARK_LOCAL_DIRS", None)` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #36756: [SPARK-39369][INFRA] Use JAVA_OPTS for AppVeyer build to increase the memory properly
HyukjinKwon commented on PR #36756: URL: https://github.com/apache/spark/pull/36756#issuecomment-1145741362 Seems working fine. Let me get this in to recover the build. Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon opened a new pull request, #36758: [SPARK-39372][INFRA][R] Update R version to 4.2.0 in AppVeyor
HyukjinKwon opened a new pull request, #36758: URL: https://github.com/apache/spark/pull/36758 ### What changes were proposed in this pull request? This PR updates AppVeyor to use the latest R version 4.2.0. ### Why are the changes needed? To test the latest R. R community tends to use the latest versions aggresively. ### Does this PR introduce _any_ user-facing change? No, dev-only ### How was this patch tested? CI in this PR should test it out. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #36756: [SPARK-39369][INFRA] Use JAVA_OPTS for AppVeyer build to increase the memory properly
HyukjinKwon closed pull request #36756: [SPARK-39369][INFRA] Use JAVA_OPTS for AppVeyer build to increase the memory properly URL: https://github.com/apache/spark/pull/36756 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #36753: [SPARK-39259][SQL][3.2] Evaluate timestamps consistently in subqueries
MaxGekk commented on PR #36753: URL: https://github.com/apache/spark/pull/36753#issuecomment-1145741825 > There are some Python errors in the build, which seem to me like there is a problem with pandas or python versions in the build pipeline. Looking at GAs for commits in https://github.com/apache/spark/commits/branch-3.2, they have been broken for some time already. cc @HyukjinKwon @ueshin @itholic -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #36757: [SPARK-39371][DOCS][Core] Review and fix issues in Scala/Java API docs of Core module
HyukjinKwon commented on PR #36757: URL: https://github.com/apache/spark/pull/36757#issuecomment-1145743251 Merged to master and branch-3.3. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #36757: [SPARK-39371][DOCS][Core] Review and fix issues in Scala/Java API docs of Core module
HyukjinKwon closed pull request #36757: [SPARK-39371][DOCS][Core] Review and fix issues in Scala/Java API docs of Core module URL: https://github.com/apache/spark/pull/36757 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #36753: [SPARK-39259][SQL][3.2] Evaluate timestamps consistently in subqueries
HyukjinKwon commented on PR #36753: URL: https://github.com/apache/spark/pull/36753#issuecomment-1145749316 Let me fix it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #36753: [SPARK-39259][SQL][3.2] Evaluate timestamps consistently in subqueries
HyukjinKwon commented on PR #36753: URL: https://github.com/apache/spark/pull/36753#issuecomment-1145754894 https://github.com/apache/spark/pull/36759 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon opened a new pull request, #36759: [SPARK-39373][PYTHON][3.2] Recover branch-3.2 build broken by SPARK-39273 and SPARK-39252
HyukjinKwon opened a new pull request, #36759: URL: https://github.com/apache/spark/pull/36759 ### What changes were proposed in this pull request? Backporting SPARK-39273 and SPARK-39252 brought some mistakes together into branch-3.2. This PR fixes it. One is to avoid exact match (by a different pandas version). Another one is unused import. ### Why are the changes needed? To recover the build ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? CI in this PR should test it out. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #36753: [SPARK-39259][SQL][3.2] Evaluate timestamps consistently in subqueries
MaxGekk commented on PR #36753: URL: https://github.com/apache/spark/pull/36753#issuecomment-1145768022 Let's wait for https://github.com/apache/spark/pull/36759 and then re-trigger the build via an empty commit: ``` $ git commit --allow-empty -m "Trigger build" ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pralabhkumar commented on a diff in pull request #36701: [SPARK-39179][PYTHON][TESTS] Improve the test coverage for pyspark/shuffle.py
pralabhkumar commented on code in PR #36701: URL: https://github.com/apache/spark/pull/36701#discussion_r888778272 ## python/pyspark/tests/test_shuffle.py: ## @@ -54,6 +63,59 @@ def test_medium_dataset(self): self.assertTrue(m.spills >= 1) self.assertEqual(sum(sum(v) for k, v in m.items()), sum(range(self.N)) * 3) +def test_shuffle_data_with_multiple_locations(self): +# SPARK-39179: Test shuffle of data with multiple location also check +# shuffle locations get randomized + +with tempfile.TemporaryDirectory() as tempdir1, tempfile.TemporaryDirectory() as tempdir2: +original = None +if "SPARK_LOCAL_DIRS" in os.environ: +original = os.environ["SPARK_LOCAL_DIRS"] Review Comment: done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36758: [SPARK-39372][INFRA][R] Update R version to 4.2.0 in AppVeyor
HyukjinKwon commented on code in PR #36758: URL: https://github.com/apache/spark/pull/36758#discussion_r10253 ## dev/appveyor-install-dependencies.ps1: ## @@ -129,7 +129,7 @@ $env:PATH = "$env:HADOOP_HOME\bin;" + $env:PATH Pop-Location # == R -$rVer = "4.0.2" +$rVer = "4.2.0" $rToolsVer = "4.0.2" Review Comment: While the tests passed, it failed to download RTools 4.2.0. I reverted the RTools upgrade here for now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36758: [SPARK-39372][INFRA][R] Update R version to 4.2.0 in AppVeyor
HyukjinKwon commented on code in PR #36758: URL: https://github.com/apache/spark/pull/36758#discussion_r10253 ## dev/appveyor-install-dependencies.ps1: ## @@ -129,7 +129,7 @@ $env:PATH = "$env:HADOOP_HOME\bin;" + $env:PATH Pop-Location # == R -$rVer = "4.0.2" +$rVer = "4.2.0" $rToolsVer = "4.0.2" Review Comment: While the tests passed (one test failed but seems flaky), it failed to download RTools 4.2.0. I reverted the RTools upgrade here for now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36759: [SPARK-39373][PYTHON][3.2] Recover branch-3.2 build broken by SPARK-39273 and SPARK-39252
HyukjinKwon commented on code in PR #36759: URL: https://github.com/apache/spark/pull/36759#discussion_r21336 ## python/pyspark/pandas/tests/test_series.py: ## @@ -2201,7 +2201,7 @@ def test_mad(self): pser.index = pmidx psser = ps.from_pandas(pser) -self.assert_eq(pser.mad(), psser.mad()) +self.assert_eq(pser.mad(), psser.mad(), check_exact=False) Review Comment: ```suggestion self.assert_eq(pser.mad(), psser.mad(), almost=True) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36758: [SPARK-39372][R] Support R 4.2.0
HyukjinKwon commented on code in PR #36758: URL: https://github.com/apache/spark/pull/36758#discussion_r10253 ## dev/appveyor-install-dependencies.ps1: ## @@ -129,7 +129,7 @@ $env:PATH = "$env:HADOOP_HOME\bin;" + $env:PATH Pop-Location # == R -$rVer = "4.0.2" +$rVer = "4.2.0" $rToolsVer = "4.0.2" Review Comment: While the tests passed, it failed to download RTools 4.2.0. I reverted the RTools upgrade here for now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36758: [SPARK-39372][R] Support R 4.2.0
HyukjinKwon commented on code in PR #36758: URL: https://github.com/apache/spark/pull/36758#discussion_r58660 ## R/pkg/R/mllib_classification.R: ## @@ -331,7 +331,7 @@ setMethod("spark.logit", signature(data = "SparkDataFrame", formula = "formula") } if (!is.null(upperBoundsOnCoefficients)) { - if (class(upperBoundsOnCoefficients) != "matrix") { + if (is.matrix(upperBoundsOnCoefficients)) { Review Comment: ```suggestion if (!is.matrix(upperBoundsOnCoefficients)) { ``` ## R/pkg/R/mllib_classification.R: ## @@ -322,7 +322,7 @@ setMethod("spark.logit", signature(data = "SparkDataFrame", formula = "formula") } if (!is.null(lowerBoundsOnCoefficients)) { - if (class(lowerBoundsOnCoefficients) != "matrix") { + if (is.matrix(lowerBoundsOnCoefficients)) { Review Comment: ```suggestion if (!is.matrix(lowerBoundsOnCoefficients)) { ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum opened a new pull request, #36760: [SPARK-39374][SQL] Improve error message for user specified column list
wangyum opened a new pull request, #36760: URL: https://github.com/apache/spark/pull/36760 ### What changes were proposed in this pull request? This PR improves error message for user specified column list. For example: ```sql create table t1(c1 int, c2 bigint, c3 string) using parquet; insert into t1(c1, c2, c4) values(1, 2, 3); ``` Before this PR: ``` Cannot resolve column name c4; line 1 pos 0 org.apache.spark.sql.AnalysisException: Cannot resolve column name c4; line 1 pos 0 ``` After this PR: ``` [MISSING_COLUMN] Column 'c4' does not exist. Did you mean one of the following? [c1, c2, c3]; line 1 pos 0 org.apache.spark.sql.AnalysisException: [MISSING_COLUMN] Column 'c4' does not exist. Did you mean one of the following? [c1, c2, c3]; line 1 pos 0 ``` ### Why are the changes needed? Improve error message. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang opened a new pull request, #36761: [SPARK-39367][DOCS][SQL][3.2] Make SQL error objects private
gengliangwang opened a new pull request, #36761: URL: https://github.com/apache/spark/pull/36761 ### What changes were proposed in this pull request? Partially backport https://github.com/apache/spark/pull/36754 to branch-3.2, make the following objects as private so that they won't show up in API doc: * QueryCompilationErrors * QueryExecutionErrors * QueryParsingErrors ### Why are the changes needed? Fix bug in doc ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? GA tests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36758: [SPARK-39372][R] Support R 4.2.0
HyukjinKwon commented on code in PR #36758: URL: https://github.com/apache/spark/pull/36758#discussion_r888911462 ## R/pkg/R/serialize.R: ## @@ -58,7 +58,12 @@ writeObject <- function(con, object, writeType = TRUE) { # Checking types is needed here, since 'is.na' only handles atomic vectors, # lists and pairlists if (type %in% c("integer", "character", "logical", "double", "numeric")) { -if (is.na(object)) { +if (is.na(object[[1]])) { Review Comment: **R 4.1 and below:** ``` Warning in if (is.na(c(1, 2))) print("abc") : the condition has length > 1 and only the first element will be used ``` **R 4.2+:** ``` Error in if (is.na(c(1, 2))) print("abc") : the condition has length > 1 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #36758: [SPARK-39372][R] Support R 4.2.0
HyukjinKwon commented on PR #36758: URL: https://github.com/apache/spark/pull/36758#issuecomment-1145930588 Tests should pass now .. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #36759: [SPARK-39373][PYTHON][3.2] Recover branch-3.2 build broken by SPARK-39273 and SPARK-39252
HyukjinKwon commented on PR #36759: URL: https://github.com/apache/spark/pull/36759#issuecomment-1145932875 Merged to branch-3.2. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #36759: [SPARK-39373][PYTHON][3.2] Recover branch-3.2 build broken by SPARK-39273 and SPARK-39252
HyukjinKwon closed pull request #36759: [SPARK-39373][PYTHON][3.2] Recover branch-3.2 build broken by SPARK-39273 and SPARK-39252 URL: https://github.com/apache/spark/pull/36759 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #36761: [SPARK-39367][DOCS][SQL][3.2] Make SQL error objects private
HyukjinKwon commented on PR #36761: URL: https://github.com/apache/spark/pull/36761#issuecomment-1145933920 Merged to branch-3.2. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #36761: [SPARK-39367][DOCS][SQL][3.2] Make SQL error objects private
HyukjinKwon closed pull request #36761: [SPARK-39367][DOCS][SQL][3.2] Make SQL error objects private URL: https://github.com/apache/spark/pull/36761 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36758: [SPARK-39372][R] Support R 4.2.0
HyukjinKwon commented on code in PR #36758: URL: https://github.com/apache/spark/pull/36758#discussion_r888920686 ## R/pkg/R/serialize.R: ## @@ -58,7 +58,12 @@ writeObject <- function(con, object, writeType = TRUE) { # Checking types is needed here, since 'is.na' only handles atomic vectors, # lists and pairlists if (type %in% c("integer", "character", "logical", "double", "numeric")) { -if (is.na(object)) { +if (is.na(object[[1]])) { + # Uses the first element for now to keep the behavior same as R before + # 4.2.0. This is wrong because we should differenciate c(NA) from a + # single NA as the former means array(null) and the latter means null + # in Spark SQL. However, it requires non-trivial comparison to distinguish Review Comment: e.g.) we should check if the input is vector, list, array, etc, which is exactly being done at `getSerdeType`. However, this comparison here (up to my best knowledge) is a shortcut to avoid the overhead from `getSerdeType`. So, I just decided to leave it as is for now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #36753: [SPARK-39259][SQL][3.2] Evaluate timestamps consistently in subqueries
MaxGekk commented on PR #36753: URL: https://github.com/apache/spark/pull/36753#issuecomment-1145942007 @olaky Could you re-trigger builds, please. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #36758: [SPARK-39372][R] Support R 4.2.0
HyukjinKwon commented on PR #36758: URL: https://github.com/apache/spark/pull/36758#issuecomment-1145940475 cc @felixcheung @shivaram @viirya too FYI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] EnricoMi commented on a diff in pull request #36150: [SPARK-38864][SQL] Add melt / unpivot to Dataset
EnricoMi commented on code in PR #36150: URL: https://github.com/apache/spark/pull/36150#discussion_r889066692 ## sql/core/src/main/scala/org/apache/spark/sql/Melt.scala: ## @@ -0,0 +1,125 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.spark.sql.catalyst.analysis.{AnsiTypeCoercion, TypeCoercion} +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, Cast, Expression, Literal} +import org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull +import org.apache.spark.sql.catalyst.plans.logical.Expand +import org.apache.spark.sql.functions.col +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.types.{DataType, StringType} + +private[sql] object Melt { + def of[_](ds: Dataset[_], +ids: Seq[String], +values: Seq[String] = Seq.empty, +dropNulls: Boolean = false, +variableColumnName: String = "variable", +valueColumnName: String = "value"): DataFrame = { +// values should be disjoint to ids +if (values.intersect(ids).nonEmpty) { + throw new IllegalArgumentException(s"A column cannot be both an id and a value column: " + Review Comment: We are down to 2 melt-specific errors, and `MISSING_COLUMNS` is gone, so never mind. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] akpatnam25 commented on a diff in pull request #36734: [SPARK-38987][SHUFFLE] Throw FetchFailedException when merged shuffle blocks are corrupted and spark.shuffle.detectCorrupt is se
akpatnam25 commented on code in PR #36734: URL: https://github.com/apache/spark/pull/36734#discussion_r889067758 ## core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala: ## @@ -1885,6 +1885,16 @@ private[spark] class DAGScheduler( mapOutputTracker. unregisterMergeResult(shuffleId, reduceId, bmAddress, Option(mapIndex)) } + } else { +// Unregister the merge result of if there is a FetchFailed event +// and is not a MetaDataFetchException which is signified by bmAddress being null +if (bmAddress != null + && bmAddress.executorId.equals(BlockManagerId.SHUFFLE_MERGER_IDENTIFIER)) { Review Comment: done ## core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala: ## @@ -1885,6 +1885,16 @@ private[spark] class DAGScheduler( mapOutputTracker. unregisterMergeResult(shuffleId, reduceId, bmAddress, Option(mapIndex)) } + } else { +// Unregister the merge result of if there is a FetchFailed event +// and is not a MetaDataFetchException which is signified by bmAddress being null +if (bmAddress != null + && bmAddress.executorId.equals(BlockManagerId.SHUFFLE_MERGER_IDENTIFIER)) { + assert(pushBasedShuffleEnabled, "Pushed based shuffle needs to " + +"be enabled so that merge results are present.") Review Comment: done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] EnricoMi commented on a diff in pull request #36150: [SPARK-38864][SQL] Add melt / unpivot to Dataset
EnricoMi commented on code in PR #36150: URL: https://github.com/apache/spark/pull/36150#discussion_r889068185 ## sql/core/src/main/scala/org/apache/spark/sql/Melt.scala: ## @@ -0,0 +1,125 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.spark.sql.catalyst.analysis.{AnsiTypeCoercion, TypeCoercion} +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, Cast, Expression, Literal} +import org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull +import org.apache.spark.sql.catalyst.plans.logical.Expand +import org.apache.spark.sql.functions.col +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.types.{DataType, StringType} + +private[sql] object Melt { Review Comment: Alright, `Melt` is now a `LogicalPlan`, which already fixed / implemented SPARK-39292. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] EnricoMi commented on a diff in pull request #36150: [SPARK-38864][SQL] Add melt / unpivot to Dataset
EnricoMi commented on code in PR #36150: URL: https://github.com/apache/spark/pull/36150#discussion_r889072078 ## sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -2012,6 +2012,152 @@ class Dataset[T] private[sql]( @scala.annotation.varargs def agg(expr: Column, exprs: Column*): DataFrame = groupBy().agg(expr, exprs : _*) + /** + * (Scala-specific) + * Unpivot a DataFrame from wide format to long format, optionally + * leaving identifier variables set. + * + * This function is useful to massage a DataFrame into a format where some + * columns are identifier variables (`ids`), while all other columns, + * considered measured variables (`values`), are "unpivoted" to the rows, + * leaving just two non-identifier columns, 'variable' and 'value'. + * + * {{{ + * val df = Seq((1, 11, 12L), (2, 21, 22L)).toDF("id", "int", "long") + * df.show() + * // output: + * // +---+---++ + * // | id|int|long| + * // +---+---++ + * // | 1| 11| 12| + * // | 2| 21| 22| + * // +---+---++ + * + * df.melt(Seq("id")).show() + * // output: + * // +---++-+ + * // | id|variable|value| Review Comment: done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] EnricoMi commented on a diff in pull request #36150: [SPARK-38864][SQL] Add melt / unpivot to Dataset
EnricoMi commented on code in PR #36150: URL: https://github.com/apache/spark/pull/36150#discussion_r889074533 ## sql/core/src/main/scala/org/apache/spark/sql/Melt.scala: ## @@ -0,0 +1,143 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.spark.sql.catalyst.analysis.{AnsiTypeCoercion, TypeCoercion, UnresolvedAttribute} +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, Cast, Expression, Literal, NamedExpression} +import org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull +import org.apache.spark.sql.catalyst.plans.logical.Expand +import org.apache.spark.sql.functions.col +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.types.{DataType, StringType} + +private[sql] object Melt { + def of[_]( + ds: Dataset[_], + ids: Seq[String], + values: Seq[String], + dropNulls: Boolean, + variableColumnName: String, + valueColumnName: String): DataFrame = { +// values should be disjoint to ids +if (values.intersect(ids).nonEmpty) { + throw new AnalysisException("MELT_ID_AND_VALUE_COLUMNS_NOT_DISJOINT", Array( +ids.mkString(", "), values.mkString(", "), values.intersect(ids).mkString(", ") + )) +} + +// id columns must not contain variable or value column name +if (ids.contains(variableColumnName)) { + throw new AnalysisException("MELT_VARIABLE_COLUMN_IS_ID_COLUMN", Array( +variableColumnName, ids.mkString(", ") + )) +} +if (ids.contains(valueColumnName)) { + throw new AnalysisException("MELT_VALUE_COLUMN_IS_ID_COLUMN", Array( +valueColumnName, ids.mkString(", ") + )) +} + +// if no values given, all non-id columns are melted +val valueNameCandidates = ds.columns.diff(ids).toSeq +val valueNames = Some(values).filter(_.nonEmpty).getOrElse(valueNameCandidates) + +// if there are no values given and no non-id columns exist, we cannot melt +if (valueNames.isEmpty) { + throw new AnalysisException("MELT_REQUIRES_VALUE_COLUMNS", Array(ids.mkString(", "))) +} + +// resolve given id and value column names +val idExprs = ids.map(resolve(ds, _)) +val valueExprs = valueNames.map(resolve(ds, _)) + +// all given ids and values should exist in ds +val unresolvedIds = idExprs.filter(_.isLeft).map(_.swap.toOption.get) +if (unresolvedIds.nonEmpty) { + val idNameCandidates = ds.columns.diff(valueNames).toSeq + throw new AnalysisException("MISSING_COLUMNS", +Array(unresolvedIds.mkString(", "), idNameCandidates.mkString(", "))) +} +val unresolvedValues = valueExprs.filter(_.isLeft).map(_.swap.toOption.get) +if (unresolvedValues.nonEmpty) { + throw new AnalysisException("MISSING_COLUMNS", +Array(unresolvedValues.mkString(", "), valueNameCandidates.mkString(", "))) +} + +// all melted values have to have the same type +val resolvedValues = valueExprs.filter(_.isRight).map(_.toOption.get) +val valueType = resolvedValues.map(_.dataType).reduce(tightestCommonType) Review Comment: The best I could do is use `findWiderType` / `findWiderTypeWithoutStringPromotion`, applied to `Melt` value columns in a new type coercion rule `MeltCoercion`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] EnricoMi commented on a diff in pull request #36150: [SPARK-38864][SQL] Add melt / unpivot to Dataset
EnricoMi commented on code in PR #36150: URL: https://github.com/apache/spark/pull/36150#discussion_r889077442 ## sql/core/src/main/scala/org/apache/spark/sql/Melt.scala: ## @@ -0,0 +1,143 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.spark.sql.catalyst.analysis.{AnsiTypeCoercion, TypeCoercion, UnresolvedAttribute} +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, Cast, Expression, Literal, NamedExpression} +import org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull +import org.apache.spark.sql.catalyst.plans.logical.Expand +import org.apache.spark.sql.functions.col +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.types.{DataType, StringType} + +private[sql] object Melt { + def of[_]( + ds: Dataset[_], + ids: Seq[String], + values: Seq[String], + dropNulls: Boolean, + variableColumnName: String, + valueColumnName: String): DataFrame = { Review Comment: `Melt` is now a `LogicalPlan` with some rules in the analyser to melt all value columns except for the id columns if no value columns are given. The `Melt` logical plan is replaced by the equivalent `Expand` operation during analysis, once ids columns, value columns and output value type are resolved. This is much cleaner now, and allows for expressions, structs and `*` for id and value columns. Thanks for the pointers! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] EnricoMi commented on pull request #36150: [SPARK-38864][SQL] Add melt / unpivot to Dataset
EnricoMi commented on PR #36150: URL: https://github.com/apache/spark/pull/36150#issuecomment-1146105405 @HyukjinKwon @cloud-fan which value do you prefer for the variable column? - `NamedExpression.name` - `NamedExpression.prettyName` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor commented on pull request #36672: [SPARK-39265][SQL] Support vectorized Parquet scans with DEFAULT values
dtenedor commented on PR #36672: URL: https://github.com/apache/spark/pull/36672#issuecomment-1146161162 @gengliangwang @HyukjinKwon @cloud-fan can someone please merge this in (or leave more review comment(s) if desired for another pass)? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] singhpk234 commented on a diff in pull request #36760: [SPARK-39374][SQL] Improve error message for user specified column list
singhpk234 commented on code in PR #36760: URL: https://github.com/apache/spark/pull/36760#discussion_r889140462 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala: ## @@ -3424,9 +3424,10 @@ class Analyzer(override val catalogManager: CatalogManager) i.userSpecifiedCols, "in the column list", resolver) i.userSpecifiedCols.map { col => - i.table.resolve(Seq(col), resolver) -.getOrElse(throw QueryCompilationErrors.cannotResolveUserSpecifiedColumnsError( - col, i.table)) +i.table.resolve(Seq(col), resolver) + .getOrElse(i.failAnalysis( +errorClass = "MISSING_COLUMN", +messageParameters = Array(col, i.table.output.map(_.name).mkString(", " Review Comment: [question] we can have a table which has > 100 columns as well, showing all the columns in error message would make it unusually large. Are we ok with it ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #36758: [SPARK-39372][R] Support R 4.2.0
dongjoon-hyun closed pull request #36758: [SPARK-39372][R] Support R 4.2.0 URL: https://github.com/apache/spark/pull/36758 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun opened a new pull request, #36762: [SPARK-39259][SQL][TEST] Fix Scala 2.13 ClassCastException
dongjoon-hyun opened a new pull request, #36762: URL: https://github.com/apache/spark/pull/36762 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #36753: [SPARK-39259][SQL][3.2] Evaluate timestamps consistently in subqueries
dongjoon-hyun commented on PR #36753: URL: https://github.com/apache/spark/pull/36753#issuecomment-1146210977 Here is the PR to fix Scala 2.13 issue. - https://github.com/apache/spark/pull/36762 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #36752: [SPARK-39259][SQL][3.3] Evaluate timestamps consistently in subqueries
dongjoon-hyun commented on PR #36752: URL: https://github.com/apache/spark/pull/36752#issuecomment-1146211548 Hi, @olaky and @MaxGekk . Unfortunately, this broke Scala 2.13 in master/branch-3.3 and RC4. I made a PR to fix it. - https://github.com/apache/spark/pull/36762 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #36762: [SPARK-39259][SQL][TEST][FOLLOWUP] Fix Scala 2.13 ClassCastException
dongjoon-hyun commented on PR #36762: URL: https://github.com/apache/spark/pull/36762#issuecomment-1146213732 cc @olaky , @MaxGekk , @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #36762: [SPARK-39259][SQL][TEST][FOLLOWUP] Fix Scala 2.13 `ClassCastException` in `ComputeCurrentTimeSuite`
dongjoon-hyun commented on PR #36762: URL: https://github.com/apache/spark/pull/36762#issuecomment-1146215037 cc @viirya -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #36762: [SPARK-39259][SQL][TEST][FOLLOWUP] Fix Scala 2.13 `ClassCastException` in `ComputeCurrentTimeSuite`
dongjoon-hyun commented on PR #36762: URL: https://github.com/apache/spark/pull/36762#issuecomment-1146234918 cc @huaxingao , too -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] karenfeng opened a new pull request, #36763: [SPARK-39376][SQL] Hide duplicated columns in star expansion of subquery alias from NATURAL/USING JOIN
karenfeng opened a new pull request, #36763: URL: https://github.com/apache/spark/pull/36763 ### What changes were proposed in this pull request? Follows up from https://github.com/apache/spark/pull/31666. This PR introduced a bug where the qualified star expansion of a subquery alias containing a NATURAL/USING output duplicated columns. ### Why are the changes needed? Duplicated, hidden columns should not be output from a star expansion. ### Does this PR introduce _any_ user-facing change? The query ``` val df1 = Seq((3, 8)).toDF("a", "b") val df2 = Seq((8, 7)).toDF("b", "d") val joinDF = df1.join(df2, "b") joinDF.alias("r").select("r.*") ``` Now outputs a single column `b`, instead of two (duplicate) columns for `b`. ### How was this patch tested? UTs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #36762: [SPARK-39259][SQL][TEST][FOLLOWUP] Fix Scala 2.13 `ClassCastException` in `ComputeCurrentTimeSuite`
dongjoon-hyun commented on PR #36762: URL: https://github.com/apache/spark/pull/36762#issuecomment-1146248944 Thank you for your review and approval, @huaxingao ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #36586: [SPARK-39236][SQL] Make CreateTable and ListTables be compatible with 3 layer namespace
amaliujia commented on code in PR #36586: URL: https://github.com/apache/spark/pull/36586#discussion_r889237990 ## sql/core/src/test/scala/org/apache/spark/sql/internal/CatalogSuite.scala: ## @@ -553,4 +570,100 @@ class CatalogSuite extends SharedSparkSession with AnalysisTest { }.getMessage assert(errMsg.contains("my_temp_table is a temp view. 'recoverPartitions()' expects a table")) } + + test("three layer namespace compatibility - create managed table") { +spark.conf.set("spark.sql.catalog.testcat", classOf[InMemoryCatalog].getName) +val catalogName = "testcat" +val dbName = "my_db" +val tableName = "my_table" +val tableSchema = new StructType().add("i", "int") +val description = "this is a test table" + +val df = spark.catalog.createTable( + tableName = Array(catalogName, dbName, tableName).mkString("."), + source = classOf[FakeV2Provider].getName, + schema = tableSchema, + description = description, + options = Map.empty[String, String]) +assert(df.schema.equals(tableSchema)) + +val testCatalog = + spark.sessionState.catalogManager.catalog("testcat").asTableCatalog +val table = testCatalog.loadTable(Identifier.of(Array(dbName), tableName)) +assert(table.schema().equals(tableSchema)) + assert(table.properties().get("provider").equals(classOf[FakeV2Provider].getName)) +assert(table.properties().get("comment").equals(description)) + } + + test("three layer namespace compatibility - create external table") { +withTempDir { dir => + val catalogName = "testcat" + val dbName = "my_db" + val tableName = "my_table" + val tableSchema = new StructType().add("i", "int") + val description = "this is a test table" + + val df = spark.catalog.createTable( +tableName = Array(catalogName, dbName, tableName).mkString("."), +source = classOf[FakeV2Provider].getName, +schema = tableSchema, +description = description, +options = Map("path" -> dir.getAbsolutePath)) + assert(df.schema.equals(tableSchema)) + + val testCatalog = +spark.sessionState.catalogManager.catalog("testcat").asTableCatalog + val table = testCatalog.loadTable(Identifier.of(Array(dbName), tableName)) + assert(table.schema().equals(tableSchema)) + assert(table.properties().get("provider").equals(classOf[FakeV2Provider].getName)) + assert(table.properties().get("comment").equals(description)) + assert(table.properties().get("path").equals(dir.getAbsolutePath)) +} + } + + test("three layer namespace compatibility - list tables") { Review Comment: It turns out our code does not support `a.b` as table name: ``` `also_table_name.my_table1` is not a valid name for tables/databases. Valid names only contain alphabet characters, numbers and _. org.apache.spark.sql.AnalysisException: `also_table_name.my_table1` is not a valid name for tables/databases. Valid names only contain alphabet characters, numbers and _. at org.apache.spark.sql.errors.QueryCompilationErrors$.invalidNameForTableOrDatabaseError(QueryCompilationErrors.scala:572) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.validateName(SessionCatalog.scala:146) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:350) at org.apache.spark.sql.internal.CatalogSuite.createTable(CatalogSuite.scala:63) at org.apache.spark.sql.internal.CatalogSuite.$anonfun$new$101(CatalogSuite.scala:677) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang closed pull request #36672: [SPARK-39265][SQL] Support vectorized Parquet scans with DEFAULT values
gengliangwang closed pull request #36672: [SPARK-39265][SQL] Support vectorized Parquet scans with DEFAULT values URL: https://github.com/apache/spark/pull/36672 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #36675: [SPARK-39294][SQL] Support vectorized Orc scans with DEFAULT values
gengliangwang commented on PR #36675: URL: https://github.com/apache/spark/pull/36675#issuecomment-1146266746 @dtenedor The PR https://github.com/apache/spark/pull/36672/ is merged. Can you remove the Parquet-related code in this one? I will review this one after that today. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #36586: [SPARK-39236][SQL] Make CreateTable and ListTables be compatible with 3 layer namespace
amaliujia commented on code in PR #36586: URL: https://github.com/apache/spark/pull/36586#discussion_r889237990 ## sql/core/src/test/scala/org/apache/spark/sql/internal/CatalogSuite.scala: ## @@ -553,4 +570,100 @@ class CatalogSuite extends SharedSparkSession with AnalysisTest { }.getMessage assert(errMsg.contains("my_temp_table is a temp view. 'recoverPartitions()' expects a table")) } + + test("three layer namespace compatibility - create managed table") { +spark.conf.set("spark.sql.catalog.testcat", classOf[InMemoryCatalog].getName) +val catalogName = "testcat" +val dbName = "my_db" +val tableName = "my_table" +val tableSchema = new StructType().add("i", "int") +val description = "this is a test table" + +val df = spark.catalog.createTable( + tableName = Array(catalogName, dbName, tableName).mkString("."), + source = classOf[FakeV2Provider].getName, + schema = tableSchema, + description = description, + options = Map.empty[String, String]) +assert(df.schema.equals(tableSchema)) + +val testCatalog = + spark.sessionState.catalogManager.catalog("testcat").asTableCatalog +val table = testCatalog.loadTable(Identifier.of(Array(dbName), tableName)) +assert(table.schema().equals(tableSchema)) + assert(table.properties().get("provider").equals(classOf[FakeV2Provider].getName)) +assert(table.properties().get("comment").equals(description)) + } + + test("three layer namespace compatibility - create external table") { +withTempDir { dir => + val catalogName = "testcat" + val dbName = "my_db" + val tableName = "my_table" + val tableSchema = new StructType().add("i", "int") + val description = "this is a test table" + + val df = spark.catalog.createTable( +tableName = Array(catalogName, dbName, tableName).mkString("."), +source = classOf[FakeV2Provider].getName, +schema = tableSchema, +description = description, +options = Map("path" -> dir.getAbsolutePath)) + assert(df.schema.equals(tableSchema)) + + val testCatalog = +spark.sessionState.catalogManager.catalog("testcat").asTableCatalog + val table = testCatalog.loadTable(Identifier.of(Array(dbName), tableName)) + assert(table.schema().equals(tableSchema)) + assert(table.properties().get("provider").equals(classOf[FakeV2Provider].getName)) + assert(table.properties().get("comment").equals(description)) + assert(table.properties().get("path").equals(dir.getAbsolutePath)) +} + } + + test("three layer namespace compatibility - list tables") { Review Comment: It turns out our code does not support `a.b` as table/database name: ``` `also_table_name.my_table1` is not a valid name for tables/databases. Valid names only contain alphabet characters, numbers and _. org.apache.spark.sql.AnalysisException: `also_table_name.my_table1` is not a valid name for tables/databases. Valid names only contain alphabet characters, numbers and _. at org.apache.spark.sql.errors.QueryCompilationErrors$.invalidNameForTableOrDatabaseError(QueryCompilationErrors.scala:572) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.validateName(SessionCatalog.scala:146) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:350) at org.apache.spark.sql.internal.CatalogSuite.createTable(CatalogSuite.scala:63) at org.apache.spark.sql.internal.CatalogSuite.$anonfun$new$101(CatalogSuite.scala:677) ``` In this case, we won't hit the ambiguity that a database name `a.b` could either a database name or `a` is catalog name for `ListTables` API? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #36586: [SPARK-39236][SQL] Make CreateTable and ListTables be compatible with 3 layer namespace
amaliujia commented on code in PR #36586: URL: https://github.com/apache/spark/pull/36586#discussion_r889237990 ## sql/core/src/test/scala/org/apache/spark/sql/internal/CatalogSuite.scala: ## @@ -553,4 +570,100 @@ class CatalogSuite extends SharedSparkSession with AnalysisTest { }.getMessage assert(errMsg.contains("my_temp_table is a temp view. 'recoverPartitions()' expects a table")) } + + test("three layer namespace compatibility - create managed table") { +spark.conf.set("spark.sql.catalog.testcat", classOf[InMemoryCatalog].getName) +val catalogName = "testcat" +val dbName = "my_db" +val tableName = "my_table" +val tableSchema = new StructType().add("i", "int") +val description = "this is a test table" + +val df = spark.catalog.createTable( + tableName = Array(catalogName, dbName, tableName).mkString("."), + source = classOf[FakeV2Provider].getName, + schema = tableSchema, + description = description, + options = Map.empty[String, String]) +assert(df.schema.equals(tableSchema)) + +val testCatalog = + spark.sessionState.catalogManager.catalog("testcat").asTableCatalog +val table = testCatalog.loadTable(Identifier.of(Array(dbName), tableName)) +assert(table.schema().equals(tableSchema)) + assert(table.properties().get("provider").equals(classOf[FakeV2Provider].getName)) +assert(table.properties().get("comment").equals(description)) + } + + test("three layer namespace compatibility - create external table") { +withTempDir { dir => + val catalogName = "testcat" + val dbName = "my_db" + val tableName = "my_table" + val tableSchema = new StructType().add("i", "int") + val description = "this is a test table" + + val df = spark.catalog.createTable( +tableName = Array(catalogName, dbName, tableName).mkString("."), +source = classOf[FakeV2Provider].getName, +schema = tableSchema, +description = description, +options = Map("path" -> dir.getAbsolutePath)) + assert(df.schema.equals(tableSchema)) + + val testCatalog = +spark.sessionState.catalogManager.catalog("testcat").asTableCatalog + val table = testCatalog.loadTable(Identifier.of(Array(dbName), tableName)) + assert(table.schema().equals(tableSchema)) + assert(table.properties().get("provider").equals(classOf[FakeV2Provider].getName)) + assert(table.properties().get("comment").equals(description)) + assert(table.properties().get("path").equals(dir.getAbsolutePath)) +} + } + + test("three layer namespace compatibility - list tables") { Review Comment: It turns out our code does not support `a.b` as table/database name: ``` `also_table_name.my_table1` is not a valid name for tables/databases. Valid names only contain alphabet characters, numbers and _. org.apache.spark.sql.AnalysisException: `also_table_name.my_table1` is not a valid name for tables/databases. Valid names only contain alphabet characters, numbers and _. at org.apache.spark.sql.errors.QueryCompilationErrors$.invalidNameForTableOrDatabaseError(QueryCompilationErrors.scala:572) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.validateName(SessionCatalog.scala:146) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:350) at org.apache.spark.sql.internal.CatalogSuite.createTable(CatalogSuite.scala:63) at org.apache.spark.sql.internal.CatalogSuite.$anonfun$new$101(CatalogSuite.scala:677) ``` In this case, we won't hit the ambiguity that a database name `a.b` could be either a database name or `a` is catalog name for `ListTables` API? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor commented on pull request #36675: [SPARK-39294][SQL] Support vectorized Orc scans with DEFAULT values
dtenedor commented on PR #36675: URL: https://github.com/apache/spark/pull/36675#issuecomment-1146271997 @gengliangwang sure, I just synced my Apache Spark repository fork to include the PR https://github.com/apache/spark/pull/36672 and then merged the changes into this branch. This PR now just shows changes to two files. Thanks for your help. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #36586: [SPARK-39236][SQL] Make CreateTable and ListTables be compatible with 3 layer namespace
amaliujia commented on code in PR #36586: URL: https://github.com/apache/spark/pull/36586#discussion_r889237990 ## sql/core/src/test/scala/org/apache/spark/sql/internal/CatalogSuite.scala: ## @@ -553,4 +570,100 @@ class CatalogSuite extends SharedSparkSession with AnalysisTest { }.getMessage assert(errMsg.contains("my_temp_table is a temp view. 'recoverPartitions()' expects a table")) } + + test("three layer namespace compatibility - create managed table") { +spark.conf.set("spark.sql.catalog.testcat", classOf[InMemoryCatalog].getName) +val catalogName = "testcat" +val dbName = "my_db" +val tableName = "my_table" +val tableSchema = new StructType().add("i", "int") +val description = "this is a test table" + +val df = spark.catalog.createTable( + tableName = Array(catalogName, dbName, tableName).mkString("."), + source = classOf[FakeV2Provider].getName, + schema = tableSchema, + description = description, + options = Map.empty[String, String]) +assert(df.schema.equals(tableSchema)) + +val testCatalog = + spark.sessionState.catalogManager.catalog("testcat").asTableCatalog +val table = testCatalog.loadTable(Identifier.of(Array(dbName), tableName)) +assert(table.schema().equals(tableSchema)) + assert(table.properties().get("provider").equals(classOf[FakeV2Provider].getName)) +assert(table.properties().get("comment").equals(description)) + } + + test("three layer namespace compatibility - create external table") { +withTempDir { dir => + val catalogName = "testcat" + val dbName = "my_db" + val tableName = "my_table" + val tableSchema = new StructType().add("i", "int") + val description = "this is a test table" + + val df = spark.catalog.createTable( +tableName = Array(catalogName, dbName, tableName).mkString("."), +source = classOf[FakeV2Provider].getName, +schema = tableSchema, +description = description, +options = Map("path" -> dir.getAbsolutePath)) + assert(df.schema.equals(tableSchema)) + + val testCatalog = +spark.sessionState.catalogManager.catalog("testcat").asTableCatalog + val table = testCatalog.loadTable(Identifier.of(Array(dbName), tableName)) + assert(table.schema().equals(tableSchema)) + assert(table.properties().get("provider").equals(classOf[FakeV2Provider].getName)) + assert(table.properties().get("comment").equals(description)) + assert(table.properties().get("path").equals(dir.getAbsolutePath)) +} + } + + test("three layer namespace compatibility - list tables") { Review Comment: It turns out our code does not support `a.b` as table/database name: ``` `also_name.my_db` is not a valid name for tables/databases. Valid names only contain alphabet characters, numbers and _. org.apache.spark.sql.AnalysisException: `also_name.my_db` is not a valid name for tables/databases. Valid names only contain alphabet characters, numbers and _. at org.apache.spark.sql.errors.QueryCompilationErrors$.invalidNameForTableOrDatabaseError(QueryCompilationErrors.scala:572) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.validateName(SessionCatalog.scala:146) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:350) at org.apache.spark.sql.internal.CatalogSuite.createTable(CatalogSuite.scala:63) at org.apache.spark.sql.internal.CatalogSuite.$anonfun$new$101(CatalogSuite.scala:677) ``` In this case, we won't hit the ambiguity that a database name `a.b` could be either a database name or `a` is catalog name for `ListTables` API? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #36586: [SPARK-39236][SQL] Make CreateTable and ListTables be compatible with 3 layer namespace
amaliujia commented on code in PR #36586: URL: https://github.com/apache/spark/pull/36586#discussion_r889237990 ## sql/core/src/test/scala/org/apache/spark/sql/internal/CatalogSuite.scala: ## @@ -553,4 +570,100 @@ class CatalogSuite extends SharedSparkSession with AnalysisTest { }.getMessage assert(errMsg.contains("my_temp_table is a temp view. 'recoverPartitions()' expects a table")) } + + test("three layer namespace compatibility - create managed table") { +spark.conf.set("spark.sql.catalog.testcat", classOf[InMemoryCatalog].getName) +val catalogName = "testcat" +val dbName = "my_db" +val tableName = "my_table" +val tableSchema = new StructType().add("i", "int") +val description = "this is a test table" + +val df = spark.catalog.createTable( + tableName = Array(catalogName, dbName, tableName).mkString("."), + source = classOf[FakeV2Provider].getName, + schema = tableSchema, + description = description, + options = Map.empty[String, String]) +assert(df.schema.equals(tableSchema)) + +val testCatalog = + spark.sessionState.catalogManager.catalog("testcat").asTableCatalog +val table = testCatalog.loadTable(Identifier.of(Array(dbName), tableName)) +assert(table.schema().equals(tableSchema)) + assert(table.properties().get("provider").equals(classOf[FakeV2Provider].getName)) +assert(table.properties().get("comment").equals(description)) + } + + test("three layer namespace compatibility - create external table") { +withTempDir { dir => + val catalogName = "testcat" + val dbName = "my_db" + val tableName = "my_table" + val tableSchema = new StructType().add("i", "int") + val description = "this is a test table" + + val df = spark.catalog.createTable( +tableName = Array(catalogName, dbName, tableName).mkString("."), +source = classOf[FakeV2Provider].getName, +schema = tableSchema, +description = description, +options = Map("path" -> dir.getAbsolutePath)) + assert(df.schema.equals(tableSchema)) + + val testCatalog = +spark.sessionState.catalogManager.catalog("testcat").asTableCatalog + val table = testCatalog.loadTable(Identifier.of(Array(dbName), tableName)) + assert(table.schema().equals(tableSchema)) + assert(table.properties().get("provider").equals(classOf[FakeV2Provider].getName)) + assert(table.properties().get("comment").equals(description)) + assert(table.properties().get("path").equals(dir.getAbsolutePath)) +} + } + + test("three layer namespace compatibility - list tables") { Review Comment: It turns out our code does not support `a.b` as table/database name: ``` `also_table_name.my_db` is not a valid name for tables/databases. Valid names only contain alphabet characters, numbers and _. org.apache.spark.sql.AnalysisException: `also_table_name.my_db` is not a valid name for tables/databases. Valid names only contain alphabet characters, numbers and _. at org.apache.spark.sql.errors.QueryCompilationErrors$.invalidNameForTableOrDatabaseError(QueryCompilationErrors.scala:572) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.validateName(SessionCatalog.scala:146) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:350) at org.apache.spark.sql.internal.CatalogSuite.createTable(CatalogSuite.scala:63) at org.apache.spark.sql.internal.CatalogSuite.$anonfun$new$101(CatalogSuite.scala:677) ``` In this case, we won't hit the ambiguity that a database name `a.b` could be either a database name or `a` is catalog name for `ListTables` API? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #36762: [SPARK-39259][SQL][TEST][FOLLOWUP] Fix Scala 2.13 `ClassCastException` in `ComputeCurrentTimeSuite`
dongjoon-hyun commented on PR #36762: URL: https://github.com/apache/spark/pull/36762#issuecomment-1146288710 Thank you, @MaxGekk ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on pull request #36762: [SPARK-39259][SQL][TEST][FOLLOWUP] Fix Scala 2.13 `ClassCastException` in `ComputeCurrentTimeSuite`
viirya commented on PR #36762: URL: https://github.com/apache/spark/pull/36762#issuecomment-1146303346 Nice catch! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #36762: [SPARK-39259][SQL][TEST][FOLLOWUP] Fix Scala 2.13 `ClassCastException` in `ComputeCurrentTimeSuite`
dongjoon-hyun commented on PR #36762: URL: https://github.com/apache/spark/pull/36762#issuecomment-1146309982 Thank you, @viirya and all. All test passed in CI and I verified manually with Scala 2.13. Merged to master/3.3. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #36762: [SPARK-39259][SQL][TEST][FOLLOWUP] Fix Scala 2.13 `ClassCastException` in `ComputeCurrentTimeSuite`
dongjoon-hyun closed pull request #36762: [SPARK-39259][SQL][TEST][FOLLOWUP] Fix Scala 2.13 `ClassCastException` in `ComputeCurrentTimeSuite` URL: https://github.com/apache/spark/pull/36762 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #36675: [SPARK-39294][SQL] Support vectorized Orc scans with DEFAULT values
gengliangwang commented on PR #36675: URL: https://github.com/apache/spark/pull/36675#issuecomment-1146349945 Merging to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang closed pull request #36675: [SPARK-39294][SQL] Support vectorized Orc scans with DEFAULT values
gengliangwang closed pull request #36675: [SPARK-39294][SQL] Support vectorized Orc scans with DEFAULT values URL: https://github.com/apache/spark/pull/36675 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor commented on a diff in pull request #36745: [SPARK-39359][SQL] Restrict DEFAULT columns to allowlist of supported data source types
dtenedor commented on code in PR #36745: URL: https://github.com/apache/spark/pull/36745#discussion_r889272134 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala: ## @@ -122,7 +122,7 @@ abstract class SessionCatalogSuite extends AnalysisTest with Eventually { } test("create table with default columns") { -withBasicCatalog { catalog => +if (!isHiveExternalCatalog) withBasicCatalog { catalog => Review Comment: Good question. This test is part of the base class `SessionCatalogSuite`. It fails when running with the `HiveExternalSessionCatalogSuite` subclass, since Hive tables are not included in the allowlist of providers that allow DEFAULT columns. To simplify this, I edited the config to add Hive tables to the list of allowed providers and this test passes now in all cases. ## sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala: ## @@ -41,21 +41,28 @@ import org.apache.spark.sql.types.{MetadataBuilder, StructField, StructType} * * We can remove this rule once we implement all the catalog functionality in `V2SessionCatalog`. */ -class ResolveSessionCatalog(val catalogManager: CatalogManager) +class ResolveSessionCatalog(val analyzer: Analyzer) extends Rule[LogicalPlan] with LookupCatalog { + val catalogManager = analyzer.catalogManager Review Comment: SG, done. ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala: ## @@ -83,16 +83,26 @@ object ResolveDefaultColumns { * * @param analyzer used for analyzing the result of parsing the expression stored as text. * @param tableSchema represents the names and types of the columns of the statement to process. + * @param tableProvider provider of the target table to store default values for, if any. * @param statementType name of the statement being processed, such as INSERT; useful for errors. * @return a copy of `tableSchema` with field metadata updated with the constant-folded values. */ def constantFoldCurrentDefaultsToExistDefaults( analyzer: Analyzer, tableSchema: StructType, + tableProvider: Option[String], statementType: String): StructType = { if (SQLConf.get.enableDefaultColumns) { val newFields: Seq[StructField] = tableSchema.fields.map { field => if (field.metadata.contains(CURRENT_DEFAULT_COLUMN_METADATA_KEY)) { + // Make sure that the target table has a provider that supports default column values. + val allowedProviders: Array[String] = +SQLConf.get.getConf(SQLConf.DEFAULT_COLUMN_ALLOWED_PROVIDERS) Review Comment: Good idea, done! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] otterc commented on a diff in pull request #35906: [SPARK-33236][shuffle] Enable Push-based shuffle service to store state in NM level DB for work preserving restart
otterc commented on code in PR #35906: URL: https://github.com/apache/spark/pull/35906#discussion_r888604583 ## common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java: ## @@ -209,9 +243,13 @@ private AppShufflePartitionInfo getOrCreateAppShufflePartitionInfo( appShuffleInfo.getMergedShuffleIndexFilePath(shuffleId, shuffleMergeId, reduceId)); File metaFile = appShuffleInfo.getMergedShuffleMetaFile(shuffleId, shuffleMergeId, reduceId); + // Make sure unuseful non-finalized merged data/index/meta files get cleaned up + // during service restart Review Comment: Is this comment misplaced? ## common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java: ## @@ -316,23 +356,25 @@ public String[] getMergedBlockDirs(String appId) { public void applicationRemoved(String appId, boolean cleanupLocalDirs) { logger.info("Application {} removed, cleanupLocalDirs = {}", appId, cleanupLocalDirs); AppShuffleInfo appShuffleInfo = appsShuffleInfo.remove(appId); +cleanupAppAttemptPathInfoInDB( +new AppAttemptId(appShuffleInfo.appId, appShuffleInfo.attemptId)); if (null != appShuffleInfo) { mergedShuffleCleaner.execute( -() -> closeAndDeletePartitionFilesIfNeeded(appShuffleInfo, cleanupLocalDirs)); +() -> closeAndDeletePartitionFilesOrDBIfNeeded(appShuffleInfo, cleanupLocalDirs, true)); } } - /** * Clean up the AppShufflePartitionInfo for a specific AppShuffleInfo. * If cleanupLocalDirs is true, the merged shuffle files will also be deleted. * The cleanup will be executed in a separate thread. */ @SuppressWarnings("SynchronizationOnLocalVariableOrMethodParameter") @VisibleForTesting - void closeAndDeletePartitionFilesIfNeeded( + void closeAndDeletePartitionFilesOrDBIfNeeded( AppShuffleInfo appShuffleInfo, - boolean cleanupLocalDirs) { + boolean cleanupLocalDirs, + boolean cleanupDB) { Review Comment: `cleanupDB` -> `removeFromDb`? ## common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java: ## @@ -286,7 +325,8 @@ public ManagedBuffer getMergedBlockData( shuffleId, shuffleMergeId, reduceId, ErrorHandler.BlockFetchErrorHandler.STALE_SHUFFLE_BLOCK_FETCH)); } -File dataFile = appShuffleInfo.getMergedShuffleDataFile(shuffleId, shuffleMergeId, reduceId); +File dataFile = + appShuffleInfo.getMergedShuffleDataFile(shuffleId, shuffleMergeId, reduceId); Review Comment: revert? ## common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java: ## @@ -655,6 +747,194 @@ public void registerExecutor(String appId, ExecutorShuffleInfo executorInfo) { } } + + @Override + public void close() { +if (db != null) { + try { +db.close(); + } catch (IOException e) { +logger.error("Exception closing leveldb with registered app paths info and " ++ "shuffle partition info", e); + } +} + } + + private void writeAppPathsInfoToDb(String appId, int attemptId, AppPathsInfo appPathsInfo) { +if (db != null) { + try { +byte[] key = getDbAppAttemptPathsKey(new AppAttemptId(appId, attemptId)); +String valueStr = mapper.writeValueAsString(appPathsInfo); +byte[] value = valueStr.getBytes(StandardCharsets.UTF_8); +db.put(key, value); + } catch (Exception e) { +logger.error("Error saving registered app paths info", e); + } +} + } + + private void writeAppAttemptShuffleMergeInfo( + String appId, + int appAttemptId, + int shuffleId, + int shuffleMergeId) { +if (db != null) { + // Write AppAttemptShuffleMergeId into LevelDB for finalized shuffles + try{ +byte[] dbKey = getDbAppAttemptShufflePartitionKey( +new AppAttemptShuffleMergeId(appId, appAttemptId, shuffleId, shuffleMergeId)); +db.put(dbKey, new byte[0]); + } catch (Exception e) { +logger.error("Error saving active app shuffle partition", e); + } +} + + } + + private T parseDbKey(String key, String prefix, Class valueType) { +try { + String json = key.substring(prefix.length() + 1); + return mapper.readValue(json, valueType); +} catch (Exception exception) { + logger.error("Exception while parsing the DB key {}", key); + return null; +} + } + + private AppPathsInfo parseDBAppAttemptPathsValue(byte[] value, AppAttemptId appAttemptId) { +try { + return mapper.readValue(value, AppPathsInfo.class); +} catch (Exception exception) { + logger.error("Exception while parsing the DB value for {}", appAttemptId); + return null; +} + } + + private AppAttemptId parseDbAppAttemptPathsKey(String key) { +return p
[GitHub] [spark] HyukjinKwon commented on pull request #36758: [SPARK-39372][R] Support R 4.2.0
HyukjinKwon commented on PR #36758: URL: https://github.com/apache/spark/pull/36758#issuecomment-1146426846 Thanks!!! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] BryanCutler commented on a diff in pull request #36683: [SPARK-39301][SQL][PYTHON] Leverage LocalRelation and respect Arrow batch size in createDataFrame with Arrow optimization
BryanCutler commented on code in PR #36683: URL: https://github.com/apache/spark/pull/36683#discussion_r889416243 ## python/pyspark/sql/pandas/conversion.py: ## @@ -596,7 +596,7 @@ def _create_from_pandas_with_arrow( ] # Slice the DataFrame to be batched -step = -(-len(pdf) // self.sparkContext.defaultParallelism) # round int up +step = self._jconf.arrowMaxRecordsPerBatch() Review Comment: I thought this was to control how many partitions were in the rdd? Each partition could have multiple batches, and probably should be capped at `arrowMaxRecordsPerBatch`, but since it was coming from a local Pandas DataFrame already in memory, that didn't seem to be a big deal. ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -2575,6 +2575,18 @@ object SQLConf { .booleanConf .createWithDefault(false) + val ARROW_LOCAL_RELATION_THRESHOLD = +buildConf("spark.sql.execution.arrow.localRelationThreshold") + .doc( +"When converting Arrow batches to Spark DataFrame, local collections are used in the " + + "driver side if the byte size of Arrow batches is smaller than this threshold. " + + "Otherwise, the Arrow batches are sent and deserialized to Spark internal rows " + + "in the executors.") + .version("3.4.0") + .bytesConf(ByteUnit.BYTE) + .checkValue(_ >= 0, "This value must be equal to or greater than 0.") + .createWithDefaultString("48MB") Review Comment: Is this the max size of each batch or all batches together? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36683: [SPARK-39301][SQL][PYTHON] Leverage LocalRelation and respect Arrow batch size in createDataFrame with Arrow optimization
HyukjinKwon commented on code in PR #36683: URL: https://github.com/apache/spark/pull/36683#discussion_r889421112 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -2575,6 +2575,18 @@ object SQLConf { .booleanConf .createWithDefault(false) + val ARROW_LOCAL_RELATION_THRESHOLD = +buildConf("spark.sql.execution.arrow.localRelationThreshold") + .doc( +"When converting Arrow batches to Spark DataFrame, local collections are used in the " + + "driver side if the byte size of Arrow batches is smaller than this threshold. " + + "Otherwise, the Arrow batches are sent and deserialized to Spark internal rows " + + "in the executors.") + .version("3.4.0") + .bytesConf(ByteUnit.BYTE) + .checkValue(_ >= 0, "This value must be equal to or greater than 0.") + .createWithDefaultString("48MB") Review Comment: It's all together ..so pretty small -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #33934: [SPARK-36691][PYTHON] PythonRunner failed should pass error message to ApplicationMaster too
github-actions[bot] commented on PR #33934: URL: https://github.com/apache/spark/pull/33934#issuecomment-1146474433 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum closed pull request #36755: [SPARK-39368][SQL] Move `RewritePredicateSubquery` into `InjectRuntimeFilter`
wangyum closed pull request #36755: [SPARK-39368][SQL] Move `RewritePredicateSubquery` into `InjectRuntimeFilter` URL: https://github.com/apache/spark/pull/36755 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #36755: [SPARK-39368][SQL] Move `RewritePredicateSubquery` into `InjectRuntimeFilter`
wangyum commented on PR #36755: URL: https://github.com/apache/spark/pull/36755#issuecomment-1146477751 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] tiagovrtr commented on pull request #33675: [SPARK-27997][K8S] Add support for kubernetes OAuth Token refresh
tiagovrtr commented on PR #33675: URL: https://github.com/apache/spark/pull/33675#issuecomment-1146480382 Looping in @renato-farias as I'm leaving the BBC and he'll very much welcome this fix in a future release 🙂 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] JoshRosen commented on pull request #36654: [SPARK-39259][SQL] Evaluate timestamps consistently in subqueries
JoshRosen commented on PR #36654: URL: https://github.com/apache/spark/pull/36654#issuecomment-1146483609 This looks like it might be a fix for a correctness issue? If so, we should probably backport this change to maintenance branches for the other currently-supported Spark versions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #36675: [SPARK-39294][SQL] Support vectorized Orc scans with DEFAULT values
dongjoon-hyun commented on PR #36675: URL: https://github.com/apache/spark/pull/36675#issuecomment-1146485190 Thank you all! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhouyejoe commented on pull request #35906: [SPARK-33236][shuffle] Enable Push-based shuffle service to store state in NM level DB for work preserving restart
zhouyejoe commented on PR #35906: URL: https://github.com/apache/spark/pull/35906#issuecomment-1146485325 Updated PR. Addressed comments and fixed break build. Will also add more comments to the code change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum opened a new pull request, #36764: [SPARK-39377][SQL][TESTS] Normalize expr ids in ListQuery and Exists expressions
wangyum opened a new pull request, #36764: URL: https://github.com/apache/spark/pull/36764 ### What changes were proposed in this pull request? This PR makes it normalize expr ids in `ListQuery` and `Exists` expressions. For example: ```scala val testRelation = LocalRelation($"a".int, $"b".int, $"c".int) val x = testRelation.as("x") val y = testRelation.as("y") val originalQuery = x.where($"x.a".in(ListQuery(y.select($"b".analyze println(Optimize.execute(normalizeExprIds(originalQuery))) ``` Before this PR: ``` Filter a#0 IN (list#0 []) : +- Project [b#15] : +- SubqueryAlias y :+- LocalRelation , [a#14, b#15, c#16] +- LocalRelation , [a#0, b#0, c#0] ``` After this PR: ``` Filter a#0 IN (list#0 []) : +- Project [b#0] : +- SubqueryAlias y :+- LocalRelation , [a#0, b#0, c#0] +- LocalRelation , [a#0, b#0, c#0] ``` ### Why are the changes needed? `PlanTestBase.comparePlans` fails in some cases because the expr id not equal. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? manual test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on a diff in pull request #36760: [SPARK-39374][SQL] Improve error message for user specified column list
wangyum commented on code in PR #36760: URL: https://github.com/apache/spark/pull/36760#discussion_r889430728 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala: ## @@ -3424,9 +3424,10 @@ class Analyzer(override val catalogManager: CatalogManager) i.userSpecifiedCols, "in the column list", resolver) i.userSpecifiedCols.map { col => - i.table.resolve(Seq(col), resolver) -.getOrElse(throw QueryCompilationErrors.cannotResolveUserSpecifiedColumnsError( - col, i.table)) +i.table.resolve(Seq(col), resolver) + .getOrElse(i.failAnalysis( +errorClass = "MISSING_COLUMN", +messageParameters = Array(col, i.table.output.map(_.name).mkString(", " Review Comment: The following query also shows all columns: ```sql select not_exist_col from tbl; ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] JoshRosen commented on a diff in pull request #36654: [SPARK-39259][SQL] Evaluate timestamps consistently in subqueries
JoshRosen commented on code in PR #36654: URL: https://github.com/apache/spark/pull/36654#discussion_r889431243 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala: ## @@ -479,21 +479,24 @@ abstract class QueryPlan[PlanType <: QueryPlan[PlanType]] * first to this node, then this node's subqueries and finally this node's children. * When the partial function does not apply to a given node, it is left unchanged. */ - def transformDownWithSubqueries(f: PartialFunction[PlanType, PlanType]): PlanType = { + def transformDownWithSubqueries( Review Comment: Instead of modifying this method's signature, I slightly prefer adding an overload named `transformDownWithSubqueriesAndPruning` in order to remain consistent with the naming conventions for other transform methods. Adding a new method also avoids introducing source or binary compatibility issues for third-party code that calls Catalyst APIs. Technically speaking, Catalyst APIs are considered internal to Spark and are subject to change between minor releases (see [source](https://github.com/apache/spark/blob/bb51add5c79558df863d37965603387d40cc4387/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala#L20-L24)), but I think it's nice to try to avoid API breakage when feasible. I also ran into problems when trying to call `transformDownWithSubqueries` without supplying any arguments in the first argument group: calling `transformDownWithSubqueries() { f }` resulted in confusing compilation errors. As a result, I'm going to submit a followup to this PR to split this into two methods as I've suggested above. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] JoshRosen opened a new pull request, #36765: [SPARK-39259][FOLLOWUP] Fix source and binary incompatibilities in transformDownWithSubqueries
JoshRosen opened a new pull request, #36765: URL: https://github.com/apache/spark/pull/36765 ### What changes were proposed in this pull request? This is a followup to #36654. That PR modified `QueryPlan.transformDownWithSubqueries` to add tree pattern pruning. In this PR, I roll back the change to that method's signature and instead add a new `transformDownWithSubqueriesAndPruning` method. ### Why are the changes needed? The original change breaks binary and source compatibility in Catalyst. Technically speaking, Catalyst APIs are considered internal to Spark and are subject to change between minor releases (see [source](https://github.com/apache/spark/blob/bb51add5c79558df863d37965603387d40cc4387/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala#L20-L24)), but I think it's nice to try to avoid API breakage when possible. While trying to compile some custom Catalyst code, I ran into issues when trying to call the `transformDownWithSubqueries` method without supplying pattern bits. If I do `transformDownWithSubqueries() { f} ` then I get a compilation error. I think this is due to the first parameter group containing all default parameters. This PR's solution of adding a new `transformDownWithSubqueriesAndPruning` method solves this problem. It's also more consistent with the naming convention of other pruning-enabled tree transformation methods. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] JoshRosen commented on a diff in pull request #36654: [SPARK-39259][SQL] Evaluate timestamps consistently in subqueries
JoshRosen commented on code in PR #36654: URL: https://github.com/apache/spark/pull/36654#discussion_r889432632 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala: ## @@ -479,21 +479,24 @@ abstract class QueryPlan[PlanType <: QueryPlan[PlanType]] * first to this node, then this node's subqueries and finally this node's children. * When the partial function does not apply to a given node, it is left unchanged. */ - def transformDownWithSubqueries(f: PartialFunction[PlanType, PlanType]): PlanType = { + def transformDownWithSubqueries( Review Comment: Here's my PR: https://github.com/apache/spark/pull/36765 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum opened a new pull request, #36766: [SPARK-32184][SQL] Removes inferred predicate if it has InOrCorrelatedExistsSubquery
wangyum opened a new pull request, #36766: URL: https://github.com/apache/spark/pull/36766 ### What changes were proposed in this pull request? This pr removes inferred predicate if it has InOrCorrelatedExistsSubquery. ### Why are the changes needed? Inferred predicate do not always has benefit. For example: TPCH q18 before this PR | TPCH q18 after this PR -- | -- ![image](https://user-images.githubusercontent.com/5399861/171972270-14091986-cee5-411c-9d19-f335c4eb2b75.jpg) | ![image](https://user-images.githubusercontent.com/5399861/171972406-a6b8b188-ae76-4036-abcb-47932f3bc83f.jpg) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #36765: [SPARK-39259][SQL][FOLLOWUP] Fix source and binary incompatibilities in transformDownWithSubqueries
MaxGekk commented on PR #36765: URL: https://github.com/apache/spark/pull/36765#issuecomment-1146546802 +1, LGTM. Merging to master/3.3. Thank you, @JoshRosen. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk closed pull request #36765: [SPARK-39259][SQL][FOLLOWUP] Fix source and binary incompatibilities in transformDownWithSubqueries
MaxGekk closed pull request #36765: [SPARK-39259][SQL][FOLLOWUP] Fix source and binary incompatibilities in transformDownWithSubqueries URL: https://github.com/apache/spark/pull/36765 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org