Re: [PR] [SPARK-48215][SQL] Extending support for collated strings on date_format expression [spark]
uros-db commented on code in PR #46561: URL: https://github.com/apache/spark/pull/46561#discussion_r1599471162 ## sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala: ## @@ -1275,6 +1275,38 @@ class CollationSQLExpressionsSuite }) } + test("DateFormat expression with collation") { +case class DateFormatTestCase[R](date: String, format: String, collation: String, result: R) +val testCases = Seq( + DateFormatTestCase("2021-01-01", "-MM-dd", "UTF8_BINARY", "2021-01-01"), + DateFormatTestCase("2021-01-01", "-dd", "UTF8_BINARY_LCASE", "2021-01"), + DateFormatTestCase("2021-01-01", "-MM-dd", "UNICODE", "2021-01-01"), + DateFormatTestCase("2021-01-01", "", "UNICODE_CI", "2021") +) + +for { + collateDate <- Seq(true, false) + collateFormat <- Seq(true, false) +} { + testCases.foreach(t => { +val dateArg = if (collateDate) s"collate('${t.date}', '${t.collation}')" else s"'${t.date}'" +val formatArg = + if (collateFormat) { +s"collate('${t.format}', '${t.collation}')" + } else { +s"'${t.format}'" + } + +withSQLConf(SqlApiConf.DEFAULT_COLLATION -> t.collation) { + val query = s"SELECT date_format(${dateArg}, ${formatArg})" Review Comment: I suppose this was the reason this test was failing -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48215][SQL] Extending support for collated strings on date_format expression [spark]
uros-db commented on code in PR #46561: URL: https://github.com/apache/spark/pull/46561#discussion_r1599470164 ## sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala: ## @@ -1275,6 +1275,38 @@ class CollationSQLExpressionsSuite }) } + test("DateFormat expression with collation") { +case class DateFormatTestCase[R](date: String, format: String, collation: String, result: R) +val testCases = Seq( + DateFormatTestCase("2021-01-01", "-MM-dd", "UTF8_BINARY", "2021-01-01"), + DateFormatTestCase("2021-01-01", "-dd", "UTF8_BINARY_LCASE", "2021-01"), + DateFormatTestCase("2021-01-01", "-MM-dd", "UNICODE", "2021-01-01"), + DateFormatTestCase("2021-01-01", "", "UNICODE_CI", "2021") +) + +for { + collateDate <- Seq(true, false) + collateFormat <- Seq(true, false) +} { + testCases.foreach(t => { +val dateArg = if (collateDate) s"collate('${t.date}', '${t.collation}')" else s"'${t.date}'" +val formatArg = + if (collateFormat) { +s"collate('${t.format}', '${t.collation}')" + } else { +s"'${t.format}'" + } + +withSQLConf(SqlApiConf.DEFAULT_COLLATION -> t.collation) { + val query = s"SELECT date_format(${dateArg}, ${formatArg})" Review Comment: ```suggestion val query = s"SELECT date_format('${dateArg}', '${formatArg}')" ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-48221][SQL] Alter string search logic for UTF8_BINARY_LCASE collation [spark]
mkaravel commented on code in PR #46511: URL: https://github.com/apache/spark/pull/46511#discussion_r1599455730 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java: ## @@ -34,6 +34,27 @@ * Utility class for collation-aware UTF8String operations. */ public class CollationAwareUTF8String { + + public static boolean lowercaseMatchFrom(final UTF8String l, final UTF8String r, int pos) { +if (pos < 0) return false; +for (int len = 0; len <= l.numChars() - pos; len++) { Review Comment: ```suggestion for (int len = 0; len <= l.numChars() - pos; ++len) { ``` ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java: ## @@ -34,6 +34,27 @@ * Utility class for collation-aware UTF8String operations. */ public class CollationAwareUTF8String { + + public static boolean lowercaseMatchFrom(final UTF8String l, final UTF8String r, int pos) { +if (pos < 0) return false; Review Comment: Why do we return false in this case? Also why not make this an assert and make sure we never call this method with a negative `pos`? ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java: ## @@ -34,6 +34,27 @@ * Utility class for collation-aware UTF8String operations. */ public class CollationAwareUTF8String { + + public static boolean lowercaseMatchFrom(final UTF8String l, final UTF8String r, int pos) { +if (pos < 0) return false; +for (int len = 0; len <= l.numChars() - pos; len++) { + if (l.substring(pos, pos + len).toLowerCase().equals(r)) { +return true; + } +} +return false; + } + + public static boolean lowercaseMatchUntil(final UTF8String l, final UTF8String r, int pos) { +if (pos > l.numChars()) return false; Review Comment: Same comment regarding the bounds for `pos`. ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java: ## @@ -183,6 +204,19 @@ public static int findInSet(final UTF8String match, final UTF8String set, int co return 0; } + public static int lowercaseIndexOf(final UTF8String target, final UTF8String pattern, + final int start) { +if (pattern.numChars() == 0) return 0; Review Comment: I would expect `start` to be returned here, not `0`. ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java: ## @@ -34,6 +34,27 @@ * Utility class for collation-aware UTF8String operations. */ public class CollationAwareUTF8String { + + public static boolean lowercaseMatchFrom(final UTF8String l, final UTF8String r, int pos) { +if (pos < 0) return false; +for (int len = 0; len <= l.numChars() - pos; len++) { + if (l.substring(pos, pos + len).toLowerCase().equals(r)) { Review Comment: ```suggestion if (l.substring(pos, pos + len).toLowerCase().equals(r.toLowerCase())) { ``` Somehow we need to state that `r` is already in lowercase (comment and/or change the name of the variable). ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java: ## @@ -183,6 +204,19 @@ public static int findInSet(final UTF8String match, final UTF8String set, int co return 0; } + public static int lowercaseIndexOf(final UTF8String target, final UTF8String pattern, + final int start) { +if (pattern.numChars() == 0) return 0; +int lenHaystack = target.numChars(), lenNeedle = pattern.numChars(); +final UTF8String needle = pattern.toLowerCase(); +for (int i = start; i <= (lenHaystack - lenNeedle); i++) { Review Comment: ```suggestion for (int i = start; i <= (lenHaystack - lenNeedle); ++i) { ``` ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java: ## @@ -118,7 +118,8 @@ public static boolean execBinary(final UTF8String l, final UTF8String r) { return l.contains(r); } public static boolean execLowercase(final UTF8String l, final UTF8String r) { - return l.containsInLowerCase(r); + if (r.numBytes() == 0) return true; Review Comment: Do you need this? There is special case inside `lowercaseIndexOf` that handles it. ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java: ## @@ -156,7 +157,8 @@ public static boolean execBinary(final UTF8String l, final UTF8String r) { return l.startsWith(r); } public static boolean execLowercase(final UTF8String l, final UTF8String r) { - return l.startsWithInLowerCase(r); + if (r.numBytes() == 0) return true; Review Comment: Same here. ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java: ## @@ -34,6 +34,27 @@ * Utility class for collation-awar
Re: [PR] [SPARK-48267][SS] Regression e2e test with SPARK-47305 [spark]
HeartSaVioR commented on PR #46569: URL: https://github.com/apache/spark/pull/46569#issuecomment-2109409434 Also cherry-picked to 3.5 as well as it's a clean cherry-pick. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48267][SS] Regression e2e test with SPARK-47305 [spark]
HeartSaVioR closed pull request #46569: [SPARK-48267][SS] Regression e2e test with SPARK-47305 URL: https://github.com/apache/spark/pull/46569 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47599][MLLIB] MLLib: Migrate logWarn with variables to structured logging framework [spark]
panbingkun commented on code in PR #46527: URL: https://github.com/apache/spark/pull/46527#discussion_r1599456251 ## mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala: ## @@ -179,8 +179,8 @@ class LinearSVC @Since("2.2.0") ( maxBlockSizeInMB) if (dataset.storageLevel != StorageLevel.NONE) { - instr.logWarning(s"Input instances will be standardized, blockified to blocks, and " + Review Comment: Delete unnecessary string interpreter. `s"..."` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-48221][SQL] Alter string search logic for UTF8_BINARY_LCASE collation [spark]
mkaravel commented on code in PR #46511: URL: https://github.com/apache/spark/pull/46511#discussion_r1599447728 ## common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java: ## @@ -102,20 +102,30 @@ public void testContains() throws SparkException { assertContains("äbćδe", "ÄbćδE", "UNICODE_CI", true); assertContains("äbćδe", "ÄBcΔÉ", "UNICODE_CI", false); // Case-variable character length -assertContains("abİo12", "i̇o", "UNICODE_CI", true); -assertContains("abi̇o12", "İo", "UNICODE_CI", true); -assertContains("the İodine", "the i̇odine", "UTF8_BINARY_LCASE", true); -assertContains("the i̇odine", "the İodine", "UTF8_BINARY_LCASE", true); -assertContains("The İodiNe", " i̇oDin", "UTF8_BINARY_LCASE", true); -assertContains("İodiNe", "i̇oDine", "UTF8_BINARY_LCASE", true); -assertContains("İodiNe", " i̇oDin", "UTF8_BINARY_LCASE", false); -// Characters with the same binary lowercase representation -assertContains("The Kelvin.", "Kelvin", "UTF8_BINARY_LCASE", true); -assertContains("The Kelvin.", "Kelvin", "UTF8_BINARY_LCASE", true); -assertContains("The KKelvin.", "KKelvin", "UTF8_BINARY_LCASE", true); -assertContains("2 Kelvin.", "2 Kelvin", "UTF8_BINARY_LCASE", true); -assertContains("2 Kelvin.", "2 Kelvin", "UTF8_BINARY_LCASE", true); -assertContains("The KKelvin.", "KKelvin,", "UTF8_BINARY_LCASE", false); +assertContains("i̇", "i", "UNICODE_CI", false); +assertContains("i̇", "̇", "UNICODE_CI", false); Review Comment: ```suggestion assertContains("i̇", "\u0307", "UNICODE_CI", false); ``` ## common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java: ## @@ -279,20 +313,44 @@ public void testEndsWith() throws SparkException { assertEndsWith("ab世De", "AB世dE", "UNICODE_CI", true); assertEndsWith("äbćδe", "ÄbćδE", "UNICODE_CI", true); assertEndsWith("äbćδe", "ÄBcΔÉ", "UNICODE_CI", false); -// Case-variable character length -assertEndsWith("The İo", "i̇o", "UNICODE_CI", true); -assertEndsWith("The i̇o", "İo", "UNICODE_CI", true); -assertEndsWith("the İodine", "the i̇odine", "UTF8_BINARY_LCASE", true); -assertEndsWith("the i̇odine", "the İodine", "UTF8_BINARY_LCASE", true); -assertEndsWith("The İodiNe", "i̇oDine", "UTF8_BINARY_LCASE", true); -assertEndsWith("The İodiNe", "i̇oDin", "UTF8_BINARY_LCASE", false); // Characters with the same binary lowercase representation assertEndsWith("The Kelvin", "Kelvin", "UTF8_BINARY_LCASE", true); assertEndsWith("The Kelvin", "Kelvin", "UTF8_BINARY_LCASE", true); assertEndsWith("The KKelvin", "KKelvin", "UTF8_BINARY_LCASE", true); assertEndsWith("The 2 Kelvin", "2 Kelvin", "UTF8_BINARY_LCASE", true); assertEndsWith("The 2 Kelvin", "2 Kelvin", "UTF8_BINARY_LCASE", true); assertEndsWith("The KKelvin", "KKelvin,", "UTF8_BINARY_LCASE", false); +// Case-variable character length +assertEndsWith("i̇", "̇", "UNICODE_CI", false); +assertEndsWith("i̇", "İ", "UNICODE_CI", true); +assertEndsWith("İ", "i", "UNICODE_CI", false); +assertEndsWith("İİİ", "i̇i̇", "UNICODE_CI", true); +assertEndsWith("İİİ", "ii̇", "UNICODE_CI", false); +assertEndsWith("İi̇İ", "İi̇", "UNICODE_CI", true); +assertEndsWith("i̇İi̇i̇", "̇İi̇İ", "UNICODE_CI", false); Review Comment: ```suggestion assertEndsWith("i̇İi̇i̇", "\u0307İi̇İ", "UNICODE_CI", false); ``` ## common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java: ## @@ -279,20 +313,44 @@ public void testEndsWith() throws SparkException { assertEndsWith("ab世De", "AB世dE", "UNICODE_CI", true); assertEndsWith("äbćδe", "ÄbćδE", "UNICODE_CI", true); assertEndsWith("äbćδe", "ÄBcΔÉ", "UNICODE_CI", false); -// Case-variable character length -assertEndsWith("The İo", "i̇o", "UNICODE_CI", true); -assertEndsWith("The i̇o", "İo", "UNICODE_CI", true); -assertEndsWith("the İodine", "the i̇odine", "UTF8_BINARY_LCASE", true); -assertEndsWith("the i̇odine", "the İodine", "UTF8_BINARY_LCASE", true); -assertEndsWith("The İodiNe", "i̇oDine", "UTF8_BINARY_LCASE", true); -assertEndsWith("The İodiNe", "i̇oDin", "UTF8_BINARY_LCASE", false); // Characters with the same binary lowercase representation assertEndsWith("The Kelvin", "Kelvin", "UTF8_BINARY_LCASE", true); assertEndsWith("The Kelvin", "Kelvin", "UTF8_BINARY_LCASE", true); assertEndsWith("The KKelvin", "KKelvin", "UTF8_BINARY_LCASE", true); assertEndsWith("The 2 Kelvin", "2 Kelvin", "UTF8_BINARY_LCASE", true); assertEndsWith("The 2 Kelvin", "2 Kelvin", "UTF8_BINARY_LCASE", true); assertEndsWith("The KKelvin", "KKelvin,", "UTF8_BINARY_LCASE", false); +// Case-variable character length +assertEndsWith("i̇", "̇", "UNICODE_CI", false); +assertEndsWith("i̇", "İ", "UNICODE_CI", true); +asser
Re: [PR] [SPARK-48267][SS] Regression e2e test with SPARK-47305 [spark]
HeartSaVioR commented on PR #46569: URL: https://github.com/apache/spark/pull/46569#issuecomment-2109390060 Thanks @viirya for quick reviewing! Merging to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47599][MLLIB] MLLib: Migrate logWarn with variables to structured logging framework [spark]
panbingkun commented on code in PR #46527: URL: https://github.com/apache/spark/pull/46527#discussion_r1599449547 ## mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala: ## @@ -451,8 +451,8 @@ class KMeans @Since("1.5.0") ( private def trainWithBlock(dataset: Dataset[_], instr: Instrumentation) = { if (dataset.storageLevel != StorageLevel.NONE) { - instr.logWarning(s"Input vectors will be blockified to blocks, and " + Review Comment: Delete unnecessary string interpreter `s"..."` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48157][SQL] Add collation support for CSV expressions [spark]
cloud-fan closed pull request #46504: [SPARK-48157][SQL] Add collation support for CSV expressions URL: https://github.com/apache/spark/pull/46504 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48157][SQL] Add collation support for CSV expressions [spark]
cloud-fan commented on PR #46504: URL: https://github.com/apache/spark/pull/46504#issuecomment-2109363247 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48229][SQL] Add collation support for inputFile expressions [spark]
cloud-fan closed pull request #46503: [SPARK-48229][SQL] Add collation support for inputFile expressions URL: https://github.com/apache/spark/pull/46503 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48229][SQL] Add collation support for inputFile expressions [spark]
cloud-fan commented on PR #46503: URL: https://github.com/apache/spark/pull/46503#issuecomment-2109358482 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48265][SQL] Infer window group limit batch should do constant folding [spark]
beliefer commented on PR #46568: URL: https://github.com/apache/spark/pull/46568#issuecomment-2109342800 LGTM later. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48251][BUILD] Disable `maven local cache` on GA's step `MIMA test` [spark]
panbingkun commented on code in PR #46551: URL: https://github.com/apache/spark/pull/46551#discussion_r1599416657 ## project/SparkBuild.scala: ## @@ -273,10 +273,9 @@ object SparkBuild extends PomBuild { // Google Mirror of Maven Central, placed first so that it's used instead of flaky Maven Central. // See https://storage-download.googleapis.com/maven-central/index.html for more info. "gcs-maven-central-mirror" at "https://maven-central.storage-download.googleapis.com/maven2/";, - DefaultMavenRepository, - Resolver.mavenLocal, - Resolver.file("ivyLocal", file(Path.userHome.absolutePath + "/.ivy2/local"))(Resolver.ivyStylePatterns) -), + DefaultMavenRepository) ++ + { if (sys.env.contains("SKIP_LOCAL_M2")) Nil else Seq(Resolver.mavenLocal) } :+ Review Comment: Let's make the conditions more stricter. Updated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-46707][SQL][FOLLOWUP] Push down throwable predicate through aggregates [spark]
zml1206 opened a new pull request, #44975: URL: https://github.com/apache/spark/pull/44975 ### What changes were proposed in this pull request? Push down throwable predicate through aggregates and add ut for "can't push down nondeterministic filter through aggregate". ### Why are the changes needed? If we can push down a filter through Aggregate, it means the filter only references the grouping keys. The Aggregate operator can't reduce grouping keys so the filter won't see any new data after pushing down. So push down throwable filter through aggregate does not affect exception thrown. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? UT ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48266][CONNECT] Move package object `org.apache.spark.sql.connect.dsl` to test directory [spark]
LuciferYang commented on PR #46567: URL: https://github.com/apache/spark/pull/46567#issuecomment-2109338837 Thanks @HyukjinKwon @zhengruifeng @amaliujia -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48265][SQL] Infer window group limit batch should do constant folding [spark]
cloud-fan closed pull request #46568: [SPARK-48265][SQL] Infer window group limit batch should do constant folding URL: https://github.com/apache/spark/pull/46568 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48265][SQL] Infer window group limit batch should do constant folding [spark]
cloud-fan commented on PR #46568: URL: https://github.com/apache/spark/pull/46568#issuecomment-2109331575 thanks, merging to master/3.5! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48027][SQL][FOLLOWUP] Add comments for the other code branch [spark]
beliefer commented on PR #46536: URL: https://github.com/apache/spark/pull/46536#issuecomment-2109330131 @cloud-fan Thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48241][SQL][3.5] CSV parsing failure with char/varchar type columns [spark]
cloud-fan closed pull request #46565: [SPARK-48241][SQL][3.5] CSV parsing failure with char/varchar type columns URL: https://github.com/apache/spark/pull/46565 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-39195][SQL] Spark OutputCommitCoordinator should abort stage when committed file not consistent with task status [spark]
cloud-fan commented on code in PR #36564: URL: https://github.com/apache/spark/pull/36564#discussion_r1599405027 ## core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala: ## @@ -155,9 +158,9 @@ private[spark] class OutputCommitCoordinator(conf: SparkConf, isDriver: Boolean) val taskId = TaskIdentifier(stageAttempt, attemptNumber) stageState.failures.getOrElseUpdate(partition, mutable.Set()) += taskId if (stageState.authorizedCommitters(partition) == taskId) { - logDebug(s"Authorized committer (attemptNumber=$attemptNumber, stage=$stage, " + -s"partition=$partition) failed; clearing lock") - stageState.authorizedCommitters(partition) = null + sc.foreach(_.dagScheduler.stageFailed(stage, s"Authorized committer " + +s"(attemptNumber=$attemptNumber, stage=$stage, partition=$partition) failed; " + +s"but task commit success, data duplication may happen.")) } Review Comment: +1. @AngersZh can you refine it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47599][MLLIB] MLLib: Migrate logWarn with variables to structured logging framework [spark]
panbingkun commented on PR #46527: URL: https://github.com/apache/spark/pull/46527#issuecomment-2109320498 > Let's continue on this one :) Okay, let me check it first, 😄 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48027][SQL][FOLLOWUP] Add comments for the other code branch [spark]
cloud-fan closed pull request #46536: [SPARK-48027][SQL][FOLLOWUP] Add comments for the other code branch URL: https://github.com/apache/spark/pull/46536 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48027][SQL][FOLLOWUP] Add comments for the other code branch [spark]
cloud-fan commented on PR #46536: URL: https://github.com/apache/spark/pull/46536#issuecomment-2109312325 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48155][SQL] AQEPropagateEmptyRelation for join should check if remain child is just BroadcastQueryStageExec [spark]
cloud-fan commented on code in PR #46523: URL: https://github.com/apache/spark/pull/46523#discussion_r1599396352 ## sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AQEPropagateEmptyRelation.scala: ## @@ -82,6 +82,21 @@ object AQEPropagateEmptyRelation extends PropagateEmptyRelationBase { case _ => false } + // In AQE, query stage will be wrapped with LogicalQueryStage, if it's child is + // BroadcastQueryStage and join other side is empty relation like the follow plan: + // Join + // :- LocalTableScan , [a#23] + // +- LogicalQueryStage(_, BroadcastQueryStage) + // After AQEPropagateEmptyRelation, the plan will be + // Project + // +- LogicalQueryStage(_, BroadcastQueryStage) + // Then after LogicalQueryStageStrategy, will only remain BroadcastQueryStage after project, + // the plan can't execute. Review Comment: how hard is it to return the original query plan? Seems not hard as we just need to add a new `def returnSingleJoinSide` function in the base class, and unwrap broadcast stage in the AQE rule. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48155][SQL] AQEPropagateEmptyRelation for join should check if remain child is just BroadcastQueryStageExec [spark]
cloud-fan commented on code in PR #46523: URL: https://github.com/apache/spark/pull/46523#discussion_r1599395049 ## sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AQEPropagateEmptyRelation.scala: ## @@ -82,6 +82,21 @@ object AQEPropagateEmptyRelation extends PropagateEmptyRelationBase { case _ => false } + // In AQE, query stage will be wrapped with LogicalQueryStage, if it's child is + // BroadcastQueryStage and join other side is empty relation like the follow plan: + // Join + // :- LocalTableScan , [a#23] + // +- LogicalQueryStage(_, BroadcastQueryStage) + // After AQEPropagateEmptyRelation, the plan will be + // Project + // +- LogicalQueryStage(_, BroadcastQueryStage) + // Then after LogicalQueryStageStrategy, will only remain BroadcastQueryStage after project, + // the plan can't execute. Review Comment: We can simply say ``` // A broadcast query stage can't be executed without the join operator. // TODO: we can return the original query plan before broadcast. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48266][CONNECT] Move package object `org.apache.spark.sql.connect.dsl` to test directory [spark]
HyukjinKwon closed pull request #46567: [SPARK-48266][CONNECT] Move package object `org.apache.spark.sql.connect.dsl` to test directory URL: https://github.com/apache/spark/pull/46567 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48266][CONNECT] Move package object `org.apache.spark.sql.connect.dsl` to test directory [spark]
HyukjinKwon commented on PR #46567: URL: https://github.com/apache/spark/pull/46567#issuecomment-2109296872 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48241][SQL][3.5] CSV parsing failure with char/varchar type columns [spark]
cloud-fan commented on PR #46565: URL: https://github.com/apache/spark/pull/46565#issuecomment-2109293728 thanks, merging to 3.5! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48229][SQL] Add collation support for inputFile expressions [spark]
uros-db commented on code in PR #46503: URL: https://github.com/apache/spark/pull/46503#discussion_r1597873492 ## sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala: ## @@ -1151,6 +1151,23 @@ class CollationSQLExpressionsSuite }) } + test("Support InputFileName expression with collation") { +// Supported collations +Seq("UTF8_BINARY", "UTF8_BINARY_LCASE", "UNICODE", "UNICODE_CI").foreach(collationName => { Review Comment: I'd say most of the time tests are custom, depending on the function, and only rarely do we have functions with no parameters - so this is not very common (actually, this may be the only such function in this test suite). So I'd say there's no need for a preset suite-level collation name collection. On the other hand, we will soon have hundreds/thousands of collations in Spark, so these 4 collations will soon appear to be very random (i.e. we won't run this test for **all** supported collations, but these 4 should still be enough to cover all cases for all functions). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48267][SS] Regression e2e test with SPARK-47305 [spark]
HeartSaVioR commented on PR #46569: URL: https://github.com/apache/spark/pull/46569#issuecomment-2109270674 Looks like the behavior of PruneFilters is somewhat interesting, with combination of filter pushdown... If the filter can push down more aggressively, the scope of subtree PruneFilters will prune is going to be "smaller". E.g. ``` val df = input1.toDF() .observe("myEvent", count(lit(1)).as("rows")) .filter(expr("false")) ``` to ``` == Analyzed Logical Plan == WriteToMicroBatchDataSource MemorySink, 5afa5bf2-9757-4e90-a604-bb2c4438ebec, Update, 0 +- Filter false +- CollectMetrics myEvent, [count(1) AS rows#3L], 0 +- StreamingDataSourceV2ScanRelation[value#1] MemoryStreamDataSource == Optimized Logical Plan == WriteToDataSourceV2 MicroBatchWrite[epoch: 0, writer: org.apache.spark.sql.execution.streaming.sources.MemoryStreamingWrite@64b874f0] +- LocalRelation , [value#1] ``` vs ``` val df = input1.toDF() .withColumn("eventTime", timestamp_seconds($"value")) .withWatermark("eventTime", "0 second") .filter(expr("false")) ``` to ``` == Analyzed Logical Plan == WriteToMicroBatchDataSource MemorySink, 371ac05f-8c25-43b1-b82a-ed11e9a16c29, Update, 0 +- Filter false +- EventTimeWatermark eventTime#3: timestamp, 0 seconds +- Project [value#1, timestamp_seconds(value#1) AS eventTime#3] +- StreamingDataSourceV2ScanRelation[value#1] MemoryStreamDataSource == Optimized Logical Plan == WriteToDataSourceV2 MicroBatchWrite[epoch: 0, writer: org.apache.spark.sql.execution.streaming.sources.MemoryStreamingWrite@4af7] +- EventTimeWatermark eventTime#3: timestamp, 0 seconds +- LocalRelation , [value#1, eventTime#3] ``` For former, filter can't be pushed down under CollectMetrics, and PruneFilters takes effect and replace both CollectMetrics and Relation with an empty relation. For latter, EventTimeWatermark supports pushdown which predicate isn't comparing with event time column. This case filter is pushed down under EventTimeWatermark, and PruneFilters takes effect and replace Relation, but EventTimeWatermark is left as it is. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48267][SS] Regression e2e test with SPARK-47305 [spark]
HeartSaVioR commented on code in PR #46569: URL: https://github.com/apache/spark/pull/46569#discussion_r1599352970 ## sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryOptimizationCorrectnessSuite.scala: ## @@ -416,4 +416,39 @@ class StreamingQueryOptimizationCorrectnessSuite extends StreamTest { ) } } + + test("SPARK-48267: regression test, stream-stream union followed by stream-batch join") { +withTempDir { dir => + val input1 = MemoryStream[Int] + val input2 = MemoryStream[Int] + + val df1 = input1.toDF().withColumn("code", lit(1)) Review Comment: Given the sequence of optimizations in reproducer, it wouldn't happen if the union wasn't eliminated, as the value of 'code' won't be known during optimization phase. That said, if we call observe() or withWatermark() or etc in streaming DF to prevent optimizer to collapse everything at one side, the bug wouldn't be triggered. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48267][SS] Regression e2e test with SPARK-47305 [spark]
HeartSaVioR commented on code in PR #46569: URL: https://github.com/apache/spark/pull/46569#discussion_r1599352970 ## sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryOptimizationCorrectnessSuite.scala: ## @@ -416,4 +416,39 @@ class StreamingQueryOptimizationCorrectnessSuite extends StreamTest { ) } } + + test("SPARK-48267: regression test, stream-stream union followed by stream-batch join") { +withTempDir { dir => + val input1 = MemoryStream[Int] + val input2 = MemoryStream[Int] + + val df1 = input1.toDF().withColumn("code", lit(1)) Review Comment: Given the sequence of optimizations in reproducer, it wouldn't happen if the union wasn't eliminated, as the value of 'code' won't be known during optimization phase. That said, if we call observe() or withWatermark() or etc in these streaming DF, the bug wouldn't be triggered. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [[SPARK-48258][PYTHON][CONNECT] Checkpoint and localCheckpoint in Spark Connect [spark]
HyukjinKwon opened a new pull request, #46570: URL: https://github.com/apache/spark/pull/46570 ### What changes were proposed in this pull request? TBD ### Why are the changes needed? TBD ### Does this PR introduce _any_ user-facing change? TBD ### How was this patch tested? TBD ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47599][MLLIB] MLLib: Migrate logWarn with variables to structured logging framework [spark]
gengliangwang commented on PR #46527: URL: https://github.com/apache/spark/pull/46527#issuecomment-2109242560 Let's continue on this one :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48209][CORE] Common (java side): Migrate `error/warn/info` with variables to structured logging framework [spark]
gengliangwang closed pull request #46493: [SPARK-48209][CORE] Common (java side): Migrate `error/warn/info` with variables to structured logging framework URL: https://github.com/apache/spark/pull/46493 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48209][CORE] Common (java side): Migrate `error/warn/info` with variables to structured logging framework [spark]
gengliangwang commented on PR #46493: URL: https://github.com/apache/spark/pull/46493#issuecomment-2109237678 Thanks, merging to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48267][SS] Regression e2e test with SPARK-47305 [spark]
HeartSaVioR commented on PR #46569: URL: https://github.com/apache/spark/pull/46569#issuecomment-2109235633 cc. @cloud-fan @viirya PTAL, thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-48267][SS] Regression e2e test with SPARK-47305 [spark]
HeartSaVioR opened a new pull request, #46569: URL: https://github.com/apache/spark/pull/46569 ### What changes were proposed in this pull request? This PR proposes to add a regression test (e2e) with SPARK-47305 (https://issues.apache.org/jira/browse/SPARK-47305). As of commit cae2248bc13 (pre-Spark 4.0), the test query is represented as below logical plans: > Batch 0 >> analyzed plan ``` WriteToMicroBatchDataSource MemorySink, 5067923b-e1d0-484c-914c-b111c9e60aac, Append, 0 +- Project [value#1] +- Join Inner, (cast(code#5 as bigint) = ref_code#14L) :- Union false, false : :- Project [value#1, 1 AS code#5] : : +- StreamingDataSourceV2ScanRelation[value#1] MemoryStreamDataSource : +- Project [value#3, cast(code#9 as int) AS code#16] : +- Project [value#3, null AS code#9] :+- LocalRelation , [value#3] +- Project [id#12L AS ref_code#14L] +- Range (1, 5, step=1, splits=Some(2)) ``` >> optimized plan ``` WriteToDataSourceV2 MicroBatchWrite[epoch: 0, writer: ...] +- Join Inner :- StreamingDataSourceV2ScanRelation[value#1] MemoryStreamDataSource +- Project +- Filter (1 = id#12L) +- Range (1, 5, step=1, splits=Some(2)) ``` > Batch 1 >> analyzed plan ``` WriteToMicroBatchDataSource MemorySink, d1c8be66-88e7-437a-9f25-6b87db8efe17, Append, 1 +- Project [value#1] +- Join Inner, (cast(code#5 as bigint) = ref_code#14L) :- Union false, false : :- Project [value#1, 1 AS code#5] : : +- LocalRelation , [value#1] : +- Project [value#3, cast(code#9 as int) AS code#16] : +- Project [value#3, null AS code#9] :+- StreamingDataSourceV2ScanRelation[value#3] MemoryStreamDataSource +- Project [id#12L AS ref_code#14L] +- Range (1, 5, step=1, splits=Some(2)) ``` >> optimized plan ``` WriteToDataSourceV2 MicroBatchWrite[epoch: 1, writer: ...] +- Join Inner :- StreamingDataSourceV2ScanRelation[value#3] MemoryStreamDataSource +- LocalRelation ``` Notice the difference in optimized plan between batch 0 and batch 1. In optimized plan for batch 1, the batch side is pruned out, which goes with the path of PruneFilters. The sequence of optimization is, 1) left stream side is collapsed with empty local relation 2) union is replaced with subtree for right stream side as left stream side is simply an empty local relation 3) the value of 'code' column is now known to be 'null' and it's propagated to the join criteria (`null = ref_code`) 4) join criteria is extracted out from join, and being pushed to the batch side 5) the value of 'ref_code' column can never be null, hence the filter is optimized as `filter false` 6) `filter false` triggers PruneFilters (where we fix a bug in SPARK-47305) Before SPARK-47305, a new empty local relation was incorrectly marked as streaming. ### Why are the changes needed? In the PR of SPARK-47305 we only added an unit test to verify the fix, but it wasn't e2e about the workload we encountered an issue. Given the complexity of QO, it'd be ideal to put an e2e reproducer (despite simplified) as regression test. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New UT. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-41794][FOLLOWUP] Add `try_remainder` to python API references [spark]
zhengruifeng commented on PR #46566: URL: https://github.com/apache/spark/pull/46566#issuecomment-2109228922 merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-41794][FOLLOWUP] Add `try_remainder` to python API references [spark]
zhengruifeng closed pull request #46566: [SPARK-41794][FOLLOWUP] Add `try_remainder` to python API references URL: https://github.com/apache/spark/pull/46566 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47579][CORE][PART2] Migrate logInfo with variables to structured logging framework [spark]
gengliangwang commented on code in PR #46494: URL: https://github.com/apache/spark/pull/46494#discussion_r1599334395 ## core/src/main/scala/org/apache/spark/deploy/security/HadoopFSDelegationTokenProvider.scala: ## @@ -117,10 +117,11 @@ private[deploy] class HadoopFSDelegationTokenProvider filesystems.foreach { fs => if (fsToExclude.contains(fs.getUri.getHost)) { // YARN RM skips renewing token with empty renewer -logInfo(s"getting token for: $fs with empty renewer to skip renewal") +logInfo(log"getting token for: ${MDC(FILE_SYSTEM, fs)} with empty renewer to skip renewal") Utils.tryLogNonFatalError { fs.addDelegationTokens("", creds) } } else { -logInfo(s"getting token for: $fs with renewer $renewer") +logInfo(log"getting token for: ${MDC(FILE_SYSTEM, fs)} with" + + log" renewer ${MDC(TOKEN_RE_NEWER, renewer)}") Review Comment: @zeotuan one last comment here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47579][CORE][PART2] Migrate logInfo with variables to structured logging framework [spark]
gengliangwang commented on code in PR #46494: URL: https://github.com/apache/spark/pull/46494#discussion_r1599334199 ## core/src/main/scala/org/apache/spark/deploy/security/HadoopFSDelegationTokenProvider.scala: ## @@ -117,10 +117,11 @@ private[deploy] class HadoopFSDelegationTokenProvider filesystems.foreach { fs => if (fsToExclude.contains(fs.getUri.getHost)) { // YARN RM skips renewing token with empty renewer -logInfo(s"getting token for: $fs with empty renewer to skip renewal") +logInfo(log"getting token for: ${MDC(FILE_SYSTEM, fs)} with empty renewer to skip renewal") Utils.tryLogNonFatalError { fs.addDelegationTokens("", creds) } } else { -logInfo(s"getting token for: $fs with renewer $renewer") +logInfo(log"getting token for: ${MDC(FILE_SYSTEM, fs)} with" + + log" renewer ${MDC(TOKEN_RE_NEWER, renewer)}") Review Comment: ```suggestion log" renewer ${MDC(TOKEN_RENEWER, renewer)}") ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48251][BUILD] Disable `maven local cache` on GA's step `MIMA test` [spark]
LuciferYang commented on code in PR #46551: URL: https://github.com/apache/spark/pull/46551#discussion_r1599332094 ## project/SparkBuild.scala: ## @@ -273,10 +273,9 @@ object SparkBuild extends PomBuild { // Google Mirror of Maven Central, placed first so that it's used instead of flaky Maven Central. // See https://storage-download.googleapis.com/maven-central/index.html for more info. "gcs-maven-central-mirror" at "https://maven-central.storage-download.googleapis.com/maven2/";, - DefaultMavenRepository, - Resolver.mavenLocal, - Resolver.file("ivyLocal", file(Path.userHome.absolutePath + "/.ivy2/local"))(Resolver.ivyStylePatterns) -), + DefaultMavenRepository) ++ + { if (sys.env.contains("SKIP_LOCAL_M2")) Nil else Seq(Resolver.mavenLocal) } :+ Review Comment: hmm... so, even if the value of `SKIP_LOCAL_M2` is "", `Resolver.mavenLocal` will be disabled? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48241][SQL][3.5] CSV parsing failure with char/varchar type columns [spark]
yaooqinn commented on PR #46565: URL: https://github.com/apache/spark/pull/46565#issuecomment-2109177057 Can we port the PR description here too? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48266][CONNECT] Move package object `org.apache.spark.sql.connect.dsl` to test directory [spark]
LuciferYang commented on PR #46567: URL: https://github.com/apache/spark/pull/46567#issuecomment-2109176293 > can you write from where to where in the pr description done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48241][SQL] CSV parsing failure with char/varchar type columns [spark]
liujiayi771 commented on PR #46537: URL: https://github.com/apache/spark/pull/46537#issuecomment-2109168378 > it has conflicts with 3.5, can you create a new backport PR? Create a backport PR in https://github.com/apache/spark/pull/46565. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48266][CONNECT] Move package object `org.apache.spark.sql.connect.dsl` to test directory [spark]
HyukjinKwon commented on PR #46567: URL: https://github.com/apache/spark/pull/46567#issuecomment-2109164398 can you write from where to where in the pr description -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48266][CONNECT] Move package object `org.apache.spark.sql.connect.dsl` to test directory [spark]
LuciferYang commented on PR #46567: URL: https://github.com/apache/spark/pull/46567#issuecomment-2109161934 cc @zhengruifeng @amaliujia Let's test first -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48265][SQL] Infer window group limit batch should do constant folding [spark]
AngersZh commented on PR #46568: URL: https://github.com/apache/spark/pull/46568#issuecomment-2109161898 ping @cloud-fan cc @beliefer -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-48265][SQL] Infer window group limit batch should do constant folding [spark]
AngersZh opened a new pull request, #46568: URL: https://github.com/apache/spark/pull/46568 ### What changes were proposed in this pull request? Plan after PropagateEmptyRelation may generate double local limit ``` GlobalLimit 21 +- LocalLimit 21 ! +- Union false, false ! :- LocalLimit 21 ! : +- Project [item_id#647L] ! : +- Filter () ! :+- Relation db.table[,... 91 more fields] parquet ! +- LocalLimit 21 ! +- Project [item_id#738L] !+- LocalRelation , [, ... 91 more fields] ``` to ``` GlobalLimit 21 +- LocalLimit 21 - LocalLimit 21 +- Project [item_id#647L] +- Filter () +- Relation db.table[,... 91 more fields] parquet ``` after `Infer window group limit batch` batch's `EliminateLimits` will be ``` GlobalLimit 21 +- LocalLimit least(21, 21) +- Project [item_id#647L] +- Filter () +- Relation db.table[,... 91 more fields] parquet ``` It can't work, here miss a `ConstantFolding` ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-48266][CONNECT] Move package object `org.apache.spark.sql.connect.dsl` to test directory [spark]
LuciferYang opened a new pull request, #46567: URL: https://github.com/apache/spark/pull/46567 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-41794][SQL] Add `try_remainder` function and re-enable column tests [spark]
zhengruifeng commented on PR #46434: URL: https://github.com/apache/spark/pull/46434#issuecomment-2109157257 add it to python API references in https://github.com/apache/spark/pull/46566 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-41794][FOLLOWUP] Add `try_remainder` to python API references [spark]
zhengruifeng opened a new pull request, #46566: URL: https://github.com/apache/spark/pull/46566 ### What changes were proposed in this pull request? Add `try_remainder` to python API references ### Why are the changes needed? new methods should be added to API references ### Does this PR introduce _any_ user-facing change? doc changes ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-36691][PYTHON] PythonRunner failed should pass error message to ApplicationMaster too [spark]
AngersZh commented on code in PR #33934: URL: https://github.com/apache/spark/pull/33934#discussion_r1599293672 ## core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala: ## @@ -92,15 +92,16 @@ object PythonRunner { // see https://github.com/numpy/numpy/issues/10455 sparkConf.getOption("spark.driver.cores").foreach(env.put("OMP_NUM_THREADS", _)) } -builder.redirectErrorStream(true) // Ugly but needed for stdout and stderr to synchronize Review Comment: since we do this in `ProcessRedirectThread` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48259][CONNECT][TESTS] Add 3 missing methods in dsl [spark]
zhengruifeng commented on code in PR #46559: URL: https://github.com/apache/spark/pull/46559#discussion_r1599292541 ## connector/connect/server/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala: ## @@ -513,6 +513,25 @@ package object dsl { freqItems(cols.toArray, support) def freqItems(cols: Seq[String]): Relation = freqItems(cols, 0.01) + + def sampleBy(col: String, fractions: Map[Any, Double], seed: Long): Relation = { Review Comment: I am fine to move `dsl` to `test` if nothing is broken -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48259][CONNECT][TESTS] Add 3 missing methods in dsl [spark]
zhengruifeng commented on PR #46559: URL: https://github.com/apache/spark/pull/46559#issuecomment-2109148423 thanks, merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48259][CONNECT][TESTS] Add 3 missing methods in dsl [spark]
zhengruifeng closed pull request #46559: [SPARK-48259][CONNECT][TESTS] Add 3 missing methods in dsl URL: https://github.com/apache/spark/pull/46559 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47301][SQL][TESTS] Fix flaky ParquetIOSuite [spark]
yaooqinn commented on PR #45403: URL: https://github.com/apache/spark/pull/45403#issuecomment-2109146927 +1 @cloud-fan Since the LOCs have been moved to `ParquetIOWithoutOutputCommitCoordinationSuite`, we need a followup for reverting -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46741][SQL] Cache Table with CTE won't work [spark]
AngersZh commented on PR #44767: URL: https://github.com/apache/spark/pull/44767#issuecomment-2109143816 ping @cloud-fan @yaooqinn -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47172][CORE] Add support for AES-GCM for RPC encryption [spark]
mridulm commented on PR #46515: URL: https://github.com/apache/spark/pull/46515#issuecomment-2109142853 Took a quick pass through it, sorry for the delay. +CC @JoshRosen as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47172][CORE] Add support for AES-GCM for RPC encryption [spark]
mridulm commented on code in PR #46515: URL: https://github.com/apache/spark/pull/46515#discussion_r1598897504 ## docs/security.md: ## @@ -207,6 +207,14 @@ The following table describes the different options available for configuring th 2.2.0 + + spark.network.crypto.cipher + AES/CTR/NoPadding + +Cipher mode to use. Defaults "AES/CTR/NoPadding". Also supports "AES/GCM/NoPadding" Review Comment: Can we add more details on this ? What value should users set this to, why/impact, backward compatibility implication, ... ## common/network-common/src/main/java/org/apache/spark/network/crypto/GcmTransportCipher.java: ## @@ -0,0 +1,147 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.network.crypto; + +import com.google.common.annotations.VisibleForTesting; +import com.google.crypto.tink.subtle.AesGcmJce; +import io.netty.buffer.Unpooled; +import io.netty.channel.*; +import org.apache.spark.network.util.AbstractFileRegion; +import io.netty.buffer.ByteBuf; + +import javax.crypto.spec.SecretKeySpec; +import java.io.IOException; +import java.nio.ByteBuffer; +import java.nio.channels.ClosedChannelException; +import java.nio.channels.WritableByteChannel; +import java.security.GeneralSecurityException; + +public class GcmTransportCipher implements TransportCipher { +private static final byte[] DEFAULT_AAD = new byte[0]; + +private final SecretKeySpec aesKey; +public GcmTransportCipher(SecretKeySpec aesKey) { +this.aesKey = aesKey; +} + +@VisibleForTesting +EncryptionHandler getEncryptionHandler() { +return new EncryptionHandler(); +} + +@VisibleForTesting +DecryptionHandler getDecryptionHandler() { +return new DecryptionHandler(); +} + +public void addToChannel(Channel ch) { +ch.pipeline() +.addFirst("GcmTransportEncryption", getEncryptionHandler()) +.addFirst("GcmTransportDecryption", getDecryptionHandler()); +} + +@VisibleForTesting +class EncryptionHandler extends ChannelOutboundHandlerAdapter { +@Override +public void write(ChannelHandlerContext ctx, Object msg, ChannelPromise promise) +throws Exception { +ByteBuffer inputBuffer; +int bytesToRead; +if (msg instanceof ByteBuf byteBuf) { +bytesToRead = byteBuf.readableBytes(); +// This is allocating a buffer that is the size of the input +inputBuffer = ByteBuffer.allocate(bytesToRead); +// This will block while copying +while (inputBuffer.position() < bytesToRead) { +byteBuf.readBytes(inputBuffer); +} +} else if (msg instanceof AbstractFileRegion fileRegion) { +bytesToRead = (int) fileRegion.count(); +// This is allocating a buffer that is the size of the input +inputBuffer = ByteBuffer.allocate(bytesToRead); +ByteBufferWriteableChannel writeableChannel = +new ByteBufferWriteableChannel(inputBuffer); +long transferred = 0; +// This will block while copying +while (transferred < bytesToRead) { +transferred += +fileRegion.transferTo(writeableChannel, fileRegion.transferred()); +} Review Comment: This can be quite large - and result in a nontrivial memory consumption. We will need to do something similar to `CtrTransportCipher.EncryptedMessage` and transfer in chunks. ## common/network-common/src/main/java/org/apache/spark/network/crypto/GcmTransportCipher.java: ## @@ -0,0 +1,147 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License
[PR] [SPARK-48241][SQL][3.5] CSV parsing failure with char/varchar type columns [spark]
liujiayi771 opened a new pull request, #46565: URL: https://github.com/apache/spark/pull/46565 ### What changes were proposed in this pull request? Backport CSV parsing failure with char/varchar type columns to Spark 3.5. ### Why are the changes needed? For char and varchar types, Spark will convert them to `StringType` in `CharVarcharUtils.replaceCharVarcharWithStringInSchema` and record `__CHAR_VARCHAR_TYPE_STRING` in the metadata. The reason for the above error is that the `StringType` columns in the `dataSchema` and `requiredSchema` of `UnivocityParser` are not consistent. The `StringType` in the `dataSchema` has metadata, while the metadata in the `requiredSchema` is empty. We need to retain the metadata when resolving schema. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add a new test case in `CSVSuite`. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [Only Test] Test label profiler [spark]
panbingkun opened a new pull request, #46564: URL: https://github.com/apache/spark/pull/46564 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48155][SQL] AQEPropagateEmptyRelation for join should check if remain child is just BroadcastQueryStageExec [spark]
AngersZh commented on code in PR #46523: URL: https://github.com/apache/spark/pull/46523#discussion_r1599275544 ## sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AQEPropagateEmptyRelation.scala: ## @@ -82,6 +82,11 @@ object AQEPropagateEmptyRelation extends PropagateEmptyRelationBase { case _ => false } + override protected def canPropagate(plan: LogicalPlan): Boolean = plan match { +case LogicalQueryStage(_, _: BroadcastQueryStageExec) => false Review Comment: Done ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PropagateEmptyRelation.scala: ## @@ -65,6 +65,8 @@ abstract class PropagateEmptyRelationBase extends Rule[LogicalPlan] with CastSup private def nullValueProjectList(plan: LogicalPlan): Seq[NamedExpression] = plan.output.map{ a => Alias(cast(Literal(null), a.dataType), a.name)(a.exprId) } + protected def canPropagate(plan: LogicalPlan): Boolean = true Review Comment: Done ## sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala: ## @@ -2772,6 +2778,30 @@ class AdaptiveQueryExecSuite } } + test("SPARK-48155: AQEPropagateEmptyRelation check remained child for join") { +withSQLConf( + SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true") { + val (_, adaptivePlan) = runAdaptiveAndVerifyResult( +""" + |SELECT /*+ BROADCAST(t3) */ t3.b, count(t3.a) FROM testData2 t1 Review Comment: Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47301][SQL][TESTS] Fix flaky ParquetIOSuite [spark]
cloud-fan commented on PR #45403: URL: https://github.com/apache/spark/pull/45403#issuecomment-2109105455 I think https://github.com/apache/spark/pull/46562 is a better fix, can we revert this workaround now? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48027][SQL][FOLLOWUP] Add comments for the other code branch [spark]
beliefer commented on code in PR #46536: URL: https://github.com/apache/spark/pull/46536#discussion_r1599263801 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InjectRuntimeFilter.scala: ## @@ -152,7 +152,15 @@ object InjectRuntimeFilter extends Rule[LogicalPlan] with PredicateHelper with J } else if (right.output.exists(_.semanticEquals(targetKey))) { extract(right, AttributeSet.empty, hasHitFilter = false, hasHitSelectiveFilter = false, currentPlan = right, targetKey = targetKey).orElse { -// We can also extract from the left side if the join keys are transitive. +// We can also extract from the left side if the join keys are transitive, and +// the left side always produces a superset output of join right keys. +// Let's look at an example +// left table: 1, 2, 3 +// right table, 3, 4 +// right outer join output: (3, 3), (null, 4) +// right key output: 3, 4 +// Any join side always produce a superset output of its corresponding +// join keys, but for transitive join keys we need to check the join type. Review Comment: Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-48264][BUILD] Upgrade `datasketches-java` to 6.0.0 [spark]
panbingkun opened a new pull request, #46563: URL: https://github.com/apache/spark/pull/46563 ### What changes were proposed in this pull request? The pr aims to upgrade `datasketches-java` from `5.0.1` to `6.0.0` ### Why are the changes needed? The full release notes: - https://github.com/apache/datasketches-java/releases/tag/6.0.0 - https://github.com/apache/datasketches-java/releases/tag/5.0.2 https://github.com/apache/spark/assets/15246973/fff5905a-25e8-4e2f-9492-1b6099b2bd05";> ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48220][PYTHON] Allow passing PyArrow Table to createDataFrame() [spark]
alippai commented on PR #46529: URL: https://github.com/apache/spark/pull/46529#issuecomment-2109086611 This makes the usage so much easier, thanks! What will happen with the nanosecond timestamps? Truncated to milliseconds? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48253][PS][SQL] Support default mode for Pandas API on Spark [spark]
itholic closed pull request #46553: [SPARK-48253][PS][SQL] Support default mode for Pandas API on Spark URL: https://github.com/apache/spark/pull/46553 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48253][PS][SQL] Support default mode for Pandas API on Spark [spark]
itholic commented on PR #46553: URL: https://github.com/apache/spark/pull/46553#issuecomment-2109072613 On second thought, the performance benefits would not very meaningful. Rather, it may pollute the config surface so let me close it for now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48253][PS][SQL] Support default mode for Pandas API on Spark [spark]
itholic commented on code in PR #46553: URL: https://github.com/apache/spark/pull/46553#discussion_r1599234279 ## python/pyspark/pandas/utils.py: ## @@ -492,6 +504,10 @@ def default_session() -> SparkSession: "from pandas API on Spark since pandas API on Spark follows " "the behavior of pandas, not SQL." ) +default_mode_value = get_default_mode() +if default_mode_value is None: +default_mode_from_conf = spark.conf.get("spark.sql.pyspark.pandas.defaultMode") == "true" Review Comment: Sure, let me set the default as "false" -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46707][SQL][FOLLOWUP] Push down throwable predicate through aggregates [spark]
github-actions[bot] closed pull request #44975: [SPARK-46707][SQL][FOLLOWUP] Push down throwable predicate through aggregates URL: https://github.com/apache/spark/pull/44975 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-42727][CORE] Fix can't executing spark commands in the root directory when local mode is specified [spark]
github-actions[bot] closed pull request #44990: [SPARK-42727][CORE] Fix can't executing spark commands in the root directory when local mode is specified URL: https://github.com/apache/spark/pull/44990 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48259][CONNECT][TESTS] Add 3 missing methods in dsl [spark]
amaliujia commented on code in PR #46559: URL: https://github.com/apache/spark/pull/46559#discussion_r1599231832 ## connector/connect/server/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala: ## @@ -513,6 +513,25 @@ package object dsl { freqItems(cols.toArray, support) def freqItems(cols: Seq[String]): Relation = freqItems(cols, 0.01) + + def sampleBy(col: String, fractions: Map[Any, Double], seed: Long): Relation = { Review Comment: This file was used to be here because we assumed that the DSL here can be used to build Scala client. But now this DSL is purely used for testing, then probably we can move it. ## connector/connect/server/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala: ## @@ -513,6 +513,25 @@ package object dsl { freqItems(cols.toArray, support) def freqItems(cols: Seq[String]): Relation = freqItems(cols, 0.01) + + def sampleBy(col: String, fractions: Map[Any, Double], seed: Long): Relation = { Review Comment: This file used to be here because we assumed that the DSL here can be used to build Scala client. But now this DSL is purely used for testing, then probably we can move it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48209][CORE] Common (java side): Migrate `error/warn/info` with variables to structured logging framework [spark]
panbingkun commented on code in PR #46493: URL: https://github.com/apache/spark/pull/46493#discussion_r1599229097 ## common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala: ## @@ -292,6 +301,7 @@ object LogKeys { case object LOADED_VERSION extends LogKey case object LOAD_FACTOR extends LogKey case object LOAD_TIME extends LogKey + case object LOCAL_DIRS extends LogKey Review Comment: Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47579][CORE][PART2] Migrate logInfo with variables to structured logging framework [spark]
zeotuan commented on code in PR #46494: URL: https://github.com/apache/spark/pull/46494#discussion_r1599223748 ## core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala: ## @@ -278,11 +278,12 @@ private[spark] class TorrentBroadcast[T: ClassTag](obj: T, id: Long, serializedO } case None => val estimatedTotalSize = Utils.bytesToString(numBlocks.toLong * blockSize) -logInfo(s"Started reading broadcast variable $id with $numBlocks pieces " + - s"(estimated total size $estimatedTotalSize)") +logInfo(log"Started reading broadcast variable ${MDC(BROADCAST_ID, id)} with ${MDC(NUM_BROADCAST_BLOCK, numBlocks)} pieces " + + log"(estimated total size ${MDC(NUM_BYTES, estimatedTotalSize)})") val startTimeNs = System.nanoTime() val blocks = readBlocks() -logInfo(s"Reading broadcast variable $id took ${Utils.getUsedTimeNs(startTimeNs)}") +logInfo(log"Reading broadcast variable ${MDC(BROADCAST_ID, id)}" + + log" took ${MDC(TOTAL_TIME, Utils.getUsedTimeNs(startTimeNs))} ms") Review Comment: My bad, that is actually ms since `getUsedTimeNs` convert it to ms. But the function already provided the unit value -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48260][SQL] Disable output committer coordination in one test of ParquetIOSuite [spark]
gengliangwang closed pull request #46562: [SPARK-48260][SQL] Disable output committer coordination in one test of ParquetIOSuite URL: https://github.com/apache/spark/pull/46562 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48260][SQL] Disable output committer coordination in one test of ParquetIOSuite [spark]
gengliangwang commented on PR #46562: URL: https://github.com/apache/spark/pull/46562#issuecomment-2109015398 @viirya @cloud-fan thanks for the review. Merging to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47579][CORE][PART2] Migrate logInfo with variables to structured logging framework [spark]
gengliangwang commented on code in PR #46494: URL: https://github.com/apache/spark/pull/46494#discussion_r1599216897 ## core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala: ## @@ -100,7 +99,7 @@ private[deploy] class SparkSubmitArguments(args: Seq[String], env: Map[String, S // Property files may contain sensitive information, so redact before printing if (verbose) { Utils.redact(properties).foreach { case (k, v) => - logInfo(s"Adding default property: $k=$v") + logInfo(log"Adding default property: ${MDC(PROPERTY_NAME, k)}=${MDC(VALUE, v)}") Review Comment: ```suggestion logInfo(log"Adding default property: ${MDC(KEY, k)}=${MDC(VALUE, v)}") ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47579][CORE][PART2] Migrate logInfo with variables to structured logging framework [spark]
gengliangwang commented on code in PR #46494: URL: https://github.com/apache/spark/pull/46494#discussion_r1599216687 ## core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala: ## @@ -90,7 +89,7 @@ private[deploy] class SparkSubmitArguments(args: Seq[String], env: Map[String, S lazy val defaultSparkProperties: HashMap[String, String] = { val defaultProperties = new HashMap[String, String]() if (verbose) { - logInfo(s"Using properties file: $propertiesFile") + logInfo(log"Using properties file: ${MDC(PROPERTY_FILE, propertiesFile)}") Review Comment: Let's simply use PATH -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47579][CORE][PART2] Migrate logInfo with variables to structured logging framework [spark]
gengliangwang commented on code in PR #46494: URL: https://github.com/apache/spark/pull/46494#discussion_r1599216356 ## core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala: ## @@ -955,11 +959,12 @@ private[spark] class SparkSubmit extends Logging { } if (args.verbose) { - logInfo(s"Main class:\n$childMainClass") - logInfo(s"Arguments:\n${childArgs.mkString("\n")}") + logInfo(log"Main class:\n${MDC(LogKeys.CLASS_NAME, childMainClass)}") + logInfo(log"Arguments:\n${MDC(LogKeys.ARGS, childArgs.mkString("\n"))}") // sysProps may contain sensitive information, so redact before printing - logInfo(s"Spark config:\n${Utils.redact(sparkConf.getAll.toMap).sorted.mkString("\n")}") - logInfo(s"Classpath elements:\n${childClasspath.mkString("\n")}") + logInfo(log"Spark config:\n" + + log"${MDC(LogKeys.CONFIG, Utils.redact(sparkConf.getAll.toMap).sorted.mkString("\n"))}") + logInfo(log"Classpath elements:\n${MDC(LogKeys.CLASS_PATH, childClasspath.mkString("\n"))}") Review Comment: ```suggestion logInfo(log"Classpath elements:\n${MDC(LogKeys.CLASS_PATHS, childClasspath.mkString("\n"))}") ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48260][SQL] Disable output committer coordination in one test of ParquetIOSuite [spark]
cloud-fan commented on PR #46560: URL: https://github.com/apache/spark/pull/46560#issuecomment-2109004973 closing in favor of https://github.com/apache/spark/pull/46562 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48260][SQL] Disable output committer coordination in one test of ParquetIOSuite [spark]
cloud-fan closed pull request #46560: [SPARK-48260][SQL] Disable output committer coordination in one test of ParquetIOSuite URL: https://github.com/apache/spark/pull/46560 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47579][CORE][PART2] Migrate logInfo with variables to structured logging framework [spark]
gengliangwang commented on code in PR #46494: URL: https://github.com/apache/spark/pull/46494#discussion_r1599208898 ## core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala: ## @@ -278,11 +278,12 @@ private[spark] class TorrentBroadcast[T: ClassTag](obj: T, id: Long, serializedO } case None => val estimatedTotalSize = Utils.bytesToString(numBlocks.toLong * blockSize) -logInfo(s"Started reading broadcast variable $id with $numBlocks pieces " + - s"(estimated total size $estimatedTotalSize)") +logInfo(log"Started reading broadcast variable ${MDC(BROADCAST_ID, id)} with ${MDC(NUM_BROADCAST_BLOCK, numBlocks)} pieces " + + log"(estimated total size ${MDC(NUM_BYTES, estimatedTotalSize)})") val startTimeNs = System.nanoTime() val blocks = readBlocks() -logInfo(s"Reading broadcast variable $id took ${Utils.getUsedTimeNs(startTimeNs)}") +logInfo(log"Reading broadcast variable ${MDC(BROADCAST_ID, id)}" + + log" took ${MDC(TOTAL_TIME, Utils.getUsedTimeNs(startTimeNs))} ms") Review Comment: This is NS instead of ms -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48256][BUILD] Add a rule to check file headers for the java side, and fix inconsistent files [spark]
panbingkun commented on PR #46557: URL: https://github.com/apache/spark/pull/46557#issuecomment-2108997622 cc @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48017] Add Spark application submission worker for operator [spark-kubernetes-operator]
jiangzho commented on code in PR #10: URL: https://github.com/apache/spark-kubernetes-operator/pull/10#discussion_r1599198732 ## gradle.properties: ## @@ -18,17 +18,23 @@ group=org.apache.spark.k8s.operator version=0.1.0 -fabric8Version=6.12.1 +# Caution: fabric8 version should be aligned with Spark dependency +fabric8Version=6.7.2 commonsLang3Version=3.14.0 commonsIOVersion=2.16.1 lombokVersion=1.18.32 +#Spark +scalaVersion=2.12 Review Comment: yup! upgrading to 2.13 along with 4.0.0-preview1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48220][PYTHON] Allow passing PyArrow Table to createDataFrame() [spark]
ianmcook commented on code in PR #46529: URL: https://github.com/apache/spark/pull/46529#discussion_r1599189646 ## python/pyspark/sql/connect/session.py: ## @@ -474,35 +475,40 @@ def createDataFrame( _table: Optional[pa.Table] = None -if isinstance(data, pd.DataFrame): +if isinstance(data, pd.DataFrame) or isinstance(data, pa.Table): # Logic was borrowed from `_create_from_pandas_with_arrow` in # `pyspark.sql.pandas.conversion.py`. Should ideally deduplicate the logics. # If no schema supplied by user then get the names of columns only if schema is None: Review Comment: I flattened the conditionals. Please take a look: https://github.com/apache/spark/pull/46529/commits/795d01a1fd96fcbe80432d74be50382d350edf57?diff=split&w=0 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48209][CORE] Common (java side): Migrate `error/warn/info` with variables to structured logging framework [spark]
gengliangwang commented on code in PR #46493: URL: https://github.com/apache/spark/pull/46493#discussion_r1599127368 ## common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala: ## @@ -292,6 +301,7 @@ object LogKeys { case object LOADED_VERSION extends LogKey case object LOAD_FACTOR extends LogKey case object LOAD_TIME extends LogKey + case object LOCAL_DIRS extends LogKey Review Comment: yes, merge into PATHS -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48233][SQL][STREAMING] Tests for non-stateful streaming on columns with collations [spark]
HeartSaVioR commented on code in PR #46247: URL: https://github.com/apache/spark/pull/46247#discussion_r1599101022 ## sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingDeduplicationSuite.scala: ## @@ -484,6 +486,52 @@ class StreamingDeduplicationSuite extends StateStoreMetricsTest { CheckLastBatch(("c", 9, "c")) ) } + + test("collation aware deduplication") { +val inputData = MemoryStream[(String, Int)] +val result = inputData.toDF() + .select(col("_1") +.try_cast(StringType("UNICODE")).as("str"), Review Comment: While we are here, I see UNICODE is binary equality but non-binary ordering. Does this still ensure that we can put this into RocksDB which key is binary sorted and find the key group based on prefix of key including this column? E.g. Say we have two columns, dept (String with UNICODE collation), session start (timestamp) as grouping key, and want to scan all grouping keys which are having dept as 'dept1'. This is required for several operations like session window aggregation. My gut feeling is yes, but I would like to double confirm. ## sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingDeduplicationSuite.scala: ## @@ -484,6 +486,52 @@ class StreamingDeduplicationSuite extends StateStoreMetricsTest { CheckLastBatch(("c", 9, "c")) ) } + + test("collation aware deduplication") { +val inputData = MemoryStream[(String, Int)] +val result = inputData.toDF() + .select(col("_1") +.try_cast(StringType("UNICODE")).as("str"), +col("_2").as("int")) + .dropDuplicates("str") + +testStream(result, Append)( + AddData(inputData, "a" -> 1), + CheckLastBatch("a" -> 1), + assertNumStateRows(total = 1, updated = 1, droppedByWatermark = 0), + AddData(inputData, "a" -> 2), // Dropped + CheckLastBatch(), + assertNumStateRows(total = 1, updated = 0, droppedByWatermark = 0), + // scalastyle:off + AddData(inputData, "ä" -> 1), + CheckLastBatch("ä" -> 1), + // scalastyle:on + assertNumStateRows(total = 2, updated = 1, droppedByWatermark = 0) +) + } + + test("non-binary collation aware deduplication not supported") { +val inputData = MemoryStream[(String)] +val result = inputData.toDF() + .select(col("value") +.try_cast(StringType("UTF8_BINARY_LCASE")).as("str")) Review Comment: AFAIK, LCASE means comparing as lowercase, so yes it's bound to non-binary equality. ## sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala: ## @@ -1364,6 +1364,35 @@ class StreamingQuerySuite extends StreamTest with BeforeAndAfter with Logging wi ) } + test("Collation aware streaming") { +withTable("parquet_streaming_tbl") { + spark.sql( +""" + |CREATE TABLE parquet_streaming_tbl + |( + | key STRING COLLATE UTF8_BINARY_LCASE, + | value_stream INTEGER + |) USING parquet""".stripMargin) + + val streamDf = spark.readStream.table("parquet_streaming_tbl") + val filteredDf = streamDf.filter("key = 'aaa'") + + val clock = new StreamManualClock() + testStream(filteredDf)( +StartStream(triggerClock = clock, trigger = Trigger.ProcessingTime(100)), +Execute { _ => + spark.createDataFrame(Seq("aaa" -> 1, "AAA" -> 2, "bbb" -> 3, "aa" -> 4)) Review Comment: Not sure I understand. streamDf has non-default collation, UTF8_BINARY_LCASE. What do you mean by "incoming" stream? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48253][PS][SQL] Support default mode for Pandas API on Spark [spark]
ueshin commented on code in PR #46553: URL: https://github.com/apache/spark/pull/46553#discussion_r1599081284 ## python/pyspark/pandas/config.py: ## @@ -374,8 +375,10 @@ def get_option(key: str, default: Union[Any, _NoValueType] = _NoValue) -> Any: default = _options_dict[key].default _options_dict[key].validate(default) spark_session = default_session() - -return json.loads(spark_session.conf.get(_key_format(key), default=json.dumps(default))) +if get_default_mode(): +return json.dumps(default) Review Comment: We don't need `json.dumps` here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47976][Python] Add Python UDTF DataFrame API to return result of 'analyze' method as output table [spark]
dtenedor closed pull request #46234: [SPARK-47976][Python] Add Python UDTF DataFrame API to return result of 'analyze' method as output table URL: https://github.com/apache/spark/pull/46234 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47976][Python] Add Python UDTF DataFrame API to return result of 'analyze' method as output table [spark]
dtenedor commented on PR #46234: URL: https://github.com/apache/spark/pull/46234#issuecomment-2108777240 I am going to close this since I am getting no response on the PR; will reopen later if we want to implement this feature. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-44953][CORE] Log a warning when shuffle tracking is enabled along side another DA supported mechanism [spark]
holdenk commented on PR #45454: URL: https://github.com/apache/spark/pull/45454#issuecomment-2108758472 Merged to current branch -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-44953][CORE] Log a warning when shuffle tracking is enabled along side another DA supported mechanism [spark]
asfgit closed pull request #45454: [SPARK-44953][CORE] Log a warning when shuffle tracking is enabled along side another DA supported mechanism URL: https://github.com/apache/spark/pull/45454 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46395][CORE] Assign Spark configs to groups for use in documentation [spark]
holdenk commented on PR #44755: URL: https://github.com/apache/spark/pull/44755#issuecomment-2108746918 My bad github staged my review comment for some reason, should be visible now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46395][CORE] Assign Spark configs to groups for use in documentation [spark]
holdenk commented on code in PR #44755: URL: https://github.com/apache/spark/pull/44755#discussion_r1580256318 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -5704,10 +5704,9 @@ class SQLConf extends Serializable with Logging with SqlApiConf { settings.synchronized { settings.asScala.toMap } /** - * Return all the configuration definitions that have been defined in [[SQLConf]]. Each - * definition contains key, defaultValue and doc. + * Return all the public configuration definitions that have been defined in [[SQLConf]]. */ - def getAllDefinedConfs: Seq[(String, String, String, String)] = { + def getAllDefinedConfs: Seq[(String, String, String, String, Set[String])] = { Review Comment: This does change a public APIs type signature, I'm not sure it's super important or widely used but given the stated goal of the PR change is to improve the docs do we need this part? Or could a new internal API be introduced if we need this functionality. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-27900][CORE][K8s] Add `spark.driver.killOnOOMError` flag in cluster mode [spark]
holdenk commented on PR #26161: URL: https://github.com/apache/spark/pull/26161#issuecomment-2108745319 I don't think so. We manually set kill on OOM in our config. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-36691][PYTHON] PythonRunner failed should pass error message to ApplicationMaster too [spark]
holdenk commented on code in PR #33934: URL: https://github.com/apache/spark/pull/33934#discussion_r1599043338 ## common/utils/src/main/scala/org/apache/spark/SparkException.scala: ## @@ -145,8 +148,10 @@ private[spark] class SparkDriverExecutionException(cause: Throwable) * Exception thrown when the main user code is run as a child process (e.g. pyspark) and we want * the parent SparkSubmit process to exit with the same exit code. */ -private[spark] case class SparkUserAppException(exitCode: Int) - extends SparkException(s"User application exited with $exitCode") +private[spark] case class SparkUserAppException(exitCode: Int, errorMsg: Option[String] = None) + extends SparkException( +log"User application exited with ${MDC(LogKeys.EXIT_CODE, exitCode)}".message + + errorMsg.map(error => s" and caused by\n$error").getOrElse("")) Review Comment: I'm not super familiar with MDC this is going to use a generic exit code mapping to determine what went wrong from the exit code? ## core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala: ## @@ -92,15 +92,16 @@ object PythonRunner { // see https://github.com/numpy/numpy/issues/10455 sparkConf.getOption("spark.driver.cores").foreach(env.put("OMP_NUM_THREADS", _)) } -builder.redirectErrorStream(true) // Ugly but needed for stdout and stderr to synchronize Review Comment: Why is this not needed anymore? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org