date:20240513

Re: [PR] [SPARK-48215][SQL] Extending support for collated strings on date_format expression [spark]

2024-05-13 Thread via GitHub



uros-db commented on code in PR #46561:
URL: https://github.com/apache/spark/pull/46561#discussion_r1599471162


##
sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala:
##
@@ -1275,6 +1275,38 @@ class CollationSQLExpressionsSuite
 })
   }
 
+  test("DateFormat expression with collation") {
+case class DateFormatTestCase[R](date: String, format: String, collation: 
String, result: R)
+val testCases = Seq(
+  DateFormatTestCase("2021-01-01", "-MM-dd", "UTF8_BINARY", 
"2021-01-01"),
+  DateFormatTestCase("2021-01-01", "-dd", "UTF8_BINARY_LCASE", 
"2021-01"),
+  DateFormatTestCase("2021-01-01", "-MM-dd", "UNICODE", "2021-01-01"),
+  DateFormatTestCase("2021-01-01", "", "UNICODE_CI", "2021")
+)
+
+for {
+  collateDate <- Seq(true, false)
+  collateFormat <- Seq(true, false)
+} {
+  testCases.foreach(t => {
+val dateArg = if (collateDate) s"collate('${t.date}', 
'${t.collation}')" else s"'${t.date}'"
+val formatArg =
+  if (collateFormat) {
+s"collate('${t.format}', '${t.collation}')"
+  } else {
+s"'${t.format}'"
+  }
+
+withSQLConf(SqlApiConf.DEFAULT_COLLATION -> t.collation) {
+  val query = s"SELECT date_format(${dateArg}, ${formatArg})"

Review Comment:
   I suppose this was the reason this test was failing



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48215][SQL] Extending support for collated strings on date_format expression [spark]

2024-05-13 Thread via GitHub



uros-db commented on code in PR #46561:
URL: https://github.com/apache/spark/pull/46561#discussion_r1599470164


##
sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala:
##
@@ -1275,6 +1275,38 @@ class CollationSQLExpressionsSuite
 })
   }
 
+  test("DateFormat expression with collation") {
+case class DateFormatTestCase[R](date: String, format: String, collation: 
String, result: R)
+val testCases = Seq(
+  DateFormatTestCase("2021-01-01", "-MM-dd", "UTF8_BINARY", 
"2021-01-01"),
+  DateFormatTestCase("2021-01-01", "-dd", "UTF8_BINARY_LCASE", 
"2021-01"),
+  DateFormatTestCase("2021-01-01", "-MM-dd", "UNICODE", "2021-01-01"),
+  DateFormatTestCase("2021-01-01", "", "UNICODE_CI", "2021")
+)
+
+for {
+  collateDate <- Seq(true, false)
+  collateFormat <- Seq(true, false)
+} {
+  testCases.foreach(t => {
+val dateArg = if (collateDate) s"collate('${t.date}', 
'${t.collation}')" else s"'${t.date}'"
+val formatArg =
+  if (collateFormat) {
+s"collate('${t.format}', '${t.collation}')"
+  } else {
+s"'${t.format}'"
+  }
+
+withSQLConf(SqlApiConf.DEFAULT_COLLATION -> t.collation) {
+  val query = s"SELECT date_format(${dateArg}, ${formatArg})"

Review Comment:
   ```suggestion
 val query = s"SELECT date_format('${dateArg}', '${formatArg}')"
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP][SPARK-48221][SQL] Alter string search logic for UTF8_BINARY_LCASE collation [spark]

2024-05-13 Thread via GitHub



mkaravel commented on code in PR #46511:
URL: https://github.com/apache/spark/pull/46511#discussion_r1599455730


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java:
##
@@ -34,6 +34,27 @@
  * Utility class for collation-aware UTF8String operations.
  */
 public class CollationAwareUTF8String {
+
+  public static boolean lowercaseMatchFrom(final UTF8String l, final 
UTF8String r, int pos) {
+if (pos < 0) return false;
+for (int len = 0; len <= l.numChars() - pos; len++) {

Review Comment:
   ```suggestion
   for (int len = 0; len <= l.numChars() - pos; ++len) {
   ```



##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java:
##
@@ -34,6 +34,27 @@
  * Utility class for collation-aware UTF8String operations.
  */
 public class CollationAwareUTF8String {
+
+  public static boolean lowercaseMatchFrom(final UTF8String l, final 
UTF8String r, int pos) {
+if (pos < 0) return false;

Review Comment:
   Why do we return false in this case?
   Also why not make this an assert and make sure we never call this method 
with a negative `pos`?



##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java:
##
@@ -34,6 +34,27 @@
  * Utility class for collation-aware UTF8String operations.
  */
 public class CollationAwareUTF8String {
+
+  public static boolean lowercaseMatchFrom(final UTF8String l, final 
UTF8String r, int pos) {
+if (pos < 0) return false;
+for (int len = 0; len <= l.numChars() - pos; len++) {
+  if (l.substring(pos, pos + len).toLowerCase().equals(r)) {
+return true;
+  }
+}
+return false;
+  }
+
+  public static boolean lowercaseMatchUntil(final UTF8String l, final 
UTF8String r, int pos) {
+if (pos > l.numChars()) return false;

Review Comment:
   Same comment regarding the bounds for `pos`.



##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java:
##
@@ -183,6 +204,19 @@ public static int findInSet(final UTF8String match, final 
UTF8String set, int co
 return 0;
   }
 
+  public static int lowercaseIndexOf(final UTF8String target, final UTF8String 
pattern,
+  final int start) {
+if (pattern.numChars() == 0) return 0;

Review Comment:
   I would expect `start` to be returned here, not `0`.



##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java:
##
@@ -34,6 +34,27 @@
  * Utility class for collation-aware UTF8String operations.
  */
 public class CollationAwareUTF8String {
+
+  public static boolean lowercaseMatchFrom(final UTF8String l, final 
UTF8String r, int pos) {
+if (pos < 0) return false;
+for (int len = 0; len <= l.numChars() - pos; len++) {
+  if (l.substring(pos, pos + len).toLowerCase().equals(r)) {

Review Comment:
   ```suggestion
 if (l.substring(pos, pos + len).toLowerCase().equals(r.toLowerCase())) 
{
   ```
   Somehow we need to state that `r` is already in lowercase (comment and/or 
change the name of the variable).



##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java:
##
@@ -183,6 +204,19 @@ public static int findInSet(final UTF8String match, final 
UTF8String set, int co
 return 0;
   }
 
+  public static int lowercaseIndexOf(final UTF8String target, final UTF8String 
pattern,
+  final int start) {
+if (pattern.numChars() == 0) return 0;
+int lenHaystack = target.numChars(), lenNeedle = pattern.numChars();
+final UTF8String needle = pattern.toLowerCase();
+for (int i = start; i <= (lenHaystack - lenNeedle); i++) {

Review Comment:
   ```suggestion
   for (int i = start; i <= (lenHaystack - lenNeedle); ++i) {
   ```



##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java:
##
@@ -118,7 +118,8 @@ public static boolean execBinary(final UTF8String l, final 
UTF8String r) {
   return l.contains(r);
 }
 public static boolean execLowercase(final UTF8String l, final UTF8String 
r) {
-  return l.containsInLowerCase(r);
+  if (r.numBytes() == 0) return true;

Review Comment:
   Do you need this? There is special case inside `lowercaseIndexOf` that 
handles it.



##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java:
##
@@ -156,7 +157,8 @@ public static boolean execBinary(final UTF8String l, final 
UTF8String r) {
   return l.startsWith(r);
 }
 public static boolean execLowercase(final UTF8String l, final UTF8String 
r) {
-  return l.startsWithInLowerCase(r);
+  if (r.numBytes() == 0) return true;

Review Comment:
   Same here.



##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java:
##
@@ -34,6 +34,27 @@
  * Utility class for collation-awar

Re: [PR] [SPARK-48267][SS] Regression e2e test with SPARK-47305 [spark]

2024-05-13 Thread via GitHub



HeartSaVioR commented on PR #46569:
URL: https://github.com/apache/spark/pull/46569#issuecomment-2109409434

   Also cherry-picked to 3.5 as well as it's a clean cherry-pick.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48267][SS] Regression e2e test with SPARK-47305 [spark]

2024-05-13 Thread via GitHub



HeartSaVioR closed pull request #46569: [SPARK-48267][SS] Regression e2e test 
with SPARK-47305
URL: https://github.com/apache/spark/pull/46569


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47599][MLLIB] MLLib: Migrate logWarn with variables to structured logging framework [spark]

2024-05-13 Thread via GitHub



panbingkun commented on code in PR #46527:
URL: https://github.com/apache/spark/pull/46527#discussion_r1599456251


##
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala:
##
@@ -179,8 +179,8 @@ class LinearSVC @Since("2.2.0") (
   maxBlockSizeInMB)
 
 if (dataset.storageLevel != StorageLevel.NONE) {
-  instr.logWarning(s"Input instances will be standardized, blockified to 
blocks, and " +

Review Comment:
   Delete unnecessary string interpreter. `s"..."`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP][SPARK-48221][SQL] Alter string search logic for UTF8_BINARY_LCASE collation [spark]

2024-05-13 Thread via GitHub



mkaravel commented on code in PR #46511:
URL: https://github.com/apache/spark/pull/46511#discussion_r1599447728


##
common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java:
##
@@ -102,20 +102,30 @@ public void testContains() throws SparkException {
 assertContains("äbćδe", "ÄbćδE", "UNICODE_CI", true);
 assertContains("äbćδe", "ÄBcΔÉ", "UNICODE_CI", false);
 // Case-variable character length
-assertContains("abİo12", "i̇o", "UNICODE_CI", true);
-assertContains("abi̇o12", "İo", "UNICODE_CI", true);
-assertContains("the İodine", "the i̇odine", "UTF8_BINARY_LCASE", true);
-assertContains("the i̇odine", "the İodine", "UTF8_BINARY_LCASE", true);
-assertContains("The İodiNe", " i̇oDin", "UTF8_BINARY_LCASE", true);
-assertContains("İodiNe", "i̇oDine", "UTF8_BINARY_LCASE", true);
-assertContains("İodiNe", " i̇oDin", "UTF8_BINARY_LCASE", false);
-// Characters with the same binary lowercase representation
-assertContains("The Kelvin.", "Kelvin", "UTF8_BINARY_LCASE", true);
-assertContains("The Kelvin.", "Kelvin", "UTF8_BINARY_LCASE", true);
-assertContains("The KKelvin.", "KKelvin", "UTF8_BINARY_LCASE", true);
-assertContains("2 Kelvin.", "2 Kelvin", "UTF8_BINARY_LCASE", true);
-assertContains("2 Kelvin.", "2 Kelvin", "UTF8_BINARY_LCASE", true);
-assertContains("The KKelvin.", "KKelvin,", "UTF8_BINARY_LCASE", false);
+assertContains("i̇", "i", "UNICODE_CI", false);
+assertContains("i̇", "̇", "UNICODE_CI", false);

Review Comment:
   ```suggestion
   assertContains("i̇", "\u0307", "UNICODE_CI", false);
   ```



##
common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java:
##
@@ -279,20 +313,44 @@ public void testEndsWith() throws SparkException {
 assertEndsWith("ab世De", "AB世dE", "UNICODE_CI", true);
 assertEndsWith("äbćδe", "ÄbćδE", "UNICODE_CI", true);
 assertEndsWith("äbćδe", "ÄBcΔÉ", "UNICODE_CI", false);
-// Case-variable character length
-assertEndsWith("The İo", "i̇o", "UNICODE_CI", true);
-assertEndsWith("The i̇o", "İo", "UNICODE_CI", true);
-assertEndsWith("the İodine", "the i̇odine", "UTF8_BINARY_LCASE", true);
-assertEndsWith("the i̇odine", "the İodine", "UTF8_BINARY_LCASE", true);
-assertEndsWith("The İodiNe", "i̇oDine", "UTF8_BINARY_LCASE", true);
-assertEndsWith("The İodiNe", "i̇oDin", "UTF8_BINARY_LCASE", false);
 // Characters with the same binary lowercase representation
 assertEndsWith("The Kelvin", "Kelvin", "UTF8_BINARY_LCASE", true);
 assertEndsWith("The Kelvin", "Kelvin", "UTF8_BINARY_LCASE", true);
 assertEndsWith("The KKelvin", "KKelvin", "UTF8_BINARY_LCASE", true);
 assertEndsWith("The 2 Kelvin", "2 Kelvin", "UTF8_BINARY_LCASE", true);
 assertEndsWith("The 2 Kelvin", "2 Kelvin", "UTF8_BINARY_LCASE", true);
 assertEndsWith("The KKelvin", "KKelvin,", "UTF8_BINARY_LCASE", false);
+// Case-variable character length
+assertEndsWith("i̇", "̇", "UNICODE_CI", false);
+assertEndsWith("i̇", "İ", "UNICODE_CI", true);
+assertEndsWith("İ", "i", "UNICODE_CI", false);
+assertEndsWith("İİİ", "i̇i̇", "UNICODE_CI", true);
+assertEndsWith("İİİ", "ii̇", "UNICODE_CI", false);
+assertEndsWith("İi̇İ", "İi̇", "UNICODE_CI", true);
+assertEndsWith("i̇İi̇i̇", "̇İi̇İ", "UNICODE_CI", false);

Review Comment:
   ```suggestion
   assertEndsWith("i̇İi̇i̇", "\u0307İi̇İ", "UNICODE_CI", false);
   ```



##
common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java:
##
@@ -279,20 +313,44 @@ public void testEndsWith() throws SparkException {
 assertEndsWith("ab世De", "AB世dE", "UNICODE_CI", true);
 assertEndsWith("äbćδe", "ÄbćδE", "UNICODE_CI", true);
 assertEndsWith("äbćδe", "ÄBcΔÉ", "UNICODE_CI", false);
-// Case-variable character length
-assertEndsWith("The İo", "i̇o", "UNICODE_CI", true);
-assertEndsWith("The i̇o", "İo", "UNICODE_CI", true);
-assertEndsWith("the İodine", "the i̇odine", "UTF8_BINARY_LCASE", true);
-assertEndsWith("the i̇odine", "the İodine", "UTF8_BINARY_LCASE", true);
-assertEndsWith("The İodiNe", "i̇oDine", "UTF8_BINARY_LCASE", true);
-assertEndsWith("The İodiNe", "i̇oDin", "UTF8_BINARY_LCASE", false);
 // Characters with the same binary lowercase representation
 assertEndsWith("The Kelvin", "Kelvin", "UTF8_BINARY_LCASE", true);
 assertEndsWith("The Kelvin", "Kelvin", "UTF8_BINARY_LCASE", true);
 assertEndsWith("The KKelvin", "KKelvin", "UTF8_BINARY_LCASE", true);
 assertEndsWith("The 2 Kelvin", "2 Kelvin", "UTF8_BINARY_LCASE", true);
 assertEndsWith("The 2 Kelvin", "2 Kelvin", "UTF8_BINARY_LCASE", true);
 assertEndsWith("The KKelvin", "KKelvin,", "UTF8_BINARY_LCASE", false);
+// Case-variable character length
+assertEndsWith("i̇", "̇", "UNICODE_CI", false);
+assertEndsWith("i̇", "İ", "UNICODE_CI", true);
+asser

Re: [PR] [SPARK-48267][SS] Regression e2e test with SPARK-47305 [spark]

2024-05-13 Thread via GitHub



HeartSaVioR commented on PR #46569:
URL: https://github.com/apache/spark/pull/46569#issuecomment-2109390060

   Thanks @viirya for quick reviewing! Merging to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47599][MLLIB] MLLib: Migrate logWarn with variables to structured logging framework [spark]

2024-05-13 Thread via GitHub



panbingkun commented on code in PR #46527:
URL: https://github.com/apache/spark/pull/46527#discussion_r1599449547


##
mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala:
##
@@ -451,8 +451,8 @@ class KMeans @Since("1.5.0") (
 
   private def trainWithBlock(dataset: Dataset[_], instr: Instrumentation) = {
 if (dataset.storageLevel != StorageLevel.NONE) {
-  instr.logWarning(s"Input vectors will be blockified to blocks, and " +

Review Comment:
   Delete unnecessary string interpreter `s"..."`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48157][SQL] Add collation support for CSV expressions [spark]

2024-05-13 Thread via GitHub



cloud-fan closed pull request #46504: [SPARK-48157][SQL] Add collation support 
for CSV expressions
URL: https://github.com/apache/spark/pull/46504


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48157][SQL] Add collation support for CSV expressions [spark]

2024-05-13 Thread via GitHub



cloud-fan commented on PR #46504:
URL: https://github.com/apache/spark/pull/46504#issuecomment-2109363247

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48229][SQL] Add collation support for inputFile expressions [spark]

2024-05-13 Thread via GitHub



cloud-fan closed pull request #46503: [SPARK-48229][SQL] Add collation support 
for inputFile expressions
URL: https://github.com/apache/spark/pull/46503


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48229][SQL] Add collation support for inputFile expressions [spark]

2024-05-13 Thread via GitHub



cloud-fan commented on PR #46503:
URL: https://github.com/apache/spark/pull/46503#issuecomment-2109358482

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48265][SQL] Infer window group limit batch should do constant folding [spark]

2024-05-13 Thread via GitHub



beliefer commented on PR #46568:
URL: https://github.com/apache/spark/pull/46568#issuecomment-2109342800

   LGTM later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48251][BUILD] Disable `maven local cache` on GA's step `MIMA test` [spark]

2024-05-13 Thread via GitHub



panbingkun commented on code in PR #46551:
URL: https://github.com/apache/spark/pull/46551#discussion_r1599416657


##
project/SparkBuild.scala:
##
@@ -273,10 +273,9 @@ object SparkBuild extends PomBuild {
   // Google Mirror of Maven Central, placed first so that it's used 
instead of flaky Maven Central.
   // See https://storage-download.googleapis.com/maven-central/index.html 
for more info.
   "gcs-maven-central-mirror" at 
"https://maven-central.storage-download.googleapis.com/maven2/";,
-  DefaultMavenRepository,
-  Resolver.mavenLocal,
-  Resolver.file("ivyLocal", file(Path.userHome.absolutePath + 
"/.ivy2/local"))(Resolver.ivyStylePatterns)
-),
+  DefaultMavenRepository) ++
+  { if (sys.env.contains("SKIP_LOCAL_M2")) Nil else 
Seq(Resolver.mavenLocal) } :+

Review Comment:
   Let's make the conditions more stricter.
   Updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-46707][SQL][FOLLOWUP] Push down throwable predicate through aggregates [spark]

2024-05-13 Thread via GitHub



zml1206 opened a new pull request, #44975:
URL: https://github.com/apache/spark/pull/44975

   ### What changes were proposed in this pull request?
   Push down throwable predicate through aggregates and add ut for "can't push 
down nondeterministic filter through aggregate".
   
   ### Why are the changes needed?
   If we can push down a filter through Aggregate, it means the filter only 
references the grouping keys. The Aggregate operator can't reduce grouping keys 
so the filter won't see any new data after pushing down. So push down throwable 
filter through aggregate does not affect exception thrown. 
   
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   
   ### How was this patch tested?
   UT
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48266][CONNECT] Move package object `org.apache.spark.sql.connect.dsl` to test directory [spark]

2024-05-13 Thread via GitHub



LuciferYang commented on PR #46567:
URL: https://github.com/apache/spark/pull/46567#issuecomment-2109338837

   Thanks @HyukjinKwon @zhengruifeng @amaliujia 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48265][SQL] Infer window group limit batch should do constant folding [spark]

2024-05-13 Thread via GitHub



cloud-fan closed pull request #46568: [SPARK-48265][SQL] Infer window group 
limit batch should do constant folding
URL: https://github.com/apache/spark/pull/46568


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48265][SQL] Infer window group limit batch should do constant folding [spark]

2024-05-13 Thread via GitHub



cloud-fan commented on PR #46568:
URL: https://github.com/apache/spark/pull/46568#issuecomment-2109331575

   thanks, merging to master/3.5!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48027][SQL][FOLLOWUP] Add comments for the other code branch [spark]

2024-05-13 Thread via GitHub



beliefer commented on PR #46536:
URL: https://github.com/apache/spark/pull/46536#issuecomment-2109330131

   @cloud-fan Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48241][SQL][3.5] CSV parsing failure with char/varchar type columns [spark]

2024-05-13 Thread via GitHub



cloud-fan closed pull request #46565: [SPARK-48241][SQL][3.5] CSV parsing 
failure with char/varchar type columns
URL: https://github.com/apache/spark/pull/46565


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-39195][SQL] Spark OutputCommitCoordinator should abort stage when committed file not consistent with task status [spark]

2024-05-13 Thread via GitHub



cloud-fan commented on code in PR #36564:
URL: https://github.com/apache/spark/pull/36564#discussion_r1599405027


##
core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala:
##
@@ -155,9 +158,9 @@ private[spark] class OutputCommitCoordinator(conf: 
SparkConf, isDriver: Boolean)
 val taskId = TaskIdentifier(stageAttempt, attemptNumber)
 stageState.failures.getOrElseUpdate(partition, mutable.Set()) += taskId
 if (stageState.authorizedCommitters(partition) == taskId) {
-  logDebug(s"Authorized committer (attemptNumber=$attemptNumber, 
stage=$stage, " +
-s"partition=$partition) failed; clearing lock")
-  stageState.authorizedCommitters(partition) = null
+  sc.foreach(_.dagScheduler.stageFailed(stage, s"Authorized committer 
" +
+s"(attemptNumber=$attemptNumber, stage=$stage, 
partition=$partition) failed; " +
+s"but task commit success, data duplication may happen."))
 }

Review Comment:
   +1. @AngersZh can you refine it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47599][MLLIB] MLLib: Migrate logWarn with variables to structured logging framework [spark]

2024-05-13 Thread via GitHub



panbingkun commented on PR #46527:
URL: https://github.com/apache/spark/pull/46527#issuecomment-2109320498

   > Let's continue on this one :)
   
   Okay, let me check it first, 😄


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48027][SQL][FOLLOWUP] Add comments for the other code branch [spark]

2024-05-13 Thread via GitHub



cloud-fan closed pull request #46536: [SPARK-48027][SQL][FOLLOWUP] Add comments 
for the other code branch
URL: https://github.com/apache/spark/pull/46536


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48027][SQL][FOLLOWUP] Add comments for the other code branch [spark]

2024-05-13 Thread via GitHub



cloud-fan commented on PR #46536:
URL: https://github.com/apache/spark/pull/46536#issuecomment-2109312325

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48155][SQL] AQEPropagateEmptyRelation for join should check if remain child is just BroadcastQueryStageExec [spark]

2024-05-13 Thread via GitHub



cloud-fan commented on code in PR #46523:
URL: https://github.com/apache/spark/pull/46523#discussion_r1599396352


##
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AQEPropagateEmptyRelation.scala:
##
@@ -82,6 +82,21 @@ object AQEPropagateEmptyRelation extends 
PropagateEmptyRelationBase {
 case _ => false
   }
 
+  // In AQE, query stage will be wrapped with LogicalQueryStage, if it's child 
is
+  // BroadcastQueryStage and join other side is empty relation like the follow 
plan:
+  // Join
+  //   :- LocalTableScan , [a#23]
+  //   +- LogicalQueryStage(_, BroadcastQueryStage)
+  // After AQEPropagateEmptyRelation, the plan will be
+  // Project
+  //   +- LogicalQueryStage(_, BroadcastQueryStage)
+  // Then after LogicalQueryStageStrategy, will only remain 
BroadcastQueryStage after project,
+  // the plan can't execute.

Review Comment:
   how hard is it to return the original query plan? Seems not hard as we just 
need to add a new `def returnSingleJoinSide` function in the base class, and 
unwrap broadcast stage in the AQE rule.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48155][SQL] AQEPropagateEmptyRelation for join should check if remain child is just BroadcastQueryStageExec [spark]

2024-05-13 Thread via GitHub



cloud-fan commented on code in PR #46523:
URL: https://github.com/apache/spark/pull/46523#discussion_r1599395049


##
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AQEPropagateEmptyRelation.scala:
##
@@ -82,6 +82,21 @@ object AQEPropagateEmptyRelation extends 
PropagateEmptyRelationBase {
 case _ => false
   }
 
+  // In AQE, query stage will be wrapped with LogicalQueryStage, if it's child 
is
+  // BroadcastQueryStage and join other side is empty relation like the follow 
plan:
+  // Join
+  //   :- LocalTableScan , [a#23]
+  //   +- LogicalQueryStage(_, BroadcastQueryStage)
+  // After AQEPropagateEmptyRelation, the plan will be
+  // Project
+  //   +- LogicalQueryStage(_, BroadcastQueryStage)
+  // Then after LogicalQueryStageStrategy, will only remain 
BroadcastQueryStage after project,
+  // the plan can't execute.

Review Comment:
   We can simply say
   ```
   // A broadcast query stage can't be executed without the join operator.
   // TODO: we can return the original query plan before broadcast.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48266][CONNECT] Move package object `org.apache.spark.sql.connect.dsl` to test directory [spark]

2024-05-13 Thread via GitHub



HyukjinKwon closed pull request #46567: [SPARK-48266][CONNECT] Move package 
object `org.apache.spark.sql.connect.dsl` to test directory
URL: https://github.com/apache/spark/pull/46567


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48266][CONNECT] Move package object `org.apache.spark.sql.connect.dsl` to test directory [spark]

2024-05-13 Thread via GitHub



HyukjinKwon commented on PR #46567:
URL: https://github.com/apache/spark/pull/46567#issuecomment-2109296872

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48241][SQL][3.5] CSV parsing failure with char/varchar type columns [spark]

2024-05-13 Thread via GitHub



cloud-fan commented on PR #46565:
URL: https://github.com/apache/spark/pull/46565#issuecomment-2109293728

   thanks, merging to 3.5!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48229][SQL] Add collation support for inputFile expressions [spark]

2024-05-13 Thread via GitHub



uros-db commented on code in PR #46503:
URL: https://github.com/apache/spark/pull/46503#discussion_r1597873492


##
sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala:
##
@@ -1151,6 +1151,23 @@ class CollationSQLExpressionsSuite
 })
   }
 
+  test("Support InputFileName expression with collation") {
+// Supported collations
+Seq("UTF8_BINARY", "UTF8_BINARY_LCASE", "UNICODE", 
"UNICODE_CI").foreach(collationName => {

Review Comment:
   I'd say most of the time tests are custom, depending on the function, and 
only rarely do we have functions with no parameters - so this is not very 
common (actually, this may be the only such function in this test suite). So 
I'd say there's no need for a preset suite-level collation name collection.
   
   On the other hand, we will soon have hundreds/thousands of collations in 
Spark, so these 4 collations will soon appear to be very random (i.e. we won't 
run this test for **all** supported collations, but these 4 should still be 
enough to cover all cases for all functions).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48267][SS] Regression e2e test with SPARK-47305 [spark]

2024-05-13 Thread via GitHub



HeartSaVioR commented on PR #46569:
URL: https://github.com/apache/spark/pull/46569#issuecomment-2109270674

   Looks like the behavior of PruneFilters is somewhat interesting, with 
combination of filter pushdown... If the filter can push down more 
aggressively, the scope of subtree PruneFilters will prune is going to be 
"smaller".
   
   E.g. 
   
   ```
 val df = input1.toDF()
   .observe("myEvent", count(lit(1)).as("rows"))
   .filter(expr("false"))
   ```
   
   to
   
   ```
   == Analyzed Logical Plan ==
   WriteToMicroBatchDataSource MemorySink, 
5afa5bf2-9757-4e90-a604-bb2c4438ebec, Update, 0
   +- Filter false
  +- CollectMetrics myEvent, [count(1) AS rows#3L], 0
 +- StreamingDataSourceV2ScanRelation[value#1] MemoryStreamDataSource
   
   == Optimized Logical Plan ==
   WriteToDataSourceV2 MicroBatchWrite[epoch: 0, writer: 
org.apache.spark.sql.execution.streaming.sources.MemoryStreamingWrite@64b874f0]
   +- LocalRelation , [value#1]
   ```
   
   vs
   
   ```
 val df = input1.toDF()
   .withColumn("eventTime", timestamp_seconds($"value"))
   .withWatermark("eventTime", "0 second")
   .filter(expr("false"))
   ```
   
   to
   
   ```
   == Analyzed Logical Plan ==
   WriteToMicroBatchDataSource MemorySink, 
371ac05f-8c25-43b1-b82a-ed11e9a16c29, Update, 0
   +- Filter false
  +- EventTimeWatermark eventTime#3: timestamp, 0 seconds
 +- Project [value#1, timestamp_seconds(value#1) AS eventTime#3]
+- StreamingDataSourceV2ScanRelation[value#1] MemoryStreamDataSource
   
   == Optimized Logical Plan ==
   WriteToDataSourceV2 MicroBatchWrite[epoch: 0, writer: 
org.apache.spark.sql.execution.streaming.sources.MemoryStreamingWrite@4af7]
   +- EventTimeWatermark eventTime#3: timestamp, 0 seconds
  +- LocalRelation , [value#1, eventTime#3]
   ```
   
   For former, filter can't be pushed down under CollectMetrics, and 
PruneFilters takes effect and replace both CollectMetrics and Relation with an 
empty relation.
   
   For latter, EventTimeWatermark supports pushdown which predicate isn't 
comparing with event time column. This case filter is pushed down under 
EventTimeWatermark, and PruneFilters takes effect and replace Relation, but 
EventTimeWatermark is left as it is.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48267][SS] Regression e2e test with SPARK-47305 [spark]

2024-05-13 Thread via GitHub



HeartSaVioR commented on code in PR #46569:
URL: https://github.com/apache/spark/pull/46569#discussion_r1599352970


##
sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryOptimizationCorrectnessSuite.scala:
##
@@ -416,4 +416,39 @@ class StreamingQueryOptimizationCorrectnessSuite extends 
StreamTest {
   )
 }
   }
+
+  test("SPARK-48267: regression test, stream-stream union followed by 
stream-batch join") {
+withTempDir { dir =>
+  val input1 = MemoryStream[Int]
+  val input2 = MemoryStream[Int]
+
+  val df1 = input1.toDF().withColumn("code", lit(1))

Review Comment:
   Given the sequence of optimizations in reproducer, it wouldn't happen if the 
union wasn't eliminated, as the value of 'code' won't be known during 
optimization phase. That said, if we call observe() or withWatermark() or etc 
in streaming DF to prevent optimizer to collapse everything at one side, the 
bug wouldn't be triggered.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48267][SS] Regression e2e test with SPARK-47305 [spark]

2024-05-13 Thread via GitHub



HeartSaVioR commented on code in PR #46569:
URL: https://github.com/apache/spark/pull/46569#discussion_r1599352970


##
sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryOptimizationCorrectnessSuite.scala:
##
@@ -416,4 +416,39 @@ class StreamingQueryOptimizationCorrectnessSuite extends 
StreamTest {
   )
 }
   }
+
+  test("SPARK-48267: regression test, stream-stream union followed by 
stream-batch join") {
+withTempDir { dir =>
+  val input1 = MemoryStream[Int]
+  val input2 = MemoryStream[Int]
+
+  val df1 = input1.toDF().withColumn("code", lit(1))

Review Comment:
   Given the sequence of optimizations in reproducer, it wouldn't happen if the 
union wasn't eliminated, as the value of 'code' won't be known during 
optimization phase. That said, if we call observe() or withWatermark() or etc 
in these streaming DF, the bug wouldn't be triggered.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [[SPARK-48258][PYTHON][CONNECT] Checkpoint and localCheckpoint in Spark Connect [spark]

2024-05-13 Thread via GitHub



HyukjinKwon opened a new pull request, #46570:
URL: https://github.com/apache/spark/pull/46570

   ### What changes were proposed in this pull request?
   
   TBD
   
   ### Why are the changes needed?
   
   TBD
   
   ### Does this PR introduce _any_ user-facing change?
   
   TBD
   
   ### How was this patch tested?
   
   TBD
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47599][MLLIB] MLLib: Migrate logWarn with variables to structured logging framework [spark]

2024-05-13 Thread via GitHub



gengliangwang commented on PR #46527:
URL: https://github.com/apache/spark/pull/46527#issuecomment-2109242560

   Let's continue on this one :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48209][CORE] Common (java side): Migrate `error/warn/info` with variables to structured logging framework [spark]

2024-05-13 Thread via GitHub



gengliangwang closed pull request #46493: [SPARK-48209][CORE] Common (java 
side): Migrate `error/warn/info` with variables to structured logging framework
URL: https://github.com/apache/spark/pull/46493


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48209][CORE] Common (java side): Migrate `error/warn/info` with variables to structured logging framework [spark]

2024-05-13 Thread via GitHub



gengliangwang commented on PR #46493:
URL: https://github.com/apache/spark/pull/46493#issuecomment-2109237678

   Thanks, merging to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48267][SS] Regression e2e test with SPARK-47305 [spark]

2024-05-13 Thread via GitHub



HeartSaVioR commented on PR #46569:
URL: https://github.com/apache/spark/pull/46569#issuecomment-2109235633

   cc. @cloud-fan @viirya PTAL, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-48267][SS] Regression e2e test with SPARK-47305 [spark]

2024-05-13 Thread via GitHub



HeartSaVioR opened a new pull request, #46569:
URL: https://github.com/apache/spark/pull/46569

   ### What changes were proposed in this pull request?
   
   This PR proposes to add a regression test (e2e) with SPARK-47305 
(https://issues.apache.org/jira/browse/SPARK-47305).
   
   As of commit cae2248bc13 (pre-Spark 4.0), the test query is represented as 
below logical plans:
   
   > Batch 0
   
   >> analyzed plan
   
   ```
   WriteToMicroBatchDataSource MemorySink, 
5067923b-e1d0-484c-914c-b111c9e60aac, Append, 0
   +- Project [value#1]
  +- Join Inner, (cast(code#5 as bigint) = ref_code#14L)
 :- Union false, false
 :  :- Project [value#1, 1 AS code#5]
 :  :  +- StreamingDataSourceV2ScanRelation[value#1] 
MemoryStreamDataSource
 :  +- Project [value#3, cast(code#9 as int) AS code#16]
 : +- Project [value#3, null AS code#9]
 :+- LocalRelation , [value#3]
 +- Project [id#12L AS ref_code#14L]
+- Range (1, 5, step=1, splits=Some(2))
   ```
   
   >> optimized plan
   
   ```
   WriteToDataSourceV2 MicroBatchWrite[epoch: 0, writer: ...]
   +- Join Inner
  :- StreamingDataSourceV2ScanRelation[value#1] MemoryStreamDataSource
  +- Project
 +- Filter (1 = id#12L)
+- Range (1, 5, step=1, splits=Some(2))
   ```
   
   > Batch 1
   
   >> analyzed plan
   
   ```
   WriteToMicroBatchDataSource MemorySink, 
d1c8be66-88e7-437a-9f25-6b87db8efe17, Append, 1
   +- Project [value#1]
  +- Join Inner, (cast(code#5 as bigint) = ref_code#14L)
 :- Union false, false
 :  :- Project [value#1, 1 AS code#5]
 :  :  +- LocalRelation , [value#1]
 :  +- Project [value#3, cast(code#9 as int) AS code#16]
 : +- Project [value#3, null AS code#9]
 :+- StreamingDataSourceV2ScanRelation[value#3] 
MemoryStreamDataSource
 +- Project [id#12L AS ref_code#14L]
+- Range (1, 5, step=1, splits=Some(2))
   ```
   
   >> optimized plan
   
   ```
   WriteToDataSourceV2 MicroBatchWrite[epoch: 1, writer: ...]
   +- Join Inner
  :- StreamingDataSourceV2ScanRelation[value#3] MemoryStreamDataSource
  +- LocalRelation 
   ```
   
   Notice the difference in optimized plan between batch 0 and batch 1. In 
optimized plan for batch 1, the batch side is pruned out, which goes with the 
path of PruneFilters. The sequence of optimization is, 
   
   1) left stream side is collapsed with empty local relation
   2) union is replaced with subtree for right stream side as left stream side 
is simply an empty local relation
   3) the value of 'code' column is now known to be 'null' and it's propagated 
to the join criteria (`null = ref_code`)
   4) join criteria is extracted out from join, and being pushed to the batch 
side
   5) the value of 'ref_code' column can never be null, hence the filter is 
optimized as `filter false`
   6) `filter false` triggers PruneFilters (where we fix a bug in SPARK-47305)
   
   Before SPARK-47305, a new empty local relation was incorrectly marked as 
streaming.
   
   ### Why are the changes needed?
   
   In the PR of SPARK-47305 we only added an unit test to verify the fix, but 
it wasn't e2e about the workload we encountered an issue. Given the complexity 
of QO, it'd be ideal to put an e2e reproducer (despite simplified) as 
regression test.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   New UT.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-41794][FOLLOWUP] Add `try_remainder` to python API references [spark]

2024-05-13 Thread via GitHub



zhengruifeng commented on PR #46566:
URL: https://github.com/apache/spark/pull/46566#issuecomment-2109228922

   merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-41794][FOLLOWUP] Add `try_remainder` to python API references [spark]

2024-05-13 Thread via GitHub



zhengruifeng closed pull request #46566: [SPARK-41794][FOLLOWUP] Add 
`try_remainder` to python API references
URL: https://github.com/apache/spark/pull/46566


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47579][CORE][PART2] Migrate logInfo with variables to structured logging framework [spark]

2024-05-13 Thread via GitHub



gengliangwang commented on code in PR #46494:
URL: https://github.com/apache/spark/pull/46494#discussion_r1599334395


##
core/src/main/scala/org/apache/spark/deploy/security/HadoopFSDelegationTokenProvider.scala:
##
@@ -117,10 +117,11 @@ private[deploy] class HadoopFSDelegationTokenProvider
 filesystems.foreach { fs =>
   if (fsToExclude.contains(fs.getUri.getHost)) {
 // YARN RM skips renewing token with empty renewer
-logInfo(s"getting token for: $fs with empty renewer to skip renewal")
+logInfo(log"getting token for: ${MDC(FILE_SYSTEM, fs)} with empty 
renewer to skip renewal")
 Utils.tryLogNonFatalError { fs.addDelegationTokens("", creds) }
   } else {
-logInfo(s"getting token for: $fs with renewer $renewer")
+logInfo(log"getting token for: ${MDC(FILE_SYSTEM, fs)} with" +
+  log" renewer ${MDC(TOKEN_RE_NEWER, renewer)}")

Review Comment:
   @zeotuan one last comment here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47579][CORE][PART2] Migrate logInfo with variables to structured logging framework [spark]

2024-05-13 Thread via GitHub



gengliangwang commented on code in PR #46494:
URL: https://github.com/apache/spark/pull/46494#discussion_r1599334199


##
core/src/main/scala/org/apache/spark/deploy/security/HadoopFSDelegationTokenProvider.scala:
##
@@ -117,10 +117,11 @@ private[deploy] class HadoopFSDelegationTokenProvider
 filesystems.foreach { fs =>
   if (fsToExclude.contains(fs.getUri.getHost)) {
 // YARN RM skips renewing token with empty renewer
-logInfo(s"getting token for: $fs with empty renewer to skip renewal")
+logInfo(log"getting token for: ${MDC(FILE_SYSTEM, fs)} with empty 
renewer to skip renewal")
 Utils.tryLogNonFatalError { fs.addDelegationTokens("", creds) }
   } else {
-logInfo(s"getting token for: $fs with renewer $renewer")
+logInfo(log"getting token for: ${MDC(FILE_SYSTEM, fs)} with" +
+  log" renewer ${MDC(TOKEN_RE_NEWER, renewer)}")

Review Comment:
   ```suggestion
 log" renewer ${MDC(TOKEN_RENEWER, renewer)}")
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48251][BUILD] Disable `maven local cache` on GA's step `MIMA test` [spark]

2024-05-13 Thread via GitHub



LuciferYang commented on code in PR #46551:
URL: https://github.com/apache/spark/pull/46551#discussion_r1599332094


##
project/SparkBuild.scala:
##
@@ -273,10 +273,9 @@ object SparkBuild extends PomBuild {
   // Google Mirror of Maven Central, placed first so that it's used 
instead of flaky Maven Central.
   // See https://storage-download.googleapis.com/maven-central/index.html 
for more info.
   "gcs-maven-central-mirror" at 
"https://maven-central.storage-download.googleapis.com/maven2/";,
-  DefaultMavenRepository,
-  Resolver.mavenLocal,
-  Resolver.file("ivyLocal", file(Path.userHome.absolutePath + 
"/.ivy2/local"))(Resolver.ivyStylePatterns)
-),
+  DefaultMavenRepository) ++
+  { if (sys.env.contains("SKIP_LOCAL_M2")) Nil else 
Seq(Resolver.mavenLocal) } :+

Review Comment:
   hmm... so, even if the value of `SKIP_LOCAL_M2` is "", `Resolver.mavenLocal` 
will be disabled?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48241][SQL][3.5] CSV parsing failure with char/varchar type columns [spark]

2024-05-13 Thread via GitHub



yaooqinn commented on PR #46565:
URL: https://github.com/apache/spark/pull/46565#issuecomment-2109177057

   Can we port the PR description here too？


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48266][CONNECT] Move package object `org.apache.spark.sql.connect.dsl` to test directory [spark]

2024-05-13 Thread via GitHub



LuciferYang commented on PR #46567:
URL: https://github.com/apache/spark/pull/46567#issuecomment-2109176293

   > can you write from where to where in the pr description
   
   done


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48241][SQL] CSV parsing failure with char/varchar type columns [spark]

2024-05-13 Thread via GitHub



liujiayi771 commented on PR #46537:
URL: https://github.com/apache/spark/pull/46537#issuecomment-2109168378

   > it has conflicts with 3.5, can you create a new backport PR?
   
   Create a backport PR in https://github.com/apache/spark/pull/46565.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48266][CONNECT] Move package object `org.apache.spark.sql.connect.dsl` to test directory [spark]

2024-05-13 Thread via GitHub



HyukjinKwon commented on PR #46567:
URL: https://github.com/apache/spark/pull/46567#issuecomment-2109164398

   can you write from where to where in the pr description


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48266][CONNECT] Move package object `org.apache.spark.sql.connect.dsl` to test directory [spark]

2024-05-13 Thread via GitHub



LuciferYang commented on PR #46567:
URL: https://github.com/apache/spark/pull/46567#issuecomment-2109161934

   cc @zhengruifeng @amaliujia 
   Let's test first


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48265][SQL] Infer window group limit batch should do constant folding [spark]

2024-05-13 Thread via GitHub



AngersZh commented on PR #46568:
URL: https://github.com/apache/spark/pull/46568#issuecomment-2109161898

   ping @cloud-fan cc @beliefer 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-48265][SQL] Infer window group limit batch should do constant folding [spark]

2024-05-13 Thread via GitHub



AngersZh opened a new pull request, #46568:
URL: https://github.com/apache/spark/pull/46568

   ### What changes were proposed in this pull request?
   Plan after PropagateEmptyRelation may generate double local limit
   ```
GlobalLimit 21  
  
+- LocalLimit 21
 
   !   +- Union false, false
  
   !  :- LocalLimit 21  

   !  :  +- Project [item_id#647L]  
 
   !  : +- Filter ()
   !  :+- Relation db.table[,... 91 more fields] parquet
   !  +- LocalLimit 21
   ! +- Project [item_id#738L]
   !+- LocalRelation , [, ... 91 more fields]
   ```
   to 
   ```
GlobalLimit 21  
  
   +- LocalLimit 21 


  
  - LocalLimit 21   
   
 +- Project [item_id#647L]  
 
   +- Filter ()
  +- Relation db.table[,... 91 more fields] parquet
   ```
   after `Infer window group limit batch` batch's `EliminateLimits`
   will be 
   ```
GlobalLimit 21  
  
   +- LocalLimit least(21, 21)  



  
 +- Project [item_id#647L]  
 
   +- Filter ()
  +- Relation db.table[,... 91 more fields] parquet
   ```
   It can't work, here miss a `ConstantFolding`
   
   ### Why are the changes needed?
   Fix bug
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-48266][CONNECT] Move package object `org.apache.spark.sql.connect.dsl` to test directory [spark]

2024-05-13 Thread via GitHub



LuciferYang opened a new pull request, #46567:
URL: https://github.com/apache/spark/pull/46567

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-41794][SQL] Add `try_remainder` function and re-enable column tests [spark]

2024-05-13 Thread via GitHub



zhengruifeng commented on PR #46434:
URL: https://github.com/apache/spark/pull/46434#issuecomment-2109157257

   add it to python API references in https://github.com/apache/spark/pull/46566


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-41794][FOLLOWUP] Add `try_remainder` to python API references [spark]

2024-05-13 Thread via GitHub



zhengruifeng opened a new pull request, #46566:
URL: https://github.com/apache/spark/pull/46566

   ### What changes were proposed in this pull request?
   Add `try_remainder` to python API references
   
   ### Why are the changes needed?
   new methods should be added to API references
   
   
   ### Does this PR introduce _any_ user-facing change?
   doc changes
   
   ### How was this patch tested?
   ci
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   no


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-36691][PYTHON] PythonRunner failed should pass error message to ApplicationMaster too [spark]

2024-05-13 Thread via GitHub



AngersZh commented on code in PR #33934:
URL: https://github.com/apache/spark/pull/33934#discussion_r1599293672


##
core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala:
##
@@ -92,15 +92,16 @@ object PythonRunner {
   // see https://github.com/numpy/numpy/issues/10455
   
sparkConf.getOption("spark.driver.cores").foreach(env.put("OMP_NUM_THREADS", _))
 }
-builder.redirectErrorStream(true) // Ugly but needed for stdout and stderr 
to synchronize

Review Comment:
   since we do this in `ProcessRedirectThread`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48259][CONNECT][TESTS] Add 3 missing methods in dsl [spark]

2024-05-13 Thread via GitHub



zhengruifeng commented on code in PR #46559:
URL: https://github.com/apache/spark/pull/46559#discussion_r1599292541


##
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala:
##
@@ -513,6 +513,25 @@ package object dsl {
 freqItems(cols.toArray, support)
 
   def freqItems(cols: Seq[String]): Relation = freqItems(cols, 0.01)
+
+  def sampleBy(col: String, fractions: Map[Any, Double], seed: Long): 
Relation = {

Review Comment:
   I am fine to move `dsl` to `test` if nothing is broken



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48259][CONNECT][TESTS] Add 3 missing methods in dsl [spark]

2024-05-13 Thread via GitHub



zhengruifeng commented on PR #46559:
URL: https://github.com/apache/spark/pull/46559#issuecomment-2109148423

   thanks, merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48259][CONNECT][TESTS] Add 3 missing methods in dsl [spark]

2024-05-13 Thread via GitHub



zhengruifeng closed pull request #46559: [SPARK-48259][CONNECT][TESTS] Add 3 
missing methods in dsl
URL: https://github.com/apache/spark/pull/46559


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47301][SQL][TESTS] Fix flaky ParquetIOSuite [spark]

2024-05-13 Thread via GitHub



yaooqinn commented on PR #45403:
URL: https://github.com/apache/spark/pull/45403#issuecomment-2109146927

   +1 @cloud-fan 
   
   Since the LOCs have been moved to 
`ParquetIOWithoutOutputCommitCoordinationSuite`, we need a followup for 
reverting 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46741][SQL] Cache Table with CTE won't work [spark]

2024-05-13 Thread via GitHub



AngersZh commented on PR #44767:
URL: https://github.com/apache/spark/pull/44767#issuecomment-2109143816

   ping @cloud-fan @yaooqinn 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47172][CORE] Add support for AES-GCM for RPC encryption [spark]

2024-05-13 Thread via GitHub



mridulm commented on PR #46515:
URL: https://github.com/apache/spark/pull/46515#issuecomment-2109142853

   Took a quick pass through it, sorry for the delay.
   
   +CC @JoshRosen as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47172][CORE] Add support for AES-GCM for RPC encryption [spark]

2024-05-13 Thread via GitHub



mridulm commented on code in PR #46515:
URL: https://github.com/apache/spark/pull/46515#discussion_r1598897504


##
docs/security.md:
##
@@ -207,6 +207,14 @@ The following table describes the different options 
available for configuring th
   
   2.2.0
 
+
+  spark.network.crypto.cipher
+  AES/CTR/NoPadding
+  
+Cipher mode to use. Defaults "AES/CTR/NoPadding". Also supports 
"AES/GCM/NoPadding"

Review Comment:
   Can we add more details on this ? What value should users set this to, 
why/impact, backward compatibility implication, ...



##
common/network-common/src/main/java/org/apache/spark/network/crypto/GcmTransportCipher.java:
##
@@ -0,0 +1,147 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.network.crypto;
+
+import com.google.common.annotations.VisibleForTesting;
+import com.google.crypto.tink.subtle.AesGcmJce;
+import io.netty.buffer.Unpooled;
+import io.netty.channel.*;
+import org.apache.spark.network.util.AbstractFileRegion;
+import io.netty.buffer.ByteBuf;
+
+import javax.crypto.spec.SecretKeySpec;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.nio.channels.ClosedChannelException;
+import java.nio.channels.WritableByteChannel;
+import java.security.GeneralSecurityException;
+
+public class GcmTransportCipher implements TransportCipher {
+private static final byte[] DEFAULT_AAD = new byte[0];
+
+private final SecretKeySpec aesKey;
+public GcmTransportCipher(SecretKeySpec aesKey)  {
+this.aesKey = aesKey;
+}
+
+@VisibleForTesting
+EncryptionHandler getEncryptionHandler() {
+return new EncryptionHandler();
+}
+
+@VisibleForTesting
+DecryptionHandler getDecryptionHandler() {
+return new DecryptionHandler();
+}
+
+public void addToChannel(Channel ch) {
+ch.pipeline()
+.addFirst("GcmTransportEncryption", getEncryptionHandler())
+.addFirst("GcmTransportDecryption", getDecryptionHandler());
+}
+
+@VisibleForTesting
+class EncryptionHandler extends ChannelOutboundHandlerAdapter {
+@Override
+public void write(ChannelHandlerContext ctx, Object msg, 
ChannelPromise promise)
+throws Exception {
+ByteBuffer inputBuffer;
+int bytesToRead;
+if (msg instanceof ByteBuf byteBuf) {
+bytesToRead = byteBuf.readableBytes();
+// This is allocating a buffer that is the size of the input
+inputBuffer = ByteBuffer.allocate(bytesToRead);
+// This will block while copying
+while (inputBuffer.position() < bytesToRead) {
+byteBuf.readBytes(inputBuffer);
+}
+} else if (msg instanceof AbstractFileRegion fileRegion) {
+bytesToRead = (int) fileRegion.count();
+// This is allocating a buffer that is the size of the input
+inputBuffer = ByteBuffer.allocate(bytesToRead);
+ByteBufferWriteableChannel writeableChannel =
+new ByteBufferWriteableChannel(inputBuffer);
+long transferred = 0;
+// This will block while copying
+while (transferred < bytesToRead) {
+transferred +=
+fileRegion.transferTo(writeableChannel, 
fileRegion.transferred());
+}

Review Comment:
   This can be quite large - and result in a nontrivial memory consumption.
   We will need to do something similar to 
`CtrTransportCipher.EncryptedMessage` and transfer in chunks.



##
common/network-common/src/main/java/org/apache/spark/network/crypto/GcmTransportCipher.java:
##
@@ -0,0 +1,147 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License

[PR] [SPARK-48241][SQL][3.5] CSV parsing failure with char/varchar type columns [spark]

2024-05-13 Thread via GitHub



liujiayi771 opened a new pull request, #46565:
URL: https://github.com/apache/spark/pull/46565

   ### What changes were proposed in this pull request?
   Backport CSV parsing failure with char/varchar type columns to Spark 3.5.
   
   ### Why are the changes needed?
   For char and varchar types, Spark will convert them to `StringType` in 
`CharVarcharUtils.replaceCharVarcharWithStringInSchema` and record 
`__CHAR_VARCHAR_TYPE_STRING` in the metadata.
   
   The reason for the above error is that the `StringType` columns in the 
`dataSchema` and `requiredSchema` of `UnivocityParser` are not consistent. The 
`StringType` in the `dataSchema` has metadata, while the metadata in the 
`requiredSchema` is empty. We need to retain the metadata when resolving schema.
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   Add a new test case in `CSVSuite`.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [Only Test] Test label profiler [spark]

2024-05-13 Thread via GitHub



panbingkun opened a new pull request, #46564:
URL: https://github.com/apache/spark/pull/46564

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48155][SQL] AQEPropagateEmptyRelation for join should check if remain child is just BroadcastQueryStageExec [spark]

2024-05-13 Thread via GitHub



AngersZh commented on code in PR #46523:
URL: https://github.com/apache/spark/pull/46523#discussion_r1599275544


##
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AQEPropagateEmptyRelation.scala:
##
@@ -82,6 +82,11 @@ object AQEPropagateEmptyRelation extends 
PropagateEmptyRelationBase {
 case _ => false
   }
 
+  override protected def canPropagate(plan: LogicalPlan): Boolean = plan match 
{
+case LogicalQueryStage(_, _: BroadcastQueryStageExec) => false

Review Comment:
   Done



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PropagateEmptyRelation.scala:
##
@@ -65,6 +65,8 @@ abstract class PropagateEmptyRelationBase extends 
Rule[LogicalPlan] with CastSup
   private def nullValueProjectList(plan: LogicalPlan): Seq[NamedExpression] =
 plan.output.map{ a => Alias(cast(Literal(null), a.dataType), 
a.name)(a.exprId) }
 
+  protected def canPropagate(plan: LogicalPlan): Boolean = true

Review Comment:
   Done



##
sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala:
##
@@ -2772,6 +2778,30 @@ class AdaptiveQueryExecSuite
 }
   }
 
+  test("SPARK-48155: AQEPropagateEmptyRelation check remained child for join") 
{
+withSQLConf(
+  SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true") {
+  val (_, adaptivePlan) = runAdaptiveAndVerifyResult(
+"""
+  |SELECT /*+ BROADCAST(t3) */ t3.b, count(t3.a) FROM testData2 t1

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47301][SQL][TESTS] Fix flaky ParquetIOSuite [spark]

2024-05-13 Thread via GitHub



cloud-fan commented on PR #45403:
URL: https://github.com/apache/spark/pull/45403#issuecomment-2109105455

   I think https://github.com/apache/spark/pull/46562 is a better fix, can we 
revert this workaround now?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48027][SQL][FOLLOWUP] Add comments for the other code branch [spark]

2024-05-13 Thread via GitHub



beliefer commented on code in PR #46536:
URL: https://github.com/apache/spark/pull/46536#discussion_r1599263801


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InjectRuntimeFilter.scala:
##
@@ -152,7 +152,15 @@ object InjectRuntimeFilter extends Rule[LogicalPlan] with 
PredicateHelper with J
 } else if (right.output.exists(_.semanticEquals(targetKey))) {
   extract(right, AttributeSet.empty, hasHitFilter = false, 
hasHitSelectiveFilter = false,
 currentPlan = right, targetKey = targetKey).orElse {
-// We can also extract from the left side if the join keys are 
transitive.
+// We can also extract from the left side if the join keys are 
transitive, and
+// the left side always produces a superset output of join right 
keys.
+// Let's look at an example
+// left table: 1, 2, 3
+// right table, 3, 4
+// right outer join output: (3, 3), (null, 4)
+// right key output: 3, 4
+// Any join side always produce a superset output of its 
corresponding
+// join keys, but for transitive join keys we need to check the 
join type.

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-48264][BUILD] Upgrade `datasketches-java` to 6.0.0 [spark]

2024-05-13 Thread via GitHub



panbingkun opened a new pull request, #46563:
URL: https://github.com/apache/spark/pull/46563

   ### What changes were proposed in this pull request?
   The pr aims to upgrade `datasketches-java` from `5.0.1` to `6.0.0`
   
   
   ### Why are the changes needed?
   The full release notes:
   - https://github.com/apache/datasketches-java/releases/tag/6.0.0
   - https://github.com/apache/datasketches-java/releases/tag/5.0.2
 https://github.com/apache/spark/assets/15246973/fff5905a-25e8-4e2f-9492-1b6099b2bd05";>
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   
   ### How was this patch tested?
   Pass GA.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48220][PYTHON] Allow passing PyArrow Table to createDataFrame() [spark]

2024-05-13 Thread via GitHub



alippai commented on PR #46529:
URL: https://github.com/apache/spark/pull/46529#issuecomment-2109086611

   This makes the usage so much easier, thanks!
   
   What will happen with the nanosecond timestamps? Truncated to milliseconds?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48253][PS][SQL] Support default mode for Pandas API on Spark [spark]

2024-05-13 Thread via GitHub



itholic closed pull request #46553: [SPARK-48253][PS][SQL] Support default mode 
for Pandas API on Spark
URL: https://github.com/apache/spark/pull/46553


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48253][PS][SQL] Support default mode for Pandas API on Spark [spark]

2024-05-13 Thread via GitHub



itholic commented on PR #46553:
URL: https://github.com/apache/spark/pull/46553#issuecomment-2109072613

   On second thought, the performance benefits would not very meaningful. 
Rather, it may pollute the config surface so let me close it for now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48253][PS][SQL] Support default mode for Pandas API on Spark [spark]

2024-05-13 Thread via GitHub



itholic commented on code in PR #46553:
URL: https://github.com/apache/spark/pull/46553#discussion_r1599234279


##
python/pyspark/pandas/utils.py:
##
@@ -492,6 +504,10 @@ def default_session() -> SparkSession:
 "from pandas API on Spark since pandas API on Spark follows "
 "the behavior of pandas, not SQL."
 )
+default_mode_value = get_default_mode()
+if default_mode_value is None:
+default_mode_from_conf = 
spark.conf.get("spark.sql.pyspark.pandas.defaultMode") == "true"

Review Comment:
   Sure, let me set the default as "false"



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46707][SQL][FOLLOWUP] Push down throwable predicate through aggregates [spark]

2024-05-13 Thread via GitHub



github-actions[bot] closed pull request #44975:  [SPARK-46707][SQL][FOLLOWUP] 
Push down throwable predicate through aggregates
URL: https://github.com/apache/spark/pull/44975


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-42727][CORE] Fix can't executing spark commands in the root directory when local mode is specified [spark]

2024-05-13 Thread via GitHub



github-actions[bot] closed pull request #44990: [SPARK-42727][CORE] Fix can't 
executing spark commands in the root directory when local mode is specified
URL: https://github.com/apache/spark/pull/44990


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48259][CONNECT][TESTS] Add 3 missing methods in dsl [spark]

2024-05-13 Thread via GitHub



amaliujia commented on code in PR #46559:
URL: https://github.com/apache/spark/pull/46559#discussion_r1599231832


##
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala:
##
@@ -513,6 +513,25 @@ package object dsl {
 freqItems(cols.toArray, support)
 
   def freqItems(cols: Seq[String]): Relation = freqItems(cols, 0.01)
+
+  def sampleBy(col: String, fractions: Map[Any, Double], seed: Long): 
Relation = {

Review Comment:
   This file was used to be here because we assumed that the DSL here can be 
used to build Scala client. But now this DSL is purely used for testing, then 
probably we can move it.



##
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala:
##
@@ -513,6 +513,25 @@ package object dsl {
 freqItems(cols.toArray, support)
 
   def freqItems(cols: Seq[String]): Relation = freqItems(cols, 0.01)
+
+  def sampleBy(col: String, fractions: Map[Any, Double], seed: Long): 
Relation = {

Review Comment:
   This file used to be here because we assumed that the DSL here can be used 
to build Scala client. But now this DSL is purely used for testing, then 
probably we can move it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48209][CORE] Common (java side): Migrate `error/warn/info` with variables to structured logging framework [spark]

2024-05-13 Thread via GitHub



panbingkun commented on code in PR #46493:
URL: https://github.com/apache/spark/pull/46493#discussion_r1599229097


##
common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala:
##
@@ -292,6 +301,7 @@ object LogKeys {
   case object LOADED_VERSION extends LogKey
   case object LOAD_FACTOR extends LogKey
   case object LOAD_TIME extends LogKey
+  case object LOCAL_DIRS extends LogKey

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47579][CORE][PART2] Migrate logInfo with variables to structured logging framework [spark]

2024-05-13 Thread via GitHub



zeotuan commented on code in PR #46494:
URL: https://github.com/apache/spark/pull/46494#discussion_r1599223748


##
core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala:
##
@@ -278,11 +278,12 @@ private[spark] class TorrentBroadcast[T: ClassTag](obj: 
T, id: Long, serializedO
 }
   case None =>
 val estimatedTotalSize = Utils.bytesToString(numBlocks.toLong * 
blockSize)
-logInfo(s"Started reading broadcast variable $id with $numBlocks 
pieces " +
-  s"(estimated total size $estimatedTotalSize)")
+logInfo(log"Started reading broadcast variable ${MDC(BROADCAST_ID, 
id)} with ${MDC(NUM_BROADCAST_BLOCK, numBlocks)} pieces " +
+  log"(estimated total size ${MDC(NUM_BYTES, 
estimatedTotalSize)})")
 val startTimeNs = System.nanoTime()
 val blocks = readBlocks()
-logInfo(s"Reading broadcast variable $id took 
${Utils.getUsedTimeNs(startTimeNs)}")
+logInfo(log"Reading broadcast variable ${MDC(BROADCAST_ID, id)}" +
+  log" took ${MDC(TOTAL_TIME, Utils.getUsedTimeNs(startTimeNs))} 
ms")

Review Comment:
   My bad, that is actually ms since `getUsedTimeNs` convert it to ms. But the 
function already provided the unit value



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48260][SQL] Disable output committer coordination in one test of ParquetIOSuite [spark]

2024-05-13 Thread via GitHub



gengliangwang closed pull request #46562: [SPARK-48260][SQL] Disable output 
committer coordination in one test of ParquetIOSuite
URL: https://github.com/apache/spark/pull/46562


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48260][SQL] Disable output committer coordination in one test of ParquetIOSuite [spark]

2024-05-13 Thread via GitHub



gengliangwang commented on PR #46562:
URL: https://github.com/apache/spark/pull/46562#issuecomment-2109015398

   @viirya @cloud-fan thanks for the review. Merging to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47579][CORE][PART2] Migrate logInfo with variables to structured logging framework [spark]

2024-05-13 Thread via GitHub



gengliangwang commented on code in PR #46494:
URL: https://github.com/apache/spark/pull/46494#discussion_r1599216897


##
core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala:
##
@@ -100,7 +99,7 @@ private[deploy] class SparkSubmitArguments(args: 
Seq[String], env: Map[String, S
   // Property files may contain sensitive information, so redact before 
printing
   if (verbose) {
 Utils.redact(properties).foreach { case (k, v) =>
-  logInfo(s"Adding default property: $k=$v")
+  logInfo(log"Adding default property: ${MDC(PROPERTY_NAME, 
k)}=${MDC(VALUE, v)}")

Review Comment:
   ```suggestion
 logInfo(log"Adding default property: ${MDC(KEY, k)}=${MDC(VALUE, 
v)}")
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47579][CORE][PART2] Migrate logInfo with variables to structured logging framework [spark]

2024-05-13 Thread via GitHub



gengliangwang commented on code in PR #46494:
URL: https://github.com/apache/spark/pull/46494#discussion_r1599216687


##
core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala:
##
@@ -90,7 +89,7 @@ private[deploy] class SparkSubmitArguments(args: Seq[String], 
env: Map[String, S
   lazy val defaultSparkProperties: HashMap[String, String] = {
 val defaultProperties = new HashMap[String, String]()
 if (verbose) {
-  logInfo(s"Using properties file: $propertiesFile")
+  logInfo(log"Using properties file: ${MDC(PROPERTY_FILE, 
propertiesFile)}")

Review Comment:
   Let's simply use PATH



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47579][CORE][PART2] Migrate logInfo with variables to structured logging framework [spark]

2024-05-13 Thread via GitHub



gengliangwang commented on code in PR #46494:
URL: https://github.com/apache/spark/pull/46494#discussion_r1599216356


##
core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala:
##
@@ -955,11 +959,12 @@ private[spark] class SparkSubmit extends Logging {
 }
 
 if (args.verbose) {
-  logInfo(s"Main class:\n$childMainClass")
-  logInfo(s"Arguments:\n${childArgs.mkString("\n")}")
+  logInfo(log"Main class:\n${MDC(LogKeys.CLASS_NAME, childMainClass)}")
+  logInfo(log"Arguments:\n${MDC(LogKeys.ARGS, childArgs.mkString("\n"))}")
   // sysProps may contain sensitive information, so redact before printing
-  logInfo(s"Spark 
config:\n${Utils.redact(sparkConf.getAll.toMap).sorted.mkString("\n")}")
-  logInfo(s"Classpath elements:\n${childClasspath.mkString("\n")}")
+  logInfo(log"Spark config:\n" +
+  log"${MDC(LogKeys.CONFIG, 
Utils.redact(sparkConf.getAll.toMap).sorted.mkString("\n"))}")
+  logInfo(log"Classpath elements:\n${MDC(LogKeys.CLASS_PATH, 
childClasspath.mkString("\n"))}")

Review Comment:
   ```suggestion
 logInfo(log"Classpath elements:\n${MDC(LogKeys.CLASS_PATHS, 
childClasspath.mkString("\n"))}")
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48260][SQL] Disable output committer coordination in one test of ParquetIOSuite [spark]

2024-05-13 Thread via GitHub



cloud-fan commented on PR #46560:
URL: https://github.com/apache/spark/pull/46560#issuecomment-2109004973

   closing in favor of https://github.com/apache/spark/pull/46562


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48260][SQL] Disable output committer coordination in one test of ParquetIOSuite [spark]

2024-05-13 Thread via GitHub



cloud-fan closed pull request #46560: [SPARK-48260][SQL] Disable output 
committer coordination in one test of ParquetIOSuite
URL: https://github.com/apache/spark/pull/46560


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47579][CORE][PART2] Migrate logInfo with variables to structured logging framework [spark]

2024-05-13 Thread via GitHub



gengliangwang commented on code in PR #46494:
URL: https://github.com/apache/spark/pull/46494#discussion_r1599208898


##
core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala:
##
@@ -278,11 +278,12 @@ private[spark] class TorrentBroadcast[T: ClassTag](obj: 
T, id: Long, serializedO
 }
   case None =>
 val estimatedTotalSize = Utils.bytesToString(numBlocks.toLong * 
blockSize)
-logInfo(s"Started reading broadcast variable $id with $numBlocks 
pieces " +
-  s"(estimated total size $estimatedTotalSize)")
+logInfo(log"Started reading broadcast variable ${MDC(BROADCAST_ID, 
id)} with ${MDC(NUM_BROADCAST_BLOCK, numBlocks)} pieces " +
+  log"(estimated total size ${MDC(NUM_BYTES, 
estimatedTotalSize)})")
 val startTimeNs = System.nanoTime()
 val blocks = readBlocks()
-logInfo(s"Reading broadcast variable $id took 
${Utils.getUsedTimeNs(startTimeNs)}")
+logInfo(log"Reading broadcast variable ${MDC(BROADCAST_ID, id)}" +
+  log" took ${MDC(TOTAL_TIME, Utils.getUsedTimeNs(startTimeNs))} 
ms")

Review Comment:
   This is NS instead of ms



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48256][BUILD] Add a rule to check file headers for the java side, and fix inconsistent files [spark]

2024-05-13 Thread via GitHub



panbingkun commented on PR #46557:
URL: https://github.com/apache/spark/pull/46557#issuecomment-2108997622

   cc @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48017] Add Spark application submission worker for operator [spark-kubernetes-operator]

2024-05-13 Thread via GitHub



jiangzho commented on code in PR #10:
URL: 
https://github.com/apache/spark-kubernetes-operator/pull/10#discussion_r1599198732


##
gradle.properties:
##
@@ -18,17 +18,23 @@
 group=org.apache.spark.k8s.operator
 version=0.1.0
 
-fabric8Version=6.12.1
+# Caution: fabric8 version should be aligned with Spark dependency
+fabric8Version=6.7.2
 commonsLang3Version=3.14.0
 commonsIOVersion=2.16.1
 lombokVersion=1.18.32
 
+#Spark
+scalaVersion=2.12

Review Comment:
   yup! upgrading to 2.13 along with 4.0.0-preview1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48220][PYTHON] Allow passing PyArrow Table to createDataFrame() [spark]

2024-05-13 Thread via GitHub



ianmcook commented on code in PR #46529:
URL: https://github.com/apache/spark/pull/46529#discussion_r1599189646


##
python/pyspark/sql/connect/session.py:
##
@@ -474,35 +475,40 @@ def createDataFrame(
 
 _table: Optional[pa.Table] = None
 
-if isinstance(data, pd.DataFrame):
+if isinstance(data, pd.DataFrame) or isinstance(data, pa.Table):
 # Logic was borrowed from `_create_from_pandas_with_arrow` in
 # `pyspark.sql.pandas.conversion.py`. Should ideally deduplicate 
the logics.
 
 # If no schema supplied by user then get the names of columns only
 if schema is None:

Review Comment:
   I flattened the conditionals. Please take a look: 
https://github.com/apache/spark/pull/46529/commits/795d01a1fd96fcbe80432d74be50382d350edf57?diff=split&w=0



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48209][CORE] Common (java side): Migrate `error/warn/info` with variables to structured logging framework [spark]

2024-05-13 Thread via GitHub



gengliangwang commented on code in PR #46493:
URL: https://github.com/apache/spark/pull/46493#discussion_r1599127368


##
common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala:
##
@@ -292,6 +301,7 @@ object LogKeys {
   case object LOADED_VERSION extends LogKey
   case object LOAD_FACTOR extends LogKey
   case object LOAD_TIME extends LogKey
+  case object LOCAL_DIRS extends LogKey

Review Comment:
   yes, merge into PATHS



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48233][SQL][STREAMING] Tests for non-stateful streaming on columns with collations [spark]

2024-05-13 Thread via GitHub



HeartSaVioR commented on code in PR #46247:
URL: https://github.com/apache/spark/pull/46247#discussion_r1599101022


##
sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingDeduplicationSuite.scala:
##
@@ -484,6 +486,52 @@ class StreamingDeduplicationSuite extends 
StateStoreMetricsTest {
   CheckLastBatch(("c", 9, "c"))
 )
   }
+
+  test("collation aware deduplication") {
+val inputData = MemoryStream[(String, Int)]
+val result = inputData.toDF()
+  .select(col("_1")
+.try_cast(StringType("UNICODE")).as("str"),

Review Comment:
   While we are here, I see UNICODE is binary equality but non-binary ordering. 
Does this still ensure that we can put this into RocksDB which key is binary 
sorted and find the key group based on prefix of key including this column?
   
   E.g. Say we have two columns, dept (String with UNICODE collation), session 
start (timestamp) as grouping key, and want to scan all grouping keys which are 
having dept as 'dept1'. This is required for several operations like session 
window aggregation.
   
   My gut feeling is yes, but I would like to double confirm.



##
sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingDeduplicationSuite.scala:
##
@@ -484,6 +486,52 @@ class StreamingDeduplicationSuite extends 
StateStoreMetricsTest {
   CheckLastBatch(("c", 9, "c"))
 )
   }
+
+  test("collation aware deduplication") {
+val inputData = MemoryStream[(String, Int)]
+val result = inputData.toDF()
+  .select(col("_1")
+.try_cast(StringType("UNICODE")).as("str"),
+col("_2").as("int"))
+  .dropDuplicates("str")
+
+testStream(result, Append)(
+  AddData(inputData, "a" -> 1),
+  CheckLastBatch("a" -> 1),
+  assertNumStateRows(total = 1, updated = 1, droppedByWatermark = 0),
+  AddData(inputData, "a" -> 2), // Dropped
+  CheckLastBatch(),
+  assertNumStateRows(total = 1, updated = 0, droppedByWatermark = 0),
+  // scalastyle:off
+  AddData(inputData, "ä" -> 1),
+  CheckLastBatch("ä" -> 1),
+  // scalastyle:on
+  assertNumStateRows(total = 2, updated = 1, droppedByWatermark = 0)
+)
+  }
+
+  test("non-binary collation aware deduplication not supported") {
+val inputData = MemoryStream[(String)]
+val result = inputData.toDF()
+  .select(col("value")
+.try_cast(StringType("UTF8_BINARY_LCASE")).as("str"))

Review Comment:
   AFAIK, LCASE means comparing as lowercase, so yes it's bound to non-binary 
equality.



##
sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala:
##
@@ -1364,6 +1364,35 @@ class StreamingQuerySuite extends StreamTest with 
BeforeAndAfter with Logging wi
 )
   }
 
+  test("Collation aware streaming") {
+withTable("parquet_streaming_tbl") {
+  spark.sql(
+"""
+  |CREATE TABLE parquet_streaming_tbl
+  |(
+  |  key STRING COLLATE UTF8_BINARY_LCASE,
+  |  value_stream INTEGER
+  |) USING parquet""".stripMargin)
+
+  val streamDf = spark.readStream.table("parquet_streaming_tbl")
+  val filteredDf = streamDf.filter("key = 'aaa'")
+
+  val clock = new StreamManualClock()
+  testStream(filteredDf)(
+StartStream(triggerClock = clock, trigger = 
Trigger.ProcessingTime(100)),
+Execute { _ =>
+  spark.createDataFrame(Seq("aaa" -> 1, "AAA" -> 2, "bbb" -> 3, "aa" 
-> 4))

Review Comment:
   Not sure I understand. streamDf has non-default collation, 
UTF8_BINARY_LCASE. What do you mean by "incoming" stream?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48253][PS][SQL] Support default mode for Pandas API on Spark [spark]

2024-05-13 Thread via GitHub



ueshin commented on code in PR #46553:
URL: https://github.com/apache/spark/pull/46553#discussion_r1599081284


##
python/pyspark/pandas/config.py:
##
@@ -374,8 +375,10 @@ def get_option(key: str, default: Union[Any, _NoValueType] 
= _NoValue) -> Any:
 default = _options_dict[key].default
 _options_dict[key].validate(default)
 spark_session = default_session()
-
-return json.loads(spark_session.conf.get(_key_format(key), 
default=json.dumps(default)))
+if get_default_mode():
+return json.dumps(default)

Review Comment:
   We don't need `json.dumps` here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47976][Python] Add Python UDTF DataFrame API to return result of 'analyze' method as output table [spark]

2024-05-13 Thread via GitHub



dtenedor closed pull request #46234: [SPARK-47976][Python] Add Python UDTF 
DataFrame API to return result of 'analyze' method as output table
URL: https://github.com/apache/spark/pull/46234


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47976][Python] Add Python UDTF DataFrame API to return result of 'analyze' method as output table [spark]

2024-05-13 Thread via GitHub



dtenedor commented on PR #46234:
URL: https://github.com/apache/spark/pull/46234#issuecomment-2108777240

   I am going to close this since I am getting no response on the PR; will 
reopen later if we want to implement this feature.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-44953][CORE] Log a warning when shuffle tracking is enabled along side another DA supported mechanism [spark]

2024-05-13 Thread via GitHub



holdenk commented on PR #45454:
URL: https://github.com/apache/spark/pull/45454#issuecomment-2108758472

   Merged to current branch


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-44953][CORE] Log a warning when shuffle tracking is enabled along side another DA supported mechanism [spark]

2024-05-13 Thread via GitHub



asfgit closed pull request #45454: [SPARK-44953][CORE] Log a warning when 
shuffle tracking is enabled along side another DA supported mechanism
URL: https://github.com/apache/spark/pull/45454


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46395][CORE] Assign Spark configs to groups for use in documentation [spark]

2024-05-13 Thread via GitHub



holdenk commented on PR #44755:
URL: https://github.com/apache/spark/pull/44755#issuecomment-2108746918

   My bad github staged my review comment for some reason, should be visible 
now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46395][CORE] Assign Spark configs to groups for use in documentation [spark]

2024-05-13 Thread via GitHub



holdenk commented on code in PR #44755:
URL: https://github.com/apache/spark/pull/44755#discussion_r1580256318


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -5704,10 +5704,9 @@ class SQLConf extends Serializable with Logging with 
SqlApiConf {
 settings.synchronized { settings.asScala.toMap }
 
   /**
-   * Return all the configuration definitions that have been defined in 
[[SQLConf]]. Each
-   * definition contains key, defaultValue and doc.
+   * Return all the public configuration definitions that have been defined in 
[[SQLConf]].
*/
-  def getAllDefinedConfs: Seq[(String, String, String, String)] = {
+  def getAllDefinedConfs: Seq[(String, String, String, String, Set[String])] = 
{

Review Comment:
   This does change a public APIs type signature, I'm not sure it's super 
important or widely used but given the stated goal of the PR change is to 
improve the docs do we need this part? Or could a new internal API be 
introduced if we need this functionality.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-27900][CORE][K8s] Add `spark.driver.killOnOOMError` flag in cluster mode [spark]

2024-05-13 Thread via GitHub



holdenk commented on PR #26161:
URL: https://github.com/apache/spark/pull/26161#issuecomment-2108745319

   I don't think so. We manually set kill on OOM in our config.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-36691][PYTHON] PythonRunner failed should pass error message to ApplicationMaster too [spark]

2024-05-13 Thread via GitHub



holdenk commented on code in PR #33934:
URL: https://github.com/apache/spark/pull/33934#discussion_r1599043338


##
common/utils/src/main/scala/org/apache/spark/SparkException.scala:
##
@@ -145,8 +148,10 @@ private[spark] class SparkDriverExecutionException(cause: 
Throwable)
  * Exception thrown when the main user code is run as a child process (e.g. 
pyspark) and we want
  * the parent SparkSubmit process to exit with the same exit code.
  */
-private[spark] case class SparkUserAppException(exitCode: Int)
-  extends SparkException(s"User application exited with $exitCode")
+private[spark] case class SparkUserAppException(exitCode: Int, errorMsg: 
Option[String] = None)
+  extends SparkException(
+log"User application exited with ${MDC(LogKeys.EXIT_CODE, 
exitCode)}".message +
+  errorMsg.map(error => s" and caused by\n$error").getOrElse(""))

Review Comment:
   I'm not super familiar with MDC this is going to use a generic exit code 
mapping to determine what went wrong from the exit code?



##
core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala:
##
@@ -92,15 +92,16 @@ object PythonRunner {
   // see https://github.com/numpy/numpy/issues/10455
   
sparkConf.getOption("spark.driver.cores").foreach(env.put("OMP_NUM_THREADS", _))
 }
-builder.redirectErrorStream(true) // Ugly but needed for stdout and stderr 
to synchronize

Review Comment:
   Why is this not needed anymore?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 >

1 - 100 of 234 matches

Mail list logo