date:20220903

[GitHub] [spark] SelfImpr001 commented on a diff in pull request #37732: [SPARK-40253] [SQL] Fixed loss of precision for writing 0.00 specific…

2022-09-03 Thread GitBox



SelfImpr001 commented on code in PR #37732:
URL: https://github.com/apache/spark/pull/37732#discussion_r962252317


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala:
##
@@ -76,8 +76,19 @@ object Literal {
   val decimal = Decimal(d)
   Literal(decimal, DecimalType.fromDecimal(decimal))
 case d: JavaBigDecimal =>
-  val decimal = Decimal(d)
-  Literal(decimal, DecimalType.fromDecimal(decimal))
+  Literal(Decimal(d), DecimalType(Math.max(d.precision, d.scale), 
d.scale()))
+  if (d.abs().compareTo(new JavaBigDecimal("0")) == 0) {

Review Comment:
   Yes, it is found that it only exists in Spark 2.4 + Hive 1.x, but there are 
still many people using 2.4 +, it is recommended to provide a patch solution



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #37770: [SPARK-40314][SQL] Add scala and python bindings for inline and inline_outer

2022-09-03 Thread GitBox



AmplabJenkins commented on PR #37770:
URL: https://github.com/apache/spark/pull/37770#issuecomment-1236246586

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #37771: [SPARK-40315][SQL] Add equals() and hashCode() to ArrayBasedMapData

2022-09-03 Thread GitBox



AmplabJenkins commented on PR #37771:
URL: https://github.com/apache/spark/pull/37771#issuecomment-1236246575

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #37624: [SPARK-40186][CORE][YARN] Ensure `mergedShuffleCleaner` have been shutdown before `db` close

2022-09-03 Thread GitBox



LuciferYang commented on PR #37624:
URL: https://github.com/apache/spark/pull/37624#issuecomment-1236241317

   friendly ping @tgravescs , @Ngone51


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #37648: [SPARK-38909][BUILD][CORE][YARN][FOLLOWUP] Make some code cleanup related to shuffle state db

2022-09-03 Thread GitBox



LuciferYang commented on PR #37648:
URL: https://github.com/apache/spark/pull/37648#issuecomment-1236241254

   should we merge this one ?  @Ngone51 @mridulm 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #37729: Revert "[SPARK-33861][SQL] Simplify conditional in predicate"

2022-09-03 Thread GitBox



dongjoon-hyun commented on PR #37729:
URL: https://github.com/apache/spark/pull/37729#issuecomment-1236236113

   Thank you for reverting, @wangyum .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun closed pull request #37787: [SPARK-40323][BUILD] Update ORC to 1.8.0

2022-09-03 Thread GitBox



dongjoon-hyun closed pull request #37787: [SPARK-40323][BUILD] Update ORC to 
1.8.0
URL: https://github.com/apache/spark/pull/37787


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #37777: [WIP][SPARK-40309][PYTHON][PS] Introduce `sql_conf` context manager for `pyspark.sql`

2022-09-03 Thread GitBox



AmplabJenkins commented on PR #3:
URL: https://github.com/apache/spark/pull/3#issuecomment-1236230648

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #37779: [SPARK-40320][Core] Executor should exit when it failed to initialize for fatal error

2022-09-03 Thread GitBox



AmplabJenkins commented on PR #37779:
URL: https://github.com/apache/spark/pull/37779#issuecomment-1236230647

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] IrishBird commented on pull request #34684: [SPARK-37442][SQL] InMemoryRelation statistics bug causing broadcast join failures with AQE enabled

2022-09-03 Thread GitBox



IrishBird commented on PR #34684:
URL: https://github.com/apache/spark/pull/34684#issuecomment-1236227362

   I have a question as to this.
   a. I have one text files that about 12G.  and this file have 5 columns, for 
my broadcast table, I only need to read 1 column, and I called the API :so to 
cache the subset DF into the memory and get size of the DF, it is only around 
2G, so theoretically, it shall allow this DF to broadcast, so how the spark get 
the estimate the size of the dataframe?  
   
   df.cache.foreach(_ => ())
   val catalyst_plan = df.queryExecution.logical
   val df_size_in_bytes = spark.sessionState.executePlan(
   catalyst_plan).optimizedPlan.stats.sizeInBytes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] williamhyun opened a new pull request, #37787: [SPARK-40323][BUILD] Update ORC to 1.8.0

2022-09-03 Thread GitBox



williamhyun opened a new pull request, #37787:
URL: https://github.com/apache/spark/pull/37787

   ### What changes were proposed in this pull request?
   This PR aims to update ORC to 1.8.0. 
   
   
   ### Why are the changes needed?
   This will bring the latest changes and bug fixes. 
   - https://github.com/apache/orc/releases/tag/v1.8.0
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   
   ### How was this patch tested?
   Pass the CIs. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #37785: [SPARK-40288][SQL]After `RemoveRedundantAggregates`, `PullOutGroupingExpressions` should applied to avoid attribute missing when use co

2022-09-03 Thread GitBox



AmplabJenkins commented on PR #37785:
URL: https://github.com/apache/spark/pull/37785#issuecomment-1236215730

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #37786: [SPARK-40142][PYTHON][SQL][FOLLOW-UP] Make pyspark.sql.functions examples self-contained (part 5, ~28 functions)

2022-09-03 Thread GitBox



AmplabJenkins commented on PR #37786:
URL: https://github.com/apache/spark/pull/37786#issuecomment-1236215711

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] santosh-d3vpl3x commented on pull request #37761: [SPARK-40311][SQL][CORE][Python] Add withColumnsRenamed to scala and pyspark API

2022-09-03 Thread GitBox



santosh-d3vpl3x commented on PR #37761:
URL: https://github.com/apache/spark/pull/37761#issuecomment-1236191695

   @zhengruifeng would you like to have a look at this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk closed pull request #37763: [SPARK-40308][SQL] Allow non-foldable delimiter arguments to `str_to_map` function

2022-09-03 Thread GitBox



MaxGekk closed pull request #37763: [SPARK-40308][SQL] Allow non-foldable 
delimiter arguments to `str_to_map` function
URL: https://github.com/apache/spark/pull/37763


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk commented on pull request #37763: [SPARK-40308][SQL] Allow non-foldable delimiter arguments to `str_to_map` function

2022-09-03 Thread GitBox



MaxGekk commented on PR #37763:
URL: https://github.com/apache/spark/pull/37763#issuecomment-1236191274

   +1, LGTM. Merging to master.
   Thank you, @bersprockets and @HyukjinKwon for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] khalidmammadov opened a new pull request, #37786: [SPARK-40142][PYTHON][SQL][FOLLOW-UP] Make pyspark.sql.functions examples self-contained (part 5, ~28 functions)

2022-09-03 Thread GitBox



khalidmammadov opened a new pull request, #37786:
URL: https://github.com/apache/spark/pull/37786

   ### What changes were proposed in this pull request?
   It's part of the Pyspark docstrings improvement series 
(https://github.com/apache/spark/pull/37592, 
https://github.com/apache/spark/pull/37662, 
https://github.com/apache/spark/pull/37686)
   
   In this PR I mainly covered missing parts in the docstrings adding some more 
examples where it needed.
   
   ### Why are the changes needed?
   To improve PySpark documentation
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, documentation
   
   ### How was this patch tested?
   PYTHON_EXECUTABLE=python3.9 ./dev/lint-python
   ./python/run-tests --testnames pyspark.sql.functions


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] JoshRosen commented on pull request #37413: [SPARK-39983][CORE][SQL] Do not cache unserialized broadcast relations on the driver

2022-09-03 Thread GitBox



JoshRosen commented on PR #37413:
URL: https://github.com/apache/spark/pull/37413#issuecomment-1236173944

   Hi @sos3k,
   
   We investigated your bug report and determined that the root-cause was a 
latent bug in `UnsafeHashedRelation` that was triggered more frequently 
following the backport of this PR.
   
   PR https://github.com/apache/spark/pull/35836 fixed a bug related to 
driver-side deserialization of `UnsafeHashedRelation`, but that fix was missing 
from DBR 9.1 and 10.4 (but is present in all newer versions). We have deployed 
a hotfix to backport https://github.com/apache/spark/pull/35836 into DBR 9.1 
and 10.4.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hgs19921112 commented on pull request #37785: [SPARK-40288][SQL]After `RemoveRedundantAggregates`, `PullOutGroupingExpressions` should applied to avoid attribute missing when use comp

2022-09-03 Thread GitBox



hgs19921112 commented on PR #37785:
URL: https://github.com/apache/spark/pull/37785#issuecomment-1236172407

   Can you have look at this? @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hgs19921112 opened a new pull request, #37785: [SPARK-40288][SQL]After `RemoveRedundantAggregates`, `PullOutGroupingExpressions` should applied to avoid attribute missing when use com

2022-09-03 Thread GitBox



hgs19921112 opened a new pull request, #37785:
URL: https://github.com/apache/spark/pull/37785

   
   
   ### What changes were proposed in this pull request?
   
   Atfter RemoveRedundantAggregates rule, we should pull the complex group by 
expression out.
   
   ### Why are the changes needed?
   
   This will fix the issue SPARK-40288.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   unit test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hgs19921112 closed pull request #37784: [SPARK-40288][SQL]After `RemoveRedundantAggregates`, `PullOutGroupingExpressions` should applied to avoid attribute missing when use complex ex

2022-09-03 Thread GitBox



hgs19921112 closed pull request #37784: [SPARK-40288][SQL]After 
`RemoveRedundantAggregates`, `PullOutGroupingExpressions` should applied to 
avoid attribute missing when use complex expression
URL: https://github.com/apache/spark/pull/37784


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hgs19921112 opened a new pull request, #37784: [SPARK-40288][SQL]After `RemoveRedundantAggregates`, `PullOutGroupingExpressions` should applied to avoid attribute missing when use com

2022-09-03 Thread GitBox



hgs19921112 opened a new pull request, #37784:
URL: https://github.com/apache/spark/pull/37784

   
   
   ### What changes were proposed in this pull request?
   
   Atfter RemoveRedundantAggregates rule, we should pull the complex group by 
expression out.
   
   ### Why are the changes needed?
   
   This will fix the issue SPARK-40288.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   unit test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srowen commented on a diff in pull request #37771: [SPARK-40315][SQL] Add equals() and hashCode() to ArrayBasedMapData

2022-09-03 Thread GitBox



srowen commented on code in PR #37771:
URL: https://github.com/apache/spark/pull/37771#discussion_r962166466


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapData.scala:
##
@@ -35,6 +37,29 @@ class ArrayBasedMapData(val keyArray: ArrayData, val 
valueArray: ArrayData) exte
   override def toString: String = {
 s"keys: $keyArray, values: $valueArray"
   }
+
+  override def equals(obj: Any): Boolean = {
+if (obj == null && this == null) {
+  return true
+}
+
+if (obj == null || !obj.isInstanceOf[ArrayBasedMapData]) {
+  return false
+}
+
+val other = obj.asInstanceOf[ArrayBasedMapData]
+
+keyArray.equals(other.keyArray) && valueArray.equals(other.valueArray)
+  }

Review Comment:
   Not sure how we do it in other code, but there is Arrays.deepEquals for 
element-wise equality



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srowen commented on a diff in pull request #37771: [SPARK-40315][SQL] Add equals() and hashCode() to ArrayBasedMapData

2022-09-03 Thread GitBox



srowen commented on code in PR #37771:
URL: https://github.com/apache/spark/pull/37771#discussion_r962166373


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilderSuite.scala:
##
@@ -142,4 +142,36 @@ class ArrayBasedMapBuilderSuite extends SparkFunSuite with 
SQLHelper {
 Map(new GenericArrayData(Seq(1, 1)) -> 3, new GenericArrayData(Seq(2, 
2)) -> 2))
 }
   }
+
+  test("SPARK-40315: simple equal() and hashCode() semantics") {
+val dataToAdd: Map[Int, Int] = Map(0 -> -7, 1 -> 3, 10 -> 4, 20 -> 5)
+val builder1 = new ArrayBasedMapBuilder(IntegerType, IntegerType)
+val builder2 = new ArrayBasedMapBuilder(IntegerType, IntegerType)
+val builder3 = new ArrayBasedMapBuilder(IntegerType, IntegerType)
+dataToAdd.foreach { case (key, value) =>
+  builder1.put(key, value)
+  builder2.put(key, value)
+  // Replace the value by something slightly different in builder3 for one 
of the keys.
+  if (key == 20) {
+builder3.put(key, value - 1)
+  } else {
+builder3.put(key, value)
+  }
+}
+val arrayBasedMapData1 = builder1.build()
+val arrayBasedMapData2 = builder2.build()
+val arrayBasedMapData3 = builder3.build()
+
+// We expect two objects to be equal and to have the same hashCode if they 
have the same
+// elements.
+assert(arrayBasedMapData1 == arrayBasedMapData2)

Review Comment:
   I think this is redundant with the next line? they aren't the same object, 
so this isn't ref equality



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srowen commented on pull request #37762: [SPARK-39996][BUILD] Upgrade `postgresql` to 42.5.0

2022-09-03 Thread GitBox



srowen commented on PR #37762:
URL: https://github.com/apache/spark/pull/37762#issuecomment-1236138369

   It seems fine; the CVE almost surely doesn't affect Spark as it's a 
test-only dependency but hey.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srowen closed pull request #37757: Branch 3.3 sam

2022-09-03 Thread GitBox



srowen closed pull request #37757: Branch 3.3 sam
URL: https://github.com/apache/spark/pull/37757


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srowen commented on pull request #37778: Unsafe loop

2022-09-03 Thread GitBox



srowen commented on PR #37778:
URL: https://github.com/apache/spark/pull/37778#issuecomment-1236138062

   https://spark.apache.org/contributing.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srowen closed pull request #37778: Unsafe loop

2022-09-03 Thread GitBox



srowen closed pull request #37778: Unsafe loop
URL: https://github.com/apache/spark/pull/37778


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srowen commented on a diff in pull request #37724: [SPARK-40273][PYTHON][DOCS] Fix the documents "Contributing and Maintaining Type Hints".

2022-09-03 Thread GitBox



srowen commented on code in PR #37724:
URL: https://github.com/apache/spark/pull/37724#discussion_r962165446


##
python/docs/source/development/contributing.rst:
##
@@ -189,9 +186,6 @@ Annotations should, when possible:
 
 * Be compatible with the current stable MyPy release.
 
-
-Complex supporting type definitions, should be placed in dedicated 
``_typing.pyi`` stubs. See for example `pyspark.sql._typing.pyi 
`_.

Review Comment:
   Checking in on this last question @itholic 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hgs19921112 closed pull request #37782: [SPARK-40288][SQL]After `RemoveRedundantAggregates`, `PullOutGroupingExpressions` should applied to avoid attribute missing when use complex ex

2022-09-03 Thread GitBox



hgs19921112 closed pull request #37782: [SPARK-40288][SQL]After 
`RemoveRedundantAggregates`, `PullOutGroupingExpressions` should applied to 
avoid attribute missing when use complex expression
URL: https://github.com/apache/spark/pull/37782


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang opened a new pull request, #37783: [SPARK-40321][BUILD] Upgrade rocksdbjni to 7.5.3

2022-09-03 Thread GitBox



LuciferYang opened a new pull request, #37783:
URL: https://github.com/apache/spark/pull/37783

   ### What changes were proposed in this pull request?
   This PR aims to upgrade RocksDB JNI library from 7.4.5 to 7.5.3.
   
   
   
   ### Why are the changes needed?
   This version bring performance improvements(related to write throughput and 
compaction) and some bug fix，the release note as follows:
   
   - https://github.com/facebook/rocksdb/releases/tag/v7.5.3
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   
   - Pass GA
   - The benchmark result :
   
   **Before 7.4.5**
   
   ```
   [INFO] Running org.apache.spark.util.kvstore.RocksDBBenchmark
count   meanmin max 
95th
   dbClose  4   0.340   0.265   0.510   
0.510
   dbCreation   4   78.628  3.499   303.732 
303.732
   naturalIndexCreateIterator   10240.005   0.002   1.473   
0.006
   naturalIndexDescendingCreateIterator 10240.006   0.005   0.062   
0.007
   naturalIndexDescendingIteration  10240.006   0.004   0.268   
0.008
   naturalIndexIteration10240.006   0.004   0.053   
0.009
   randomDeleteIndexed  10240.025   0.019   0.183   
0.033
   randomDeletesNoIndex 10240.015   0.012   0.037   
0.017
   randomUpdatesIndexed 10240.081   0.033   30.773  
0.084
   randomUpdatesNoIndex 10240.034   0.031   0.580   
0.037
   randomWritesIndexed  10240.117   0.035   52.383  
0.120
   randomWritesNoIndex  10240.039   0.034   1.556   
0.044
   refIndexCreateIterator   10240.005   0.004   0.019   
0.006
   refIndexDescendingCreateIterator 10240.003   0.002   0.028   
0.004
   refIndexDescendingIteration  10240.006   0.005   0.043   
0.008
   refIndexIteration10240.007   0.005   0.068   
0.009
   sequentialDeleteIndexed  10240.021   0.017   0.096   
0.026
   sequentialDeleteNoIndex  10240.015   0.012   0.041   
0.019
   sequentialUpdatesIndexed 10240.044   0.037   0.747   
0.051
   sequentialUpdatesNoIndex 10240.039   0.032   0.771   
0.046
   sequentialWritesIndexed  10240.049   0.042   1.766   
0.056
   sequentialWritesNoIndex  10240.039   0.032   2.282   
0.041
   
   ```
   
   
   
   **After 7.5.3**
   
   ```
   [INFO] Running org.apache.spark.util.kvstore.RocksDBBenchmark
count   meanmin max 
95th
   dbClose  4   0.342   0.212   0.683   
0.683
   dbCreation   4   77.603  3.403   301.357 
301.357
   naturalIndexCreateIterator   10240.005   0.002   1.507   
0.007
   naturalIndexDescendingCreateIterator 10240.005   0.005   0.076   
0.007
   naturalIndexDescendingIteration  10240.006   0.004   0.244   
0.010
   naturalIndexIteration10240.006   0.004   0.065   
0.009
   randomDeleteIndexed  10240.028   0.019   1.385   
0.038
   randomDeletesNoIndex 10240.016   0.013   0.042   
0.019
   randomUpdatesIndexed 10240.083   0.033   32.094  
0.086
   randomUpdatesNoIndex 10240.038   0.034   0.469   
0.042
   randomWritesIndexed  10240.122   0.034   54.517  
0.126
   randomWritesNoIndex  10240.042   0.037   1.369   
0.048
   refIndexCreateIterator   10240.005   0.005   0.019   
0.007
   refIndexDescendingCreateIterator 10240.003   0.003   0.034   
0.005
   refIndexDescendingIteration  10240.006   0.005   0.044   
0.008
   refIndexIteration10240.007   0.005   0.072   
0.011
   sequentialDeleteIndexed  10240.022   0.017   0.104   
0.027
   sequentialDeleteNoIndex  10240.015   0.012   0.038   
0.019
   sequentialUpdatesIndexed 10240.046   0.039   0.862   
0.054
   sequentialUpdatesNoIndex 10240.043   0.030   0.752   
0.055
   sequentialWritesIndexed  10240.051   0.039   1.949   
0.060
   sequentialWritesNoIndex  10240.038   0.032   2.858   
0.042
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to

[GitHub] [spark] zero323 commented on pull request #37774: [SPARK-40210][PYTHON][FOLLOW-UP][TEST] Speed up new tests using one action instead of many

2022-09-03 Thread GitBox



zero323 commented on PR #37774:
URL: https://github.com/apache/spark/pull/37774#issuecomment-1236108467

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zero323 closed pull request #37774: [SPARK-40210][PYTHON][FOLLOW-UP][TEST] Speed up new tests using one action instead of many

2022-09-03 Thread GitBox



zero323 closed pull request #37774: [SPARK-40210][PYTHON][FOLLOW-UP][TEST] 
Speed up new tests using one action instead of many
URL: https://github.com/apache/spark/pull/37774


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hgs19921112 opened a new pull request, #37782: [SPARK-40288][SQL]After `RemoveRedundantAggregates`, `PullOutGroupingExpressions`" + "should applied to avoid attribute missing when

2022-09-03 Thread GitBox



hgs19921112 opened a new pull request, #37782:
URL: https://github.com/apache/spark/pull/37782

   
   
   ### What changes were proposed in this pull request?
   
   Atfter RemoveRedundantAggregates rule, we should pull the complex group by 
expression out.
   
   ### Why are the changes needed?
   
   This will fix the issue SPARK-40288.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   unit test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] khalidmammadov commented on pull request #37774: [SPARK-40210][PYTHON][FOLLOW-UP][TEST] Speed up new tests using one action instead of many

2022-09-03 Thread GitBox



khalidmammadov commented on PR #37774:
URL: https://github.com/apache/spark/pull/37774#issuecomment-1236091611

   > @khalidmammadov Could you adjust the title to 
`[SPARK-40210][PYTHON][FOLLOW-UP][TEST] Speed up new tests using one action 
instead of many`? Thanks.
   
   Done


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zero323 commented on pull request #37774: [SPARK-40210][PYTHON][TEST]Follow-up to speed up new tests using one action instead of many

2022-09-03 Thread GitBox



zero323 commented on PR #37774:
URL: https://github.com/apache/spark/pull/37774#issuecomment-1236086072

   @khalidmammadov Could you adjust the title to 
`[SPARK-40210][PYTHON][FOLLOW-UP][TEST] Speed up new tests using one action 
instead of many`? Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wangyum commented on pull request #37729: Revert "[SPARK-33861][SQL] Simplify conditional in predicate"

2022-09-03 Thread GitBox



wangyum commented on PR #37729:
URL: https://github.com/apache/spark/pull/37729#issuecomment-1236077496

   Merged to master, branch-3.3 and branch-3.2


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wangyum closed pull request #37729: Revert "[SPARK-33861][SQL] Simplify conditional in predicate"

2022-09-03 Thread GitBox



wangyum closed pull request #37729: Revert "[SPARK-33861][SQL] Simplify 
conditional in predicate"
URL: https://github.com/apache/spark/pull/37729


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Yikun closed pull request #37781: static image

2022-09-03 Thread GitBox



Yikun closed pull request #37781: static image 
URL: https://github.com/apache/spark/pull/37781


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wangyum commented on pull request #37780: [SPARK-39414][BUILD][FOLLOWUP] Update Scala to 2.12.16 in doc

2022-09-03 Thread GitBox



wangyum commented on PR #37780:
URL: https://github.com/apache/spark/pull/37780#issuecomment-1236061305

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wangyum closed pull request #37780: [SPARK-39414][BUILD][FOLLOWUP] Update Scala to 2.12.16 in doc

2022-09-03 Thread GitBox



wangyum closed pull request #37780: [SPARK-39414][BUILD][FOLLOWUP] Update Scala 
to 2.12.16 in doc
URL: https://github.com/apache/spark/pull/37780


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] viirya commented on pull request #37463: [SPARK-40033][SQL] Nested schema pruning support through element_at

2022-09-03 Thread GitBox



viirya commented on PR #37463:
URL: https://github.com/apache/spark/pull/37463#issuecomment-1236060312

   Thanks. Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wankunde commented on a diff in pull request #37533: [SPARK-40096]Fix finalize shuffle stage slow due to connection creation slow

2022-09-03 Thread GitBox



wankunde commented on code in PR #37533:
URL: https://github.com/apache/spark/pull/37533#discussion_r962112740


##
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala:
##
@@ -2242,70 +2252,110 @@ private[spark] class DAGScheduler(
 val numMergers = stage.shuffleDep.getMergerLocs.length
 val results = (0 until numMergers).map(_ => 
SettableFuture.create[Boolean]())
 externalShuffleClient.foreach { shuffleClient =>
-  if (!registerMergeResults) {
-results.foreach(_.set(true))
-// Finalize in separate thread as shuffle merge is a no-op in this case
-shuffleMergeFinalizeScheduler.schedule(new Runnable {
-  override def run(): Unit = {
-stage.shuffleDep.getMergerLocs.foreach {
-  case shuffleServiceLoc =>
-// Sends async request to shuffle service to finalize shuffle 
merge on that host.
-// Since merge statuses will not be registered in this case,
-// we pass a no-op listener.
-shuffleClient.finalizeShuffleMerge(shuffleServiceLoc.host,
-  shuffleServiceLoc.port, shuffleId, shuffleMergeId,
-  new MergeFinalizerListener {
-override def onShuffleMergeSuccess(statuses: 
MergeStatuses): Unit = {
-}
+  val scheduledFutures =
+if (!registerMergeResults) {
+  results.foreach(_.set(true))
+  // Finalize in separate thread as shuffle merge is a no-op in this 
case
+  stage.shuffleDep.getMergerLocs.map {
+case shuffleServiceLoc =>
+  // Sends async request to shuffle service to finalize shuffle 
merge on that host.
+  // Since merge statuses will not be registered in this case,
+  // we pass a no-op listener.
+  shuffleSendFinalizeRpcExecutor.submit(new Runnable() {
+override def run(): Unit = {
+  shuffleClient.finalizeShuffleMerge(shuffleServiceLoc.host,
+shuffleServiceLoc.port, shuffleId, shuffleMergeId,
+new MergeFinalizerListener {
+  override def onShuffleMergeSuccess(statuses: 
MergeStatuses): Unit = {
+  }
 
-override def onShuffleMergeFailure(e: Throwable): Unit = {
-}
-  })
-}
-  }
-}, 0, TimeUnit.SECONDS)
-  } else {
-stage.shuffleDep.getMergerLocs.zipWithIndex.foreach {
-  case (shuffleServiceLoc, index) =>
-// Sends async request to shuffle service to finalize shuffle 
merge on that host
-// TODO: SPARK-35536: Cancel finalizeShuffleMerge if the stage is 
cancelled
-// TODO: during shuffleMergeFinalizeWaitSec
-shuffleClient.finalizeShuffleMerge(shuffleServiceLoc.host,
-  shuffleServiceLoc.port, shuffleId, shuffleMergeId,
-  new MergeFinalizerListener {
-override def onShuffleMergeSuccess(statuses: MergeStatuses): 
Unit = {
-  assert(shuffleId == statuses.shuffleId)
-  eventProcessLoop.post(RegisterMergeStatuses(stage, 
MergeStatus.
-convertMergeStatusesToMergeStatusArr(statuses, 
shuffleServiceLoc)))
-  results(index).set(true)
+  override def onShuffleMergeFailure(e: Throwable): Unit = 
{
+if (e.isInstanceOf[IOException]) {
+  logInfo(s"Failed to connect external shuffle service 
" +
+s"${shuffleServiceLoc.hostPort}")
+  
blockManagerMaster.removeShufflePushMergerLocation(shuffleServiceLoc.host)
+}
+  }
+})
 }
+  })
+  }
+} else {
+  stage.shuffleDep.getMergerLocs.zipWithIndex.map {
+case (shuffleServiceLoc, index) =>
+  // Sends async request to shuffle service to finalize shuffle 
merge on that host
+  // TODO: SPARK-35536: Cancel finalizeShuffleMerge if the stage 
is cancelled
+  // TODO: during shuffleMergeFinalizeWaitSec
+  shuffleSendFinalizeRpcExecutor.submit(new Runnable() {
+override def run(): Unit = {
+  shuffleClient.finalizeShuffleMerge(shuffleServiceLoc.host,
+shuffleServiceLoc.port, shuffleId, shuffleMergeId,
+new MergeFinalizerListener {
+  override def onShuffleMergeSuccess(statuses: 
MergeStatuses): Unit = {
+assert(shuffleId == statuses.shuffleId)
+eventProcessLoop.post(RegisterMergeStatuses(stage, 
MergeStatus.
+  convertMergeStatusesToMergeStatusArr(statuses, 
shuffleServiceLoc)))
+results(index).set(true)
+

[GitHub] [spark] viirya closed pull request #37463: [SPARK-40033][SQL] Nested schema pruning support through element_at

2022-09-03 Thread GitBox



viirya closed pull request #37463: [SPARK-40033][SQL] Nested schema pruning 
support through element_at
URL: https://github.com/apache/spark/pull/37463


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #37752: [SPARK-40301][PYTHON] Add parameter validations in pyspark.rdd

2022-09-03 Thread GitBox



zhengruifeng commented on PR #37752:
URL: https://github.com/apache/spark/pull/37752#issuecomment-1236059439

   > Thanks!
   > 
   > Shall we mention that ValueError will be raised for invalid inputs in 
the`Does this PR introduce any user-facing change` of PR description?
   
   sure, thanks for pointing out it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wankunde commented on a diff in pull request #37533: [SPARK-40096]Fix finalize shuffle stage slow due to connection creation slow

2022-09-03 Thread GitBox



wankunde commented on code in PR #37533:
URL: https://github.com/apache/spark/pull/37533#discussion_r962111737


##
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala:
##
@@ -2242,70 +2252,110 @@ private[spark] class DAGScheduler(
 val numMergers = stage.shuffleDep.getMergerLocs.length
 val results = (0 until numMergers).map(_ => 
SettableFuture.create[Boolean]())
 externalShuffleClient.foreach { shuffleClient =>
-  if (!registerMergeResults) {
-results.foreach(_.set(true))
-// Finalize in separate thread as shuffle merge is a no-op in this case
-shuffleMergeFinalizeScheduler.schedule(new Runnable {
-  override def run(): Unit = {
-stage.shuffleDep.getMergerLocs.foreach {
-  case shuffleServiceLoc =>
-// Sends async request to shuffle service to finalize shuffle 
merge on that host.
-// Since merge statuses will not be registered in this case,
-// we pass a no-op listener.
-shuffleClient.finalizeShuffleMerge(shuffleServiceLoc.host,
-  shuffleServiceLoc.port, shuffleId, shuffleMergeId,
-  new MergeFinalizerListener {
-override def onShuffleMergeSuccess(statuses: 
MergeStatuses): Unit = {
-}
+  val scheduledFutures =
+if (!registerMergeResults) {
+  results.foreach(_.set(true))
+  // Finalize in separate thread as shuffle merge is a no-op in this 
case
+  stage.shuffleDep.getMergerLocs.map {
+case shuffleServiceLoc =>
+  // Sends async request to shuffle service to finalize shuffle 
merge on that host.
+  // Since merge statuses will not be registered in this case,
+  // we pass a no-op listener.
+  shuffleSendFinalizeRpcExecutor.submit(new Runnable() {
+override def run(): Unit = {
+  shuffleClient.finalizeShuffleMerge(shuffleServiceLoc.host,
+shuffleServiceLoc.port, shuffleId, shuffleMergeId,
+new MergeFinalizerListener {
+  override def onShuffleMergeSuccess(statuses: 
MergeStatuses): Unit = {
+  }
 
-override def onShuffleMergeFailure(e: Throwable): Unit = {
-}
-  })
-}
-  }
-}, 0, TimeUnit.SECONDS)
-  } else {
-stage.shuffleDep.getMergerLocs.zipWithIndex.foreach {
-  case (shuffleServiceLoc, index) =>
-// Sends async request to shuffle service to finalize shuffle 
merge on that host
-// TODO: SPARK-35536: Cancel finalizeShuffleMerge if the stage is 
cancelled
-// TODO: during shuffleMergeFinalizeWaitSec
-shuffleClient.finalizeShuffleMerge(shuffleServiceLoc.host,
-  shuffleServiceLoc.port, shuffleId, shuffleMergeId,
-  new MergeFinalizerListener {
-override def onShuffleMergeSuccess(statuses: MergeStatuses): 
Unit = {
-  assert(shuffleId == statuses.shuffleId)
-  eventProcessLoop.post(RegisterMergeStatuses(stage, 
MergeStatus.
-convertMergeStatusesToMergeStatusArr(statuses, 
shuffleServiceLoc)))
-  results(index).set(true)
+  override def onShuffleMergeFailure(e: Throwable): Unit = 
{
+if (e.isInstanceOf[IOException]) {
+  logInfo(s"Failed to connect external shuffle service 
" +
+s"${shuffleServiceLoc.hostPort}")
+  
blockManagerMaster.removeShufflePushMergerLocation(shuffleServiceLoc.host)

Review Comment:
   Because we don't want those nodes that failed to connect to be selected as 
merge locations again. 
   The merge locations come from two parts, the active block managers and the 
shuffle services that current app registered an executor in the past. We will 
remove node that failed to connect from the second part.
   So if the merge ESS is still active, it will still can be selected as merge 
location as it is in the first part. Or it will not be selected as merge 
location if it is down.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wankunde commented on a diff in pull request #37533: [SPARK-40096]Fix finalize shuffle stage slow due to connection creation slow

2022-09-03 Thread GitBox



wankunde commented on code in PR #37533:
URL: https://github.com/apache/spark/pull/37533#discussion_r962110338


##
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala:
##
@@ -282,13 +286,19 @@ private[spark] class DAGScheduler(
   None
 }
 
-  // Use multi-threaded scheduled executor. The merge finalization task could 
take some time,
-  // depending on the time to establish connections to mergers, and the time 
to get MergeStatuses
-  // from all the mergers.
+  // Use multi-threaded scheduled executor. The merge finalization task (per 
stage) takes up to
+  // PUSH_BASED_SHUFFLE_MERGE_RESULTS_TIMEOUT.
   private val shuffleMergeFinalizeScheduler =
 ThreadUtils.newDaemonThreadPoolScheduledExecutor("shuffle-merge-finalizer",
   shuffleMergeFinalizeNumThreads)
 
+  // Send finalize RPC tasks to merger ESS, one thread per RPC and will be 
cancelled after
+  // PUSH_BASED_SHUFFLE_MERGE_RESULTS_TIMEOUT. Please close the opened files 
in the merger ESS

Review Comment:
   Remove this comment



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wankunde commented on a diff in pull request #37533: [SPARK-40096]Fix finalize shuffle stage slow due to connection creation slow

2022-09-03 Thread GitBox



wankunde commented on code in PR #37533:
URL: https://github.com/apache/spark/pull/37533#discussion_r962110217


##
core/src/main/scala/org/apache/spark/internal/config/package.scala:
##
@@ -2309,7 +2309,16 @@ package object config {
 " shuffle is enabled.")
   .version("3.3.0")
   .intConf
-  .createWithDefault(3)
+  .createWithDefault(8)
+
+  private[spark] val PUSH_BASED_SHUFFLE_SEND_FINALIZE_RPC_THREADS =
+ConfigBuilder("spark.shuffle.push.sendFinalizeRPCThreads")
+  .doc("Number of threads used by driver to send finalize shuffle RPC to 
the merger" +

Review Comment:
   Updated



##
core/src/main/scala/org/apache/spark/internal/config/package.scala:
##
@@ -2309,7 +2309,16 @@ package object config {
 " shuffle is enabled.")
   .version("3.3.0")
   .intConf
-  .createWithDefault(3)
+  .createWithDefault(8)
+
+  private[spark] val PUSH_BASED_SHUFFLE_SEND_FINALIZE_RPC_THREADS =

Review Comment:
   Updated



##
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala:
##
@@ -273,6 +274,9 @@ private[spark] class DAGScheduler(
   private val shuffleMergeFinalizeNumThreads =
 sc.getConf.get(config.PUSH_BASED_SHUFFLE_MERGE_FINALIZE_THREADS)
 
+  private val shuffleSendFinalizeRpcThreads =

Review Comment:
   Updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk commented on pull request #37746: [SPARK-40293][SQL] Make the V2 table error message more meaningful

2022-09-03 Thread GitBox



MaxGekk commented on PR #37746:
URL: https://github.com/apache/spark/pull/37746#issuecomment-1236057380

   @huaxingao Could you fix PlanResolutionSuite and SparkThrowableSuite, please.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wankunde commented on a diff in pull request #37533: [SPARK-40096]Fix finalize shuffle stage slow due to connection creation slow

2022-09-03 Thread GitBox



wankunde commented on code in PR #37533:
URL: https://github.com/apache/spark/pull/37533#discussion_r962110234


##
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala:
##
@@ -282,13 +286,19 @@ private[spark] class DAGScheduler(
   None
 }
 
-  // Use multi-threaded scheduled executor. The merge finalization task could 
take some time,
-  // depending on the time to establish connections to mergers, and the time 
to get MergeStatuses
-  // from all the mergers.
+  // Use multi-threaded scheduled executor. The merge finalization task (per 
stage) takes up to
+  // PUSH_BASED_SHUFFLE_MERGE_RESULTS_TIMEOUT.

Review Comment:
   Updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

50 matches

Mail list logo