Re: [PR] [SPARK-45758][SQL] Introduce a mapper for hadoop compression codecs [spark]

2023-11-01 Thread via GitHub


beliefer commented on PR #43620:
URL: https://github.com/apache/spark/pull/43620#issuecomment-1790151803

   The GA failure is unrelated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45758][SQL] Introduce a mapper for hadoop compression codecs [spark]

2023-11-01 Thread via GitHub


beliefer commented on PR #43620:
URL: https://github.com/apache/spark/pull/43620#issuecomment-1790151607

   ping @dongjoon-hyun @srowen @LuciferYang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [MINOR][DOCS] Fixed typo [spark]

2023-11-01 Thread via GitHub


YuanHanzhong commented on PR #43634:
URL: https://github.com/apache/spark/pull/43634#issuecomment-179050

   Fixe two typoes, let's make spark perfect together.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [MINOR][DOCS] Fix typo [spark]

2023-11-01 Thread via GitHub


YuanHanzhong closed pull request #43626: [MINOR][DOCS] Fix typo
URL: https://github.com/apache/spark/pull/43626


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [MINOR][DOCS] Fixed typo [spark]

2023-11-01 Thread via GitHub


YuanHanzhong opened a new pull request, #43634:
URL: https://github.com/apache/spark/pull/43634

   Fixed typo.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] Keep the taskBinary in Stage to avoid deserializing it multiple times, and send binary to executor instead of task. [spark]

2023-11-01 Thread via GitHub


hui730 commented on PR #43621:
URL: https://github.com/apache/spark/pull/43621#issuecomment-1790095897

   > Each task has a reference to a broadcast'ed task binary `Array[Byte]` - 
not the object. This `Array[Byte]` is pulled to each executor once - and for 
each subsequent task, this `Array[Byte]` is reused to deserialize and create a 
new task closure. So essentially we are trying to save on a jvm local 
deserialization cost.
   > 
   > Explicit use of broadcast variables is documented to be read only - but 
with task closures, I would be very reluctant to assume this behavior. If the 
instance is reused, task closures can start having side effects on other tasks 
in the same executor.
   
   Okay, I get what you mean. Especially in Hadoop.
   My plan is to create a new binary from executor binary,use 
System.arraycopy().
   How do you like it, sir?:)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45742][CORE][CONNECT][MLLIB][PYTHON] Introduce an implicit function for Scala Array to wrap into `immutable.ArraySeq`. [spark]

2023-11-01 Thread via GitHub


LuciferYang commented on PR #43607:
URL: https://github.com/apache/spark/pull/43607#issuecomment-1790073890

   Thanks @dongjoon-hyun ~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45731][SQL] Also update partition statistics with `ANALYZE TABLE` command [spark]

2023-11-01 Thread via GitHub


dongjoon-hyun commented on PR #43629:
URL: https://github.com/apache/spark/pull/43629#issuecomment-1790056890

   What about other SQL engine behavior like Hive, Presto, Snowflake?
   
   1. There is no existing SYNTAX for this proposal?
   2. If the proposed behavior is consistent with other SQL engines, we may be 
able to accept this PR simply as a bug fix.
   3. If the existing behavior is consistent with other SQL engines, also +1 
for introducing new syntax for new improvement while keeping the existing 
behavior.
   
   Thank you for the suggestion and the improvement.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45511][SS] State Data Source - Reader [spark]

2023-11-01 Thread via GitHub


HeartSaVioR commented on PR #43425:
URL: https://github.com/apache/spark/pull/43425#issuecomment-1790056322

   Fixed test, added new tests as commented. @anishshri-db Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45639][SQL][PYTHON] Support loading Python data sources in DataFrameReader [spark]

2023-11-01 Thread via GitHub


allisonwang-db commented on PR #43630:
URL: https://github.com/apache/spark/pull/43630#issuecomment-1790052524

   cc @HyukjinKwon @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45639][SQL][PYTHON] Support loading Python data sources in DataFrameReader [spark]

2023-11-01 Thread via GitHub


allisonwang-db commented on code in PR #43630:
URL: https://github.com/apache/spark/pull/43630#discussion_r1379577068


##
sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala:
##
@@ -231,6 +231,7 @@ class SparkSession private(
   /**
* A collection of methods for registering user-defined data sources.
*/
+  @Unstable

Review Comment:
   I'll make it `private[sql]` for now before we decide to expose this as an 
API.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45731][SQL] Also update partition statistics with `ANALYZE TABLE` command [spark]

2023-11-01 Thread via GitHub


sunchao commented on PR #43629:
URL: https://github.com/apache/spark/pull/43629#issuecomment-1790038555

   > AFAIK, users are used to using REPAIR TABLE to update partition statistics.
   
   Hmm sorry I'm not aware that this is a common pattern among Spark users. 
However, it seems `REPAIR TABLE` is a bit more expensive than `ANALYZE TABLE` 
since it needs to list all the partitions under the table directory first, and 
process & validate them. In addition, it doesn't seem able to update row count 
for each partition too.
   
   > I think `ANALYZE TABLE` should update the whole statistics instead of 
partition statistics. How to only update the table whole statistics without 
partition statistics if we accepted this PR?
   
   Yea that's a valid question. I wonder what's the reason for users to only 
want to update table stats but not partition stats though: is it because 
updating the latter is significantly more expensive? In the `ANALYZE TABLE .. 
COMPUTE STATISTICS NOSCAN` case, the current implementation already collects 
the size in bytes for each partition and we just need to incur one extra HMS 
call (`alterPartitions`) to update the stats for these partitions.
   
   Alternatively, maybe we can introduce a new syntax `ANALYZE TABLE 
 PARTITIONS COMPUTE STATISTICS [NOSCAN]` to update both table and 
partition stats? 
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45718][PS] Remove remaining deprecated Pandas features from Spark 3.4.0 [spark]

2023-11-01 Thread via GitHub


HyukjinKwon commented on PR #43581:
URL: https://github.com/apache/spark/pull/43581#issuecomment-1790035094

   Seems actually like it's a real error:
   
   ```
   Warning, treated as error:
   [autosummary] failed to import 'pyspark.pandas.Index.is_all_dates': no 
module named pyspark.pandas.Index.is_all_dates
   make: *** [Makefile:35: html] Error 2
   
 Jekyll 4.3.2   Please append `--trace` to the `build` command 
for any additional information or backtrace. 
   
   /__w/spark/spark/docs/_plugins/copy_api_dirs.rb:130:in `': 
Python doc generation failed (RuntimeError)
   
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45718][PS] Remove remaining deprecated Pandas features from Spark 3.4.0 [spark]

2023-11-01 Thread via GitHub


dongjoon-hyun commented on PR #43581:
URL: https://github.com/apache/spark/pull/43581#issuecomment-1790034093

   It seems to be fixed. Could you re-run `linter` pipeline?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45742][CORE][CONNECT][MLLIB][PYTHON] Introduce an implicit function for Scala Array to wrap into `immutable.ArraySeq`. [spark]

2023-11-01 Thread via GitHub


dongjoon-hyun commented on PR #43607:
URL: https://github.com/apache/spark/pull/43607#issuecomment-1790033353

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45742][CORE][CONNECT][MLLIB][PYTHON] Introduce an implicit function for Scala Array to wrap into `immutable.ArraySeq`. [spark]

2023-11-01 Thread via GitHub


dongjoon-hyun closed pull request #43607: 
[SPARK-45742][CORE][CONNECT][MLLIB][PYTHON] Introduce an implicit function for 
Scala Array to wrap into `immutable.ArraySeq`.
URL: https://github.com/apache/spark/pull/43607


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45719][K8S] Upgrade AWS SDK to v2 for Kubernetes integration tests module [spark]

2023-11-01 Thread via GitHub


junyuc25 commented on code in PR #43510:
URL: https://github.com/apache/spark/pull/43510#discussion_r1379563662


##
pom.xml:
##
@@ -162,6 +162,7 @@
 1.12.0
 
 1.11.655
+2.20.128

Review Comment:
   Thanks. I've updated the version.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45767][CORE] Delete `TimeStampedHashMap` and its UT [spark]

2023-11-01 Thread via GitHub


panbingkun commented on PR #43633:
URL: https://github.com/apache/spark/pull/43633#issuecomment-1790018110

   > I think the description of the PR should be clearer, like when this class 
was introduced, why it is no longer used, instead of referencing a comments.
   
   Done.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45767][CORE] Delete `TimeStampedHashMap` and its UT [spark]

2023-11-01 Thread via GitHub


LuciferYang commented on PR #43633:
URL: https://github.com/apache/spark/pull/43633#issuecomment-1790014788

   I think the description of the PR should be clearer, like when this class 
was introduced, why it is no longer used, instead of referencing a comments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45767][CORE] Delete `TimeStampedHashMap` and its UT [spark]

2023-11-01 Thread via GitHub


panbingkun commented on PR #43633:
URL: https://github.com/apache/spark/pull/43633#issuecomment-1790014612

   The issue of `appveyor` is not related to this PR.
   https://github.com/apache/spark/pull/43631


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45767][CORE] Delete `TimeStampedHashMap` and its UT [spark]

2023-11-01 Thread via GitHub


panbingkun commented on PR #43633:
URL: https://github.com/apache/spark/pull/43633#issuecomment-1790011918

   cc @LuciferYang @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-45767][CORE] Delete `TimeStampedHashMap` and its UT [spark]

2023-11-01 Thread via GitHub


panbingkun opened a new pull request, #43633:
URL: https://github.com/apache/spark/pull/43633

   ### What changes were proposed in this pull request?
   The pr aims to delete `TimeStampedHashMap` and its UT.
   
   ### Why are the changes needed?
   During Pr https://github.com/apache/spark/pull/43578, we found that the 
class `TimeStampedHashMap` is no longer in use. Based on the suggestion, we 
have removed it.
   https://github.com/apache/spark/pull/43578#discussion_r1378687555
   
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   
   ### How was this patch tested?
   Pass GA.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] Keep the taskBinary in Stage to avoid deserializing it multiple times, and send binary to executor instead of task. [spark]

2023-11-01 Thread via GitHub


mridulm commented on PR #43621:
URL: https://github.com/apache/spark/pull/43621#issuecomment-1790008994

   Each task has a reference to a broadcast'ed task binary `Array[Byte]` - not 
the object.
   This `Array[Byte]` is pulled to each executor once - and for each subsequent 
task, this `Array[Byte]` is reused to deserialize and create a new task closure.
   
   Explicit use of broadcast variables is documented to be read only - but with 
task closures, we cannot make a change in this behavior. If the instance is 
reused, task closures can start having side effects on other tasks in the same 
executor.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45533][CORE] Use j.l.r.Cleaner instead of finalize for RocksDBIterator/LevelDBIterator [spark]

2023-11-01 Thread via GitHub


LuciferYang commented on code in PR #43502:
URL: https://github.com/apache/spark/pull/43502#discussion_r1379542967


##
common/kvstore/src/main/java/org/apache/spark/util/kvstore/RocksDBIterator.java:
##
@@ -272,4 +287,32 @@ static int compare(byte[] a, byte[] b) {
 return a.length - b.length;
   }
 
+  static class ResourceCleaner implements Runnable {
+
+private final RocksIterator rocksIterator;
+private final RocksDB rocksDB;
+private final AtomicBoolean status = new AtomicBoolean(true);
+
+public ResourceCleaner(RocksIterator rocksIterator, RocksDB rocksDB) {
+  this.rocksIterator = rocksIterator;
+  this.rocksDB = rocksDB;
+}
+
+@Override
+public void run() {
+  if (status.compareAndSet(true, false)) {
+rocksDB.notifyIteratorClosed(rocksIterator);
+rocksIterator.close();

Review Comment:
   I think this close still needs synchronized db...
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45533][CORE] Use j.l.r.Cleaner instead of finalize for RocksDBIterator/LevelDBIterator [spark]

2023-11-01 Thread via GitHub


LuciferYang commented on code in PR #43502:
URL: https://github.com/apache/spark/pull/43502#discussion_r1379542967


##
common/kvstore/src/main/java/org/apache/spark/util/kvstore/RocksDBIterator.java:
##
@@ -272,4 +287,32 @@ static int compare(byte[] a, byte[] b) {
 return a.length - b.length;
   }
 
+  static class ResourceCleaner implements Runnable {
+
+private final RocksIterator rocksIterator;
+private final RocksDB rocksDB;
+private final AtomicBoolean status = new AtomicBoolean(true);
+
+public ResourceCleaner(RocksIterator rocksIterator, RocksDB rocksDB) {
+  this.rocksIterator = rocksIterator;
+  this.rocksDB = rocksDB;
+}
+
+@Override
+public void run() {
+  if (status.compareAndSet(true, false)) {
+rocksDB.notifyIteratorClosed(rocksIterator);
+rocksIterator.close();

Review Comment:
   I think this close still needs synchronized
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [WIP][CONNECT] Fix column resolution in DataFrame.drop [spark]

2023-11-01 Thread via GitHub


zhengruifeng opened a new pull request, #43632:
URL: https://github.com/apache/spark/pull/43632

   ### What changes were proposed in this pull request?
   Fix column resolution in DataFrame.drop
   
   
   ### Why are the changes needed?
   ```
   from pyspark.sql.functions import col
   
   # create first dataframe
   left_df = spark.createDataFrame([(1, 'a'), (2, 'b'), (3, 'c')], ['join_key', 
'value1'])
   # create second dataframe
   right_df = spark.createDataFrame([(1, 'aa'), (2, 'bb'), (4, 'dd')], 
['join_key', 'value2'])
   joined_df = left_df.join(right_df, on=left_df['join_key'] == 
right_df['join_key'], how='left')
   display(joined_df)
   cleaned_df = joined_df.drop(left_df['join_key'])
   display(cleaned_df) # error here
   
   JVM stacktrace:
   org.apache.spark.sql.AnalysisException: [AMBIGUOUS_REFERENCE] Reference 
`join_key` is ambiguous, could be: [`join_key`, `join_key`]. SQLSTATE: 42704
at 
org.apache.spark.sql.errors.QueryCompilationErrors$.ambiguousReferenceError(QueryCompilationErrors.scala:1957)
at 
org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:377)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:156)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:167)
at org.apache.spark.sql.Dataset.$anonfun$drop$4(Dataset.scala:3071)
   ```
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   no
   
   
   ### How was this patch tested?
   added ut
   
   ### Was this patch authored or co-authored using generative AI tooling?
   no


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] Keep the taskBinary in Stage to avoid deserializing it multiple times, and send binary to executor instead of task. [spark]

2023-11-01 Thread via GitHub


hui730 commented on PR #43621:
URL: https://github.com/apache/spark/pull/43621#issuecomment-1789940175

   > Mind filing a JIRA please?
   Sure,  I'll complete it today
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45731][SQL] Also update partition statistics with `ANALYZE TABLE` command [spark]

2023-11-01 Thread via GitHub


beliefer commented on PR #43629:
URL: https://github.com/apache/spark/pull/43629#issuecomment-1789938018

   AFAIK, users are used to using `REPAIR TABLE` to update partition statistics.
   
   I think `ANALYZE TABLE` should update the whole statistics instead of 
partition statistics. How to only update the table whole statistics without 
partition statistics if we accepted this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] Temporarily pin R-win version to 4.3.2 [spark]

2023-11-01 Thread via GitHub


panbingkun commented on PR #43631:
URL: https://github.com/apache/spark/pull/43631#issuecomment-1789936060

   
https://github.com/apache/spark/blob/a04d4e2233c0d20c6a86c64391b1e1a6071b4550/dev/appveyor-install-dependencies.ps1#L34
   @HyukjinKwon Do we need to add `a log` below this line that prints `the 
download R-win url` above?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45748][SQL] Add a .fromSQL helper function for Literals [spark]

2023-11-01 Thread via GitHub


beliefer commented on code in PR #43612:
URL: https://github.com/apache/spark/pull/43612#discussion_r1379492080


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/LiteralExpressionSuite.scala:
##
@@ -477,4 +477,299 @@ class LiteralExpressionSuite extends SparkFunSuite with 
ExpressionEvalHelper {
   Literal.create(UTF8String.fromString("Spark SQL"), 
ObjectType(classOf[UTF8String])),
   UTF8String.fromString("Spark SQL"))
   }
+
+  private def serializationRoundTripInfo(
+  before: Literal,
+  serializedValue: String,
+  serializedDataType: String,
+  after: Literal): String = {
+s"""- Initial Literal: $before with type ${before.dataType}
+   |- String after serialization: $serializedValue with type 
$serializedDataType
+   |- Deserialized Literal: $after with type ${after.dataType}
+   |""".stripMargin
+  }
+
+  test(".sql and fromSQL round trip: basic Literals") {
+Seq(
+  // null
+  Literal.create(null, BooleanType),
+  Literal.create(null, ByteType),
+  Literal.create(null, ShortType),
+  Literal.create(null, IntegerType),
+  Literal.create(null, LongType),
+  Literal.create(null, FloatType),
+  Literal.create(null, StringType),
+  Literal.create(null, BinaryType),
+  Literal.create(null, DecimalType.USER_DEFAULT),
+  Literal.create(null, DecimalType(20, 2)),
+  Literal.create(null, DateType),
+  Literal.create(null, TimestampType),
+  Literal.create(null, CalendarIntervalType),
+  Literal.create(null, YearMonthIntervalType()),
+  Literal.create(null, DayTimeIntervalType(1, 2)),
+  Literal.create(null, ArrayType(ByteType, true)),
+  Literal.create(null, MapType(StringType, IntegerType)),
+  Literal.create(null, StructType(Seq.empty)),
+
+  // boolean
+  Literal(true),
+  Literal(false),
+
+  // int, long, short, byte
+  Literal(0),
+  Literal(1.toLong),
+  Literal(0.toShort),
+  Literal(1.toByte),
+  Literal(Int.MinValue),
+  Literal(Int.MinValue.toShort),
+  Literal(Int.MaxValue),
+  Literal(Int.MaxValue.toByte),
+  Literal(Short.MinValue),
+  Literal(Byte.MaxValue),
+  Literal(Long.MinValue),
+  Literal(Long.MaxValue),
+
+  // float
+  Literal(0.0.toFloat),
+  Literal(-0.0.toFloat),
+  Literal(Float.PositiveInfinity),
+  Literal(Float.NegativeInfinity),
+  Literal(Float.MinPositiveValue),
+  Literal(Float.MaxValue),
+  Literal(Float.MinValue),
+  Literal(Float.NaN),
+
+  // double
+  Literal(0.0),
+  Literal(-0.0),
+  Literal(Double.NegativeInfinity),
+  Literal(Double.PositiveInfinity),
+  Literal(Double.MinValue),
+  Literal(Double.MaxValue),
+  Literal(Double.NaN),
+  Literal(Double.MinPositiveValue),
+
+  // Decimal -> without type it's problematic?
+  Literal(Decimal(-0.0001)),
+  Literal(Decimal(0.0)),
+  Literal(Decimal(0.001)),
+  Literal(Decimal(1.)),
+  Literal(Decimal(5.toLong, 10, 3)),
+  Literal(BigDecimal((-0.0001).toString)),
+  Literal(new java.math.BigDecimal(0.0.toString)),
+
+  // binary
+  Literal.create(new Array[Byte](0), BinaryType),
+  Literal.create(new Array[Byte](2), BinaryType),
+
+  // string
+  Literal(""),
+  Literal("a"),
+  Literal("\u"),
+  Literal("a b"),
+
+  // DayTimeInterval (Duration)
+  Literal(Duration.ofNanos(0)),
+  Literal(Duration.ofSeconds(-1)),
+  Literal(Duration.ofMinutes(62)),
+  Literal.create(Duration.ofMinutes(62), DayTimeIntervalType(2, 3)),
+  Literal(Duration.ofHours(10)),
+  Literal.create(Duration.ofDays(12345600), DayTimeIntervalType(0, 1)),
+
+  // YearMonthInterval (Period)
+  Literal(Period.ofYears(0)),
+  Literal.create(Period.ofYears(1), YearMonthIntervalType(0, 0)),
+  Literal.create(Period.ofYears(1), YearMonthIntervalType(1, 1)),
+  Literal(Period.of(-1, 11, 0)),
+  Literal(Period.ofMonths(Int.MinValue)),
+
+
+
+  // array
+  Literal(Array(1, 2, 3)),
+  Literal.create(Array(1.0, 2.0), ArrayType(DoubleType, false)),
+  Literal.create(Array(1.0, 2.0), ArrayType(DoubleType, true)),
+  Literal.create(Array(1.0, 2.0, null), ArrayType(DoubleType, true)),
+  Literal(Array("a")),
+
+  // array of struct
+  Literal(
+new GenericArrayData(
+  Array(
+InternalRow(UTF8String.fromString("a"), 1),
+InternalRow(UTF8String.fromString("b"), 2)
+  )
+),
+ArrayType(
+  StructType(Seq(StructField("col1", StringType), StructField("col2", 
IntegerType))),
+  containsNull = false
+)
+  ),
+
+  // struct
+  Literal.create((1, 3.0, "abc")),
+  Literal(
+InternalRow(true, 1.toLong),
+StructType(
+  Seq(
+StructField("col1", BooleanType, nullable = 
false).withComment("some-com

Re: [PR] Temporarily pin R-win version to 4.3.2 [spark]

2023-11-01 Thread via GitHub


panbingkun closed pull request #43631: Temporarily pin R-win version to 4.3.2
URL: https://github.com/apache/spark/pull/43631


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-33393][SQL] Support SHOW TABLE EXTENDED in v2 [spark]

2023-11-01 Thread via GitHub


panbingkun commented on code in PR #37588:
URL: https://github.com/apache/spark/pull/37588#discussion_r1379487429


##
dev/appveyor-install-dependencies.ps1:
##
@@ -27,11 +27,13 @@ Function InstallR {
 
   $urlPath = ""
   $latestVer = $(ConvertFrom-JSON $(Invoke-WebRequest 
https://rversions.r-pkg.org/r-release-win).Content).version
+  $latestVer = "4.3.2"

Review Comment:
   This modification is not related to this PR, but is caused by another issue: 
https://github.com/apache/spark/pull/43631. Only in order to temporarily pass 
`integration appveyor`, I will remove it after it is completed.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] Temporarily pin R-win version to 4.3.2 [spark]

2023-11-01 Thread via GitHub


panbingkun commented on PR #43631:
URL: https://github.com/apache/spark/pull/43631#issuecomment-1789919487

   > It will be back up pretty much quickly. Let's just leave it out so 
https://rversions.r-pkg.org/r-release-win can return 4.3.2 properly.
   
   Okay, let me close this PR first because I have been 'consistently failing' 
in another PR, so I have slightly investigated the reason, haha


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] Temporarily set R-win version to 4.3.2 [spark]

2023-11-01 Thread via GitHub


HyukjinKwon commented on PR #43631:
URL: https://github.com/apache/spark/pull/43631#issuecomment-1789915775

   It will be back up pretty much quickly. Let's just leave it out so 
https://rversions.r-pkg.org/r-release-win can return 4.3.2 properly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] Temporarily set R-win version to 4.3.2 [spark]

2023-11-01 Thread via GitHub


panbingkun opened a new pull request, #43631:
URL: https://github.com/apache/spark/pull/43631

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45761][K8S][INFRA][DOCS] Upgrade `Volcano` to 1.8.1 [spark]

2023-11-01 Thread via GitHub


dongjoon-hyun commented on PR #43624:
URL: https://github.com/apache/spark/pull/43624#issuecomment-1789907744

   Thank you, @HyukjinKwon !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45742][CORE][CONNECT][MLLIB][PYTHON] Introduce an implicit function for Scala Array to wrap into `immutable.ArraySeq`. [spark]

2023-11-01 Thread via GitHub


LuciferYang commented on PR #43607:
URL: https://github.com/apache/spark/pull/43607#issuecomment-1789899834

   Could you help review this pr if you have time. @dongjoon-hyun Thanks ~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [MINOR][DOCS] Fix typo [spark]

2023-11-01 Thread via GitHub


YuanHanzhong commented on code in PR #43626:
URL: https://github.com/apache/spark/pull/43626#discussion_r1379459881


##
hadoop-cloud/README.md:
##
@@ -16,5 +16,5 @@ Integration tests will have some extra configurations for 
example selecting the
 run the test against. Those configs are passed as environment variables and 
the existence of these
 variables must be checked by the test.
 Like for `AwsS3AbortableStreamBasedCheckpointFileManagerSuite` the S3 bucket 
used for testing
-is passed in the `S3A_PATH` and the credetinals to access AWS S3 are 
AWS_ACCESS_KEY_ID and
+is passed in the `S3A_PATH` and the credentials to access AWS S3 are 
AWS_ACCESS_KEY_ID and

Review Comment:
   Yes, of course, I will contribute more.  Actually this typo takes me several 
hours to locate it.  In the future, I would like to make more technical 
contributions.  Let me participate in,  let's work together to make spark more 
beautiful.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [hotfix] [docs] Fix typo [spark]

2023-11-01 Thread via GitHub


HyukjinKwon closed pull request #43625: [hotfix] [docs] Fix typo 
URL: https://github.com/apache/spark/pull/43625


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [MINOR][DOCS] Fix typo [spark]

2023-11-01 Thread via GitHub


YuanHanzhong commented on PR #43626:
URL: https://github.com/apache/spark/pull/43626#issuecomment-1789873588

   > Mind keeping the PR template 
(https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE) 
please
   
   Sure, I will,  thanks for your reminder^_^


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [hotfix] [docs] Fix typo [spark]

2023-11-01 Thread via GitHub


YuanHanzhong commented on code in PR #43625:
URL: https://github.com/apache/spark/pull/43625#discussion_r1379455103


##
docs/README.md:
##
@@ -86,7 +86,7 @@ the source code and be captured by revision control 
(currently git). This way th
 includes the version of the documentation that is relevant regardless of which 
version or release
 you have checked out or downloaded.
 
-In this directory you will find text files formatted using Markdown, with an 
".md" suffix. You can
+In this directory you will find text files formatted using Markdown, with a 
".md" suffix. You can

Review Comment:
   Thanks for your review, I'm just trying to find an open source project and  
contribute.   I will contribute more. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45639][SQL][PYTHON] Support loading Python data sources in DataFrameReader [spark]

2023-11-01 Thread via GitHub


HyukjinKwon commented on code in PR #43630:
URL: https://github.com/apache/spark/pull/43630#discussion_r1379454389


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceManager.scala:
##
@@ -20,35 +20,60 @@ package org.apache.spark.sql.execution.datasources
 import java.util.Locale
 import java.util.concurrent.ConcurrentHashMap
 
+import org.apache.spark.annotation.Unstable
 import org.apache.spark.sql.SparkSession
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap
 import org.apache.spark.sql.errors.QueryCompilationErrors
 import org.apache.spark.sql.types.StructType
-import org.apache.spark.sql.util.CaseInsensitiveStringMap
 
+/**
+ * A manager for user-defined data sources. It is used to register and lookup 
data sources by
+ * their short names or fully qualified names.
+ */
+@Unstable

Review Comment:
   This too. Let's just remove this `@Unstable`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45639][SQL][PYTHON] Support loading Python data sources in DataFrameReader [spark]

2023-11-01 Thread via GitHub


HyukjinKwon commented on code in PR #43630:
URL: https://github.com/apache/spark/pull/43630#discussion_r1379454587


##
sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala:
##
@@ -231,6 +231,7 @@ class SparkSession private(
   /**
* A collection of methods for registering user-defined data sources.
*/
+  @Unstable

Review Comment:
   We should either remove `@Unstable` or `private[sql]` if we want this as an 
API.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45639][SQL][PYTHON] Support loading Python data sources in DataFrameReader [spark]

2023-11-01 Thread via GitHub


HyukjinKwon commented on code in PR #43630:
URL: https://github.com/apache/spark/pull/43630#discussion_r1379454153


##
sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala:
##
@@ -110,11 +111,13 @@ private[sql] class SharedState(
   /**
* A data source manager shared by all sessions.
*/
+  @Unstable
   lazy val dataSourceManager = new DataSourceManager()

Review Comment:
   I think we actually don't need to expose this and below as an API?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] Keep the taskBinary in Stage to avoid deserializing it multiple times, and send binary to executor instead of task. [spark]

2023-11-01 Thread via GitHub


HyukjinKwon commented on PR #43621:
URL: https://github.com/apache/spark/pull/43621#issuecomment-1789867256

   Mind filing a JIRA please?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45761][K8S][INFRA][DOCS] Upgrade `Volcano` to 1.8.1 [spark]

2023-11-01 Thread via GitHub


HyukjinKwon closed pull request #43624: [SPARK-45761][K8S][INFRA][DOCS] Upgrade 
`Volcano` to 1.8.1
URL: https://github.com/apache/spark/pull/43624


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45761][K8S][INFRA][DOCS] Upgrade `Volcano` to 1.8.1 [spark]

2023-11-01 Thread via GitHub


HyukjinKwon commented on PR #43624:
URL: https://github.com/apache/spark/pull/43624#issuecomment-1789866104

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-44419][SQL] Support to extract partial filters of datasource v2 table and push them down [spark]

2023-11-01 Thread via GitHub


github-actions[bot] commented on PR #42000:
URL: https://github.com/apache/spark/pull/42000#issuecomment-1789865589

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-44426][SQL] Optimize adaptive skew join for ExistenceJoin [spark]

2023-11-01 Thread via GitHub


github-actions[bot] commented on PR #42003:
URL: https://github.com/apache/spark/pull/42003#issuecomment-1789865574

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [MINOR][DOCS] Fix typo [spark]

2023-11-01 Thread via GitHub


HyukjinKwon commented on PR #43626:
URL: https://github.com/apache/spark/pull/43626#issuecomment-1789865527

   Mind keeping the PR template 
(https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE) 
please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [hotfix] [docs] Fix typo [spark]

2023-11-01 Thread via GitHub


HyukjinKwon commented on code in PR #43626:
URL: https://github.com/apache/spark/pull/43626#discussion_r1379451456


##
hadoop-cloud/README.md:
##
@@ -16,5 +16,5 @@ Integration tests will have some extra configurations for 
example selecting the
 run the test against. Those configs are passed as environment variables and 
the existence of these
 variables must be checked by the test.
 Like for `AwsS3AbortableStreamBasedCheckpointFileManagerSuite` the S3 bucket 
used for testing
-is passed in the `S3A_PATH` and the credetinals to access AWS S3 are 
AWS_ACCESS_KEY_ID and
+is passed in the `S3A_PATH` and the credentials to access AWS S3 are 
AWS_ACCESS_KEY_ID and

Review Comment:
   This is fine. Mind taking a look and see other typos since we're here? Let's 
fix them in batch in this PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [hotfix] [docs] Fix typo [spark]

2023-11-01 Thread via GitHub


HyukjinKwon commented on code in PR #43625:
URL: https://github.com/apache/spark/pull/43625#discussion_r1379451081


##
docs/README.md:
##
@@ -86,7 +86,7 @@ the source code and be captured by revision control 
(currently git). This way th
 includes the version of the documentation that is relevant regardless of which 
version or release
 you have checked out or downloaded.
 
-In this directory you will find text files formatted using Markdown, with an 
".md" suffix. You can
+In this directory you will find text files formatted using Markdown, with a 
".md" suffix. You can

Review Comment:
   I think it's really no biggie. we call it MD suffix, and an can be 
legitimate?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45680][CONNECT] Release session [spark]

2023-11-01 Thread via GitHub


HyukjinKwon closed pull request #43546: [SPARK-45680][CONNECT] Release session
URL: https://github.com/apache/spark/pull/43546


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45680][CONNECT] Release session [spark]

2023-11-01 Thread via GitHub


HyukjinKwon commented on PR #43546:
URL: https://github.com/apache/spark/pull/43546#issuecomment-1789860912

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45731][SQL] Also update partition statistics with `ANALYZE TABLE` command [spark]

2023-11-01 Thread via GitHub


sunchao commented on PR #43629:
URL: https://github.com/apache/spark/pull/43629#issuecomment-1789852270

   Thanks @dongjoon-hyun for the quick reply.
   
   > According to the title and first sentence of PR description, is this 
related to another JIRA
   
   Not really. The title means this PR proposes to in addition of updating 
table stats, also update partition stats with `ANALYZE TABLE` command.
   
   > Just a question. Why don't we use `REPAIR TABLE` before this?
   
   Hmm I think `REPAIR TABLE` serves a different purpose, and is used to 
recover partitions for an existing table that is created from a directory which 
contains sub-directories for partitions. On the other hand, `ANALYZE TABLE` can 
be used to update table & partition stats. For instance, a partition could 
already exist for a table, but its stats could be out-of sync, due to reasons 
such as data was written to the partition directory without going through 
Spark. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45748][SQL] Add a .fromSQL helper function for Literals [spark]

2023-11-01 Thread via GitHub


anchovYu commented on code in PR #43612:
URL: https://github.com/apache/spark/pull/43612#discussion_r1379437089


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala:
##
@@ -247,6 +251,67 @@ object Literal {
   s"Literal must have a corresponding value to ${dataType.catalogString}, 
" +
   s"but class ${Utils.getSimpleName(value.getClass)} found.")
   }
+
+  /**
+   * Parse and analyze the .sql string of a Literal, construct the Literal 
given the data type json.
+   * Example usage:
+   *   val serializedValue = lit.sql
+   *   val dataTypeJson = lit.dataType.json
+   *   val deserializedLit: Literal = Literal.fromSQL(serializedValue, 
dataTypeJson)
+   *
+   * @param valueSqlStr the .sql string of the Literal
+   * @param dataTypeJson the json format data type of the Literal
+   * @throws AnalysisException 
INVALID_LITERAL_VALUE_SQL_STRING_FOR_DESERIALIZATION
+   */
+  private[sql] def fromSQL(valueSqlStr: String, dataTypeJson: String): Literal 
= {

Review Comment:
   DDL does not preserves those `containsNull` fields while json does.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45527][core] Use fraction to do the resource calculation [spark]

2023-11-01 Thread via GitHub


wbo4958 commented on code in PR #43494:
URL: https://github.com/apache/spark/pull/43494#discussion_r1379434828


##
core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala:
##
@@ -273,7 +273,8 @@ private[spark] class CoarseGrainedExecutorBackend(
   override def statusUpdate(taskId: Long, state: TaskState, data: ByteBuffer): 
Unit = {
 val resources = taskResources.getOrDefault(taskId, Map.empty[String, 
ResourceInformation])

Review Comment:
   got it. I will fix them in this PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-45639][SQL][PYTHON] Support loading Python data sources in DataFrameReader [spark]

2023-11-01 Thread via GitHub


allisonwang-db opened a new pull request, #43630:
URL: https://github.com/apache/spark/pull/43630

   
   
   ### What changes were proposed in this pull request?
   
   This PR supports `spark.read.format(...).load()` for Python data sources.
   
   After this PR, users can use a Python data source directly like this:
   ```python
   from pyspark.sql.datasource import DataSource, DataSourceReader
   
   class MyReader(DataSourceReader):
   def read(self, partition):
   yield (0, 1)
   
   class MyDataSource(DataSource):
   @classmethod
   def name(cls):
   return "my-source"
   
   def schema(self):
   return "id INT, value INT"
   
   def reader(self, schema):
   return MyReader()
   
   spark.dataSource.register(MyDataSource)
   
   df = spark.read.format("my-source").load()
   df.show()
   +---+-+
   | id|value|
   +---+-+
   |  0|1|
   +---+-+
   ```
   
   ### Why are the changes needed?
   
   To support Python data sources.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes. After this PR, users can load a custom Python data source using 
`spark.read.format(...).load()`.
   
   ### How was this patch tested?
   
   New unit tests.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45762][CORE]: Support shuffle managers defined in user jars by changing startup order [spark]

2023-11-01 Thread via GitHub


abellina commented on PR #43627:
URL: https://github.com/apache/spark/pull/43627#issuecomment-1789727698

   @tgravescs fyi


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45763][CORE][UI] Improve `MasterPage` to show `Resource` column only when it exists [spark]

2023-11-01 Thread via GitHub


dongjoon-hyun commented on PR #43628:
URL: https://github.com/apache/spark/pull/43628#issuecomment-1789634045

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45763][CORE][UI] Improve `MasterPage` to show `Resource` column only when it exists [spark]

2023-11-01 Thread via GitHub


dongjoon-hyun closed pull request #43628: [SPARK-45763][CORE][UI] Improve 
`MasterPage` to show `Resource` column only when it exists
URL: https://github.com/apache/spark/pull/43628


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45763][CORE][UI] Improve `MasterPage` to show `Resource` column only when it exists [spark]

2023-11-01 Thread via GitHub


dongjoon-hyun commented on PR #43628:
URL: https://github.com/apache/spark/pull/43628#issuecomment-1789626309

   Thank you, @viirya !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-45731][SQL] Also update partition statistics with `ANALYZE TABLE` command [spark]

2023-11-01 Thread via GitHub


sunchao opened a new pull request, #43629:
URL: https://github.com/apache/spark/pull/43629

   
   
   ### What changes were proposed in this pull request?
   
   
   Also update partition statistics (e.g., total size in bytes, row count) with 
`ANALYZE TABLE` command.
   
   ### Why are the changes needed?
   
   
   Currently when a `ANALYZE TABLE ` command is triggered against a 
partition table, only table stats are updated, but not partition stats. For 
Spark users who want to update the latter, they have to use a different syntax: 
`ANALYZE TABLE  PARTITION()` which is more 
verbose. 
   
   Given `ANALYZE TABLE` internally already calculates total size for all the 
partitions, it makes sense to also update partition stats using the result. In 
this way, Spark users do not need to remember two different syntaxes.
   
   In addition, when using `ANALYZE TABLE` with the "scan node", i.e., `NOSCAN` 
is NOT specified, we can also calculate row count for all the partitions and 
update the stats accordingly.
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   Yes, now with `ANALYZE TABLE` command on a partition table, the partition 
stats will also get updated.
   
   ### How was this patch tested?
   
   
   Added a unit test for this feature.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45748][SQL] Add a .fromSQL helper function for Literals [spark]

2023-11-01 Thread via GitHub


anchovYu commented on code in PR #43612:
URL: https://github.com/apache/spark/pull/43612#discussion_r1379276234


##
common/utils/src/main/resources/error/error-classes.json:
##
@@ -1866,6 +1866,29 @@
 },
 "sqlState" : "42K0E"
   },
+  "INVALID_LITERAL_VALUE_SQL_STRING_FOR_DESERIALIZATION" : {
+"message" : [
+  "The given Literal value SQL string  (dataType: ) is 
invalid and can't be deserialized."
+],
+"subClass" : {
+  "ANALYSIS_FAILURE" : {
+"message" : [
+  "It cannot be analyzed successfully."
+]
+  },
+  "NOT_EXPECTED_LITERAL_TYPE" : {
+"message" : [
+  "The deserialized expression  is not an Literal with data type 
."
+]
+  },
+  "PARSE_FAILURE" : {
+"message" : [
+  "It cannot be parsed successfully."
+]
+  }
+},
+"sqlState" : "42894"

Review Comment:
   It's a bit tricky that in this PR it's a helper function that accepts any 
string. It could be a user error that user passes in a random string.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45748][SQL] Add a .fromSQL helper function for Literals [spark]

2023-11-01 Thread via GitHub


srielau commented on code in PR #43612:
URL: https://github.com/apache/spark/pull/43612#discussion_r1379272566


##
common/utils/src/main/resources/error/error-classes.json:
##
@@ -1866,6 +1866,29 @@
 },
 "sqlState" : "42K0E"
   },
+  "INVALID_LITERAL_VALUE_SQL_STRING_FOR_DESERIALIZATION" : {
+"message" : [
+  "The given Literal value SQL string  (dataType: ) is 
invalid and can't be deserialized."
+],
+"subClass" : {
+  "ANALYSIS_FAILURE" : {
+"message" : [
+  "It cannot be analyzed successfully."
+]
+  },
+  "NOT_EXPECTED_LITERAL_TYPE" : {
+"message" : [
+  "The deserialized expression  is not an Literal with data type 
."
+]
+  },
+  "PARSE_FAILURE" : {
+"message" : [
+  "It cannot be parsed successfully."
+]
+  }
+},
+"sqlState" : "42894"

Review Comment:
   I this a user error? Shouldn't it be an internal error? I.e. we expect it to 
work.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45763][CORE][UI] Improve `MasterPage` to show `Resource` column only when it exists [spark]

2023-11-01 Thread via GitHub


dongjoon-hyun commented on PR #43628:
URL: https://github.com/apache/spark/pull/43628#issuecomment-1789572705

   Could you review this PR, @viirya ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45760][SQL] Add With expression to avoid duplicating expressions [spark]

2023-11-01 Thread via GitHub


viirya commented on code in PR #43623:
URL: https://github.com/apache/spark/pull/43623#discussion_r1379249466


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteWithExpression.scala:
##
@@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, 
AttributeMap, CommonExpressionRef, Expression, With}
+import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreePattern.{COMMON_EXPR_REF, 
WITH_EXPRESSION}
+
+/**
+ * Rewrites the `With` expressions by adding a `Project` to pre-evaluate the 
common expressions, or
+ * just inline them if they are cheap.
+ *
+ * Note: For now we only use `With` in a few `RuntimeReplaceable` expressions. 
If we expand its
+ *   usage, we should support aggregate/window functions as well.
+ */
+object RewriteWithExpression extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = {
+plan.transformWithPruning(_.containsPattern(WITH_EXPRESSION)) {
+  case p if p.expressions.exists(_.containsPattern(WITH_EXPRESSION)) =>
+val commonExprs = mutable.ArrayBuffer.empty[Alias]
+// `With` can be nested, we should only rewrite the leaf `With` 
expression, as the outer
+// `With` needs to add its own Project, in the next iteration when it 
becomes leaf.
+// This is done via "transform down" and check if the common 
expression definitions does not
+// contain nested `With`.
+var newPlan: LogicalPlan = p.transformExpressionsDown {
+  case With(child, defs) if 
defs.forall(!_.containsPattern(WITH_EXPRESSION)) =>
+val idToCheapExpr = mutable.HashMap.empty[Long, Expression]
+val idToNonCheapExpr = mutable.HashMap.empty[Long, Alias]
+defs.foreach { commonExprDef =>
+  if (CollapseProject.isCheap(commonExprDef.child)) {
+idToCheapExpr(commonExprDef.id) = commonExprDef.child
+  } else {
+// TODO: we should calculate the ref count and also inline the 
common expression
+//   if it's ref count is 1.
+val alias = Alias(commonExprDef.child, 
s"_common_expr_${commonExprDef.id}")()
+commonExprs += alias
+idToNonCheapExpr(commonExprDef.id) = alias
+  }
+}
+
+child.transformWithPruning(_.containsPattern(COMMON_EXPR_REF)) {
+  case ref: CommonExpressionRef =>
+idToCheapExpr.getOrElse(ref.id, 
idToNonCheapExpr(ref.id).toAttribute)
+}
+}
+
+var exprsToAdd = commonExprs.toSeq
+val newChildren = p.children.map { child =>
+  val (newExprs, others) = 
exprsToAdd.partition(_.references.subsetOf(child.outputSet))
+  exprsToAdd = others
+  if (newExprs.nonEmpty) {
+Project(child.output ++ newExprs, child)
+  } else {
+child
+  }
+}
+
+if (exprsToAdd.nonEmpty) {
+  // If we cannot rewrite the common expressions, force to inline them 
so that the query

Review Comment:
   When we cannot rewrite them?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45760][SQL] Add With expression to avoid duplicating expressions [spark]

2023-11-01 Thread via GitHub


viirya commented on code in PR #43623:
URL: https://github.com/apache/spark/pull/43623#discussion_r1379248105


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteWithExpression.scala:
##
@@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, 
AttributeMap, CommonExpressionRef, Expression, With}
+import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreePattern.{COMMON_EXPR_REF, 
WITH_EXPRESSION}
+
+/**
+ * Rewrites the `With` expressions by adding a `Project` to pre-evaluate the 
common expressions, or
+ * just inline them if they are cheap.
+ *
+ * Note: For now we only use `With` in a few `RuntimeReplaceable` expressions. 
If we expand its
+ *   usage, we should support aggregate/window functions as well.
+ */
+object RewriteWithExpression extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = {
+plan.transformWithPruning(_.containsPattern(WITH_EXPRESSION)) {
+  case p if p.expressions.exists(_.containsPattern(WITH_EXPRESSION)) =>
+val commonExprs = mutable.ArrayBuffer.empty[Alias]
+// `With` can be nested, we should only rewrite the leaf `With` 
expression, as the outer
+// `With` needs to add its own Project, in the next iteration when it 
becomes leaf.
+// This is done via "transform down" and check if the common 
expression definitions does not
+// contain nested `With`.
+var newPlan: LogicalPlan = p.transformExpressionsDown {
+  case With(child, defs) if 
defs.forall(!_.containsPattern(WITH_EXPRESSION)) =>
+val idToCheapExpr = mutable.HashMap.empty[Long, Expression]
+val idToNonCheapExpr = mutable.HashMap.empty[Long, Alias]
+defs.foreach { commonExprDef =>
+  if (CollapseProject.isCheap(commonExprDef.child)) {
+idToCheapExpr(commonExprDef.id) = commonExprDef.child
+  } else {
+// TODO: we should calculate the ref count and also inline the 
common expression
+//   if it's ref count is 1.
+val alias = Alias(commonExprDef.child, 
s"_common_expr_${commonExprDef.id}")()
+commonExprs += alias
+idToNonCheapExpr(commonExprDef.id) = alias
+  }
+}
+
+child.transformWithPruning(_.containsPattern(COMMON_EXPR_REF)) {
+  case ref: CommonExpressionRef =>
+idToCheapExpr.getOrElse(ref.id, 
idToNonCheapExpr(ref.id).toAttribute)
+}
+}
+
+var exprsToAdd = commonExprs.toSeq
+val newChildren = p.children.map { child =>

Review Comment:
   Shouldn't be?
   
   ```suggestion
   val newChildren = newPlan.children.map { child =>
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45761][K8S][INFRA][DOCS] Upgrade `Volcano` to 1.8.1 [spark]

2023-11-01 Thread via GitHub


dongjoon-hyun commented on PR #43624:
URL: https://github.com/apache/spark/pull/43624#issuecomment-1789556244

   K8s integration test passed.
   
   https://github.com/apache/spark/assets/9700541/b116dbbf-ee28-40a8-964b-e4a71fd150b0";>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45762][CORE]: Support shuffle managers defined in user jars by changing startup order [spark]

2023-11-01 Thread via GitHub


abellina commented on PR #43627:
URL: https://github.com/apache/spark/pull/43627#issuecomment-1789508860

   Ok I believe given the mima code that I need to add a temporary restriction 
here: 
https://github.com/apache/spark/blob/master/project/MimaExcludes.scala#L37.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45754][CORE] Support `spark.deploy.appIdPattern` [spark]

2023-11-01 Thread via GitHub


dongjoon-hyun closed pull request #43616: [SPARK-45754][CORE] Support 
`spark.deploy.appIdPattern`
URL: https://github.com/apache/spark/pull/43616


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45754][CORE] Support `spark.deploy.appIdPattern` [spark]

2023-11-01 Thread via GitHub


dongjoon-hyun commented on PR #43616:
URL: https://github.com/apache/spark/pull/43616#issuecomment-1789483812

   Merged to master. Thank you again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45022][SQL] Provide context for dataset API errors [spark]

2023-11-01 Thread via GitHub


peter-toth commented on code in PR #43334:
URL: https://github.com/apache/spark/pull/43334#discussion_r1379176991


##
sql/core/src/main/scala/org/apache/spark/sql/package.scala:
##
@@ -73,4 +76,41 @@ package object sql {
* with rebasing.
*/
   private[sql] val SPARK_LEGACY_INT96_METADATA_KEY = 
"org.apache.spark.legacyINT96"
+
+  /**
+   * This helper function captures the Spark API and its call site in the user 
code from the current
+   * stacktrace.
+   *
+   * As adding `withOrigin` explicitly to all Spark API definition would be a 
huge change,
+   * `withOrigin` is used only at certain places where all API implementation 
surely pass through
+   * and the current stacktrace is filtered to the point where first Spark API 
code is invoked from
+   * the user code.
+   *
+   * As there might be multiple nested `withOrigin` calls (e.g. any Spark API 
implementations can
+   * invoke other APIs) only the first `withOrigin` is captured because that 
is closer to the user
+   * code.
+   *
+   * @param f The function that can use the origin.
+   * @return The result of `f`.
+   */
+  private[sql] def withOrigin[T](f: => T): T = {
+if (CurrentOrigin.get.stackTrace.isDefined) {
+  f
+} else {
+  val st = Thread.currentThread().getStackTrace
+  var i = 3

Review Comment:
   This has been discussed at 
https://github.com/apache/spark/pull/42740#discussion_r1337356903.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45022][SQL] Provide context for dataset API errors [spark]

2023-11-01 Thread via GitHub


MaxGekk commented on code in PR #43334:
URL: https://github.com/apache/spark/pull/43334#discussion_r1379169569


##
sql/core/src/main/scala/org/apache/spark/sql/package.scala:
##
@@ -73,4 +76,41 @@ package object sql {
* with rebasing.
*/
   private[sql] val SPARK_LEGACY_INT96_METADATA_KEY = 
"org.apache.spark.legacyINT96"
+
+  /**
+   * This helper function captures the Spark API and its call site in the user 
code from the current
+   * stacktrace.
+   *
+   * As adding `withOrigin` explicitly to all Spark API definition would be a 
huge change,
+   * `withOrigin` is used only at certain places where all API implementation 
surely pass through
+   * and the current stacktrace is filtered to the point where first Spark API 
code is invoked from
+   * the user code.
+   *
+   * As there might be multiple nested `withOrigin` calls (e.g. any Spark API 
implementations can
+   * invoke other APIs) only the first `withOrigin` is captured because that 
is closer to the user
+   * code.
+   *
+   * @param f The function that can use the origin.
+   * @return The result of `f`.
+   */
+  private[sql] def withOrigin[T](f: => T): T = {
+if (CurrentOrigin.get.stackTrace.isDefined) {
+  f
+} else {
+  val st = Thread.currentThread().getStackTrace
+  var i = 3

Review Comment:
   Regarding to the magic number, we always have 3 those elements at the 
beginning of the input array:
   https://github.com/apache/spark/assets/1580697/bff163eb-8cba-40bd-bdf6-8f7094abe80d";>
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45022][SQL] Provide context for dataset API errors [spark]

2023-11-01 Thread via GitHub


MaxGekk commented on code in PR #43334:
URL: https://github.com/apache/spark/pull/43334#discussion_r1379164499


##
sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala:
##
@@ -508,9 +508,11 @@ class Dataset[T] private[sql](
* @group basic
* @since 3.4.0
*/
-  def to(schema: StructType): DataFrame = withPlan {
-val replaced = 
CharVarcharUtils.failIfHasCharVarchar(schema).asInstanceOf[StructType]
-Project.matchSchema(logicalPlan, replaced, sparkSession.sessionState.conf)
+  def to(schema: StructType): DataFrame = withOrigin() {

Review Comment:
   @cloud-fan For example, **with** `withOrigin` in `select`, we stop at index 
3:
   https://github.com/apache/spark/assets/1580697/7080a1bb-7128-4c23-baaf-64aab6f7cd0a";>
   
   Without `withOrigin` in `select`, the picture is different. We are in 
`withOrigin` in `Column`'s constructor, and stoped at the index 5:
   https://github.com/apache/spark/assets/1580697/c6747280-fc30-4500-a26c-e0a063544966";>
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-45763][CORE][UI] Improve `MasterPage` to show `Resource` column only when it exists [spark]

2023-11-01 Thread via GitHub


dongjoon-hyun opened a new pull request, #43628:
URL: https://github.com/apache/spark/pull/43628

   …
   
   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45748][SQL] Add a .fromSQL helper function for Literals [spark]

2023-11-01 Thread via GitHub


anchovYu commented on code in PR #43612:
URL: https://github.com/apache/spark/pull/43612#discussion_r1379146887


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala:
##
@@ -247,6 +251,67 @@ object Literal {
   s"Literal must have a corresponding value to ${dataType.catalogString}, 
" +
   s"but class ${Utils.getSimpleName(value.getClass)} found.")
   }
+
+  /**
+   * Parse and analyze the .sql string of a Literal, construct the Literal 
given the data type json.
+   * Example usage:
+   *   val serializedValue = lit.sql
+   *   val dataTypeJson = lit.dataType.json
+   *   val deserializedLit: Literal = Literal.fromSQL(serializedValue, 
dataTypeJson)
+   *
+   * @param valueSqlStr the .sql string of the Literal
+   * @param dataTypeJson the json format data type of the Literal
+   * @throws AnalysisException 
INVALID_LITERAL_VALUE_SQL_STRING_FOR_DESERIALIZATION
+   */
+  private[sql] def fromSQL(valueSqlStr: String, dataTypeJson: String): Literal 
= {
+def parseAndAnalyze(valueSqlStr: String, dataType: DataType): Expression = 
{

Review Comment:
   I was thinking about that too but I found that 
`ResolveDefaultColumns.analyze` has 1) more column default specific error and 
error messages 2) has more limitation on the supported case. I also found it 
can return a non-Literal value as return, e.g. a Cast. 
   Instead I think in the future ResolveDefaultColumns could refactor to use 
this general helper function.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45762][CORE]: Support shuffle managers defined in user jars by changing startup order [spark]

2023-11-01 Thread via GitHub


abellina commented on PR #43627:
URL: https://github.com/apache/spark/pull/43627#issuecomment-1789408786

   The MIMA tests are failing due to:
   ```
   [error] spark-core: Failed binary compatibility check against 
org.apache.spark:spark-core_2.13:3.5.0! Found 1 potential problems (filtered 
3908)
   [error]  * method 
this(java.lang.String,org.apache.spark.rpc.RpcEnv,org.apache.spark.serializer.Serializer,org.apache.spark.serializer.Serializer,org.apache.spark.serializer.SerializerManager,org.apache.spark.MapOutputTracker,org.apache.spark.shuffle.ShuffleManager,org.apache.spark.broadcast.BroadcastManager,org.apache.spark.storage.BlockManager,org.apache.spark.SecurityManager,org.apache.spark.metrics.MetricsSystem,org.apache.spark.memory.MemoryManager,org.apache.spark.scheduler.OutputCommitCoordinator,org.apache.spark.SparkConf)Unit
 in class org.apache.spark.SparkEnv does not have a correspondent in current 
version
   ```
   Which makes sense, since I changed `SparkEnv`.
   
   I am not entirely sure if adding this to `MimaExcludes` is the right 
approach here, and I think I need some help.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45748][SQL] Add a .fromSQL helper function for Literals [spark]

2023-11-01 Thread via GitHub


anchovYu commented on code in PR #43612:
URL: https://github.com/apache/spark/pull/43612#discussion_r1379100451


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/LiteralExpressionSuite.scala:
##
@@ -477,4 +477,299 @@ class LiteralExpressionSuite extends SparkFunSuite with 
ExpressionEvalHelper {
   Literal.create(UTF8String.fromString("Spark SQL"), 
ObjectType(classOf[UTF8String])),
   UTF8String.fromString("Spark SQL"))
   }
+
+  private def serializationRoundTripInfo(
+  before: Literal,
+  serializedValue: String,
+  serializedDataType: String,
+  after: Literal): String = {
+s"""- Initial Literal: $before with type ${before.dataType}
+   |- String after serialization: $serializedValue with type 
$serializedDataType
+   |- Deserialized Literal: $after with type ${after.dataType}
+   |""".stripMargin
+  }
+
+  test(".sql and fromSQL round trip: basic Literals") {
+Seq(
+  // null
+  Literal.create(null, BooleanType),
+  Literal.create(null, ByteType),
+  Literal.create(null, ShortType),
+  Literal.create(null, IntegerType),
+  Literal.create(null, LongType),
+  Literal.create(null, FloatType),
+  Literal.create(null, StringType),
+  Literal.create(null, BinaryType),
+  Literal.create(null, DecimalType.USER_DEFAULT),
+  Literal.create(null, DecimalType(20, 2)),
+  Literal.create(null, DateType),
+  Literal.create(null, TimestampType),
+  Literal.create(null, CalendarIntervalType),
+  Literal.create(null, YearMonthIntervalType()),
+  Literal.create(null, DayTimeIntervalType(1, 2)),
+  Literal.create(null, ArrayType(ByteType, true)),
+  Literal.create(null, MapType(StringType, IntegerType)),
+  Literal.create(null, StructType(Seq.empty)),
+
+  // boolean
+  Literal(true),
+  Literal(false),
+
+  // int, long, short, byte
+  Literal(0),
+  Literal(1.toLong),
+  Literal(0.toShort),
+  Literal(1.toByte),
+  Literal(Int.MinValue),
+  Literal(Int.MinValue.toShort),
+  Literal(Int.MaxValue),
+  Literal(Int.MaxValue.toByte),
+  Literal(Short.MinValue),
+  Literal(Byte.MaxValue),
+  Literal(Long.MinValue),
+  Literal(Long.MaxValue),
+
+  // float
+  Literal(0.0.toFloat),
+  Literal(-0.0.toFloat),
+  Literal(Float.PositiveInfinity),
+  Literal(Float.NegativeInfinity),
+  Literal(Float.MinPositiveValue),
+  Literal(Float.MaxValue),
+  Literal(Float.MinValue),
+  Literal(Float.NaN),
+
+  // double
+  Literal(0.0),
+  Literal(-0.0),
+  Literal(Double.NegativeInfinity),
+  Literal(Double.PositiveInfinity),
+  Literal(Double.MinValue),
+  Literal(Double.MaxValue),
+  Literal(Double.NaN),
+  Literal(Double.MinPositiveValue),
+
+  // Decimal -> without type it's problematic?
+  Literal(Decimal(-0.0001)),
+  Literal(Decimal(0.0)),
+  Literal(Decimal(0.001)),
+  Literal(Decimal(1.)),
+  Literal(Decimal(5.toLong, 10, 3)),
+  Literal(BigDecimal((-0.0001).toString)),
+  Literal(new java.math.BigDecimal(0.0.toString)),
+
+  // binary
+  Literal.create(new Array[Byte](0), BinaryType),
+  Literal.create(new Array[Byte](2), BinaryType),
+
+  // string
+  Literal(""),
+  Literal("a"),
+  Literal("\u"),
+  Literal("a b"),
+
+  // DayTimeInterval (Duration)
+  Literal(Duration.ofNanos(0)),
+  Literal(Duration.ofSeconds(-1)),
+  Literal(Duration.ofMinutes(62)),
+  Literal.create(Duration.ofMinutes(62), DayTimeIntervalType(2, 3)),
+  Literal(Duration.ofHours(10)),
+  Literal.create(Duration.ofDays(12345600), DayTimeIntervalType(0, 1)),
+
+  // YearMonthInterval (Period)
+  Literal(Period.ofYears(0)),
+  Literal.create(Period.ofYears(1), YearMonthIntervalType(0, 0)),
+  Literal.create(Period.ofYears(1), YearMonthIntervalType(1, 1)),
+  Literal(Period.of(-1, 11, 0)),
+  Literal(Period.ofMonths(Int.MinValue)),
+
+
+
+  // array
+  Literal(Array(1, 2, 3)),
+  Literal.create(Array(1.0, 2.0), ArrayType(DoubleType, false)),
+  Literal.create(Array(1.0, 2.0), ArrayType(DoubleType, true)),
+  Literal.create(Array(1.0, 2.0, null), ArrayType(DoubleType, true)),
+  Literal(Array("a")),
+
+  // array of struct
+  Literal(
+new GenericArrayData(
+  Array(
+InternalRow(UTF8String.fromString("a"), 1),
+InternalRow(UTF8String.fromString("b"), 2)
+  )
+),
+ArrayType(
+  StructType(Seq(StructField("col1", StringType), StructField("col2", 
IntegerType))),
+  containsNull = false
+)
+  ),
+
+  // struct
+  Literal.create((1, 3.0, "abc")),
+  Literal(
+InternalRow(true, 1.toLong),
+StructType(
+  Seq(
+StructField("col1", BooleanType, nullable = 
false).withComment("some-com

Re: [PR] [SPARK-45762][CORE]: Support shuffle managers defined in user jars by changing startup order [spark]

2023-11-01 Thread via GitHub


abellina commented on PR #43627:
URL: https://github.com/apache/spark/pull/43627#issuecomment-1789318235

   I'll have to figure out the CI. It seems my fork is running things, but I am 
getting some failures in this page (`AppVeyor` and the `Notify test workflow`)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] SPARK-45762: Support shuffle managers defined in user jars by changing startup order [spark]

2023-11-01 Thread via GitHub


abellina opened a new pull request, #43627:
URL: https://github.com/apache/spark/pull/43627

   ### What changes were proposed in this pull request?
   As reported here https://issues.apache.org/jira/browse/SPARK-45762, 
`ShuffleManager` instances defined in a user jar cannot be use in all cases, 
unless specified in the `extraClassPath`. We would like to avoid adding extra 
configurations if this instance is already included in a jar passed via 
`--jars`.
   
   Proposed changes:
   
   Refactor code so we initialize the `ShuffleManager` later, after jars have 
been localized. This is especially necessary in the executor, where we would 
need to move this initialization until after the `replClassLoader` is updated 
with jars passed in `--jars`.
   
   Before this change, the `ShuffleManager` is instantiated at `SparkEnv` 
creation. Having to instantiate the `ShuffleManager` this early doesn't work, 
because user jars have not been localized in all scenarios, and we will fail to 
load the `ShuffleManager` defined in `--jars`. We propose moving the 
`ShuffleManager` instantiation to `SparkContext` on the driver, and `Executor`.
   
   ### Why are the changes needed?
   This is not a new API but a change of startup order. The changed are needed 
to improve the user experience for the user by reducing extra configurations 
depending on how a spark application is launched.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, but it's backwards compatible. Users no longer need to specify a 
`ShuffleManager` jar in `extraClassPath`, but they are able to if they desire.
   
   ### How was this patch tested?
   Added a unit test showing that a test `ShuffleManager` is available after 
`--jars` are passed, but not without (using local-cluster mode). 
   
   Tested manually with standalone mode, local-cluster mode, yarn client and 
cluster mode, k8s.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [hotfix] [docs] Fix typo [spark]

2023-11-01 Thread via GitHub


YuanHanzhong opened a new pull request, #43626:
URL: https://github.com/apache/spark/pull/43626

   [hotfix] [docs] Fix typo 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [hotfix] [docs] Fix typo [spark]

2023-11-01 Thread via GitHub


YuanHanzhong opened a new pull request, #43625:
URL: https://github.com/apache/spark/pull/43625

   (no comment)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [hotfix] [docs] Fix typo [spark]

2023-11-01 Thread via GitHub


YuanHanzhong closed pull request #43622: [hotfix] [docs] Fix typo
URL: https://github.com/apache/spark/pull/43622


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-45761][K8S][INFRA][DOCS] Upgrade `Volcano` to 1.8.1 [spark]

2023-11-01 Thread via GitHub


dongjoon-hyun opened a new pull request, #43624:
URL: https://github.com/apache/spark/pull/43624

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45533][CORE] Use j.l.r.Cleaner instead of finalize for RocksDBIterator/LevelDBIterator [spark]

2023-11-01 Thread via GitHub


zhaomin1423 commented on code in PR #43502:
URL: https://github.com/apache/spark/pull/43502#discussion_r1379017731


##
common/kvstore/src/main/java/org/apache/spark/util/kvstore/RocksDBIterator.java:
##
@@ -272,4 +283,37 @@ static int compare(byte[] a, byte[] b) {
 return a.length - b.length;
   }
 
+  static class ResourceCleaner implements Runnable {
+
+private final RocksIterator rocksIterator;
+private final RocksDB rocksDB;
+private final AtomicBoolean status = new AtomicBoolean(true);
+
+public ResourceCleaner(RocksIterator rocksIterator, RocksDB rocksDB) {
+  this.rocksIterator = rocksIterator;
+  this.rocksDB = rocksDB;
+}
+
+@Override
+public void run() {
+  if (status.compareAndSet(true, false)) {
+rocksDB.getIteratorTracker().forEach(ref -> {

Review Comment:
   ```
   
   void notifyIteratorClosed(RocksIterator rocksIterator) {
   iteratorTracker.removeIf(ref -> {
 RocksDBIterator rocksDBIterator = ref.get();
 return rocksDBIterator != null && 
rocksIterator.equals(rocksDBIterator.getRocksIterator());
   });
 }
   
   ```
   
   How about?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-45760][SQL] Add With expression to avoid duplicating expressions [spark]

2023-11-01 Thread via GitHub


cloud-fan opened a new pull request, #43623:
URL: https://github.com/apache/spark/pull/43623

   
   
   ### What changes were proposed in this pull request?
   
   Sometimes we need to duplicate expressions when rewriting the plan. It's OK 
for small query, as codegen has common-subexpression-elimination (CSE) to avoid 
evaluating the same expression. However, when the query is big, duplicating 
expressions can lead to a very big expression tree and make catalyst rules very 
slow, or even OOM when updating a leaf node (need to copy all tree nodes).
   
   This PR introduces a new expression to do expression-level CTE: it adds a 
Project to pre-evaluate the common expressions, so that they appear only once 
on the query plan tree, and are evaluated only once. `NullIf` now uses this new 
expression to avoid duplicating the `left` child expression.
   
   ### Why are the changes needed?
   
   make catalyst more efficient.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   new test suite
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45760][SQL] Add With expression to avoid duplicating expressions [spark]

2023-11-01 Thread via GitHub


cloud-fan commented on PR #43623:
URL: https://github.com/apache/spark/pull/43623#issuecomment-1789224288

   cc @viirya @wangyum 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [hotfix] [docs] Fix typo [spark]

2023-11-01 Thread via GitHub


YuanHanzhong opened a new pull request, #43622:
URL: https://github.com/apache/spark/pull/43622

   (no comment)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45327][BUILD] Upgrade zstd-jni to 1.5.5-7 [spark]

2023-11-01 Thread via GitHub


LuciferYang closed pull request #43113: [SPARK-45327][BUILD] Upgrade zstd-jni 
to 1.5.5-7
URL: https://github.com/apache/spark/pull/43113


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45533][CORE] Use j.l.r.Cleaner instead of finalize for RocksDBIterator/LevelDBIterator [spark]

2023-11-01 Thread via GitHub


LuciferYang commented on code in PR #43502:
URL: https://github.com/apache/spark/pull/43502#discussion_r1378970124


##
common/kvstore/src/main/java/org/apache/spark/util/kvstore/RocksDBIterator.java:
##
@@ -272,4 +283,37 @@ static int compare(byte[] a, byte[] b) {
 return a.length - b.length;
   }
 
+  static class ResourceCleaner implements Runnable {
+
+private final RocksIterator rocksIterator;
+private final RocksDB rocksDB;
+private final AtomicBoolean status = new AtomicBoolean(true);
+
+public ResourceCleaner(RocksIterator rocksIterator, RocksDB rocksDB) {
+  this.rocksIterator = rocksIterator;
+  this.rocksDB = rocksDB;
+}
+
+@Override
+public void run() {
+  if (status.compareAndSet(true, false)) {
+rocksDB.getIteratorTracker().forEach(ref -> {

Review Comment:
   hmm... Why not just change the input arg type of the 
`rocksDB.notifyIteratorClosed` method...
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45743][BUILD] Upgrade dropwizard metrics 4.2.21 [spark]

2023-11-01 Thread via GitHub


LuciferYang commented on PR #43608:
URL: https://github.com/apache/spark/pull/43608#issuecomment-1789177231

   Thanks @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45533][CORE] Use j.l.r.Cleaner instead of finalize for RocksDBIterator/LevelDBIterator [spark]

2023-11-01 Thread via GitHub


zhaomin1423 commented on code in PR #43502:
URL: https://github.com/apache/spark/pull/43502#discussion_r1378942557


##
common/kvstore/src/main/java/org/apache/spark/util/kvstore/RocksDBIterator.java:
##
@@ -270,20 +280,30 @@ static int compare(byte[] a, byte[] b) {
 return a.length - b.length;
   }
 
-  private record ResourceCleaner(RocksIterator rocksIterator, RocksDB rocksDB) 
implements Runnable {
+  private static class ResourceCleaner implements Runnable {
+
+private final RocksIterator rocksIterator;
+private final RocksDB rocksDB;
+private final AtomicBoolean status = new AtomicBoolean(true);
+
+public ResourceCleaner(RocksIterator rocksIterator, RocksDB rocksDB) {
+  this.rocksIterator = rocksIterator;
+  this.rocksDB = rocksDB;
+}
 
 @Override
 public void run() {
-  rocksDB.getIteratorTracker().removeIf(ref -> {
-RocksDBIterator rocksDBIterator = ref.get();
-return rocksDBIterator != null && 
rocksIterator.equals(rocksDBIterator.it);
-  });
-  synchronized (rocksDB.getRocksDB()) {
-org.rocksdb.RocksDB _db = rocksDB.getRocksDB().get();
-if (_db != null) {
-  rocksIterator.close();
-}
+  if (status.compareAndSet(true, false)) {
+rocksDB.getIteratorTracker().removeIf(ref -> {

Review Comment:
   Thanks, updated, levels will update after rocksdb complete



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45753][CORE] Support `spark.deploy.driverIdPattern` [spark]

2023-11-01 Thread via GitHub


dongjoon-hyun commented on PR #43615:
URL: https://github.com/apache/spark/pull/43615#issuecomment-1789141904

   Thank you! Merged to master for Apache Spark 4.0.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45753][CORE] Support `spark.deploy.driverIdPattern` [spark]

2023-11-01 Thread via GitHub


dongjoon-hyun closed pull request #43615: [SPARK-45753][CORE] Support 
`spark.deploy.driverIdPattern`
URL: https://github.com/apache/spark/pull/43615


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45743][BUILD] Upgrade dropwizard metrics 4.2.21 [spark]

2023-11-01 Thread via GitHub


dongjoon-hyun closed pull request #43608: [SPARK-45743][BUILD] Upgrade 
dropwizard metrics 4.2.21
URL: https://github.com/apache/spark/pull/43608


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45688][SPARK-45693][CORE] Clean up the deprecated API usage related to `MapOps` & Fix `method += in trait Growable is deprecated` [spark]

2023-11-01 Thread via GitHub


dongjoon-hyun commented on code in PR #43578:
URL: https://github.com/apache/spark/pull/43578#discussion_r1378932756


##
core/src/test/scala/org/apache/spark/util/TimeStampedHashMapSuite.scala:
##
@@ -118,18 +118,18 @@ class TimeStampedHashMapSuite extends SparkFunSuite {
   assert(testMap2.iterator.toSeq.head === (("k1", "v1")))
 
   // +
-  val testMap3 = testMap2 + (("k0", "v0"))
+  val testMap3 = testMap2 ++ Map("k0" -> "v0")
   assert(testMap3.size === 2)
   assert(testMap3.get("k1").isDefined)
   assert(testMap3("k1") === "v1")
   assert(testMap3.get("k0").isDefined)
   assert(testMap3("k0") === "v0")
 
   // -
-  val testMap4 = testMap3 - "k0"
-  assert(testMap4.size === 1)
-  assert(testMap4.get("k1").isDefined)
-  assert(testMap4("k1") === "v1")
+  testMap3.remove("k0")

Review Comment:
   +1 for (1) `direct deletion`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45527][core] Use fraction to do the resource calculation [spark]

2023-11-01 Thread via GitHub


tgravescs commented on code in PR #43494:
URL: https://github.com/apache/spark/pull/43494#discussion_r1378869370


##
core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala:
##
@@ -273,7 +273,8 @@ private[spark] class CoarseGrainedExecutorBackend(
   override def statusUpdate(taskId: Long, state: TaskState, data: ByteBuffer): 
Unit = {
 val resources = taskResources.getOrDefault(taskId, Map.empty[String, 
ResourceInformation])

Review Comment:
   No



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45527][core] Use fraction to do the resource calculation [spark]

2023-11-01 Thread via GitHub


tgravescs commented on code in PR #43494:
URL: https://github.com/apache/spark/pull/43494#discussion_r1378859293


##
core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala:
##
@@ -165,15 +165,17 @@ class CoarseGrainedSchedulerBackend(scheduler: 
TaskSchedulerImpl, val rpcEnv: Rp
 }
 
 override def receive: PartialFunction[Any, Unit] = {
-  case StatusUpdate(executorId, taskId, state, data, taskCpus, resources) 
=>
+  case StatusUpdate(executorId, taskId, state, data, taskCpus, resources, 
resourcesAmounts) =>

Review Comment:
   no, it doesn't make sense to leave something not used here and we should be 
as efficient as we can about storage and passing things around.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] Keep the taskBinary in Stage to avoid deserializing it multiple times, and send binary to executor instead of task. [spark]

2023-11-01 Thread via GitHub


hui730 opened a new pull request, #43621:
URL: https://github.com/apache/spark/pull/43621

   
   
   ### What changes were proposed in this pull request?
   
   My modification is a performance optimization, which keep the taskBinary in 
Stage to avoid deserializing it multiple times, and send binary to executor 
instead of task. Here are specific modifications:
   a. Add stageID info to TaskDescription.
   b. Add taskBinary info to taskSet. Remove taskBinary info to ShuffleMapTask 
and ResultTask.
   c. add binary getter/setter of Task.
   d. Store binary broadcast when submitTasks.
   e. Add new Netty event message LaunchBinary, before executor run task, 
launch binary to new executor first.
   
   
   ### Why are the changes needed?
   
   the taskBinary for each task of the same stage is the same. The current 
situation is that each task will bring a copy of the taskBinary's broadcast 
variables to the executor, and my modification will send this taskBinary to 
each executor that has executed the corresponding stage, reducing the 
repetition of broadcast variables and avoid deserializing binary multiple times.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   Because this modification is a modification of the Spark Core, and it also 
involves some parameter changes for unit testing, so I conducted all unit 
testing on all modules base on master branch.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >