Re: [PR] [SPARK-45292][SQL][HIVE] Remove Guava from shared classes from IsolatedClientLoader [spark]

2024-02-15 Thread via GitHub


dongjoon-hyun commented on PR #42599:
URL: https://github.com/apache/spark/pull/42599#issuecomment-1947911644

   Is there any update for Apache Hive 2.3.10, @pan3793 ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47072][SQL] Fix supported interval formats in error messages [spark]

2024-02-15 Thread via GitHub


MaxGekk commented on PR #45127:
URL: https://github.com/apache/spark/pull/45127#issuecomment-1947910365

   @AngersZh @cloud-fan @HyukjinKwon Could you review this bug fix, please.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45720] Upgrade AWS SDK to v2 for Spark Kinesis connector module [spark]

2024-02-15 Thread via GitHub


dongjoon-hyun commented on code in PR #44211:
URL: https://github.com/apache/spark/pull/44211#discussion_r1492052014


##
connector/kinesis-asl-assembly/pom.xml:
##
@@ -62,12 +62,18 @@
 
   com.google.protobuf
   protobuf-java
-  2.6.1
-  
+  compile
+
+  
+
+  com.google.guava
+  guava
+  ${aws.kinesis.client.guava.version}
+  compile

Review Comment:
   Sorry for being late.
   
   1. Is this only for Kinesis or for all AWS SKD v2?
   2. Instead of the following, can we use the latest33.0.0-jre like #44795 ?
   ```
   
32.1.1-jre
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45720] Upgrade AWS SDK to v2 for Spark Kinesis connector module [spark]

2024-02-15 Thread via GitHub


dongjoon-hyun commented on code in PR #44211:
URL: https://github.com/apache/spark/pull/44211#discussion_r1492043709


##
connector/kinesis-asl-assembly/pom.xml:
##
@@ -62,12 +62,18 @@
 
   com.google.protobuf
   protobuf-java
-  2.6.1
-  
+  compile

Review Comment:
   For the record, I removed this shaded `protobuf-java` completely in Apache 
Spark 4.0.0 independently.
   -  #45096



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47040][CONNECT][FOLLOWUP] Improve usability of the start-conect-server-script.sh [spark]

2024-02-15 Thread via GitHub


dongjoon-hyun commented on code in PR #45117:
URL: https://github.com/apache/spark/pull/45117#discussion_r1492039371


##
sbin/start-connect-server.sh:
##
@@ -40,8 +40,9 @@ fi
 
 if [ "$1" == "--wait" ]; then
   shift
-  exec "${SPARK_HOME}"/bin/spark-submit --class $CLASS 1 --name "Spark Connect 
Server" "$@"
-else
-  exec "${SPARK_HOME}"/sbin/spark-daemon.sh submit $CLASS 1 --name "Spark 
Connect server" "$@"
+  export SPARK_NO_DAEMONIZE=1
 fi
+  exec "${SPARK_HOME}"/sbin/spark-daemon.sh submit $CLASS 1 --name "Spark 
Connect server" "$@"

Review Comment:
   While checking once more, I realized that we have extra leading spaces in 
this line. Could you double-check the indentation?
   
   ![Screenshot 2024-02-15 at 23 10 
56](https://github.com/apache/spark/assets/9700541/538e7354-9b61-4f37-b04d-b88dbb0d7d1e)
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

2024-02-15 Thread via GitHub


peter-toth commented on PR #45036:
URL: https://github.com/apache/spark/pull/45036#issuecomment-1947853367

   > `SQLOpenHashSet` also handles null differently. Not sure if `OpenHashSet` 
already covers it.
   
   Yes, you are right. Probably the `NaN` special handling can be remvoved 
though.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

2024-02-15 Thread via GitHub


cloud-fan commented on PR #45036:
URL: https://github.com/apache/spark/pull/45036#issuecomment-1947841579

   `SQLOpenHashSet` also handles null differently. Not sure if `OpenHashSet` 
already covers it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47071][SQL] Inline With expression if it contains special expression [spark]

2024-02-15 Thread via GitHub


cloud-fan commented on code in PR #45134:
URL: https://github.com/apache/spark/pull/45134#discussion_r1492014338


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteWithExpression.scala:
##
@@ -41,10 +41,15 @@ object RewriteWithExpression extends Rule[LogicalPlan] {
   rewriteWithExprAndInputPlans(expr, inputPlans)
 }
 newPlan = newPlan.withNewChildren(inputPlans.toIndexedSeq)
-if (p.output == newPlan.output) {
-  newPlan
-} else {
+// Since we add extra Projects with extra columns to pre-evaluate the 
common expressions,
+// the current operator may have extra columns if it inherits the 
output columns from its
+// child, and we need to project away the extra columns to keep the 
plan schema unchanged.

Review Comment:
   not related to this PR but refine the code here since I'm touching this rule.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47071][SQL] Inline With expression if it contains special expression [spark]

2024-02-15 Thread via GitHub


cloud-fan commented on PR #45134:
URL: https://github.com/apache/spark/pull/45134#issuecomment-1947836202

   cc @viirya @ulysses-you 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

2024-02-15 Thread via GitHub


peter-toth commented on PR #45036:
URL: https://github.com/apache/spark/pull/45036#issuecomment-1947835261

   If we go this direction and change `OpenHashSet` do we still need 
`SQLOpenHashSet`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-47071][SQL] Inline With expression if it contains special expression [spark]

2024-02-15 Thread via GitHub


cloud-fan opened a new pull request, #45134:
URL: https://github.com/apache/spark/pull/45134

   
   
   ### What changes were proposed in this pull request?
   
   This is a bug fix for the With expression. If the common expression contains 
special expression like aggregate expresson, we cannot pull it out and put it 
in Project. We have to inline it.
   
   ### Why are the changes needed?
   
   bug fix.
   
   ### Does this PR introduce _any_ user-facing change?
   
   a failed can run after this fix, but this bug is not released yet.
   
   ### How was this patch tested?
   
   new tests
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   no


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47054][PYTHON][TESTS] Remove pinned version of torch for Python 3.12 support [spark]

2024-02-15 Thread via GitHub


dongjoon-hyun closed pull request #45113: [SPARK-47054][PYTHON][TESTS] Remove 
pinned version of torch for Python 3.12 support
URL: https://github.com/apache/spark/pull/45113


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47054][PYTHON][TESTS] Remove pinned version of torch for Python 3.12 support [spark]

2024-02-15 Thread via GitHub


dongjoon-hyun commented on PR #45113:
URL: https://github.com/apache/spark/pull/45113#issuecomment-1947828146

   The `yarn` and `connect` module failures are known flaky ones.
   https://github.com/apache/spark/assets/9700541/ef345b48-d02b-4310-80bd-aa0169e040db;>
   
   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

2024-02-15 Thread via GitHub


nchammas commented on PR #45036:
URL: https://github.com/apache/spark/pull/45036#issuecomment-1947785597

   @revans2 - Just to confirm, have you rerun your tests against this branch 
(the ones that exposed this bug) and do they pass now?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

2024-02-15 Thread via GitHub


nchammas commented on code in PR #45036:
URL: https://github.com/apache/spark/pull/45036#discussion_r1491975003


##
core/src/test/scala/org/apache/spark/util/collection/OpenHashMapSuite.scala:
##
@@ -249,4 +249,34 @@ class OpenHashMapSuite extends SparkFunSuite with Matchers 
{
 map(null) = null
 assert(map.get(null) === Some(null))
   }
+
+  test("SPARK-45599: 0.0 and -0.0 should count distinctly") {
+// Exactly these elements provided in roughly this order trigger a 
condition where lookups of
+// 0.0 and -0.0 in the bitset happen to collide, causing their counts to 
be merged incorrectly
+// and inconsistently if `==` is used to check for key equality.

Review Comment:
   I tweaked the test name. Is that what you had in mind?
   
   This comment explains why we need exactly the following elements to trigger 
the 0.0/-0.0 miscount. It doesn't always happen (which is part of what kept 
this bug hidden for so long).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46972][SQL] Fix asymmetrical replacement for char/varchar in V2SessionCatalog.createTable [spark]

2024-02-15 Thread via GitHub


cloud-fan commented on code in PR #45019:
URL: https://github.com/apache/spark/pull/45019#discussion_r1491966372


##
sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala:
##
@@ -1735,6 +1735,15 @@ class DataSourceV2SQLSuiteV1Filter
 }
   }
 
+  test("SPARK-46972: asymmetrical replacement for char/varchar in 
V2SessionCatalog.createTable") {
+// unset this config to use the default v2 session catalog.
+spark.conf.unset(V2_SESSION_CATALOG_IMPLEMENTATION.key)
+withTable("t") {
+  sql(s"CREATE TABLE t(c char(1), v varchar(2)) USING $v2Source")
+  assert(!spark.table("t").isEmpty)

Review Comment:
   I think we should not trigger an actual scan. The data schema (int) is 
incompatible with char, and the scan behavior is kind of undefined. It should 
be good enough to run the CREATE TABLE command in this test case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

2024-02-15 Thread via GitHub


cloud-fan commented on code in PR #45036:
URL: https://github.com/apache/spark/pull/45036#discussion_r1491931994


##
core/src/test/scala/org/apache/spark/util/collection/OpenHashMapSuite.scala:
##
@@ -249,4 +249,34 @@ class OpenHashMapSuite extends SparkFunSuite with Matchers 
{
 map(null) = null
 assert(map.get(null) === Some(null))
   }
+
+  test("SPARK-45599: 0.0 and -0.0 should count distinctly") {
+// Exactly these elements provided in roughly this order trigger a 
condition where lookups of
+// 0.0 and -0.0 in the bitset happen to collide, causing their counts to 
be merged incorrectly
+// and inconsistently if `==` is used to check for key equality.

Review Comment:
   shall we mention the NaN behavior as well? All NaN values are all the same.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47069][PYTHON] Introduce `spark.profile.show/dump` for SparkSession-based profiling [spark]

2024-02-15 Thread via GitHub


HyukjinKwon commented on code in PR #45129:
URL: https://github.com/apache/spark/pull/45129#discussion_r1491925413


##
python/pyspark/sql/profiler.py:
##
@@ -239,3 +241,72 @@ def _profile_results(self) -> "ProfileResults":
 with self._lock:
 value = self._accumulator.value
 return value if value is not None else {}
+
+
+class Profile:
+"""User-facing profile API. This instance can be accessed by
+:attr:`spark.profile`.
+
+.. versionadded: 4.0.0
+"""
+
+def __init__(self, sparkSession: "SparkSession"):
+self.sparkSession = sparkSession
+
+def show(self, *, type: Optional[str] = None, id: Optional[int] = None) -> 
None:
+"""
+Show the profile results.
+
+.. versionadded:: 4.0.0
+
+Parameters
+--
+type : str, optional
+The profiler type, which can be either "perf" or "memory".
+id : int, optional
+A UDF ID to be shown. If not specified, all the results will be 
shown.
+"""
+if type == "memory":
+self.sparkSession.showMemoryProfiles(id)

Review Comment:
    



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47069][PYTHON] Introduce `spark.profile.show/dump` for SparkSession-based profiling [spark]

2024-02-15 Thread via GitHub


HyukjinKwon commented on code in PR #45129:
URL: https://github.com/apache/spark/pull/45129#discussion_r1491924933


##
python/pyspark/sql/profiler.py:
##
@@ -239,3 +241,72 @@ def _profile_results(self) -> "ProfileResults":
 with self._lock:
 value = self._accumulator.value
 return value if value is not None else {}
+
+
+class Profile:
+"""User-facing profile API. This instance can be accessed by

Review Comment:
   We should probably also fix the docs at `python/docs/source/reference`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47068][PYTHON][TESTS] Recover -1 and 0 case for spark.sql.execution.arrow.maxRecordsPerBatch [spark]

2024-02-15 Thread via GitHub


HyukjinKwon closed pull request #45132: [SPARK-47068][PYTHON][TESTS] Recover -1 
and 0 case for spark.sql.execution.arrow.maxRecordsPerBatch
URL: https://github.com/apache/spark/pull/45132


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47068][PYTHON][TESTS] Recover -1 and 0 case for spark.sql.execution.arrow.maxRecordsPerBatch [spark]

2024-02-15 Thread via GitHub


HyukjinKwon commented on PR #45132:
URL: https://github.com/apache/spark/pull/45132#issuecomment-1947709204

   Merged to master, brnach-3.5 and branch-3.4.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46078][PYTHON][TESTS] Upgrade `pytorch` for Python 3.12 [spark]

2024-02-15 Thread via GitHub


HyukjinKwon closed pull request #45119: [SPARK-46078][PYTHON][TESTS] Upgrade 
`pytorch` for Python 3.12
URL: https://github.com/apache/spark/pull/45119


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46078][PYTHON][TESTS] Upgrade `pytorch` for Python 3.12 [spark]

2024-02-15 Thread via GitHub


HyukjinKwon commented on PR #45119:
URL: https://github.com/apache/spark/pull/45119#issuecomment-1947708773

   This won't affect the main build. I will check it through the scheduled 
build.
   
   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47055][PYTHON] Upgrade MyPy 1.8.0 [spark]

2024-02-15 Thread via GitHub


HyukjinKwon commented on PR #45115:
URL: https://github.com/apache/spark/pull/45115#issuecomment-1947706285

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47055][PYTHON] Upgrade MyPy 1.8.0 [spark]

2024-02-15 Thread via GitHub


HyukjinKwon closed pull request #45115: [SPARK-47055][PYTHON] Upgrade MyPy 1.8.0
URL: https://github.com/apache/spark/pull/45115


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-38098][PYTHON] Add support for ArrayType of nested StructType to arrow-based conversion [spark]

2024-02-15 Thread via GitHub


itholic commented on code in PR #35391:
URL: https://github.com/apache/spark/pull/35391#discussion_r1491894555


##
python/pyspark/sql/pandas/types.py:
##
@@ -86,8 +86,15 @@ def to_arrow_type(dt: DataType) -> "pa.DataType":
 elif type(dt) == DayTimeIntervalType:
 arrow_type = pa.duration("us")
 elif type(dt) == ArrayType:
-if type(dt.elementType) in [StructType, TimestampType]:
+if type(dt.elementType) == TimestampType:
 raise TypeError("Unsupported type in conversion to Arrow: " + 
str(dt))
+elif type(dt.elementType) == StructType:
+if LooseVersion(pa.__version__) < LooseVersion("2.0.0"):
+raise TypeError(
+"Array of StructType is only supported with pyarrow 2.0.0 
and above"

Review Comment:
   Hi, @LucaCanali I think I've found a case where Array of StructType doesn't 
work properly:
   
   **In:**
   ```python
   df = spark.createDataFrame(
 [
   ("a", [("b", False), ("c", True)]),
 ]
   ).toDF("c1", "c2")
   df.toPandas()
   ```
   
   **Out:**
   ```python
 c1 c2
   0  a  [{'_1': 'b', '_2': False}, {'_1': 'c', '_2': T...
   ```
   
   **Expected:**
   ```python
 c1   c2
   0  a  [(b, False), (c, True)]
   ```
   
   
   I roughly suspect that this is maybe an internal table conversion issue from 
PyArrow side but not very sure about this, so could you check this out when you 
find some time??
   
   Thanks in advance!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-XXX] Fix invalid aggregation after in-subquery rewrite [spark]

2024-02-15 Thread via GitHub


anton5798 opened a new pull request, #45133:
URL: https://github.com/apache/spark/pull/45133

   ### What changes were proposed in this pull request?
   
   todo
   
   ### Why are the changes needed?
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Query tests
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47069][PYTHON] Introduce `spark.profile.show/dump` for SparkSession-based profiling [spark]

2024-02-15 Thread via GitHub


ueshin commented on code in PR #45129:
URL: https://github.com/apache/spark/pull/45129#discussion_r1491866174


##
python/pyspark/sql/profiler.py:
##
@@ -239,3 +241,72 @@ def _profile_results(self) -> "ProfileResults":
 with self._lock:
 value = self._accumulator.value
 return value if value is not None else {}
+
+
+class Profile:
+"""User-facing profile API. This instance can be accessed by
+:attr:`spark.profile`.
+
+.. versionadded: 4.0.0
+"""
+
+def __init__(self, sparkSession: "SparkSession"):
+self.sparkSession = sparkSession
+
+def show(self, *, type: Optional[str] = None, id: Optional[int] = None) -> 
None:
+"""
+Show the profile results.
+
+.. versionadded:: 4.0.0
+
+Parameters
+--
+type : str, optional
+The profiler type, which can be either "perf" or "memory".
+id : int, optional
+A UDF ID to be shown. If not specified, all the results will be 
shown.
+"""
+if type == "memory":
+self.sparkSession.showMemoryProfiles(id)

Review Comment:
   Shall we remove the old APIs? I think the new APIs are enough to have.



##
python/pyspark/sql/profiler.py:
##
@@ -239,3 +241,72 @@ def _profile_results(self) -> "ProfileResults":
 with self._lock:
 value = self._accumulator.value
 return value if value is not None else {}
+
+
+class Profile:
+"""User-facing profile API. This instance can be accessed by
+:attr:`spark.profile`.
+
+.. versionadded: 4.0.0
+"""
+
+def __init__(self, sparkSession: "SparkSession"):
+self.sparkSession = sparkSession

Review Comment:
   It should take `ProfilerCollector`?



##
python/pyspark/sql/profiler.py:
##
@@ -239,3 +241,72 @@ def _profile_results(self) -> "ProfileResults":
 with self._lock:
 value = self._accumulator.value
 return value if value is not None else {}
+
+
+class Profile:
+"""User-facing profile API. This instance can be accessed by
+:attr:`spark.profile`.
+
+.. versionadded: 4.0.0
+"""
+
+def __init__(self, sparkSession: "SparkSession"):
+self.sparkSession = sparkSession
+
+def show(self, *, type: Optional[str] = None, id: Optional[int] = None) -> 
None:
+"""
+Show the profile results.
+
+.. versionadded:: 4.0.0
+
+Parameters
+--
+type : str, optional
+The profiler type, which can be either "perf" or "memory".
+id : int, optional
+A UDF ID to be shown. If not specified, all the results will be 
shown.
+"""
+if type == "memory":
+self.sparkSession.showMemoryProfiles(id)
+elif type == "perf" or type is None:
+self.sparkSession.showPerfProfiles(id)
+if type is None:  # Show both perf and memory profiles
+self.sparkSession.showMemoryProfiles(id)
+else:
+raise PySparkValueError(
+error_class="VALUE_NOT_ALLOWED",
+message_parameters={
+"arg_name": "type",
+"allowed_values": str(["perf", "memory"]),
+},
+)
+
+def dump(self, path: str, *, type: Optional[str] = None, id: Optional[int] 
= None) -> None:

Review Comment:
   ditto.
   ```suggestion
   def dump(self, path: str, id: Optional[int] = None, *, type: 
Optional[str] = None) -> None:
   ```



##
python/pyspark/sql/profiler.py:
##
@@ -239,3 +241,72 @@ def _profile_results(self) -> "ProfileResults":
 with self._lock:
 value = self._accumulator.value
 return value if value is not None else {}
+
+
+class Profile:
+"""User-facing profile API. This instance can be accessed by
+:attr:`spark.profile`.
+
+.. versionadded: 4.0.0
+"""
+
+def __init__(self, sparkSession: "SparkSession"):
+self.sparkSession = sparkSession
+
+def show(self, *, type: Optional[str] = None, id: Optional[int] = None) -> 
None:

Review Comment:
   I prefer:
   ```suggestion
   def show(self, id: Optional[int] = None, *, type: Optional[str] = None) 
-> None:
   ```
   
   ```py
   spark.profile.show()  # show all profile results
   spark.profile.show(1)  # show the profile results for ID = 1
   spark.profile.show(1, type="memory")  # show the memory profile results for 
ID = 1
   spark.profile.show(type="memory")  # show all memory profile results
   ```



##
python/pyspark/sql/session.py:
##
@@ -906,6 +907,12 @@ def dataSource(self) -> "DataSourceRegistration":
 
 return DataSourceRegistration(self)
 
+@property
+def profile(self) -> "Profile":

Review Comment:
   Need this for connect, too?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to 

Re: [PR] [SPARK-46962][SS][PYTHON] Add interface for python streaming data source API and implement python worker to run python streaming data source [spark]

2024-02-15 Thread via GitHub


HeartSaVioR commented on code in PR #45023:
URL: https://github.com/apache/spark/pull/45023#discussion_r1491869371


##
sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonStreamingSourceRunner.scala:
##
@@ -0,0 +1,209 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+
+package org.apache.spark.sql.execution.python
+
+import java.io.{BufferedInputStream, BufferedOutputStream, DataInputStream, 
DataOutputStream}
+
+import scala.collection.mutable.ArrayBuffer
+import scala.jdk.CollectionConverters._
+
+import org.apache.spark.SparkEnv
+import org.apache.spark.api.python.{PythonFunction, PythonWorker, 
PythonWorkerFactory, PythonWorkerUtils, SpecialLengths}
+import org.apache.spark.internal.Logging
+import org.apache.spark.internal.config.BUFFER_SIZE
+import org.apache.spark.internal.config.Python.{PYTHON_AUTH_SOCKET_TIMEOUT, 
PYTHON_USE_DAEMON}
+import org.apache.spark.sql.errors.{QueryCompilationErrors, 
QueryExecutionErrors}
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.StructType
+
+object PythonStreamingSourceRunner {
+  val initialOffsetFuncId = 884
+  val latestOffsetFuncId = 885
+  val partitionsFuncId = 886
+  val commitFuncId = 887
+}
+
+/**
+ * This class is a proxy to invoke methods in Python DataSourceStreamReader 
from JVM.
+ * A runner spawns a python worker process. In the main function, set up 
communication
+ * between JVM and python process through socket and create a 
DataSourceStreamReader instance.
+ * In an infinite loop, the python worker process poll information(function 
name and parameters)
+ * from the socket, invoke the corresponding method of StreamReader and send 
return value to JVM.
+ */
+class PythonStreamingSourceRunner(
+func: PythonFunction,
+outputSchema: StructType) extends Logging  {
+  val workerModule = "pyspark.sql.streaming.python_streaming_source_runner"
+
+  private val conf = SparkEnv.get.conf
+  protected val bufferSize: Int = conf.get(BUFFER_SIZE)
+  protected val authSocketTimeout = conf.get(PYTHON_AUTH_SOCKET_TIMEOUT)
+
+  private val envVars: java.util.Map[String, String] = func.envVars
+  private val pythonExec: String = func.pythonExec
+  private var pythonWorker: Option[PythonWorker] = None
+  private var pythonWorkerFactory: Option[PythonWorkerFactory] = None
+  protected val pythonVer: String = func.pythonVer
+
+  private var dataOut: DataOutputStream = null
+  private var dataIn: DataInputStream = null
+
+  import PythonStreamingSourceRunner._
+
+  /**
+   * Initializes the Python worker for running the streaming source.
+   */
+  def init(): Unit = {
+logInfo(s"Initializing Python runner pythonExec: $pythonExec")
+val env = SparkEnv.get
+
+val localdir = env.blockManager.diskBlockManager.localDirs.map(f => 
f.getPath()).mkString(",")
+envVars.put("SPARK_LOCAL_DIRS", localdir)
+
+envVars.put("SPARK_AUTH_SOCKET_TIMEOUT", authSocketTimeout.toString)
+envVars.put("SPARK_BUFFER_SIZE", bufferSize.toString)
+
+val prevConf = conf.get(PYTHON_USE_DAEMON)
+conf.set(PYTHON_USE_DAEMON, false)
+try {
+  val workerFactory =
+new PythonWorkerFactory(pythonExec, workerModule, 
envVars.asScala.toMap)
+  val (worker: PythonWorker, _) = 
workerFactory.createSimpleWorker(blockingMode = true)
+  pythonWorker = Some(worker)
+  pythonWorkerFactory = Some(workerFactory)
+} finally {
+  conf.set(PYTHON_USE_DAEMON, prevConf)
+}
+
+val stream = new BufferedOutputStream(
+  pythonWorker.get.channel.socket().getOutputStream, bufferSize)
+dataOut = new DataOutputStream(stream)
+
+PythonWorkerUtils.writePythonVersion(pythonVer, dataOut)
+
+val pythonIncludes = func.pythonIncludes.asScala.toSet
+PythonWorkerUtils.writeSparkFiles(Some("streaming_job"), pythonIncludes, 
dataOut)
+
+// Send the user function to python process
+PythonWorkerUtils.writePythonFunction(func, dataOut)
+
+// Send output schema
+PythonWorkerUtils.writeUTF(outputSchema.json, dataOut)
+
+// Send configurations
+dataOut.writeInt(SQLConf.get.arrowMaxRecordsPerBatch)
+dataOut.flush()
+
+dataIn = new DataInputStream(
+  new 

Re: [PR] [SPARK-47055][PYTHON] Upgrade MyPy 1.8.0 [spark]

2024-02-15 Thread via GitHub


HyukjinKwon commented on code in PR #45115:
URL: https://github.com/apache/spark/pull/45115#discussion_r1491862078


##
python/pyspark/streaming/dstream.py:
##
@@ -849,7 +847,7 @@ def reduceFunc(t: datetime, a: Any, b: Any) -> Any:
 if a is None:
 g = b.groupByKey(numPartitions).mapValues(lambda vs: 
(list(vs), None))
 else:
-g = a.cogroup(b.partitionBy(cast(int, numPartitions)), 
numPartitions)
+g = a.cogroup(b.partitionBy(int), numPartitions)

Review Comment:
   ```suggestion
   g = a.cogroup(b.partitionBy(numPartitions), numPartitions)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47055][PYTHON] Upgrade MyPy 1.8.0 [spark]

2024-02-15 Thread via GitHub


HyukjinKwon commented on code in PR #45115:
URL: https://github.com/apache/spark/pull/45115#discussion_r1491862078


##
python/pyspark/streaming/dstream.py:
##
@@ -849,7 +847,7 @@ def reduceFunc(t: datetime, a: Any, b: Any) -> Any:
 if a is None:
 g = b.groupByKey(numPartitions).mapValues(lambda vs: 
(list(vs), None))
 else:
-g = a.cogroup(b.partitionBy(cast(int, numPartitions)), 
numPartitions)
+g = a.cogroup(b.partitionBy(int), numPartitions)

Review Comment:
   ```suggestion
   g = a.cogroup(b.partitionBy(numPartitions), numPartitions)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47067][INFRA] Add Daily Apple Silicon Github Action Job (Java/Scala) [spark]

2024-02-15 Thread via GitHub


dongjoon-hyun closed pull request #45128: [SPARK-47067][INFRA] Add Daily Apple 
Silicon Github Action Job (Java/Scala)
URL: https://github.com/apache/spark/pull/45128


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47067][INFRA] Add Daily Apple Silicon Github Action Job (Java/Scala) [spark]

2024-02-15 Thread via GitHub


dongjoon-hyun commented on PR #45128:
URL: https://github.com/apache/spark/pull/45128#issuecomment-1947600534

   Thank you, @viirya ! Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47067][INFRA] Add Daily Apple Silicon Github Action Job (Java/Scala) [spark]

2024-02-15 Thread via GitHub


dongjoon-hyun commented on code in PR #45128:
URL: https://github.com/apache/spark/pull/45128#discussion_r1491857406


##
.github/workflows/build_apple_silicon.yml:
##
@@ -0,0 +1,47 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+
+name: "Build / Apple Silicon (master, JDK 21)"
+
+on:
+  schedule:
+- cron: '0 20 * * *'
+
+jobs:
+  run-build:
+permissions:
+  packages: write
+name: Run
+uses: ./.github/workflows/build_and_test.yml
+runs-on: macos-14
+if: github.repository == 'apache/spark'
+with:
+  java: 21
+  branch: master
+  hadoop: hadoop3
+  envs: >-
+{
+  "SKIP_MIMA": "true",
+  "SKIP_UNIDOC": "true",
+  "DEDICATED_JVM_SBT_TESTS": 
"org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormatV1Suite,org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormatV2Suite,org.apache.spark.sql.execution.datasources.orc.OrcSourceV1Suite,org.apache.spark.sql.execution.datasources.orc.OrcSourceV2Suite"
+}
+  jobs: >-
+{
+  "build": "true"

Review Comment:
   Yes, I excluded them intentionally at this stage. They need more care. Only 
Java/Scala is the target.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] Recover -1 and 0 case for spark.sql.execution.arrow.maxRecordsPerBatch [spark]

2024-02-15 Thread via GitHub


HyukjinKwon opened a new pull request, #45132:
URL: https://github.com/apache/spark/pull/45132

   ### What changes were proposed in this pull request?
   
   This PR fixes the regression introduced by 
https://github.com/apache/spark/pull/36683.
   
   ```python
   import pandas as pd
   spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
   spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", 0)
   spark.conf.set("spark.sql.execution.arrow.pyspark.fallback.enabled", False)
   spark.createDataFrame(pd.DataFrame({'a': [123]})).toPandas()
   
   spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", -1)
   spark.createDataFrame(pd.DataFrame({'a': [123]})).toPandas()
   ```
   
   **Before**
   
   ```
   /.../spark/python/pyspark/sql/pandas/conversion.py:371: UserWarning: 
createDataFrame attempted Arrow optimization because 
'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached the 
error below and will not continue because automatic fallback with 
'spark.sql.execution.arrow.pyspark.fallback.enabled' has been set to false.
 range() arg 3 must not be zero
 warn(msg)
   Traceback (most recent call last):
 File "", line 1, in 
 File "/.../spark/python/pyspark/sql/session.py", line 1483, in 
createDataFrame
   return super(SparkSession, self).createDataFrame(  # type: 
ignore[call-overload]
 File "/.../spark/python/pyspark/sql/pandas/conversion.py", line 351, in 
createDataFrame
   return self._create_from_pandas_with_arrow(data, schema, timezone)
 File "/.../spark/python/pyspark/sql/pandas/conversion.py", line 633, in 
_create_from_pandas_with_arrow
   pdf_slices = (pdf.iloc[start : start + step] for start in range(0, 
len(pdf), step))
   ValueError: range() arg 3 must not be zero
   ```
   ```
   Empty DataFrame
   Columns: [a]
   Index: []
   ```
   
   **After**
   
   ```
a
   0  123
   ```
   
   ```
a
   0  123
   ```
   
   ### Why are the changes needed?
   
   It fixes a regerssion. This is a documented behaviour. It should be 
backported to branch-3.4 and branch-3.5.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, it fixes a regression as described above.
   
   ### How was this patch tested?
   
   Unittest was added.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP] TransformWithState Batch Support [spark]

2024-02-15 Thread via GitHub


ericm-db closed pull request #44963: [WIP] TransformWithState Batch Support
URL: https://github.com/apache/spark/pull/44963


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP] POC to add TTL for ValueState [spark]

2024-02-15 Thread via GitHub


ericm-db closed pull request #45072: [WIP] POC to add TTL for ValueState
URL: https://github.com/apache/spark/pull/45072


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [wip] value state ttl poc [spark]

2024-02-15 Thread via GitHub


ericm-db closed pull request #45108: [wip] value state ttl poc
URL: https://github.com/apache/spark/pull/45108


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [WIP] Add Variant type info to PySpark [spark]

2024-02-15 Thread via GitHub


desmondcheongzx opened a new pull request, #45131:
URL: https://github.com/apache/spark/pull/45131

   ### What changes were proposed in this pull request?
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47067][INFRA] Add Daily Apple Silicon Github Action Job (Java/Scala) [spark]

2024-02-15 Thread via GitHub


viirya commented on code in PR #45128:
URL: https://github.com/apache/spark/pull/45128#discussion_r1491824408


##
.github/workflows/build_apple_silicon.yml:
##
@@ -0,0 +1,47 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+
+name: "Build / Apple Silicon (master, JDK 21)"
+
+on:
+  schedule:
+- cron: '0 20 * * *'
+
+jobs:
+  run-build:
+permissions:
+  packages: write
+name: Run
+uses: ./.github/workflows/build_and_test.yml
+runs-on: macos-14
+if: github.repository == 'apache/spark'
+with:
+  java: 21
+  branch: master
+  hadoop: hadoop3
+  envs: >-
+{
+  "SKIP_MIMA": "true",
+  "SKIP_UNIDOC": "true",
+  "DEDICATED_JVM_SBT_TESTS": 
"org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormatV1Suite,org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormatV2Suite,org.apache.spark.sql.execution.datasources.orc.OrcSourceV1Suite,org.apache.spark.sql.execution.datasources.orc.OrcSourceV2Suite"
+}
+  jobs: >-
+{
+  "build": "true"

Review Comment:
   We don't do other jobs like `tpcds-1g`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47067][INFRA] Add Daily Apple Silicon Github Action Job (Java/Scala) [spark]

2024-02-15 Thread via GitHub


dongjoon-hyun commented on PR #45128:
URL: https://github.com/apache/spark/pull/45128#issuecomment-1947552568

   Could you review this Daily GitHub Action job, `build_apple_silicon.yml`, 
@viirya ?
   
   It will work like `build_java21.yml` and we need to see the result after 
merging.
   - https://github.com/apache/spark/actions/workflows/build_java21.yml


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45669][CORE] Ensure the continuity of rolling log index [spark]

2024-02-15 Thread via GitHub


github-actions[bot] closed pull request #43534: [SPARK-45669][CORE] Ensure the 
continuity of rolling log index
URL: https://github.com/apache/spark/pull/43534


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] Encoder ttl poc [spark]

2024-02-15 Thread via GitHub


ericm-db opened a new pull request, #45130:
URL: https://github.com/apache/spark/pull/45130

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-47067][INFRA] Add Daily Apple Silicon Github Action Job (Java/Scala) [spark]

2024-02-15 Thread via GitHub


dongjoon-hyun opened a new pull request, #45128:
URL: https://github.com/apache/spark/pull/45128

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47066][INFRA] Add `Apple Silicon` Maven build test to GitHub Action CI [spark]

2024-02-15 Thread via GitHub


dongjoon-hyun closed pull request #45126: [SPARK-47066][INFRA] Add `Apple 
Silicon` Maven build test to GitHub Action CI
URL: https://github.com/apache/spark/pull/45126


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47066][INFRA] Add `Apple Silicon` Maven build test to GitHub Action CI [spark]

2024-02-15 Thread via GitHub


dongjoon-hyun commented on PR #45126:
URL: https://github.com/apache/spark/pull/45126#issuecomment-1947433794

   Thank you so much, @viirya .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47066][INFRA] Add `Apple Silicon` Maven build test to GitHub Action CI [spark]

2024-02-15 Thread via GitHub


dongjoon-hyun commented on PR #45126:
URL: https://github.com/apache/spark/pull/45126#issuecomment-1947434491

   Merged to master for Apache Spark 4.0.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47066][INFRA] Add `Apple Silicon` Maven build test to GitHub Action CI [spark]

2024-02-15 Thread via GitHub


dongjoon-hyun commented on PR #45126:
URL: https://github.com/apache/spark/pull/45126#issuecomment-1947425316

   Can I get your approval, @viirya ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47066][INFRA] Add `Apple Silicon` Maven build test to GitHub Action CI [spark]

2024-02-15 Thread via GitHub


dongjoon-hyun commented on PR #45126:
URL: https://github.com/apache/spark/pull/45126#issuecomment-1947425044

   Yes, it does~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-46947][CORE] Delay memory manager initialization until Driver plugin is loaded [spark]

2024-02-15 Thread via GitHub


sunchao commented on PR #45052:
URL: https://github.com/apache/spark/pull/45052#issuecomment-1947389764

   Sure, thanks @mridulm in advance!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46743] Count bug after constant folding [spark]

2024-02-15 Thread via GitHub


agubichev commented on PR #45125:
URL: https://github.com/apache/spark/pull/45125#issuecomment-1947380207

   @jchen5 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [WIP][SQL] Fix supported interval formats in error messages [spark]

2024-02-15 Thread via GitHub


MaxGekk opened a new pull request, #45127:
URL: https://github.com/apache/spark/pull/45127

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes.
   
   ### How was this patch tested?
   By running the existing test suite:
   ```
   $ build/sbt "test:testOnly *IntervalUtilsSuite"
   ```
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46962][SS][PYTHON] Add interface for python streaming data source API and implement python worker to run python streaming data source [spark]

2024-02-15 Thread via GitHub


chaoqin-li1123 commented on code in PR #45023:
URL: https://github.com/apache/spark/pull/45023#discussion_r1491663766


##
python/pyspark/sql/datasource.py:
##
@@ -298,6 +320,117 @@ def read(self, partition: InputPartition) -> 
Iterator[Union[Tuple, Row]]:
 ...
 
 
+class DataSourceStreamReader(ABC):
+"""
+A base class for streaming data source readers. Data source stream readers 
are responsible
+for outputting data from a streaming data source.
+
+.. versionadded: 4.0.0
+"""
+
+def initialOffset(self) -> dict:
+"""
+Return the initial offset of the streaming data source.
+A new streaming query starts reading data from the initial offset.
+If Spark is restarting an existing query, it will restart from the 
check-pointed offset
+rather than the initial one.
+
+Returns
+---
+dict
+A dict whose key and values are str type.
+"""
+...
+raise PySparkNotImplementedError(
+error_class="NOT_IMPLEMENTED",
+message_parameters={"feature": "initialOffset"},
+)
+
+def latestOffset(self) -> dict:
+"""
+Returns the most recent offset available.
+
+Returns
+---
+dict
+A dict whose key and values are str type.
+"""
+...
+raise PySparkNotImplementedError(
+error_class="NOT_IMPLEMENTED",
+message_parameters={"feature": "latestOffset"},
+)
+
+def partitions(self, start: dict, end: dict) -> Sequence[InputPartition]:
+"""
+Returns a list of InputPartition  given the start and end offsets. 
Each InputPartition
+represents a data split that can be processed by one Spark task.
+
+Parameters
+--
+start : dict
+The start offset of the microbatch to plan partitioning.
+end : dict
+The end offset of the microbatch to plan partitioning.
+
+Returns
+---
+Sequence[InputPartition]
+A sequence of partitions for this data source. Each partition value
+must be an instance of `InputPartition` or a subclass of it.
+"""
+...
+raise PySparkNotImplementedError(
+error_class="NOT_IMPLEMENTED",
+message_parameters={"feature": "partitions"},
+)
+
+@abstractmethod
+def read(self, partition) -> Iterator[Union[Tuple, Row]]:
+"""
+Generates data for a given partition and returns an iterator of tuples 
or rows.
+
+This method is invoked once per partition to read the data. 
Implementing
+this method is required for stream reader. You can initialize any
+non-serializable resources required for reading data from the data 
source
+within this method.
+This method is static and stateless. You shouldn't access mutable 
class member
+or keep in memory state between different invocations of read().
+
+Parameters
+--
+partition : object
+The partition to read. It must be one of the partition values 
returned by
+``partitions()``.
+
+Returns
+---
+Iterator[Tuple] or Iterator[Row]
+An iterator of tuples or rows. Each tuple or row will be converted 
to a row
+in the final DataFrame.
+"""
+...
+
+def commit(self, end: dict):

Review Comment:
   Return type added.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47066][INFRA] Add Apple Silicon Maven build test to GitHub Action CI [spark]

2024-02-15 Thread via GitHub


viirya commented on PR #45126:
URL: https://github.com/apache/spark/pull/45126#issuecomment-1947327388

   Yea, the new pipeline looks like passed already.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47053][INFRA] Bump python libraries (pandas, pyarrow) in Docker image for release script [spark]

2024-02-15 Thread via GitHub


xinrong-meng commented on PR #45110:
URL: https://github.com/apache/spark/pull/45110#issuecomment-1947099775

   Thank you @HeartSaVioR !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47054][PYTHON][TESTS] Remove pinned version of torch for Python 3.12 support [spark]

2024-02-15 Thread via GitHub


xinrong-meng commented on PR #45113:
URL: https://github.com/apache/spark/pull/45113#issuecomment-1947085069

   Oh I saw you upgrading mypy, great! Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47054][PYTHON][TESTS] Remove pinned version of torch for Python 3.12 support [spark]

2024-02-15 Thread via GitHub


xinrong-meng commented on PR #45113:
URL: https://github.com/apache/spark/pull/45113#issuecomment-1947070089

   MyPy seems to complain about an error code `method-assign` in  torchdynamo:
   
   ```
   /usr/local/lib/python3.9/dist-packages/torch/_dynamo/eval_frame.py:1: error: 
disable_error_code: Invalid error code(s): method-assign  [misc]
   ```
   
   Not sure how to fix it though..


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-47066][INFRA] Add Apple Silicon Maven build test to GitHub Action CI [spark]

2024-02-15 Thread via GitHub


dongjoon-hyun opened a new pull request, #45126:
URL: https://github.com/apache/spark/pull/45126

   …
   
   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46962][SS][PYTHON] Add interface for python streaming data source API and implement python worker to run python streaming data source [spark]

2024-02-15 Thread via GitHub


chaoqin-li1123 commented on code in PR #45023:
URL: https://github.com/apache/spark/pull/45023#discussion_r1491464370


##
sql/core/src/test/scala/org/apache/spark/sql/execution/python/PythonStreamingDataSourceSuite.scala:
##
@@ -0,0 +1,229 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.python
+
+import org.apache.spark.SparkException
+import org.apache.spark.sql.AnalysisException
+import 
org.apache.spark.sql.IntegratedUDFTestUtils.{createUserDefinedPythonDataSource, 
shouldTestPandasUDFs}
+import 
org.apache.spark.sql.execution.datasources.v2.python.{PythonDataSourceV2, 
PythonMicroBatchStream, PythonStreamingSourceOffset}
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.sql.util.CaseInsensitiveStringMap
+
+class PythonStreamingDataSourceSuite extends PythonDataSourceSuiteBase {
+
+  protected def simpleDataStreamReaderScript: String =
+"""
+  |from pyspark.sql.datasource import DataSourceStreamReader, 
InputPartition
+  |
+  |class SimpleDataStreamReader(DataSourceStreamReader):
+  |def initialOffset(self):
+  |return {"offset": "0"}
+  |def latestOffset(self):
+  |return {"offset": "2"}
+  |def partitions(self, start: dict, end: dict):
+  |return [InputPartition(i) for i in range(int(start["offset"]))]
+  |def commit(self, end: dict):
+  |1 + 2
+  |def read(self, partition):
+  |yield (0, partition.value)
+  |yield (1, partition.value)
+  |yield (2, partition.value)
+  |""".stripMargin
+
+  protected def errorDataStreamReaderScript: String =
+"""
+  |from pyspark.sql.datasource import DataSourceStreamReader, 
InputPartition
+  |
+  |class ErrorDataStreamReader(DataSourceStreamReader):
+  |def initialOffset(self):
+  |raise Exception("error reading initial offset")
+  |def latestOffset(self):
+  |raise Exception("error reading latest offset")
+  |def partitions(self, start: dict, end: dict):
+  |raise Exception("error planning partitions")
+  |def commit(self, end: dict):
+  |raise Exception("error committing offset")
+  |def read(self, partition):
+  |yield (0, partition.value)
+  |yield (1, partition.value)
+  |yield (2, partition.value)
+  |""".stripMargin
+
+  private val errorDataSourceName = "ErrorDataSource"
+
+  test("simple data stream source") {
+assume(shouldTestPandasUDFs)
+val dataSourceScript =
+  s"""
+ |from pyspark.sql.datasource import DataSource
+ |$simpleDataStreamReaderScript
+ |
+ |class $dataSourceName(DataSource):
+ |def streamReader(self, schema):
+ |return SimpleDataStreamReader()
+ |""".stripMargin
+val inputSchema = StructType.fromDDL("input BINARY")
+
+val dataSource = createUserDefinedPythonDataSource(dataSourceName, 
dataSourceScript)
+spark.dataSource.registerPython(dataSourceName, dataSource)
+val pythonDs = new PythonDataSourceV2
+pythonDs.setShortName("SimpleDataSource")
+val stream = new PythonMicroBatchStream(
+  pythonDs, dataSourceName, inputSchema, CaseInsensitiveStringMap.empty())
+
+val initialOffset = stream.initialOffset()
+assert(initialOffset.json == "{\"offset\": \"0\"}")
+for (_ <- 1 to 50) {
+  val offset = stream.latestOffset()
+  assert(offset.json == "{\"offset\": \"2\"}")
+  assert(stream.planInputPartitions(offset, offset).size == 2)
+  stream.commit(offset)
+}
+stream.stop()

Review Comment:
   read will be implemented in the next PR, it is executed in executor.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: 

Re: [PR] [SPARK-46962][SS][PYTHON] Add interface for python streaming data source API and implement python worker to run python streaming data source [spark]

2024-02-15 Thread via GitHub


chaoqin-li1123 commented on code in PR #45023:
URL: https://github.com/apache/spark/pull/45023#discussion_r1491464124


##
python/pyspark/sql/streaming/python_streaming_source_runner.py:
##
@@ -0,0 +1,159 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import os
+import sys
+import json
+from typing import IO
+
+from pyspark.accumulators import _accumulatorRegistry
+from pyspark.errors import PySparkAssertionError, PySparkRuntimeError
+from pyspark.java_gateway import local_connect_and_auth
+from pyspark.serializers import (
+read_int,
+write_int,
+write_with_length,
+SpecialLengths,
+)
+from pyspark.sql.datasource import DataSource
+from pyspark.sql.types import (
+_parse_datatype_json_string,
+StructType,
+)
+from pyspark.util import handle_worker_exception
+from pyspark.worker_util import (
+check_python_version,
+read_command,
+pickleSer,
+send_accumulator_updates,
+setup_memory_limits,
+setup_spark_files,
+utf8_deserializer,
+)
+
+initial_offset_func_id = 884
+latest_offset_func_id = 885
+partitions_func_id = 886
+commit_func_id = 887
+
+
+def initial_offset_func(reader, outfile):
+offset = reader.initialOffset()

Review Comment:
   We throw unimplemented error.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46962][SS][PYTHON] Add interface for python streaming data source API and implement python worker to run python streaming data source [spark]

2024-02-15 Thread via GitHub


chaoqin-li1123 commented on code in PR #45023:
URL: https://github.com/apache/spark/pull/45023#discussion_r1491463836


##
python/pyspark/sql/datasource.py:
##
@@ -298,6 +320,117 @@ def read(self, partition: InputPartition) -> 
Iterator[Union[Tuple, Row]]:
 ...
 
 
+class DataSourceStreamReader(ABC):
+"""
+A base class for streaming data source readers. Data source stream readers 
are responsible
+for outputting data from a streaming data source.
+
+.. versionadded: 4.0.0
+"""
+
+def initialOffset(self) -> dict:
+"""
+Return the initial offset of the streaming data source.
+A new streaming query starts reading data from the initial offset.
+If Spark is restarting an existing query, it will restart from the 
check-pointed offset
+rather than the initial one.
+
+Returns
+---
+dict
+A dict whose key and values are str type.
+"""
+...
+raise PySparkNotImplementedError(
+error_class="NOT_IMPLEMENTED",
+message_parameters={"feature": "initialOffset"},
+)
+
+def latestOffset(self) -> dict:
+"""
+Returns the most recent offset available.
+
+Returns
+---
+dict
+A dict whose key and values are str type.
+"""
+...
+raise PySparkNotImplementedError(
+error_class="NOT_IMPLEMENTED",
+message_parameters={"feature": "latestOffset"},
+)
+
+def partitions(self, start: dict, end: dict) -> Sequence[InputPartition]:
+"""
+Returns a list of InputPartition  given the start and end offsets. 
Each InputPartition
+represents a data split that can be processed by one Spark task.
+
+Parameters
+--
+start : dict
+The start offset of the microbatch to plan partitioning.
+end : dict
+The end offset of the microbatch to plan partitioning.
+
+Returns
+---
+Sequence[InputPartition]
+A sequence of partitions for this data source. Each partition value
+must be an instance of `InputPartition` or a subclass of it.
+"""
+...
+raise PySparkNotImplementedError(
+error_class="NOT_IMPLEMENTED",
+message_parameters={"feature": "partitions"},
+)
+
+@abstractmethod
+def read(self, partition) -> Iterator[Union[Tuple, Row]]:
+"""
+Generates data for a given partition and returns an iterator of tuples 
or rows.
+
+This method is invoked once per partition to read the data. 
Implementing
+this method is required for stream reader. You can initialize any
+non-serializable resources required for reading data from the data 
source
+within this method.
+This method is static and stateless. You shouldn't access mutable 
class member
+or keep in memory state between different invocations of read().
+
+Parameters
+--
+partition : object
+The partition to read. It must be one of the partition values 
returned by
+``partitions()``.
+
+Returns
+---
+Iterator[Tuple] or Iterator[Row]
+An iterator of tuples or rows. Each tuple or row will be converted 
to a row
+in the final DataFrame.
+"""
+...
+
+def commit(self, end: dict):

Review Comment:
   This function is not supposed to return anything, how do we express that in 
python?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47058][TESTS] Add `scalastyle` and `checkstyle` rules to ban `AtomicDoubleArray|CompoundOrdering` [spark]

2024-02-15 Thread via GitHub


dongjoon-hyun closed pull request #45120: [SPARK-47058][TESTS] Add `scalastyle` 
and `checkstyle` rules to ban `AtomicDoubleArray|CompoundOrdering`
URL: https://github.com/apache/spark/pull/45120


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47058][TESTS] Add `scalastyle` and `checkstyle` rules to ban `AtomicDoubleArray|CompoundOrdering` [spark]

2024-02-15 Thread via GitHub


dongjoon-hyun commented on PR #45120:
URL: https://github.com/apache/spark/pull/45120#issuecomment-1946929639

   Thank you so much, @huaxingao !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47058][TESTS] Add `scalastyle` and `checkstyle` rules to ban `AtomicDoubleArray|CompoundOrdering` [spark]

2024-02-15 Thread via GitHub


huaxingao commented on PR #45120:
URL: https://github.com/apache/spark/pull/45120#issuecomment-1946858479

   LGTM. Thanks for the PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] Count bug in temp views [spark]

2024-02-15 Thread via GitHub


agubichev opened a new pull request, #45125:
URL: https://github.com/apache/spark/pull/45125

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46400][CORE][SQL] When there are corrupted files in the local maven repo, skip this cache and try again [spark]

2024-02-15 Thread via GitHub


dongjoon-hyun commented on PR #44343:
URL: https://github.com/apache/spark/pull/44343#issuecomment-1946772230

   Thank you, @panbingkun and ll!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46400][CORE][SQL][3.5] When there are corrupted files in the local maven repo, skip this cache and try again [spark]

2024-02-15 Thread via GitHub


dongjoon-hyun commented on PR #45017:
URL: https://github.com/apache/spark/pull/45017#issuecomment-1946768228

   Please update the JIRA for the backporting PRs, @LuciferYang .  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46400][CORE][SQL][3.4] When there are corrupted files in the local maven repo, skip this cache and try again [spark]

2024-02-15 Thread via GitHub


dongjoon-hyun closed pull request #45018: [SPARK-46400][CORE][SQL][3.4] When 
there are corrupted files in the local maven repo, skip this cache and try again
URL: https://github.com/apache/spark/pull/45018


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47064][SQL][TESTS] Use Scala 2.13 Spark distribution in `HiveExternalCatalogVersionsSuite` [spark]

2024-02-15 Thread via GitHub


dongjoon-hyun closed pull request #45124: [SPARK-47064][SQL][TESTS] Use Scala 
2.13 Spark distribution in `HiveExternalCatalogVersionsSuite`
URL: https://github.com/apache/spark/pull/45124


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47064][SQL][TESTS] Use Scala 2.13 Spark distribution in `HiveExternalCatalogVersionsSuite` [spark]

2024-02-15 Thread via GitHub


dongjoon-hyun commented on PR #45124:
URL: https://github.com/apache/spark/pull/45124#issuecomment-1946730394

   Let me merge this because this affects only 
`HiveExternalCatalogVersionsSuite` and I verified it manually.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47064][SQL][TESTS] Use Scala 2.13 Spark distribution in `HiveExternalCatalogVersionsSuite` [spark]

2024-02-15 Thread via GitHub


dongjoon-hyun commented on PR #45124:
URL: https://github.com/apache/spark/pull/45124#issuecomment-1946726359

   Thank you so much, @viirya .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47064][SQL][TESTS] Use Scala 2.13 Spark distribution in `HiveExternalCatalogVersionsSuite` [spark]

2024-02-15 Thread via GitHub


dongjoon-hyun commented on code in PR #45124:
URL: https://github.com/apache/spark/pull/45124#discussion_r1491387564


##
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala:
##
@@ -95,7 +95,7 @@ class HiveExternalCatalogVersionsSuite extends 
SparkSubmitTestUtils {
   mirrors.distinct :+ "https://archive.apache.org/dist; :+ 
PROCESS_TABLES.releaseMirror
 logInfo(s"Trying to download Spark $version from $sites")
 for (site <- sites) {
-  val filename = s"spark-$version-bin-hadoop3.tgz"
+  val filename = s"spark-$version-bin-hadoop3-scala2.13.tgz"

Review Comment:
   We can change it back to `*-hadoop3.tgz` at Apache Spark 4.1 after the 
official Apache Spark 4.0 release.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47064][SQL][TESTS] Use Scala 2.13 Spark distribution in `HiveExternalCatalogVersionsSuite` [spark]

2024-02-15 Thread via GitHub


dongjoon-hyun commented on PR #45124:
URL: https://github.com/apache/spark/pull/45124#issuecomment-1946711084

   Could you review this test PR, @viirya ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-47064][SQL][TESTS] Use Scala 2.13 Spark distribution in `HiveExternalCatalogVersionsSuite` [spark]

2024-02-15 Thread via GitHub


dongjoon-hyun opened a new pull request, #45124:
URL: https://github.com/apache/spark/pull/45124

   ### What changes were proposed in this pull request?
   
   This PR aims to use `Scala 2.13` Spark binary in 
`HiveExternalCatalogVersionsSuite`.
   
   ### Why are the changes needed?
   
   SPARK-45314 makes Scala 2.13 is the default Scala version.
   As one of migration paths, the users choose Apache Spark 3.5.0 (Scala 2.13) 
and Apache Spark 3.4.2 (Scala 2.13).
   We had better focus on Scala 2.13 testing.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Pass the CIs.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47058][TESTS] Add `scalastyle` and `checkstyle` rules to ban `AtomicDoubleArray|CompoundOrdering` [spark]

2024-02-15 Thread via GitHub


dongjoon-hyun commented on PR #45120:
URL: https://github.com/apache/spark/pull/45120#issuecomment-1946669694

   All tests passed.
   
   ![Screenshot 2024-02-15 at 09 23 
38](https://github.com/apache/spark/assets/9700541/5ad55819-8c29-49f4-a1c8-6464f8a2a84c)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47058][TESTS] Add `scalastyle` and `checkstyle` rules to ban `AtomicDoubleArray|CompoundOrdering` [spark]

2024-02-15 Thread via GitHub


dongjoon-hyun commented on PR #45120:
URL: https://github.com/apache/spark/pull/45120#issuecomment-1946623436

   Could you review this PR, please, @huaxingao ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47058][TESTS] Add `scalastyle` and `checkstyle` rules to ban `AtomicDoubleArray|CompoundOrdering` [spark]

2024-02-15 Thread via GitHub


dongjoon-hyun commented on PR #45120:
URL: https://github.com/apache/spark/pull/45120#issuecomment-1946622765

   All tests passed except one irrelevant pyspark test pipeline.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47055][PYTHON] Upgrade MyPy 1.8.0 [spark]

2024-02-15 Thread via GitHub


dongjoon-hyun commented on PR #45115:
URL: https://github.com/apache/spark/pull/45115#issuecomment-1946533127

   Python linter seems to complain still.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47040][CONNECT][FOLLOWUP] Improve usability of the start-conect-server-script.sh [spark]

2024-02-15 Thread via GitHub


dongjoon-hyun commented on PR #45117:
URL: https://github.com/apache/spark/pull/45117#issuecomment-1946527338

   Thank you for update, @grundprinzip .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-47050][SQL] Collect and publish partition level metrics [spark]

2024-02-15 Thread via GitHub


snmvaughan opened a new pull request, #45123:
URL: https://github.com/apache/spark/pull/45123

   We currently capture metrics which include the number of files, bytes and 
rows for a task along with the updated partitions. 
This change captures metrics for each updated partition, reporting the 
partition sub-paths along with the number of files, bytes, and rows per 
partition for each task.
   
   ### What changes were proposed in this pull request?
   
   - Update the `WriteTaskStatsTracker` implementation to associate a partition 
with new files during writing, and to track the number of rows written to each 
file. The final stats now include a map of partitions and the associated stats 
(number of committed files, bytes, and rows)
   - Update the `WriteJobStatsTracker` implementation to capture the partition 
subpaths and to publish a new Event to the listener bus. The processed stats 
aggregate the statistics for each partition which are reported by the executors
   - Add a new `SparkListenerEvent` used to publish the task's collected 
partition metrics
   
   ### Why are the changes needed?
   This increases our understanding of written data by tracking the impact for 
each task on our datasets
   
   ### Does this PR introduce _any_ user-facing change?
   This adds an additional event which provides partition-level data to 
listeners.
   
   ### How was this patch tested?
   In addition to the new unit tests, this was run in a Kubernetes environment 
writing tables with differing partitioning strategies and validating the 
reported stats.  In all cases where partitioning was enabled we also verified 
that the aggregated partition metrics matched the existing metrics for number 
of files, bytes, and rows.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47059][SQL] Attach error context for ALTER COLUMN v1 command [spark]

2024-02-15 Thread via GitHub


MaxGekk closed pull request #45121: [SPARK-47059][SQL] Attach error context for 
ALTER COLUMN v1 command
URL: https://github.com/apache/spark/pull/45121


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47059][SQL] Attach error context for ALTER COLUMN v1 command [spark]

2024-02-15 Thread via GitHub


MaxGekk commented on PR #45121:
URL: https://github.com/apache/spark/pull/45121#issuecomment-1946340230

   +1, LGTM. Merging to master.
   Thank you, @cloud-fan.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

2024-02-15 Thread via GitHub


nchammas commented on code in PR #45036:
URL: https://github.com/apache/spark/pull/45036#discussion_r1491176098


##
core/src/test/scala/org/apache/spark/util/collection/OpenHashMapSuite.scala:
##
@@ -249,4 +249,32 @@ class OpenHashMapSuite extends SparkFunSuite with Matchers 
{
 map(null) = null
 assert(map.get(null) === Some(null))
   }
+
+  test("SPARK-45599: 0.0 and -0.0 should count distinctly") {
+// Exactly these elements provided in roughly this order trigger a 
condition where lookups of
+// 0.0 and -0.0 in the bitset happen to collide, causing their counts to 
be merged incorrectly
+// and inconsistently if `==` is used to check for key equality.
+val spark45599Repro = Seq(
+  Double.NaN,
+  2.0,
+  168.0,
+  Double.NaN,
+  Double.NaN,
+  -0.0,
+  153.0,
+  0.0
+)
+
+val map1 = new OpenHashMap[Double, Int]()
+spark45599Repro.foreach(map1.changeValue(_, 1, {_ + 1}))
+assert(map1(0.0) == 1)
+assert(map1(-0.0) == 1)
+
+val map2 = new OpenHashMap[Double, Int]()
+// Simply changing the order in which the elements are added to the map 
should not change the
+// counts for 0.0 and -0.0.
+spark45599Repro.reverse.foreach(map2.changeValue(_, 1, {_ + 1}))
+assert(map2(0.0) == 1)
+assert(map2(-0.0) == 1)

Review Comment:
   Added just below.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47062][CONNECT] Move Connect Plugins to Java for Compatibility [spark]

2024-02-15 Thread via GitHub


grundprinzip commented on PR #45114:
URL: https://github.com/apache/spark/pull/45114#issuecomment-1946297630

   Will update the PR description.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

2024-02-15 Thread via GitHub


nchammas commented on code in PR #45036:
URL: https://github.com/apache/spark/pull/45036#discussion_r1491158277


##
core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala:
##
@@ -110,6 +110,18 @@ class OpenHashSet[@specialized(Long, Int, Double, Float) 
T: ClassTag](
 this
   }
 
+  /**
+   * Check if a key exists at the provided position using object equality 
rather than
+   * cooperative equality. Otherwise, hash sets will mishandle values for 
which `==`
+   * and `equals` return different results, like 0.0/-0.0 and NaN/NaN.

Review Comment:
   Yes, the differences are subtle:
   
   ```scala
   scala> 0.0 == -0.0
   val res0: Boolean = true
   
   scala> 0.0 equals -0.0
   val res1: Boolean = false
   
   scala> Double.NaN == Double.NaN
   val res2: Boolean = false
   
   scala> Double.NaN equals Double.NaN
   val res3: Boolean = true
   ```
   
   There is a long discussion on the Scala forums from 2017 about this 
difference and some of the problems it causes:
   
   [Can we get rid of cooperative 
equality?](https://contributors.scala-lang.org/t/can-we-get-rid-of-cooperative-equality/1131)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

2024-02-15 Thread via GitHub


nchammas commented on code in PR #45036:
URL: https://github.com/apache/spark/pull/45036#discussion_r1491125038


##
core/src/test/scala/org/apache/spark/util/collection/OpenHashSetSuite.scala:
##
@@ -269,4 +269,43 @@ class OpenHashSetSuite extends SparkFunSuite with Matchers 
{
   assert(pos1 == pos2)
 }
   }
+
+  test("SPARK-45599: 0.0 and -0.0 are equal but not the same") {
+// Therefore, 0.0 and -0.0 should get separate entries in the hash set.
+//
+// Exactly these elements provided in roughly this order will trigger the 
following scenario:
+// When probing the bitset in `getPos(-0.0)`, the loop will happen upon 
the entry for 0.0.
+// In the old logic pre-SPARK-45599, the loop will find that the bit is 
set and, because
+// -0.0 == 0.0, it will think that's the position of -0.0. But in reality 
this is the position
+// of 0.0. So -0.0 and 0.0 will be stored at different positions, but 
`getPos()` will return
+// the same position for them. This can cause users of OpenHashSet, like 
OpenHashMap, to
+// return the wrong value for a key based on whether or not this bitset 
lookup collision
+// happens.
+val spark45599Repro = Seq(
+  Double.NaN,
+  2.0,
+  168.0,
+  Double.NaN,
+  Double.NaN,
+  -0.0,
+  153.0,
+  0.0
+)
+val set = new OpenHashSet[Double]()
+spark45599Repro.foreach(set.add)
+assert(set.size == 6)
+val zeroPos = set.getPos(0.0)
+val negZeroPos = set.getPos(-0.0)
+assert(zeroPos != negZeroPos)
+  }
+
+  test("SPARK-45599: NaN and NaN are the same but not equal") {
+// Any mathematical comparison to NaN will return false, but when we place 
it in
+// a hash set we want the lookup to work like a "normal" value.
+val set = new OpenHashSet[Double]()
+set.add(Double.NaN)
+set.add(Double.NaN)
+assert(set.contains(Double.NaN))
+assert(set.size == 1)

Review Comment:
   Yes sir. On `master`, this is the actual behavior of `OpenHashSet`:
   
   ```scala
   // ...OpenHashSet$mcD$sp@21b327e6 did not contain NaN
   assert(set.contains(Double.NaN))
   
   // ...OpenHashSet$mcD$sp@1f09db1e had size 2 instead of expected size 1
   assert(set.size == 1)
   ```
   
   Every NaN will get its own entry in `OpenHashSet` on `master`. So if we add 
1,000,000 NaNs to the set, NaN will have 1,000,000 entries in there. And 
`.contains()` will _still_ return `false`. :D



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45789][SQL] Support DESCRIBE TABLE for clustering columns [spark]

2024-02-15 Thread via GitHub


cloud-fan commented on code in PR #45077:
URL: https://github.com/apache/spark/pull/45077#discussion_r1490980421


##
sql/core/src/test/scala/org/apache/spark/sql/execution/command/DescribeTableSuiteBase.scala:
##
@@ -178,4 +178,43 @@ trait DescribeTableSuiteBase extends QueryTest with 
DDLCommandTestUtils {
   assert(errMsg === "DESC TABLE COLUMN does not support nested column: 
col.x.")
 }
   }
+
+  test("describe a clustered table") {
+withNamespaceAndTable("ns", "tbl") { tbl =>
+  sql(s"CREATE TABLE $tbl (col1 STRING, col2 INT) $defaultUsing CLUSTER BY 
(col1, col2)")
+  val descriptionDf = sql(s"DESC $tbl")
+  assert(descriptionDf.schema.map(field => (field.name, field.dataType)) 
=== Seq(
+("col_name", StringType),
+("data_type", StringType),
+("comment", StringType)))
+  QueryTest.checkAnswer(
+descriptionDf,
+Seq(
+  Row("col1", "string", null),
+  Row("col2", "int", null),
+  Row("# Clustering Information", "", ""),
+  Row("# col_name", "data_type", "comment"),
+  Row("col1", "string", null),
+  Row("col2", "int", null)))
+}
+  }
+
+  test("describe a clustered table with comments on clustering columns") {
+withNamespaceAndTable("ns", "tbl") { tbl =>
+  sql(s"CREATE TABLE $tbl (col1 STRING) $defaultUsing CLUSTER BY (col1)")
+  sql(s"ALTER TABLE $tbl ALTER COLUMN col1 COMMENT 'this is comment';")

Review Comment:
   since we only want to test DESCRIBE TABLE, not ATLER COLUMN, shall we 
specify the column comment in CREATE TABLE? then we don't need to exclude this 
test case in the hive test.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

2024-02-15 Thread via GitHub


cloud-fan commented on code in PR #45036:
URL: https://github.com/apache/spark/pull/45036#discussion_r1490973900


##
core/src/test/scala/org/apache/spark/util/collection/OpenHashSetSuite.scala:
##
@@ -269,4 +269,43 @@ class OpenHashSetSuite extends SparkFunSuite with Matchers 
{
   assert(pos1 == pos2)
 }
   }
+
+  test("SPARK-45599: 0.0 and -0.0 are equal but not the same") {
+// Therefore, 0.0 and -0.0 should get separate entries in the hash set.
+//
+// Exactly these elements provided in roughly this order will trigger the 
following scenario:
+// When probing the bitset in `getPos(-0.0)`, the loop will happen upon 
the entry for 0.0.
+// In the old logic pre-SPARK-45599, the loop will find that the bit is 
set and, because
+// -0.0 == 0.0, it will think that's the position of -0.0. But in reality 
this is the position
+// of 0.0. So -0.0 and 0.0 will be stored at different positions, but 
`getPos()` will return
+// the same position for them. This can cause users of OpenHashSet, like 
OpenHashMap, to
+// return the wrong value for a key based on whether or not this bitset 
lookup collision
+// happens.
+val spark45599Repro = Seq(
+  Double.NaN,
+  2.0,
+  168.0,
+  Double.NaN,
+  Double.NaN,
+  -0.0,
+  153.0,
+  0.0
+)
+val set = new OpenHashSet[Double]()
+spark45599Repro.foreach(set.add)
+assert(set.size == 6)
+val zeroPos = set.getPos(0.0)
+val negZeroPos = set.getPos(-0.0)
+assert(zeroPos != negZeroPos)
+  }
+
+  test("SPARK-45599: NaN and NaN are the same but not equal") {
+// Any mathematical comparison to NaN will return false, but when we place 
it in
+// a hash set we want the lookup to work like a "normal" value.
+val set = new OpenHashSet[Double]()
+set.add(Double.NaN)
+set.add(Double.NaN)
+assert(set.contains(Double.NaN))
+assert(set.size == 1)

Review Comment:
   do we actually waste space in `OpenHashSet` to store all the NaN values?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

2024-02-15 Thread via GitHub


cloud-fan commented on code in PR #45036:
URL: https://github.com/apache/spark/pull/45036#discussion_r1490972974


##
core/src/test/scala/org/apache/spark/util/collection/OpenHashMapSuite.scala:
##
@@ -249,4 +249,32 @@ class OpenHashMapSuite extends SparkFunSuite with Matchers 
{
 map(null) = null
 assert(map.get(null) === Some(null))
   }
+
+  test("SPARK-45599: 0.0 and -0.0 should count distinctly") {
+// Exactly these elements provided in roughly this order trigger a 
condition where lookups of
+// 0.0 and -0.0 in the bitset happen to collide, causing their counts to 
be merged incorrectly
+// and inconsistently if `==` is used to check for key equality.
+val spark45599Repro = Seq(
+  Double.NaN,
+  2.0,
+  168.0,
+  Double.NaN,
+  Double.NaN,
+  -0.0,
+  153.0,
+  0.0
+)
+
+val map1 = new OpenHashMap[Double, Int]()
+spark45599Repro.foreach(map1.changeValue(_, 1, {_ + 1}))
+assert(map1(0.0) == 1)
+assert(map1(-0.0) == 1)
+
+val map2 = new OpenHashMap[Double, Int]()
+// Simply changing the order in which the elements are added to the map 
should not change the
+// counts for 0.0 and -0.0.
+spark45599Repro.reverse.foreach(map2.changeValue(_, 1, {_ + 1}))
+assert(map2(0.0) == 1)
+assert(map2(-0.0) == 1)

Review Comment:
   shall we test NaN  as well?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

2024-02-15 Thread via GitHub


cloud-fan commented on code in PR #45036:
URL: https://github.com/apache/spark/pull/45036#discussion_r1490967547


##
core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala:
##
@@ -110,6 +110,18 @@ class OpenHashSet[@specialized(Long, Int, Double, Float) 
T: ClassTag](
 this
   }
 
+  /**
+   * Check if a key exists at the provided position using object equality 
rather than
+   * cooperative equality. Otherwise, hash sets will mishandle values for 
which `==`
+   * and `equals` return different results, like 0.0/-0.0 and NaN/NaN.

Review Comment:
   I was told that in scala `==` is the same as `equals`, but `eq` is a 
different operator. I need to refresh my knowledge now :)



##
core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala:
##
@@ -110,6 +110,18 @@ class OpenHashSet[@specialized(Long, Int, Double, Float) 
T: ClassTag](
 this
   }
 
+  /**
+   * Check if a key exists at the provided position using object equality 
rather than
+   * cooperative equality. Otherwise, hash sets will mishandle values for 
which `==`
+   * and `equals` return different results, like 0.0/-0.0 and NaN/NaN.
+   *
+   * See: https://issues.apache.org/jira/browse/SPARK-45599
+   */
+  @annotation.nowarn("cat=other-non-cooperative-equals")
+  private def keyExistsAtPos(k: T, pos: Int) =
+// _data(pos) == k

Review Comment:
   we should remove it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [MINOR][SQL] Show only number of test blocks when there is a mismatch [spark]

2024-02-15 Thread via GitHub


cloud-fan commented on PR #45083:
URL: https://github.com/apache/spark/pull/45083#issuecomment-1946030494

   late LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47026][SQL][TESTS] Enable JSON sources in default value nested type tests [spark]

2024-02-15 Thread via GitHub


cloud-fan commented on PR #45086:
URL: https://github.com/apache/spark/pull/45086#issuecomment-1946029398

   late LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] Update pom.xml [spark]

2024-02-15 Thread via GitHub


ArunDhamotharan opened a new pull request, #45122:
URL: https://github.com/apache/spark/pull/45122

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47059][SQL] Attach error context for ALTER COLUMN v1 command [spark]

2024-02-15 Thread via GitHub


cloud-fan commented on PR #45121:
URL: https://github.com/apache/spark/pull/45121#issuecomment-1945785935

   cc @MaxGekk 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-47059][SQL] Attach error context for ALTER COLUMN v1 command [spark]

2024-02-15 Thread via GitHub


cloud-fan opened a new pull request, #45121:
URL: https://github.com/apache/spark/pull/45121

   
   
   ### What changes were proposed in this pull request?
   
   This is a small fix to improve the error message for ALTER COLUMN. We attach 
the error context for v1 command as well, making it consistent with v2 command.
   
   ### Why are the changes needed?
   
   better error message
   
   ### Does this PR introduce _any_ user-facing change?
   
   no
   
   ### How was this patch tested?
   
   updated tests
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   no


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-36964][CORE][YARN] Share cached dnsToSwitchMapping for yarn locality container requests [spark]

2024-02-15 Thread via GitHub


vbmacher commented on PR #34231:
URL: https://github.com/apache/spark/pull/34231#issuecomment-1945751728

   Hey! Im receiving this time-to-time 
   
   ```
   24/02/15 09:53:52 ERROR AsyncEventQueue: Dropping event from queue 
executorManagement. This likely means one of the listeners is too slow and 
cannot keep up with the rate at which tasks are being started by the scheduler.
   ```
   
   So can this be "revived"? cc @gaoyajun02 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-47058][TESTS] Add `scalastyle` and `checkstyle` rules to ban `AtomicDoubleArray|CompoundOrdering` [spark]

2024-02-15 Thread via GitHub


dongjoon-hyun opened a new pull request, #45120:
URL: https://github.com/apache/spark/pull/45120

   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47056][TESTS] Add `scalastyle` and `checkstyle` rules to ban `FileBackedOutputStream` [spark]

2024-02-15 Thread via GitHub


dongjoon-hyun closed pull request #45116: [SPARK-47056][TESTS] Add `scalastyle` 
and `checkstyle` rules to ban `FileBackedOutputStream`
URL: https://github.com/apache/spark/pull/45116


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47056][TESTS] Add `scalastyle` and `checkstyle` rules to ban `FileBackedOutputStream` [spark]

2024-02-15 Thread via GitHub


dongjoon-hyun commented on PR #45116:
URL: https://github.com/apache/spark/pull/45116#issuecomment-1945684945

   Scala and Java linters passed. Merged to master.
   
   ![Screenshot 2024-02-15 at 01 26 
10](https://github.com/apache/spark/assets/9700541/a55f8fbd-c4ab-4c77-afc5-d2e6f363af5f)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47056][TESTS] Add `scalastyle` and `checkstyle` rules to ban `FileBackedOutputStream` [spark]

2024-02-15 Thread via GitHub


dongjoon-hyun commented on PR #45116:
URL: https://github.com/apache/spark/pull/45116#issuecomment-1945680565

   Thank you, @HyukjinKwon !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >