date:20240313

[PR] [SPARK-41762][PYTHON][CONNECT][TESTS] Enable column name comparsion in `test_column_arithmetic_ops` [spark]

2024-03-13 Thread via GitHub



zhengruifeng opened a new pull request, #45493:
URL: https://github.com/apache/spark/pull/45493

   ### What changes were proposed in this pull request?
   Enable column name comparsion in `test_column_arithmetic_ops`
   
   ### Why are the changes needed?
   the default column name should had been already fixed
   
   
   ### Does this PR introduce _any_ user-facing change?
   no, test-only
   
   
   ### How was this patch tested?
   ci
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   no


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-47374][CONNECT][DOCS] Fix connect-repl `usage prompt` & `docs link` [spark]

2024-03-13 Thread via GitHub



panbingkun opened a new pull request, #45494:
URL: https://github.com/apache/spark/pull/45494

   ### What changes were proposed in this pull request?
   The pr aims to:
   - fix connect-repl `usage prompt`.
   - fix docs link.
   
   
   ### Why are the changes needed?
   Only fix bug.
   
   - Usage prompt
   1.update `enable_ssl` to `use_ssl`.
   2.add `user_agent` and `session_id`
 Before:
 https://github.com/apache/spark/assets/15246973/db401e7f-9569-469c-90eb-17c277472b4d";>
   
 After:
 https://github.com/apache/spark/assets/15246973/de7b1f8c-1d0d-4696-8f93-f72636b0d1a0";>
   
   - Docs link
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, the `connect-repl` `usage prompt` and `docs link` have been corrected.
   
   
   ### How was this patch tested?
   - Manually test.
   - Pass GA.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46761][SQL] Quoted strings in a JSON path should support ? characters [spark]

2024-03-13 Thread via GitHub



HyukjinKwon commented on PR #45420:
URL: https://github.com/apache/spark/pull/45420#issuecomment-1993710219

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46761][SQL] Quoted strings in a JSON path should support ? characters [spark]

2024-03-13 Thread via GitHub



HyukjinKwon closed pull request #45420: [SPARK-46761][SQL] Quoted strings in a 
JSON path should support ? characters
URL: https://github.com/apache/spark/pull/45420


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47368][SQL]][3.5] Remove inferTimestampNTZ config check in ParquetRo… [spark]

2024-03-13 Thread via GitHub



gengliangwang closed pull request #45492: [SPARK-47368][SQL]][3.5] Remove 
inferTimestampNTZ config check in ParquetRo…
URL: https://github.com/apache/spark/pull/45492


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47374][CONNECT][DOCS] Fix connect-repl `usage prompt` & `docs link` [spark]

2024-03-13 Thread via GitHub



panbingkun commented on code in PR #45494:
URL: https://github.com/apache/spark/pull/45494#discussion_r1522651676


##
connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/SparkConnectClientParser.scala:
##
@@ -29,17 +29,21 @@ private[sql] object SparkConnectClientParser {
*   usage string.
*/
   def usage(): String =
+// scalastyle:off line.size.limit
 s"""
|Options:
|   --remote REMOTE  URI of the Spark Connect Server to connect 
to.
|   --host HOST  Host where the Spark Connect Server is 
running.
|   --port PORT  Port where the Spark Connect Server is 
running.
-   |   --enable-ssl Connect to the server using SSL.
+   |   --use_sslConnect to the server using SSL.

Review Comment:
   Obviously, it is `use_ssl` here
   
https://github.com/apache/spark/blob/master/connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/SparkConnectClientParser.scala#L63



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47374][CONNECT][DOCS] Fix connect-repl `usage prompt` & `docs link` [spark]

2024-03-13 Thread via GitHub



panbingkun commented on code in PR #45494:
URL: https://github.com/apache/spark/pull/45494#discussion_r1522653119


##
docs/spark-connect-overview.md:
##
@@ -279,7 +279,7 @@ The customizations may also be passed in through CLI 
arguments as shown below:
 spark-connect-repl --host myhost.com --port 443 --token ABCDEFG
 {% endhighlight %}
 
-The supported list of CLI arguments may be found 
[here](https://github.com/apache/spark/blob/master/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/connect/client/SparkConnectClientParser.scala#L48).

Review Comment:
   The link has outdated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47374][CONNECT][DOCS] Fix connect-repl `usage prompt` & `docs link` [spark]

2024-03-13 Thread via GitHub



panbingkun commented on code in PR #45494:
URL: https://github.com/apache/spark/pull/45494#discussion_r1522658327


##
connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/SparkConnectClientParser.scala:
##
@@ -29,17 +29,21 @@ private[sql] object SparkConnectClientParser {
*   usage string.
*/
   def usage(): String =
+// scalastyle:off line.size.limit
 s"""
|Options:
|   --remote REMOTE  URI of the Spark Connect Server to connect 
to.
|   --host HOST  Host where the Spark Connect Server is 
running.
|   --port PORT  Port where the Spark Connect Server is 
running.
-   |   --enable-ssl Connect to the server using SSL.
+   |   --use_sslConnect to the server using SSL.
|   --token TOKENToken to use for authentication.
|   --user_id USER_IDId of the user connecting.
|   --user_name USER_NAMEName of the user connecting.
+   |   --user_agent USER_AGENT  The User-Agent Client information (only 
intended for logging purposes by the server).

Review Comment:
   An explanation of `user_agent`
   
https://github.com/apache/spark/blob/c7795bb8cc82073ba555aaa233e3b2586ae6d1eb/connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/SparkConnectClient.scala#L676
   
   
https://github.com/apache/spark/blob/c7795bb8cc82073ba555aaa233e3b2586ae6d1eb/connector/connect/common/src/main/protobuf/spark/connect/base.proto#L72-L75



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47374][CONNECT][DOCS] Fix connect-repl `usage prompt` & `docs link` [spark]

2024-03-13 Thread via GitHub



panbingkun commented on PR #45494:
URL: https://github.com/apache/spark/pull/45494#issuecomment-1993719969

   cc @hvanhovell @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47373][SQL] Match FileSourceScanLike to get metadata instead of FileSourceScanExec [spark]

2024-03-13 Thread via GitHub



dongjoon-hyun closed pull request #45491: [SPARK-47373][SQL] Match 
FileSourceScanLike to get metadata instead of FileSourceScanExec
URL: https://github.com/apache/spark/pull/45491


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47373][SQL] Match FileSourceScanLike to get metadata instead of FileSourceScanExec [spark]

2024-03-13 Thread via GitHub



dongjoon-hyun commented on PR #45491:
URL: https://github.com/apache/spark/pull/45491#issuecomment-1993726765

   Thank you, @zwangsheng and all!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [WIP][MINOR][INFRA] Update the link text and ref in `notify_test_workflow.yml` [spark]

2024-03-13 Thread via GitHub



panbingkun opened a new pull request, #45495:
URL: https://github.com/apache/spark/pull/45495

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, only for dev.
   
   
   ### How was this patch tested?
   Manually test.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP][MINOR][INFRA] Update the link text and ref in `notify_test_workflow.yml` [spark]

2024-03-13 Thread via GitHub



panbingkun commented on PR #45495:
URL: https://github.com/apache/spark/pull/45495#issuecomment-1993829413

   cc `the original author of this file` @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-47375][SQL] Add guidelines for timestamp mapping in `JdbcDialect#getCatalystType` [spark]

2024-03-13 Thread via GitHub



yaooqinn opened a new pull request, #45496:
URL: https://github.com/apache/spark/pull/45496

   
   
   ### What changes were proposed in this pull request?
   
   
   This PR adds guidelines for mapping database timestamps to Spark SQL 
Timestamps through the JDBC Standard API and Spark JDBCDialects trait. The 
details of this PR can be viewed directly in the method descriptions, no more 
copying here  
   
   ### Why are the changes needed?
   
   These guidelines can help us revise the built-in jdbc datasource later 
without controversies. It also encourages custom dialects to follow it。
   
   ### Does this PR introduce _any_ user-facing change?
   
   no, developer API doc changes
   
   ### How was this patch tested?
   
   doc build
   
   ### Was this patch authored or co-authored using generative AI tooling?
   no
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47375][SQL] Add guidelines for timestamp mapping in `JdbcDialect#getCatalystType` [spark]

2024-03-13 Thread via GitHub



yaooqinn commented on PR #45496:
URL: https://github.com/apache/spark/pull/45496#issuecomment-1993857925

   Some built-in JDBC data sources need to distinguish the `timestamp tz` from 
` timestamp ntz` as possible as we can instead of falling back to the default 
implementation


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47342][SQL] Support TimestampNTZ for DB2 TIMESTAMP WITH TIME ZONE [spark]

2024-03-13 Thread via GitHub



yaooqinn commented on PR #45471:
URL: https://github.com/apache/spark/pull/45471#issuecomment-1993888126

   This is reverted this 
by[e170252](https://github.com/apache/spark/commit/e170252714e3662c7354321f78a3250114ea7e9e).
   After an offline discussion with @cloud-fan, we have reached a consensus for 
timestamp mapping between databases and spark sql through JDBC APIs.
   
   The decisions can be found at https://github.com/apache/spark/pull/45496.
   
   The revision for DB2 might be raised in followups if necessary 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47141] [Core]: Support shuffle migration to external storage. [spark]

2024-03-13 Thread via GitHub



abhishekd0907 commented on PR #45228:
URL: https://github.com/apache/spark/pull/45228#issuecomment-1993902079

   Hi @mridulm  @attilapiros , Can you please help in reviewing this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47208][CORE] Allow overriding base overhead memory [spark]

2024-03-13 Thread via GitHub



jpcorreia99 commented on code in PR #45240:
URL: https://github.com/apache/spark/pull/45240#discussion_r1522859546


##
core/src/main/scala/org/apache/spark/resource/ResourceProfile.scala:
##
@@ -489,10 +489,11 @@ object ResourceProfile extends Logging {
 
   private[spark] def calculateOverHeadMemory(
   overHeadMemFromConf: Option[Long],
+  minimumOverHeadMemoryFromConf: Long,
   executorMemoryMiB: Long,
   overheadFactor: Double): Long = {
 overHeadMemFromConf.getOrElse(math.max((overheadFactor * 
executorMemoryMiB).toInt,
-ResourceProfile.MEMORY_OVERHEAD_MIN_MIB))
+  minimumOverHeadMemoryFromConf))

Review Comment:
   I do agree, at this point, outside of tests, 
`ResourceProfile.MEMORY_OVERHEAD_MIN_MIB` has only one usage in client AM 
overhead, so will make that change, can revert if @tgravescs disagrees



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47320][SQL] : The behaviour of Datasets involving self joins is inconsistent, unintuitive, with contradictions [spark]

2024-03-13 Thread via GitHub



cloud-fan commented on code in PR #45446:
URL: https://github.com/apache/spark/pull/45446#discussion_r1522868178


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala:
##
@@ -477,6 +482,57 @@ trait ColumnResolutionHelper extends Logging with 
DataTypeErrorsBase {
 assert(q.children.length == 1)
 q.children.head.output
   },
+
+  resolveOnDatasetId = (datasetid: Long, name: String) => {

Review Comment:
   the purpose of https://github.com/apache/spark/pull/43115 is to use plan for 
both Spark Connect and classic Spark.



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala:
##
@@ -477,6 +482,57 @@ trait ColumnResolutionHelper extends Logging with 
DataTypeErrorsBase {
 assert(q.children.length == 1)
 q.children.head.output
   },
+
+  resolveOnDatasetId = (datasetid: Long, name: String) => {

Review Comment:
   the purpose of https://github.com/apache/spark/pull/43115 is to use plan id 
for both Spark Connect and classic Spark.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47320][SQL] : The behaviour of Datasets involving self joins is inconsistent, unintuitive, with contradictions [spark]

2024-03-13 Thread via GitHub



cloud-fan commented on code in PR #45446:
URL: https://github.com/apache/spark/pull/45446#discussion_r1522868178


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala:
##
@@ -477,6 +482,57 @@ trait ColumnResolutionHelper extends Logging with 
DataTypeErrorsBase {
 assert(q.children.length == 1)
 q.children.head.output
   },
+
+  resolveOnDatasetId = (datasetid: Long, name: String) => {

Review Comment:
   the purpose of https://github.com/apache/spark/pull/43115 is to use plan id 
for both Spark Connect and classic Spark. That said, no more already-resolved 
`AttributeReference`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47208][CORE] Allow overriding base overhead memory [spark]

2024-03-13 Thread via GitHub



jpcorreia99 commented on code in PR #45240:
URL: https://github.com/apache/spark/pull/45240#discussion_r1522896795


##
core/src/main/scala/org/apache/spark/resource/ResourceProfile.scala:
##
@@ -489,10 +489,11 @@ object ResourceProfile extends Logging {
 
   private[spark] def calculateOverHeadMemory(
   overHeadMemFromConf: Option[Long],
+  minimumOverHeadMemoryFromConf: Long,
   executorMemoryMiB: Long,
   overheadFactor: Double): Long = {
 overHeadMemFromConf.getOrElse(math.max((overheadFactor * 
executorMemoryMiB).toInt,
-ResourceProfile.MEMORY_OVERHEAD_MIN_MIB))
+  minimumOverHeadMemoryFromConf))

Review Comment:
   Addressed in 3abd57efffcc3df2e2f4225b45f00f5e8af01724



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47329][SS] Persist dataframe while using foreachbatch and stateful streaming query to prevent state from being re-loaded in each batch [spark]

2024-03-13 Thread via GitHub



Bobstar55 commented on code in PR #45432:
URL: https://github.com/apache/spark/pull/45432#discussion_r1522898039


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/ForeachBatchSink.scala:
##
@@ -22,19 +22,41 @@ import scala.util.control.NonFatal
 import org.apache.spark.{SparkException, SparkThrowable}
 import org.apache.spark.sql._
 import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, Deduplicate, 
DeduplicateWithinWatermark, Distinct, FlatMapGroupsInPandasWithState, 
FlatMapGroupsWithState, GlobalLimit, Join, LogicalPlan, TransformWithState}
 import org.apache.spark.sql.execution.LogicalRDD
 import org.apache.spark.sql.execution.streaming.Sink
 import org.apache.spark.sql.streaming.DataStreamWriter
 
 class ForeachBatchSink[T](batchWriter: (Dataset[T], Long) => Unit, encoder: 
ExpressionEncoder[T])
   extends Sink {
 
+  private def isQueryStateful(logicalPlan: LogicalPlan): Boolean = {
+logicalPlan.collect {
+  case node @ (_: Aggregate | _: Distinct | _: FlatMapGroupsWithState

Review Comment:
   join 



##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/ForeachBatchSink.scala:
##
@@ -22,19 +22,41 @@ import scala.util.control.NonFatal
 import org.apache.spark.{SparkException, SparkThrowable}
 import org.apache.spark.sql._
 import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, Deduplicate, 
DeduplicateWithinWatermark, Distinct, FlatMapGroupsInPandasWithState, 
FlatMapGroupsWithState, GlobalLimit, Join, LogicalPlan, TransformWithState}

Review Comment:
   join



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP][SPARK-47338][SQL] Introduce `_LEGACY_ERROR_UNKNOWN` for default error class [spark]

2024-03-13 Thread via GitHub



MaxGekk commented on code in PR #45457:
URL: https://github.com/apache/spark/pull/45457#discussion_r1522991723


##
common/utils/src/main/resources/error/error-classes.json:
##
@@ -7902,6 +7902,11 @@
   "Doesn't support month or year interval: "
 ]
   },
+  "_LEGACY_ERROR_UNKNOWN" : {

Review Comment:
   > Because I thought that not having an error class assigned basically meant 
it was a LEGACY error
   
   I would say it is true. `SparkException` can still be raised w/ just a 
message since it is not fully ported on error classes. For instance:
   
https://github.com/apache/spark/blob/e170252714e3662c7354321f78a3250114ea7e9e/common/utils/src/main/scala/org/apache/spark/util/SparkThreadUtils.scala#L51-L53



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP][SPARK-47338][SQL] Introduce `_LEGACY_ERROR_UNKNOWN` for default error class [spark]

2024-03-13 Thread via GitHub



MaxGekk commented on code in PR #45457:
URL: https://github.com/apache/spark/pull/45457#discussion_r1523001153


##
common/utils/src/main/resources/error/error-classes.json:
##
@@ -7902,6 +7902,11 @@
   "Doesn't support month or year interval: "
 ]
   },
+  "_LEGACY_ERROR_UNKNOWN" : {

Review Comment:
   Since we know the cases when the error class is not set, how about just name 
the error class like `UNCLASSIFIED`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-47377][PYTHON][CONNECT][TESTS] Factor out tests from `SparkConnectSQLTestCase` [spark]

2024-03-13 Thread via GitHub



zhengruifeng opened a new pull request, #45497:
URL: https://github.com/apache/spark/pull/45497

   ### What changes were proposed in this pull request?
   Factor out tests from `SparkConnectSQLTestCase`
   
   ### Why are the changes needed?
   for testing parallelism
   
   
   ### Does this PR introduce _any_ user-facing change?
   no, test only
   
   ### How was this patch tested?
   ci
   
   ### Was this patch authored or co-authored using generative AI tooling?
   no


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-47378][PROTOBUF][TESTS] Make related Protobuf UT run well in IDE [spark]

2024-03-13 Thread via GitHub



panbingkun opened a new pull request, #45498:
URL: https://github.com/apache/spark/pull/45498

   ### What changes were proposed in this pull request?
   The pr aims to make related Protobuf `UT` run well in IDE.
   
   ### Why are the changes needed?
   Before:
   https://github.com/apache/spark/assets/15246973/c00781b2-3477-4b2c-b871-ead997fda697";>
   
   After:
   https://github.com/apache/spark/assets/15246973/665fc67d-c69e-45c7-b37d-bb4ef8e72930";>
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   - Manually test.
   - Pass GA.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47378][PROTOBUF][TESTS] Make the related Protobuf UT run well in IDE [spark]

2024-03-13 Thread via GitHub



panbingkun commented on PR #45498:
URL: https://github.com/apache/spark/pull/45498#issuecomment-1994200416

   I also verified using `./build/sbt` to run the related Protobuf `UT` 
locally, and it also work well, eg:
   https://github.com/apache/spark/assets/15246973/2e40ef68-78b2-482c-8ec1-663d0e819cd5";>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46654][SQL][PYTHON] Make `to_csv` explicitly indicate that it does not support some types of data [spark]

2024-03-13 Thread via GitHub



MaxGekk commented on PR #44665:
URL: https://github.com/apache/spark/pull/44665#issuecomment-1994249547

   +1, LGTM. Merging to master.
   Thank you, @panbingkun and @HyukjinKwon @LuciferYang @srowen for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46654][SQL][PYTHON] Make `to_csv` explicitly indicate that it does not support some types of data [spark]

2024-03-13 Thread via GitHub



MaxGekk closed pull request #44665: [SPARK-46654][SQL][PYTHON] Make `to_csv` 
explicitly indicate that it does not support some types of data
URL: https://github.com/apache/spark/pull/44665


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47375][SQL] Add guidelines for timestamp mapping in `JdbcDialect#getCatalystType` [spark]

2024-03-13 Thread via GitHub



yaooqinn closed pull request #45496: [SPARK-47375][SQL] Add guidelines for 
timestamp mapping in `JdbcDialect#getCatalystType`
URL: https://github.com/apache/spark/pull/45496


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47375][SQL] Add guidelines for timestamp mapping in `JdbcDialect#getCatalystType` [spark]

2024-03-13 Thread via GitHub



yaooqinn commented on PR #45496:
URL: https://github.com/apache/spark/pull/45496#issuecomment-1994278172

   Merged to master.
   
   Thank you @MaxGekk @cloud-fan for the review


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-47380][CONNECT] Ensure on the server side that the SparkSession is the same [spark]

2024-03-13 Thread via GitHub



nemanja-boric-databricks opened a new pull request, #45499:
URL: https://github.com/apache/spark/pull/45499

   
   
   
   
   ### What changes were proposed in this pull request?
   
   In this PR we change the client behaviour to send the previously observed 
server session id so that the server can validate that the client used to talk 
with this specific session. Previously this was only validated on the client 
side which made the server actually execute the request for the wrong session 
before throwing on the client side (once the response from the server was 
obtained).
   
   
   ### Why are the changes needed?
   The server can execute the client command on the wrong spark session before 
client figuring out it's the different session.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   
   ### How was this patch tested?
   Existing unit tests, add new unit test, e2e test added, manual testing
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47296][SQL][COLLATION] Fail unsupported functions for non-binary collations [spark]

2024-03-13 Thread via GitHub



uros-db commented on PR #45422:
URL: https://github.com/apache/spark/pull/45422#issuecomment-1994394143

   @cloud-fan that makes a lot of sense, to combat this - now new case classes 
should handle this. essentially:
   - `StringType` no longer accepts all collationIds, but only the default 
collationId (0) - i.e. UTF8_BINARY
   - `StringTypeBinary` is added to allow binary collations (for now: 
UTF8_BINARY and UNICODE), but at this time we need this for full lockdown 
because casting is not ready (casting is a separate effort, and when it's done, 
we can have `StringType` accept all binary collations directly; for now, it's 
incorrect)
   - `StringTypeBinaryLcase` is added to allow binary & lowercase 
(UTF8_BINARY_LCASE, UTF8_BINARY, UNICODE) - this class is important because 
some expressions will support binary & lowercase, but not other collations at a 
given time
   - `StringTypeAllCollations` is added to allow all collations (for now this 
is supported only in StringPredicate expressions: Contains, StartsWith, 
EndsWith) - note that these expressions handle all collations, but can't 
guarantee that all string arguments have exactly the same collation type, so we 
still need `checkCollationCompatibility` in CollationTypeConstraints) once 
casting is ready, we will delete this


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47344] Extend INVALID_IDENTIFIER error beyond catching '-' in an unquoted identifier and fix "IS ! NULL" et al. [spark]

2024-03-13 Thread via GitHub



MaxGekk commented on code in PR #45470:
URL: https://github.com/apache/spark/pull/45470#discussion_r1523276788


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/ErrorParserSuite.scala:
##
@@ -39,6 +39,23 @@ class ErrorParserSuite extends AnalysisTest {
   context = ExpectedContext(fragment = "order by q\ncluster by q", start = 
16, stop = 38))
   }
 
+  test("Illegal characters in unquoted identifier") {
+// scalastyle:off

Review Comment:
   Why do you need this?



##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/ErrorParserSuite.scala:
##
@@ -39,6 +39,23 @@ class ErrorParserSuite extends AnalysisTest {
   context = ExpectedContext(fragment = "order by q\ncluster by q", start = 
16, stop = 38))
   }
 
+  test("Illegal characters in unquoted identifier") {
+// scalastyle:off
+checkError(

Review Comment:
   Do you know why don't we have any query context here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-47086][BUILD][CORE][SQL][UI] Migrate from Jetty 11 to Jetty 12 [spark]

2024-03-13 Thread via GitHub



HiuKwok opened a new pull request, #45500:
URL: https://github.com/apache/spark/pull/45500

   
   
   ### What changes were proposed in this pull request?
   This is a draft MR to upgrade Jetty deps from 11 to 12. 
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47086][BUILD][CORE][SQL][UI] Migrate from Jetty 11 to Jetty 12 [spark]

2024-03-13 Thread via GitHub



HiuKwok commented on PR #45500:
URL: https://github.com/apache/spark/pull/45500#issuecomment-1994466598

   This is a draft MR and I'm still working on it. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47086][BUILD][CORE][SQL][UI] Migrate from Jetty 11 to Jetty 12 [spark]

2024-03-13 Thread via GitHub



HiuKwok commented on PR #45500:
URL: https://github.com/apache/spark/pull/45500#issuecomment-1994496507

   @dongjoon-hyun @LuciferYang 
   
   During the past few weeks, I managed to re-write / update, all Jetty-related 
classes, things look fine in most of the Java / Scala classes. 
   However, due to the new handler class structure that Jetty 12 introduced, 
I'm not sure is that feasible to replicate what `ProxyRedirectHandler` is 
performing now.
   
   If I understand correctly the initial intent of `ProxyRedirectHandler` is to 
override the redirect behaviour, in the case that Jetty decides to redirect the 
given request, BEFORE the request reaches any of the servlets. 
   
   However in Jetty 12, all Jetty handlers are switched to use the Jetty 
Request and Response wrapper object, hence it's no longer possible to override 
the redirect behaviour via the `sendRedirect` method call.
   
   I have checked on the Jetty upgrade guide, which the guide suggests that all 
 `sendDirect()` should be rewritten with ` Response.sendRedirect(request, 
response, callback, location)`.
   However in this case we no longer have to control URL rewrite, because this 
is a static method from Jetty lib.
   
   I wonder if you guys have an idea on this / or if any visible alternative 
can be implemented instead?
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47208][CORE] Allow overriding base overhead memory [spark]

2024-03-13 Thread via GitHub



tgravescs commented on code in PR #45240:
URL: https://github.com/apache/spark/pull/45240#discussion_r1523328549


##
resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala:
##
@@ -706,6 +706,39 @@ class ClientSuite extends SparkFunSuite
 assert(client.getPreloadedStatCache(sparkConf.get(JARS_TO_DISTRIBUTE), 
mockFsLookup).size === 2)
   }
 
+  Seq(
+  "client",
+  "cluster"
+).foreach { case (deployMode) =>
+  test(s"SPARK-47208: minimum memory overhead is correctly set in 
($deployMode mode)") {
+val sparkConf = new SparkConf()
+  .set("spark.app.name", "foo-test-app")
+  .set(SUBMIT_DEPLOY_MODE, deployMode)
+  .set(DRIVER_MIN_MEMORY_OVERHEAD, 500L)
+val args = new ClientArguments(Array())
+
+val appContext = 
Records.newRecord(classOf[ApplicationSubmissionContext])
+val getNewApplicationResponse = 
Records.newRecord(classOf[GetNewApplicationResponse])
+val containerLaunchContext = 
Records.newRecord(classOf[ContainerLaunchContext])
+
+val client = new Client(args, sparkConf, null)
+client.createApplicationSubmissionContext(
+  new YarnClientApplication(getNewApplicationResponse, appContext),
+  containerLaunchContext)
+
+appContext.getApplicationName should be("foo-test-app")
+// flag should only work for cluster mode
+if (deployMode=="cluster") {

Review Comment:
   nit spaces around " == "



##
resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnAllocatorSuite.scala:
##
@@ -772,6 +773,42 @@ class YarnAllocatorSuite extends SparkFunSuite
   assert(memory == (executorMemory * 1.4).toLong)
 } finally {
   sparkConf.set(EXECUTOR_MEMORY_OVERHEAD_FACTOR, 0.1)
+  sparkConf.remove(EXECUTOR_MEMORY_OVERHEAD)
+}
+  }
+
+  test("SPARK-47208: User can override the minimum memory overhead of the 
executor") {
+val executorMemory = sparkConf.get(EXECUTOR_MEMORY)
+try {
+  sparkConf
+.set(EXECUTOR_MIN_MEMORY_OVERHEAD, 500L)
+  val (handler, _) = createAllocator(maxExecutors = 1,
+additionalConfigs = Map(EXECUTOR_MEMORY.key -> 
executorMemory.toString))
+  val defaultResource = handler.rpIdToYarnResource.get(defaultRPId)
+  val memory = defaultResource.getMemorySize
+  assert(memory == (executorMemory + 500))
+} finally {
+  sparkConf
+.remove(EXECUTOR_MIN_MEMORY_OVERHEAD)
+}
+  }
+
+  test("SPARK-47208: Explicit overhead takes precedence over minimum 
overhead") {
+val executorMemory = sparkConf.get(EXECUTOR_MEMORY)
+try {
+  sparkConf
+.set(EXECUTOR_MIN_MEMORY_OVERHEAD, 500L)
+.set(EXECUTOR_MEMORY_OVERHEAD, 100L)
+  val (handler, _) = createAllocator(maxExecutors = 1,
+additionalConfigs = Map(EXECUTOR_MEMORY.key -> 
executorMemory.toString))
+  val defaultResource = handler.rpIdToYarnResource.get(defaultRPId)
+  val memory = defaultResource.getMemorySize
+  assert(memory == (executorMemory + 100))
+} finally {
+  sparkConf
+.remove(EXECUTOR_MIN_MEMORY_OVERHEAD)
+  sparkConf
+.remove(EXECUTOR_MIN_MEMORY_OVERHEAD)

Review Comment:
   you are removing EXECUTOR_MIN_MEMORY_OVERHEAD twice, one of these should be 
EXECUTOR_MEMORY_OVERHEAD right?



##
resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala:
##
@@ -706,6 +706,39 @@ class ClientSuite extends SparkFunSuite
 assert(client.getPreloadedStatCache(sparkConf.get(JARS_TO_DISTRIBUTE), 
mockFsLookup).size === 2)
   }
 
+  Seq(
+  "client",
+  "cluster"
+).foreach { case (deployMode) =>
+  test(s"SPARK-47208: minimum memory overhead is correctly set in 
($deployMode mode)") {
+val sparkConf = new SparkConf()
+  .set("spark.app.name", "foo-test-app")
+  .set(SUBMIT_DEPLOY_MODE, deployMode)
+  .set(DRIVER_MIN_MEMORY_OVERHEAD, 500L)
+val args = new ClientArguments(Array())
+
+val appContext = 
Records.newRecord(classOf[ApplicationSubmissionContext])
+val getNewApplicationResponse = 
Records.newRecord(classOf[GetNewApplicationResponse])
+val containerLaunchContext = 
Records.newRecord(classOf[ContainerLaunchContext])
+
+val client = new Client(args, sparkConf, null)
+client.createApplicationSubmissionContext(
+  new YarnClientApplication(getNewApplicationResponse, appContext),
+  containerLaunchContext)
+
+appContext.getApplicationName should be("foo-test-app")
+// flag should only work for cluster mode
+if (deployMode=="cluster") {
+  // 1Gb driver default + 500 overridden minimum default overhead
+  appContext.getResource should be(Resource.newInstance(1524L, 1))

Review Comment:
   nit space after "should be "  for readability



##
resource-managers/yarn/src/test/scala/org/apache/spa

Re: [PR] [SPARK-47208][CORE] Allow overriding base overhead memory [spark]

2024-03-13 Thread via GitHub



tgravescs commented on code in PR #45240:
URL: https://github.com/apache/spark/pull/45240#discussion_r1523337134


##
core/src/main/scala/org/apache/spark/resource/ResourceProfile.scala:
##
@@ -489,10 +489,11 @@ object ResourceProfile extends Logging {
 
   private[spark] def calculateOverHeadMemory(
   overHeadMemFromConf: Option[Long],
+  minimumOverHeadMemoryFromConf: Long,
   executorMemoryMiB: Long,
   overheadFactor: Double): Long = {
 overHeadMemFromConf.getOrElse(math.max((overheadFactor * 
executorMemoryMiB).toInt,
-ResourceProfile.MEMORY_OVERHEAD_MIN_MIB))
+  minimumOverHeadMemoryFromConf))

Review Comment:
   I agree with removing it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47344] Extend INVALID_IDENTIFIER error beyond catching '-' in an unquoted identifier and fix "IS ! NULL" et al. [spark]

2024-03-13 Thread via GitHub



srielau commented on code in PR #45470:
URL: https://github.com/apache/spark/pull/45470#discussion_r1523389357


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/ErrorParserSuite.scala:
##
@@ -39,6 +39,23 @@ class ErrorParserSuite extends AnalysisTest {
   context = ExpectedContext(fragment = "order by q\ncluster by q", start = 
16, stop = 38))
   }
 
+  test("Illegal characters in unquoted identifier") {
+// scalastyle:off

Review Comment:
   Scalastyle complains about non-ASCII characters.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47208][CORE] Allow overriding base overhead memory [spark]

2024-03-13 Thread via GitHub



jpcorreia99 commented on PR #45240:
URL: https://github.com/apache/spark/pull/45240#issuecomment-1994561966

   @tgravescs thank you for the comments, I've addressed them in 
cd03ec88bf965622c4fad3e60dc76b5a6bd78e5d


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47210][SQL][COLLATION] Implicit casting on collated expressions [spark]

2024-03-13 Thread via GitHub



dbatomic commented on PR #45383:
URL: https://github.com/apache/spark/pull/45383#issuecomment-1994569802

   > Some high-level questions:
   > 
   > 1. If a function requires certain collations but the input uses a 
different collation, shall we implicitly cast or fail?
   > 2. If a function's inputs do not use the same collation, shall we implicit 
cast or fail?
   > 3. If we cast a string with collation to integer or datetime, do we need 
to consider the collation?
   
   @cloud-fan - these are great questions and I think that they should be part 
of the spec.
   Rough answers from my side are:
   1) I think that we should fail but I think that there are some subtle 
caveats here that should be covered in the design spec.
   2) Depends on a function. e.g. for contains we should fail. For concat we 
should succeed.
   3) No. Decimal/datetime formatting should be part of "language settings" 
which are not part of collation track.
   
   @mihailom-db will extend casting section of the doc. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47208][CORE] Allow overriding base overhead memory [spark]

2024-03-13 Thread via GitHub



tgravescs commented on PR #45240:
URL: https://github.com/apache/spark/pull/45240#issuecomment-1994608914

   Looks good, thanks @jpcorreia99 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47344] Extend INVALID_IDENTIFIER error beyond catching '-' in an unquoted identifier and fix "IS ! NULL" et al. [spark]

2024-03-13 Thread via GitHub



MaxGekk commented on code in PR #45470:
URL: https://github.com/apache/spark/pull/45470#discussion_r1523544839


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/ErrorParserSuite.scala:
##
@@ -39,6 +39,23 @@ class ErrorParserSuite extends AnalysisTest {
   context = ExpectedContext(fragment = "order by q\ncluster by q", start = 
16, stop = 38))
   }
 
+  test("Illegal characters in unquoted identifier") {
+// scalastyle:off

Review Comment:
   which one? seems all chars are ASCII.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47344] Extend INVALID_IDENTIFIER error beyond catching '-' in an unquoted identifier and fix "IS ! NULL" et al. [spark]

2024-03-13 Thread via GitHub



MaxGekk commented on code in PR #45470:
URL: https://github.com/apache/spark/pull/45470#discussion_r1523544102


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/ErrorParserSuite.scala:
##
@@ -39,6 +39,23 @@ class ErrorParserSuite extends AnalysisTest {
   context = ExpectedContext(fragment = "order by q\ncluster by q", start = 
16, stop = 38))
   }
 
+  test("Illegal characters in unquoted identifier") {
+// scalastyle:off
+checkError(

Review Comment:
   Just wonder because you set the parser context.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47344] Extend INVALID_IDENTIFIER error beyond catching '-' in an unquoted identifier and fix "IS ! NULL" et al. [spark]

2024-03-13 Thread via GitHub



srielau commented on code in PR #45470:
URL: https://github.com/apache/spark/pull/45470#discussion_r1523561534


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/ErrorParserSuite.scala:
##
@@ -39,6 +39,23 @@ class ErrorParserSuite extends AnalysisTest {
   context = ExpectedContext(fragment = "order by q\ncluster by q", start = 
16, stop = 38))
   }
 
+  test("Illegal characters in unquoted identifier") {
+// scalastyle:off
+checkError(

Review Comment:
   I have no idea!
   When I run from teh command line I get the full info:
   `spark-sql (default)> use 表1;
   
   [INVALID_IDENTIFIER] The unquoted identifier 表1 is invalid and must be back 
quoted as: `表1`.
   Unquoted identifiers can only contain ASCII letters ('a' - 'z', 'A' - 'Z'), 
digits ('0' - '9'), and underbar ('_').
   Unquoted identifiers must also not start with a digit.
   Different data sources and meta stores may impose additional restrictions on 
valid identifiers. SQLSTATE: 42602 (line 1, pos 4)
   
   == SQL ==
   use 表1
   ^^^
   
   `
   But when I add a context to checkError it complains:
   `[info] - Illegal characters in unquoted identifier *** FAILED *** (12 
milliseconds)
   [info]   0 did not equal 1 Invalid length of the query context 
(SparkFunSuite.scala:365)
   [info]   org.scalatest.exceptions.TestFailedException:
   `
   
   An investigation for another PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-47327][SQL] Move sort keys concurrency test to CollationFactorySuite [spark]

2024-03-13 Thread via GitHub



stefankandic opened a new pull request, #45501:
URL: https://github.com/apache/spark/pull/45501

   
   
   ### What changes were proposed in this pull request?
   
   Move concurrency test to the `CollationFactorySuite`
   
   ### Why are the changes needed?
   
   This is more appropriate location for the test as it directly uses the 
`CollationFactory`
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   With existing UTs
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47342][SQL] Support TimestampNTZ for DB2 TIMESTAMP WITH TIME ZONE [spark]

2024-03-13 Thread via GitHub



dongjoon-hyun commented on PR #45471:
URL: https://github.com/apache/spark/pull/45471#issuecomment-1994934200

   Thank you for reverting, @yaooqinn and all.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47086][BUILD][CORE][SQL][UI] Migrate from Jetty 11 to Jetty 12 [spark]

2024-03-13 Thread via GitHub



dongjoon-hyun commented on code in PR #45500:
URL: https://github.com/apache/spark/pull/45500#discussion_r1523601523


##
core/src/main/scala/org/apache/spark/TestUtils.scala:
##
@@ -46,15 +46,16 @@ import 
org.apache.logging.log4j.core.config.builder.api.ConfigurationBuilderFact
 import org.eclipse.jetty.server.Handler
 import org.eclipse.jetty.server.Server
 import org.eclipse.jetty.server.handler.DefaultHandler
-import org.eclipse.jetty.server.handler.HandlerList
 import org.eclipse.jetty.server.handler.ResourceHandler
+import org.eclipse.jetty.util.resource.ResourceFactory
 import org.json4s.JsonAST.JValue
 import org.json4s.jackson.JsonMethods.{compact, render}
 
 import org.apache.spark.executor.TaskMetrics
 import org.apache.spark.scheduler._
 import org.apache.spark.util.{SparkTestUtils, Utils}
 
+

Review Comment:
   Please remove this kind of empty lines to minimize your PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47086][BUILD][CORE][SQL][UI] Migrate from Jetty 11 to Jetty 12 [spark]

2024-03-13 Thread via GitHub



dongjoon-hyun commented on code in PR #45500:
URL: https://github.com/apache/spark/pull/45500#discussion_r1523603371


##
core/src/main/scala/org/apache/spark/ui/JettyUtils.scala:
##
@@ -149,6 +152,7 @@ private[spark] object JettyUtils extends Logging {
 // Make sure we don't end up with "//" in the middle
 val newUrl = new URL(new URL(request.getRequestURL.toString), 
prefixedDestPath).toString
 response.sendRedirect(newUrl)
+//Response.sendRedirect(request, response, callback, location)

Review Comment:
   Please clean up this leftover.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47086][BUILD][CORE][SQL][UI] Migrate from Jetty 11 to Jetty 12 [spark]

2024-03-13 Thread via GitHub



dongjoon-hyun commented on code in PR #45500:
URL: https://github.com/apache/spark/pull/45500#discussion_r1523602207


##
core/src/main/scala/org/apache/spark/TestUtils.scala:
##
@@ -335,9 +336,9 @@ private[spark] object TestUtils extends SparkTestUtils {
 // 0 as port means choosing randomly from the available ports
 val server = new Server(new 
InetSocketAddress(Utils.localCanonicalHostName(), 0))
 val resHandler = new ResourceHandler()
-resHandler.setResourceBase(resBaseDir)
-val handlers = new HandlerList()
-handlers.setHandlers(Array[Handler](resHandler, new DefaultHandler()))
+
resHandler.setBaseResource(ResourceFactory.of(resHandler).newResource(resBaseDir))
+val handlers = new Handler.Sequence;

Review Comment:
   Please follow the Scala style. We don't need `;` at the end.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47086][BUILD][CORE][SQL][UI] Migrate from Jetty 11 to Jetty 12 [spark]

2024-03-13 Thread via GitHub



dongjoon-hyun commented on code in PR #45500:
URL: https://github.com/apache/spark/pull/45500#discussion_r1523605466


##
core/src/main/scala/org/apache/spark/ui/JettyUtils.scala:
##
@@ -209,12 +213,12 @@ private[spark] object JettyUtils extends Logging {
 
   override def filterServerResponseHeader(
   clientRequest: HttpServletRequest,
-  serverResponse: Response,
+  serverResponse: org.eclipse.jetty.client.Response,
   headerName: String,
   headerValue: String): String = {
 if (headerName.equalsIgnoreCase("location")) {
   val newHeader = createProxyLocationHeader(headerValue, clientRequest,
-serverResponse.getRequest().getURI())
+serverResponse.getRequest.getURI)

Review Comment:
   Do we need this change really? Otherwise, please revert all this kind of 
style change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47086][BUILD][CORE][SQL][UI] Migrate from Jetty 11 to Jetty 12 [spark]

2024-03-13 Thread via GitHub



dongjoon-hyun commented on code in PR #45500:
URL: https://github.com/apache/spark/pull/45500#discussion_r1523604616


##
core/src/main/scala/org/apache/spark/ui/JettyUtils.scala:
##
@@ -209,12 +213,12 @@ private[spark] object JettyUtils extends Logging {
 
   override def filterServerResponseHeader(
   clientRequest: HttpServletRequest,
-  serverResponse: Response,
+  serverResponse: org.eclipse.jetty.client.Response,

Review Comment:
   Please use `import` statement. If there is a conflict, you can use Scala's 
import renaming feature like `JMap` usage.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47086][BUILD][CORE][SQL][UI] Migrate from Jetty 11 to Jetty 12 [spark]

2024-03-13 Thread via GitHub



dongjoon-hyun commented on code in PR #45500:
URL: https://github.com/apache/spark/pull/45500#discussion_r1523607249


##
core/src/main/scala/org/apache/spark/ui/JettyUtils.scala:
##
@@ -579,6 +575,12 @@ private[spark] case class ServerInfo(
 
 }
 
+//  private def getRedirectUrl(location: String): Unit = {
+//
+//val proxyUri = _proxyUri.stripSuffix("/")
+//
+//  }
+

Review Comment:
   Please clean up this left-over.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47086][BUILD][CORE][SQL][UI] Migrate from Jetty 11 to Jetty 12 [spark]

2024-03-13 Thread via GitHub



dongjoon-hyun commented on code in PR #45500:
URL: https://github.com/apache/spark/pull/45500#discussion_r1523607793


##
core/src/main/scala/org/apache/spark/ui/JettyUtils.scala:
##
@@ -588,36 +590,37 @@ private[spark] case class ServerInfo(
  * a servlet context without the trailing slash (e.g. "/jobs") - Jetty will 
send a redirect to the
  * same URL, but with a trailing slash.
  */
-private class ProxyRedirectHandler(_proxyUri: String) extends HandlerWrapper {
+private class ProxyRedirectHandler(_proxyUri: String) extends Handler.Wrapper {
 
   private val proxyUri = _proxyUri.stripSuffix("/")
 
   override def handle(
-  target: String,
-  baseRequest: Request,
-  request: HttpServletRequest,
-  response: HttpServletResponse): Unit = {
-super.handle(target, baseRequest, request, new ResponseWrapper(request, 
response))
-  }
-
-  private class ResponseWrapper(
-  req: HttpServletRequest,
-  res: HttpServletResponse)
-extends HttpServletResponseWrapper(res) {
-
-override def sendRedirect(location: String): Unit = {
-  val newTarget = if (location != null) {
-val target = new URI(location)
-// The target path should already be encoded, so don't re-encode it, 
just the
-// proxy address part.
-val proxyBase = UIUtils.uiRoot(req)
-val proxyPrefix = if (proxyBase.nonEmpty) s"$proxyUri$proxyBase" else 
proxyUri
-s"${res.encodeURL(proxyPrefix)}${target.getPath()}"
-  } else {
-null
-  }
-  super.sendRedirect(newTarget)
-}
+  request: Request,
+  response: org.eclipse.jetty.server.Response,
+  callback: Callback): Boolean = {
+// Todo: Fix the proxy redirect behaviour.
+//super.handle(request, new ResponseWrapper(request, response), callback)
+super.handle(request, response, callback)
   }
+//
+//  private class ResponseWrapper(
+//  req: Request,
+//  res: Response)
+//extends Response.Wrapper(req, res) {
+//
+//override def sendRedirect(location: String): Unit = {
+//  val newTarget = if (location != null) {
+//val target = new URI(location)
+//// The target path should already be encoded, so don't re-encode it, 
just the
+//// proxy address part.
+//val proxyBase = UIUtils.uiRoot(req)
+//val proxyPrefix = if (proxyBase.nonEmpty) s"$proxyUri$proxyBase" 
else proxyUri
+//s"${res.encodeURL(proxyPrefix)}${target.getPath()}"
+//  } else {
+//null
+//  }
+//  super.sendRedirect(newTarget)
+//}
+//  }

Review Comment:
   Please remove this leftover.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47086][BUILD][CORE][SQL][UI] Migrate from Jetty 11 to Jetty 12 [spark]

2024-03-13 Thread via GitHub



dongjoon-hyun commented on code in PR #45500:
URL: https://github.com/apache/spark/pull/45500#discussion_r1523608885


##
core/src/main/scala/org/apache/spark/util/Utils.scala:
##
@@ -2198,8 +2197,6 @@ private[spark] object Utils
   return true
 }
 isBindCollision(e.getCause)
-  case e: MultiException =>
-e.getThrowables.asScala.exists(isBindCollision)

Review Comment:
   This looks like a nice removal.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47086][BUILD][CORE][SQL][UI] Migrate from Jetty 11 to Jetty 12 [spark]

2024-03-13 Thread via GitHub



dongjoon-hyun commented on code in PR #45500:
URL: https://github.com/apache/spark/pull/45500#discussion_r1523611719


##
core/src/test/scala/org/apache/spark/ui/UISuite.scala:
##
@@ -398,8 +386,8 @@ class UISuite extends SparkFunSuite {
 }
   }
 
-  test("SPARK-45522: Jetty 10 and above shouuld return status code 302 with 
correct redirect url" +
-" when request URL ends with a context path without trailing '/'") {
+  test("SPARK-34449: Jetty 9.4.35.v20201120 and later no longer return status 
code 302 " +
+" and handle internally when request URL ends with a context path without 
trailing '/'") {

Review Comment:
   I'm not sure why do we change this in this PR. Could you spin-off this in 
order to merge before this PR?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47086][BUILD][CORE][SQL][UI] Migrate from Jetty 11 to Jetty 12 [spark]

2024-03-13 Thread via GitHub



dongjoon-hyun commented on code in PR #45500:
URL: https://github.com/apache/spark/pull/45500#discussion_r1523613261


##
sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/thrift/ThriftBinaryCLIService.java:
##
@@ -92,9 +92,9 @@ protected void initializeServer() {
   // Server args
   int maxMessageSize = 
hiveConf.getIntVar(HiveConf.ConfVars.HIVE_SERVER2_THRIFT_MAX_MESSAGE_SIZE);
   int requestTimeout = (int) hiveConf.getTimeVar(
-  HiveConf.ConfVars.HIVE_SERVER2_THRIFT_LOGIN_TIMEOUT, 
TimeUnit.SECONDS);
+HiveConf.ConfVars.HIVE_SERVER2_THRIFT_LOGIN_TIMEOUT, 
TimeUnit.SECONDS);

Review Comment:
   This looks like a mistake. Please revert this file's changes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47086][BUILD][CORE][SQL][UI] Migrate from Jetty 11 to Jetty 12 [spark]

2024-03-13 Thread via GitHub



dongjoon-hyun commented on PR #45500:
URL: https://github.com/apache/spark/pull/45500#issuecomment-1994982261

   Thank you for sharing your AS-IS status, @HiuKwok .
   
   For the following, it's a little surprising to me.
   > However in this case we no longer have to control URL rewrite, because 
this is a static method from Jetty lib.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47378][PROTOBUF][TESTS] Make the related Protobuf UT run well in IDE [spark]

2024-03-13 Thread via GitHub



dongjoon-hyun closed pull request #45498: [SPARK-47378][PROTOBUF][TESTS] Make 
the related Protobuf UT run well in IDE
URL: https://github.com/apache/spark/pull/45498


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47378][PROTOBUF][TESTS] Make the related Protobuf UT run well in IDE [spark]

2024-03-13 Thread via GitHub



dongjoon-hyun commented on PR #45498:
URL: https://github.com/apache/spark/pull/45498#issuecomment-1994988778

   Thank you, @panbingkun . Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47289][SQL] Allow extensions to log extended information in explain plan [spark]

2024-03-13 Thread via GitHub



parthchandra commented on PR #45488:
URL: https://github.com/apache/spark/pull/45488#issuecomment-1995049132

   @dongjoon-hyun github actions are enabled in my repository and the branch is 
based on the latest commit in master. In my repo the ci checks are shown as 
passing. 
   https://github.com/apache/spark/assets/6529136/3e177782-1700-4262-a330-6def7533837d";>
   I've never encountered this situation in github before. Any suggestions? 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SS][SPARK-47363] Initial State without state reader implementation for State API v2. [spark]

2024-03-13 Thread via GitHub



anishshri-db commented on code in PR #45467:
URL: https://github.com/apache/spark/pull/45467#discussion_r1523663013


##
sql/api/src/main/scala/org/apache/spark/sql/streaming/StatefulProcessor.scala:
##
@@ -85,3 +85,21 @@ private[sql] trait StatefulProcessor[K, I, O] extends 
Serializable {
 statefulProcessorHandle
   }
 }
+
+/**
+ * Similar usage as StatefulProcessor. Represents the arbitrary stateful logic 
that needs to

Review Comment:
   Maybe reword this - `Stateful processor with support for specifying initial 
state. Accepts a user-defined type as initial state to be initialized in the 
first batch. This can be used for starting a new streaming query with existing 
state from a previous streaming query` ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SS][SPARK-47363] Initial State without state reader implementation for State API v2. [spark]

2024-03-13 Thread via GitHub



anishshri-db commented on code in PR #45467:
URL: https://github.com/apache/spark/pull/45467#discussion_r1523668147


##
sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala:
##
@@ -665,7 +665,8 @@ class KeyValueGroupedDataset[K, V] private[sql](
   outputMode: OutputMode = OutputMode.Append()): Dataset[U] = {
 Dataset[U](
   sparkSession,
-  TransformWithState[K, V, U](
+  // The last K type is only to silence compiler error

Review Comment:
   Any way to avoid this ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SS][SPARK-47363] Initial State without state reader implementation for State API v2. [spark]

2024-03-13 Thread via GitHub



anishshri-db commented on code in PR #45467:
URL: https://github.com/apache/spark/pull/45467#discussion_r1523670747


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/TransformWithStateExec.scala:
##
@@ -64,31 +68,52 @@ case class TransformWithStateExec(
 eventTimeWatermarkForLateEvents: Option[Long],
 eventTimeWatermarkForEviction: Option[Long],
 child: SparkPlan,
-isStreaming: Boolean = true)
-  extends UnaryExecNode with StateStoreWriter with WatermarkSupport with 
ObjectProducerExec {
+isStreaming: Boolean = true,
+hasInitialState: Boolean = false,
+initialStateGroupingAttrs: Seq[Attribute],
+initialStateDataAttrs: Seq[Attribute],
+initialStateDeserializer: Expression,
+initialState: SparkPlan)
+  extends BinaryExecNode with StateStoreWriter with WatermarkSupport with 
ObjectProducerExec {
 
   override def shortName: String = "transformWithStateExec"
 
   // TODO: update this to run no-data batches when timer support is added
   override def shouldRunAnotherBatch(newInputWatermark: Long): Boolean = false
 
-  override protected def withNewChildInternal(
-newChild: SparkPlan): TransformWithStateExec = copy(child = newChild)
+  override def left: SparkPlan = child
+
+  override def right: SparkPlan = initialState
+
+  override protected def withNewChildrenInternal(
+  newLeft: SparkPlan, newRight: SparkPlan): TransformWithStateExec =
+copy(child = newLeft, initialState = newRight)
 
   override def keyExpressions: Seq[Attribute] = groupingAttributes
 
   protected val schemaForKeyRow: StructType = new StructType().add("key", 
BinaryType)
 
   protected val schemaForValueRow: StructType = new StructType().add("value", 
BinaryType)
 
+  /**
+   * Distribute by grouping attributes - We need the underlying data and the 
initial state data
+   * to have the same grouping so that the data are co-lacated on the same 
task.
+   */
   override def requiredChildDistribution: Seq[Distribution] = {
-StatefulOperatorPartitioning.getCompatibleDistribution(groupingAttributes,
-  getStateInfo, conf) ::
+StatefulOperatorPartitioning.getCompatibleDistribution(
+  groupingAttributes, getStateInfo, conf) ::
+  StatefulOperatorPartitioning.getCompatibleDistribution(

Review Comment:
   nit: indent ?



##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/TransformWithStateExec.scala:
##
@@ -111,6 +136,26 @@ case class TransformWithStateExec(
 mappedIterator
   }
 
+  private def processInitialStateRows(keyRow: UnsafeRow, initStateIter: 
Iterator[InternalRow]):
+  Unit = {

Review Comment:
   indent ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SS][SPARK-47363] Initial State without state reader implementation for State API v2. [spark]

2024-03-13 Thread via GitHub



anishshri-db commented on PR #45467:
URL: https://github.com/apache/spark/pull/45467#issuecomment-1995127603

   Error seems relevant on the MIMA checks - 
   ```
   problems with Sql module: 
   method 
transformWithState(org.apache.spark.sql.streaming.StatefulProcessorWithInitialState,org.apache.spark.sql.streaming.TimeoutMode,org.apache.spark.sql.streaming.OutputMode,org.apache.spark.sql.KeyValueGroupedDataset,org.apache.spark.sql.Encoder,org.apache.spark.sql.Encoder)org.apache.spark.sql.Dataset
 in class org.apache.spark.sql.KeyValueGroupedDataset does not have a 
correspondent in client version
   ```
   
   we probably need to update the Connect variants as well


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SS][SPARK-47363] Initial State without state reader implementation for State API v2. [spark]

2024-03-13 Thread via GitHub



anishshri-db commented on code in PR #45467:
URL: https://github.com/apache/spark/pull/45467#discussion_r1523675926


##
sql/core/src/test/scala/org/apache/spark/sql/streaming/TransformWithStateInitialStateSuite.scala:
##
@@ -0,0 +1,268 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.streaming
+
+import org.apache.spark.sql.{Encoders, KeyValueGroupedDataset}
+import org.apache.spark.sql.execution.streaming.MemoryStream
+import 
org.apache.spark.sql.execution.streaming.state.{AlsoTestWithChangelogCheckpointingEnabled,
 RocksDBStateStoreProvider, 
StateStoreMultipleColumnFamiliesNotSupportedException}
+import org.apache.spark.sql.internal.SQLConf
+
+case class InitInputRow(key: String, action: String, value: Double)
+
+class StatefulProcessorWithInitialStateTestClass extends 
StatefulProcessorWithInitialState[
+String, InitInputRow, (String, String, Double), (String, Double)] {
+  @transient var _valState: ValueState[Double] = _
+  @transient var _listState: ListState[Double] = _
+  @transient var _mapState: MapState[Double, Int] = _
+
+  override def handleInitialState(
+  key: String,
+  initialState: (String, Double)): Unit = {
+val initStateVal = initialState._2
+_valState.update(initStateVal)

Review Comment:
   Can we simulate an actual case class that stores list/map and/or iterator 
for list values/iterator for map key-values ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SS][SPARK-47363] Initial State without state reader implementation for State API v2. [spark]

2024-03-13 Thread via GitHub



anishshri-db commented on code in PR #45467:
URL: https://github.com/apache/spark/pull/45467#discussion_r1523675926


##
sql/core/src/test/scala/org/apache/spark/sql/streaming/TransformWithStateInitialStateSuite.scala:
##
@@ -0,0 +1,268 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.streaming
+
+import org.apache.spark.sql.{Encoders, KeyValueGroupedDataset}
+import org.apache.spark.sql.execution.streaming.MemoryStream
+import 
org.apache.spark.sql.execution.streaming.state.{AlsoTestWithChangelogCheckpointingEnabled,
 RocksDBStateStoreProvider, 
StateStoreMultipleColumnFamiliesNotSupportedException}
+import org.apache.spark.sql.internal.SQLConf
+
+case class InitInputRow(key: String, action: String, value: Double)
+
+class StatefulProcessorWithInitialStateTestClass extends 
StatefulProcessorWithInitialState[
+String, InitInputRow, (String, String, Double), (String, Double)] {
+  @transient var _valState: ValueState[Double] = _
+  @transient var _listState: ListState[Double] = _
+  @transient var _mapState: MapState[Double, Int] = _
+
+  override def handleInitialState(
+  key: String,
+  initialState: (String, Double)): Unit = {
+val initStateVal = initialState._2
+_valState.update(initStateVal)

Review Comment:
   Can we simulate an actual case class for initial state that stores list/map 
and/or iterator for list values/iterator for map key-values ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47343][SQL] Fix NPE when `sqlString` variable value is null string in execute immediate [spark]

2024-03-13 Thread via GitHub



srielau commented on code in PR #45462:
URL: https://github.com/apache/spark/pull/45462#discussion_r1523678372


##
common/utils/src/main/resources/error/error-classes.json:
##
@@ -3004,6 +3004,12 @@
 ],
 "sqlState" : "2200E"
   },
+  "NULL_QUERY_STRING_EXECUTE_IMMEDIATE" : {
+"message" : [
+  "Execute immediate requires a non-null variable as the query string, but 
the provided variable  is null."
+],
+"sqlState" : "42K09"

Review Comment:
   22004 it is



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47094][SQL] SPJ : Dynamically rebalance number of buckets when they are not equal [spark]

2024-03-13 Thread via GitHub



szehon-ho commented on code in PR #45267:
URL: https://github.com/apache/spark/pull/45267#discussion_r1523706702


##
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/ReducibleFunction.java:
##
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.connector.catalog.functions;
+
+import org.apache.spark.annotation.Evolving;
+import scala.Option;
+
+/**
+ * Base class for user-defined functions that can be 'reduced' on another 
function.
+ *
+ * A function f_source(x) is 'reducible' on another function f_target(x) if
+ * there exists a reducer function r(x) such that r(f_source(x)) = f_target(x) 
for all input x.
+ *
+ * @since 4.0.0
+ */
+@Evolving
+public interface ReducibleFunction extends ScalarFunction {
+
+/**
+ * If this function is 'reducible' on another function, return the {@link 
Reducer} function.
+ * @param other other function
+ * @param thisArgument argument for this function instance

Review Comment:
   Added javadocs



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala:
##
@@ -635,6 +636,22 @@ trait ShuffleSpec {
*/
   def createPartitioning(clustering: Seq[Expression]): Partitioning =
 throw SparkUnsupportedOperationException()
+
+  /**
+   * Return a set of [[Reducer]] for the partition expressions of this shuffle 
spec,
+   * on the partition expressions of another shuffle spec.
+   * 
+   * A [[Reducer]] exists for a partition expression function of this shuffle 
spec if it is
+   * 'reducible' on the corresponding partition expression function of the 
other shuffle spec.
+   * 
+   * If a value is returned, there must be one Option[[Reducer]] per partition 
expression.
+   * A None value in the set indicates that the particular partition 
expression is not reducible
+   * on the corresponding expression on the other shuffle spec.
+   * 
+   * Returning none also indicates that none of the partition expressions can 
be reduced on the
+   * corresponding expression on the other shuffle spec.
+   */
+  def reducers(spec: ShuffleSpec): Option[Seq[Option[Reducer[_ = None

Review Comment:
   done



##
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/ReducibleFunction.java:
##
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.connector.catalog.functions;
+
+import org.apache.spark.annotation.Evolving;
+import scala.Option;
+
+/**
+ * Base class for user-defined functions that can be 'reduced' on another 
function.
+ *
+ * A function f_source(x) is 'reducible' on another function f_target(x) if
+ * there exists a reducer function r(x) such that r(f_source(x)) = f_target(x) 
for all input x.
+ *
+ * @since 4.0.0
+ */
+@Evolving
+public interface ReducibleFunction extends ScalarFunction {

Review Comment:
   added javadocs



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47094][SQL] SPJ : Dynamically rebalance number of buckets when they are not equal [spark]

2024-03-13 Thread via GitHub



szehon-ho commented on code in PR #45267:
URL: https://github.com/apache/spark/pull/45267#discussion_r1523707057


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala:
##
@@ -829,20 +846,59 @@ case class KeyGroupedShuffleSpec(
 }
   }
 
+  override def canCreatePartitioning: Boolean = 
SQLConf.get.v2BucketingShuffleEnabled &&
+// Only support partition expressions are AttributeReference for now
+partitioning.expressions.forall(_.isInstanceOf[AttributeReference])
+
+  override def createPartitioning(clustering: Seq[Expression]): Partitioning = 
{
+KeyGroupedPartitioning(clustering, partitioning.numPartitions, 
partitioning.partitionValues)
+  }
+
+  override def reducers(other: ShuffleSpec): Option[Seq[Option[Reducer[_ = 
{
+other match {
+  case otherSpec: KeyGroupedShuffleSpec =>
+val results = 
partitioning.expressions.zip(otherSpec.partitioning.expressions).map {
+  case (e1: TransformExpression, e2: TransformExpression)

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47094][SQL] SPJ : Dynamically rebalance number of buckets when they are not equal [spark]

2024-03-13 Thread via GitHub



szehon-ho commented on code in PR #45267:
URL: https://github.com/apache/spark/pull/45267#discussion_r1523710658


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala:
##
@@ -829,20 +846,59 @@ case class KeyGroupedShuffleSpec(
 }
   }
 
+  override def canCreatePartitioning: Boolean = 
SQLConf.get.v2BucketingShuffleEnabled &&

Review Comment:
   done, i think i was trying to move the private method to the bottom



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47320][SQL] : The behaviour of Datasets involving self joins is inconsistent, unintuitive, with contradictions [spark]

2024-03-13 Thread via GitHub



ahshahid commented on code in PR #45446:
URL: https://github.com/apache/spark/pull/45446#discussion_r1523727676


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala:
##
@@ -477,6 +482,57 @@ trait ColumnResolutionHelper extends Logging with 
DataTypeErrorsBase {
 assert(q.children.length == 1)
 q.children.head.output
   },
+
+  resolveOnDatasetId = (datasetid: Long, name: String) => {

Review Comment:
   I see. Let me modify the code to use #43115 and some change of mine to see 
if tryResolveDataFrameColumns can be used.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-47375][Doc][FollowUp] Correct the preferTimestampNTZ option description in JDBC doc [spark]

2024-03-13 Thread via GitHub



gengliangwang opened a new pull request, #45502:
URL: https://github.com/apache/spark/pull/45502

   
   
   ### What changes were proposed in this pull request?
   
   Correct the preferTimestampNTZ option description in JDBC doc as per 
https://github.com/apache/spark/pull/45496
   
   ### Why are the changes needed?
   
   The current doc is wrong about the jdbc option preferTimestampNTZ
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   
   Just doc change
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47375][Doc][FollowUp] Correct the preferTimestampNTZ option description in JDBC doc [spark]

2024-03-13 Thread via GitHub



gengliangwang commented on PR #45502:
URL: https://github.com/apache/spark/pull/45502#issuecomment-1995432119

   cc @sadikovi 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SS][SPARK-47363] Initial State without state reader implementation for State API v2. [spark]

2024-03-13 Thread via GitHub



jingz-db commented on code in PR #45467:
URL: https://github.com/apache/spark/pull/45467#discussion_r1523797687


##
sql/core/src/test/scala/org/apache/spark/sql/streaming/TransformWithStateInitialStateSuite.scala:
##
@@ -0,0 +1,268 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.streaming
+
+import org.apache.spark.sql.{Encoders, KeyValueGroupedDataset}
+import org.apache.spark.sql.execution.streaming.MemoryStream
+import 
org.apache.spark.sql.execution.streaming.state.{AlsoTestWithChangelogCheckpointingEnabled,
 RocksDBStateStoreProvider, 
StateStoreMultipleColumnFamiliesNotSupportedException}
+import org.apache.spark.sql.internal.SQLConf
+
+case class InitInputRow(key: String, action: String, value: Double)
+
+class StatefulProcessorWithInitialStateTestClass extends 
StatefulProcessorWithInitialState[
+String, InitInputRow, (String, String, Double), (String, Double)] {
+  @transient var _valState: ValueState[Double] = _
+  @transient var _listState: ListState[Double] = _
+  @transient var _mapState: MapState[Double, Int] = _
+
+  override def handleInitialState(
+  key: String,
+  initialState: (String, Double)): Unit = {
+val initStateVal = initialState._2
+_valState.update(initStateVal)

Review Comment:
   Not sure if I understand you correctly, do you mean we should have a test 
case where `initialState` is a case class, and inside `handleInitialState`, we 
update the value for listState/mapState variable?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47367][PYTHON][CONNECT] Support Python data sources with Spark Connect [spark]

2024-03-13 Thread via GitHub



ueshin commented on code in PR #45486:
URL: https://github.com/apache/spark/pull/45486#discussion_r1523802659


##
python/pyspark/sql/tests/connect/test_parity_python_datasource.py:
##


Review Comment:
   This module needs to be listed in `dev/sparktestsupport/modules.py`.



##
python/pyspark/sql/connect/plan.py:
##
@@ -2408,6 +2408,46 @@ def __repr__(self) -> str:
 return f"{self._function_name}({', '.join([str(arg) for arg in 
self._arguments])})"
 
 
+class PythonDataSource:
+"""Represents a user-defined Python data source."""
+def __init__(self, data_source: Type, python_ver: str):
+self._data_source = data_source
+self._python_ver = python_ver
+
+def to_plan(self, session: "SparkConnectClient") -> proto.PythonDataSource:
+ds = proto.PythonDataSource()
+ds.command = CloudPickleSerializer().dumps(self._data_source)
+ds.python_ver = self._python_ver
+return ds
+
+
+class CommonInlineUserDefinedDataSource(LogicalPlan):
+"""Logical plan object for a user-defined data source"""
+
+def __init__(self, name: str, data_source: PythonDataSource) -> None:
+super().__init__(None)
+self._name = name
+self._data_source = data_source
+
+def plan(self, session: "SparkConnectClient") -> proto.Relation:
+plan = self._create_proto_relation()
+plan.common_inline_user_defined_data_source.name = self._name
+
plan.common_inline_user_defined_data_source.python_data_source.CopyFrom(
+self._data_source.to_plan(session)
+)
+return plan
+
+def to_data_source_proto(
+self, session: "SparkConnectClient"
+) -> "proto.CommonInlineUserDefinedDataSource":
+plan = proto.CommonInlineUserDefinedDataSource()
+plan.name = self._name
+plan.python_data_source.CopyFrom(
+cast(proto.PythonDataSource, self._data_source.to_plan(session))  
# type: ignore[arg-type]

Review Comment:
   Do we need `cast` here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47273][SS][PYTHON] implement Python data stream writer interface. [spark]

2024-03-13 Thread via GitHub



chaoqin-li1123 commented on PR #45305:
URL: https://github.com/apache/spark/pull/45305#issuecomment-1995481769

   PTAL,Thanks! @HeartSaVioR @HyukjinKwon @allisonwang-db @sahnib 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47320][SQL] : The behaviour of Datasets involving self joins is inconsistent, unintuitive, with contradictions [spark]

2024-03-13 Thread via GitHub



ahshahid commented on code in PR #45446:
URL: https://github.com/apache/spark/pull/45446#discussion_r1523844585


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala:
##
@@ -477,6 +482,57 @@ trait ColumnResolutionHelper extends Logging with 
DataTypeErrorsBase {
 assert(q.children.length == 1)
 q.children.head.output
   },
+
+  resolveOnDatasetId = (datasetid: Long, name: String) => {

Review Comment:
   @cloud-fan @peter-toth  I applied the patch #43115 and verified that in 
df.select , the projection is added with Alias having UnresolvedAttribute.   
There are many test failures in DataFrameSelfJoinSuite with that change. So 
atleast with the current code base of spark with patch #43115 is not sufficient 
to fix new bugs which are encountered  nor the existing tests pass. If planId 
has to be used to fix the issue so that the current code of 
tryResolveDataFrameColumns can fix the problem, then atleast it will need 
modification & debugging.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-47372] Add support for range scan based key encoder for stateful streaming using state provider [spark]

2024-03-13 Thread via GitHub



anishshri-db opened a new pull request, #45503:
URL: https://github.com/apache/spark/pull/45503

   ### What changes were proposed in this pull request?
   Add support for range scan based key encoder for stateful streaming using 
state provider
   
   
   ### Why are the changes needed?
   Changes are needed to allow range scan of fixed size initial cols especially 
with RocksDB state store provider
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Added unit tests
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47344] Extend INVALID_IDENTIFIER error beyond catching '-' in an unquoted identifier and fix "IS ! NULL" et al. [spark]

2024-03-13 Thread via GitHub



gengliangwang commented on PR #45470:
URL: https://github.com/apache/spark/pull/45470#issuecomment-1995659868

   Thanks, merging to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47344] Extend INVALID_IDENTIFIER error beyond catching '-' in an unquoted identifier and fix "IS ! NULL" et al. [spark]

2024-03-13 Thread via GitHub



gengliangwang closed pull request #45470: [SPARK-47344] Extend 
INVALID_IDENTIFIER error beyond catching '-' in an unquoted identifier and fix 
"IS ! NULL" et al.
URL: https://github.com/apache/spark/pull/45470


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46913][SS] Add support for processing/event time based timers with transformWithState operator [spark]

2024-03-13 Thread via GitHub



HeartSaVioR commented on PR #45051:
URL: https://github.com/apache/spark/pull/45051#issuecomment-1995685935

   Thanks! Merging to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46913][SS] Add support for processing/event time based timers with transformWithState operator [spark]

2024-03-13 Thread via GitHub



HeartSaVioR closed pull request #45051: [SPARK-46913][SS] Add support for 
processing/event time based timers with transformWithState operator
URL: https://github.com/apache/spark/pull/45051


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-47383][CORE] Make the shutdown hook timeout configurable [spark]

2024-03-13 Thread via GitHub



robreeves opened a new pull request, #45504:
URL: https://github.com/apache/spark/pull/45504

   
   
   ### What changes were proposed in this pull request?
   Make the shutdown hook timeout configurable. If this is not defined it falls 
back to the existing behavior, which uses a default timeout of 30 seconds, or 
whatever is defined in core-site.xml for the hadoop.service.shutdown.timeout 
property.
   
   ### Why are the changes needed?
   Spark sometimes times out during the shutdown process. This results in any 
data left in the queues to be dropped and causes metadata data loss (e.g. event 
logs, anything written by custom listeners). 
   
   This is not easily configurable before this change. The underlying 
`org.apache.hadoop.util.ShutdownHookManager` a the default timeout of 30 
seconds.  It can be configured by setting hadoop.service.shutdown.timeout, but 
this must be done in the core-site.xml/core-default.xml because a new hadoop 
conf object is created and there is no opportunity to modify it.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, a new config `spark.shutdown.timeout` is added.
   
   
   ### How was this patch tested?
   Manual testing in spark-shell. This is not easily unit testable.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47346][PYTHON] Make daemon mode configurable when creating Python planner workers [spark]

2024-03-13 Thread via GitHub



ueshin commented on PR #45468:
URL: https://github.com/apache/spark/pull/45468#issuecomment-1995996043

   @allisonwang-db Seems like the new test is still failing:
   
   ```
   [info] - SPARK-47346: cannot create Python worker with different useDaemon 
flag *** FAILED *** (34 milliseconds)
   [info]   org.apache.spark.SparkException: Error from python worker:
   [info]   /usr/bin/python3: Error while finding module specification for 
'pyspark.daemon' (ModuleNotFoundError: No module named 'pyspark')
   [info] PYTHONPATH was:
   [info]   
/home/runner/work/spark/spark/core/target/scala-2.13/spark-core_2.13-4.0.0-SNAPSHOT.jar
   [info] org.apache.spark.SparkException: EOFException occurred while reading 
the port number from pyspark.daemon's stdout.
   [info]   at 
org.apache.spark.errors.SparkCoreErrors$.eofExceptionWhileReadPortNumberError(SparkCoreErrors.scala:55)
   [info]   at 
org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:264)
   [info]   at 
org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:138)
   [info]   at 
org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:104)
   [info]   at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:156)
   [info]   at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:166)
   [info]   at 
org.apache.spark.api.python.PythonWorkerFactorySuite.$anonfun$new$4(PythonWorkerFactorySuite.scala:64)
   ```
   
   We may need to set up a working Python env for this test somehow?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-47290][SQL] Extend CustomTaskMetric to allow metric values from multiple sources [spark]

2024-03-13 Thread via GitHub



parthchandra opened a new pull request, #45505:
URL: https://github.com/apache/spark/pull/45505

   ### What changes were proposed in this pull request?
   Provides a new interface `CustomFileTaskMetric` that extends the 
`CustomTaskMetric` and allows updating of values.
   
   ### Why are the changes needed?
   The current interface to provide custom metrics does not work for adding 
file based metrics for the parquet reader where a single `FilePartitionReader` 
may need to collect metrics from multiple parquet file readers 
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   This is just adding the interface. The implementation and tests will be done 
in a follow up PR that addresses 
https://issues.apache.org/jira/browse/SPARK-47291
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47217][SQL] Fix ambiguity check in self joins [spark]

2024-03-13 Thread via GitHub



ahshahid closed pull request #45343: [SPARK-47217][SQL] Fix ambiguity check in 
self joins
URL: https://github.com/apache/spark/pull/45343


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47217][SQL] Fix ambiguity check in self joins [spark]

2024-03-13 Thread via GitHub



ahshahid commented on PR #45343:
URL: https://github.com/apache/spark/pull/45343#issuecomment-1996044845

   The PR #45446 handles the issue comprehensively and this PR was a subset of 
the changes contained in PR #45446 , so closing this PR


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-03-13 Thread via GitHub



wbo4958 commented on PR #45232:
URL: https://github.com/apache/spark/pull/45232#issuecomment-1996052736

   Hi @grundprinzip, I would be grateful if you could kindly take another look 
at this PR, Thx.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SS] Allow chaining other stateful operators after transformWIthState operator. [spark]

2024-03-13 Thread via GitHub



sahnib commented on code in PR #45376:
URL: https://github.com/apache/spark/pull/45376#discussion_r1524024974


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/EventTimeWatermark.scala:
##
@@ -72,3 +74,32 @@ case class EventTimeWatermark(
   override protected def withNewChildInternal(newChild: LogicalPlan): 
EventTimeWatermark =
 copy(child = newChild)
 }
+
+case class UpdateEventTimeWatermarkColumn(
+eventTime: Attribute,
+delay: CalendarInterval,
+child: LogicalPlan) extends UnaryNode {
+  override def output: Seq[Attribute] = child.output.map { a =>

Review Comment:
   Yep, done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47320][SQL] : The behaviour of Datasets involving self joins is inconsistent, unintuitive, with contradictions [spark]

2024-03-13 Thread via GitHub



ahshahid commented on code in PR #45446:
URL: https://github.com/apache/spark/pull/45446#discussion_r1524029007


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala:
##
@@ -477,6 +482,57 @@ trait ColumnResolutionHelper extends Logging with 
DataTypeErrorsBase {
 assert(q.children.length == 1)
 q.children.head.output
   },
+
+  resolveOnDatasetId = (datasetid: Long, name: String) => {

Review Comment:
   > the purpose of #43115 is to use plan id for both Spark Connect and classic 
Spark. That said, no more already-resolved `AttributeReference`
   
   Further checking the PR #43115  I see that it is substituting  DataSet ID 
for plan ID for the attributes. Which is I suppose fine, though redundant, 
because metadata already contains the datasetID . However main issue with the 
current PR #43115  which makes it untesttable I think is, that for resolution 
using LogicalPlan ID , the LogicalPlan should contain the Tag PLAN_ID, which is 
not being set as it is not SQLConnect plan. 
   And I am not that well versed to set the Plan ID  for usual Spark 
LogicalPlans, for testing purposes.
   But I will see if I can reconcile my code with the logic used in 
tryResolveDataFrameColumns



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-41762][PYTHON][CONNECT][TESTS] Enable column name comparsion in `test_column_arithmetic_ops` [spark]

2024-03-13 Thread via GitHub



zhengruifeng closed pull request #45493: [SPARK-41762][PYTHON][CONNECT][TESTS] 
Enable column name comparsion in `test_column_arithmetic_ops`
URL: https://github.com/apache/spark/pull/45493


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-41762][PYTHON][CONNECT][TESTS] Enable column name comparsion in `test_column_arithmetic_ops` [spark]

2024-03-13 Thread via GitHub



zhengruifeng commented on PR #45493:
URL: https://github.com/apache/spark/pull/45493#issuecomment-1996169050

   merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-45827] Refactor supportsDataType [spark]

2024-03-13 Thread via GitHub



cashmand opened a new pull request, #45506:
URL: https://github.com/apache/spark/pull/45506

   
   
   ### What changes were proposed in this pull request?
   
   This is a follow-up to https://github.com/apache/spark/pull/45409. It 
refactors `supportsDataType` to use a non-recursive `supportsScalarDataType` 
method. This allows subclasses that just want to include or exclude one or two 
types to override with something like the following, rather than having to make 
a full copy of the recursive `supportsDataType` function, and risk getting out 
of sync as other types are changed in the base trait.
   
   ```
   override def supportsScalarDataType(dt: DataType): Boolean = {
   dt.isInstanceOf[VariantType] || super.supportsScalarDataType(dt)
   }
   ```
   
   ### Why are the changes needed?
   Allows subclasses to override the list of supported types with less 
duplicated code, and less risk of bugs.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Existing tests.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-23015][WINDOWS] Mitigate bug in Windows where starting multiple Spark instances within the same second causes a failure [spark]

2024-03-13 Thread via GitHub



github-actions[bot] commented on PR #43706:
URL: https://github.com/apache/spark/pull/43706#issuecomment-1996172425

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45827] Refactor supportsDataType [spark]

2024-03-13 Thread via GitHub



cloud-fan commented on code in PR #45506:
URL: https://github.com/apache/spark/pull/45506#discussion_r1524065273


##
sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala:
##
@@ -190,10 +210,7 @@ trait CreatableRelationProvider {
   case MapType(k, v, _) => supportsDataType(k) && supportsDataType(v)
   case StructType(fields) => fields.forall(f => 
supportsDataType(f.dataType))
   case udt: UserDefinedType[_] => supportsDataType(udt.sqlType)

Review Comment:
   This case match can be removed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-23015][WINDOWS] Mitigate bug in Windows where starting multiple Spark instances within the same second causes a failure [spark]

2024-03-13 Thread via GitHub



panbingkun commented on PR #43706:
URL: https://github.com/apache/spark/pull/43706#issuecomment-1996194002

   I'm very sorry, I lost it. I will verify it on a Windows machine this 
weekend.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-47384][BUILD] Upgrade RoaringBitmap to 1.0.5 [spark]

2024-03-13 Thread via GitHub



panbingkun opened a new pull request, #45507:
URL: https://github.com/apache/spark/pull/45507

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP][SPARK-47338][SQL] Introduce `_LEGACY_ERROR_UNKNOWN` for default error class [spark]

2024-03-13 Thread via GitHub



itholic commented on code in PR #45457:
URL: https://github.com/apache/spark/pull/45457#discussion_r1524079365


##
common/utils/src/main/resources/error/error-classes.json:
##
@@ -7902,6 +7902,11 @@
   "Doesn't support month or year interval: "
 ]
   },
+  "_LEGACY_ERROR_UNKNOWN" : {

Review Comment:
   Sounds reasonable to me. Let me address it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 >

1 - 100 of 127 matches

Mail list logo