Re: [PR] [SPARK-48220][PYTHON] Allow passing PyArrow Table to createDataFrame() [spark]

2024-05-20 Thread via GitHub


zhengruifeng commented on code in PR #46529:
URL: https://github.com/apache/spark/pull/46529#discussion_r1607672232


##
python/pyspark/sql/context.py:
##
@@ -46,6 +46,7 @@
 
 if TYPE_CHECKING:
 from py4j.java_gateway import JavaObject
+import pyarrow as pa

Review Comment:
   I think we can not assume `pyarrow` is installed by default in Spark Classic
   
   
https://spark.apache.org/docs/latest/api/python/getting_started/install.html#dependencies
   
   @HyukjinKwon 



##
python/pyspark/sql/pandas/types.py:
##
@@ -162,6 +243,8 @@ def from_arrow_type(at: "pa.DataType", 
prefer_timestamp_ntz: bool = False) -> Da
 spark_type = StringType()
 elif types.is_binary(at):
 spark_type = BinaryType()
+elif types.is_fixed_size_binary(at):
+spark_type = BinaryType()

Review Comment:
   is it related?



##
python/pyspark/sql/context.py:
##
@@ -343,28 +344,29 @@ def createDataFrame(
 
 @overload
 def createDataFrame(
-self, data: "PandasDataFrameLike", samplingRatio: Optional[float] = ...
+self, data: Union["PandasDataFrameLike", "pa.Table"], samplingRatio: 
Optional[float] = ...

Review Comment:
   can we include `pa.Table` in `PandasDataFrameLike`?



##
python/pyspark/sql/connect/session.py:
##
@@ -555,11 +575,21 @@ def createDataFrame(
 ]
 )
 
-if isinstance(schema, StructType):
-assert arrow_schema is not None
-_table = _table.rename_columns(
-cast(StructType, _deduplicate_field_names(schema)).names
-).cast(arrow_schema)
+elif isinstance(data, pa.Table):

Review Comment:
   Are above changes in this file related to this PR?



##
python/pyspark/sql/pandas/types.py:
##
@@ -174,6 +257,18 @@ def from_arrow_type(at: "pa.DataType", 
prefer_timestamp_ntz: bool = False) -> Da
 spark_type = DayTimeIntervalType()
 elif types.is_list(at):
 spark_type = ArrayType(from_arrow_type(at.value_type, 
prefer_timestamp_ntz))
+elif types.is_fixed_size_list(at):
+import pyarrow as pa
+
+if LooseVersion(pa.__version__) < LooseVersion("14.0.0"):

Review Comment:
   we'd better support version >= 4.0.0
   
   
https://spark.apache.org/docs/latest/api/python/getting_started/install.html#dependencies



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48329][SQL] SPJ: Default spark.sql.sources.v2.bucketing.pushPartValues.enabled to true [spark]

2024-05-20 Thread via GitHub


szehon-ho commented on PR #46673:
URL: https://github.com/apache/spark/pull/46673#issuecomment-2121805760

   Thanks!  @dongjoon-hyun @sunchao  can you take another look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-19426][SQL] Custom coalescer for Dataset [spark]

2024-05-20 Thread via GitHub


SubhamSinghal commented on PR #46541:
URL: https://github.com/apache/spark/pull/46541#issuecomment-2121803074

   @hvanhovell will you be able to add review here or tag relevant folks here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48300][SQL] Codegen Support for `from_xml` [spark]

2024-05-20 Thread via GitHub


yaooqinn commented on PR #46609:
URL: https://github.com/apache/spark/pull/46609#issuecomment-2121794324

   Merged to master.
   
   Thank you @panbingkun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48300][SQL] Codegen Support for `from_xml` [spark]

2024-05-20 Thread via GitHub


yaooqinn closed pull request #46609: [SPARK-48300][SQL] Codegen Support for 
`from_xml`
URL: https://github.com/apache/spark/pull/46609


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [MINOR][DOCS] correct the doc error in configuration page (fix rest to reset) [spark]

2024-05-20 Thread via GitHub


yaooqinn closed pull request #46663: [MINOR][DOCS] correct the doc error in 
configuration page (fix rest to reset)
URL: https://github.com/apache/spark/pull/46663


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [MINOR][DOCS] correct the doc error in configuration page (fix rest to reset) [spark]

2024-05-20 Thread via GitHub


yaooqinn commented on PR #46663:
URL: https://github.com/apache/spark/pull/46663#issuecomment-2121790028

   Thank you @Justontheway @HyukjinKwon, merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [MINOR][TESTS] Rename test_union to test_eqnullsafe at ColumnTestsMixin [spark]

2024-05-20 Thread via GitHub


HyukjinKwon commented on PR #46675:
URL: https://github.com/apache/spark/pull/46675#issuecomment-2121771596

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [MINOR][TESTS] Rename test_union to test_eqnullsafe at ColumnTestsMixin [spark]

2024-05-20 Thread via GitHub


HyukjinKwon closed pull request #46675: [MINOR][TESTS] Rename test_union to 
test_eqnullsafe at ColumnTestsMixin
URL: https://github.com/apache/spark/pull/46675


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48359][SQL] Built-in functions for Zstd compression and decompression [spark]

2024-05-20 Thread via GitHub


yaooqinn commented on PR #46672:
URL: https://github.com/apache/spark/pull/46672#issuecomment-2121735788

   Instead of adding (de)compression functions for different codecs, how about 
adding the `compression` and `decompression` directly, like,
   - 
https://dev.mysql.com/doc/refman/8.0/en/encryption-functions.html#function_compress
   - 
https://learn.microsoft.com/en-us/sql/t-sql/functions/compress-transact-sql?view=sql-server-ver16


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [CONNECT] Allow plugins to use QueryTest in their tests [spark]

2024-05-20 Thread via GitHub


zhengruifeng commented on PR #46667:
URL: https://github.com/apache/spark/pull/46667#issuecomment-2121734784

   the failed tests seems related:
   ```
   [error] 
/home/runner/work/spark/spark/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/CatalogSuite.scala:29:28:
 illegal inheritance;
   [error]  self-type org.apache.spark.sql.CatalogSuite does not conform to 
org.apache.spark.sql.test.RemoteSparkSession's selftype 
org.apache.spark.sql.test.RemoteSparkSession with org.scalatest.Suite
   [error] class CatalogSuite extends RemoteSparkSession with SQLHelper {
   [error]^
   [error] 
/home/runner/work/spark/spark/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/CatalogSuite.scala:31:3:
 package org.apache.spark.sql.test is not a value
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [MINOR][TESTS] Rename test_union to test_eqnullsafe at ColumnTestsMixin [spark]

2024-05-20 Thread via GitHub


HyukjinKwon opened a new pull request, #46675:
URL: https://github.com/apache/spark/pull/46675

   ### What changes were proposed in this pull request?
   
   This PR proposes to rename `test_union` to `test_eqnullsafe` at 
`ColumnTestsMixin`.
   
   ### Why are the changes needed?
   
   To avoid confusion from the test name.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, test-only
   
   ### How was this patch tested?
   
   CI in this PR.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-39195][SQL] Spark OutputCommitCoordinator should abort stage when committed file not consistent with task status [spark]

2024-05-20 Thread via GitHub


AngersZh commented on code in PR #36564:
URL: https://github.com/apache/spark/pull/36564#discussion_r1607575927


##
core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala:
##
@@ -155,9 +158,9 @@ private[spark] class OutputCommitCoordinator(conf: 
SparkConf, isDriver: Boolean)
 val taskId = TaskIdentifier(stageAttempt, attemptNumber)
 stageState.failures.getOrElseUpdate(partition, mutable.Set()) += taskId
 if (stageState.authorizedCommitters(partition) == taskId) {
-  logDebug(s"Authorized committer (attemptNumber=$attemptNumber, 
stage=$stage, " +
-s"partition=$partition) failed; clearing lock")
-  stageState.authorizedCommitters(partition) = null
+  sc.foreach(_.dagScheduler.stageFailed(stage, s"Authorized committer 
" +
+s"(attemptNumber=$attemptNumber, stage=$stage, 
partition=$partition) failed; " +
+s"but task commit success, data duplication may happen."))
 }

Review Comment:
   With https://github.com/apache/spark/pull/38980 seems we didn't need this 
patch anymore



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48329][SQL] SPJ: Default spark.sql.sources.v2.bucketing.pushPartValues.enabled to true [spark]

2024-05-20 Thread via GitHub


superdiaodiao commented on PR #46673:
URL: https://github.com/apache/spark/pull/46673#issuecomment-2121646393

   > Thanks @dongjoon-hyun sorry the pr was not ready. I was trying to 
integrate the changes from @superdiaodiao who I asaw also made a pr for the 
same, so we can be co-authors. Reverted the additional config change and test 
case, will check the test result.
   
   I will close my PR and you can continue, let's co-author this time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48337][SQL] Fix precision loss for JDBC TIME values [spark]

2024-05-20 Thread via GitHub


yaooqinn commented on PR #46662:
URL: https://github.com/apache/spark/pull/46662#issuecomment-2121637292

   Merged to master. Thank you @dongjoon-hyun @LuciferYang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48337][SQL] Fix precision loss for JDBC TIME values [spark]

2024-05-20 Thread via GitHub


yaooqinn closed pull request #46662: [SPARK-48337][SQL] Fix precision loss for 
JDBC TIME values
URL: https://github.com/apache/spark/pull/46662


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48320][CORE][DOCS] Add external third-party ecosystem access guide to the doc [spark]

2024-05-20 Thread via GitHub


panbingkun commented on code in PR #46634:
URL: https://github.com/apache/spark/pull/46634#discussion_r1607534427


##
common/utils/src/main/scala/org/apache/spark/internal/README.md:
##
@@ -45,3 +45,29 @@ logger.error("Failed to abort the writer after failing to 
write map output.", e)
 ## Exceptions
 
 To ensure logs are compatible with Spark SQL and log analysis tools, avoid 
`Exception.printStackTrace()`. Use `logError`, `logWarning`, and `logInfo` 
methods from the `Logging` trait to log exceptions, maintaining structured and 
parsable logs.
+
+## External third-party ecosystem access
+
+* If you want to output logs in `scala code` through the structured log 
framework, you can define `custom LogKey` and use it in `scala` code as follows:
+
+```scala
+// External third-party ecosystem `custom LogKey` must be `extends LogKey`
+case object CUSTOM_LOG_KEY extends LogKey
+```
+```scala
+import org.apache.spark.internal.MDC;
+
+logInfo(log"${MDC(CUSTOM_LOG_KEY, "key")}")
+```
+
+* If you want to output logs in `java code` through the structured log 
framework, you can define `custom LogKey` and use it in `java` code as follows:
+
+```java
+// External third-party ecosystem `custom LogKey` must be `implements LogKey`
+public static class CUSTOM_LOG_KEY implements LogKey { }
+```
+```java
+import org.apache.spark.internal.MDC;
+
+logger.error("Unable to delete key {} for cache", MDC.of(CUSTOM_LOG_KEY, 
"key"));
+```

Review Comment:
   Well, I will remove the above `README.md` file and write the above guide to 
the `scala/java` doc.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48337][SQL] Fix precision loss for JDBC TIME values [spark]

2024-05-20 Thread via GitHub


dongjoon-hyun commented on PR #46662:
URL: https://github.com/apache/spark/pull/46662#issuecomment-2121629155

   Thank you, @yaooqinn .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48337][SQL] Fix precision loss for JDBC TIME values [spark]

2024-05-20 Thread via GitHub


yaooqinn commented on PR #46662:
URL: https://github.com/apache/spark/pull/46662#issuecomment-2121627559

   cc @cloud-fan @LuciferYang @dongjoon-hyun thanks 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47233][CONNECT][SS][2/2] Client & Server logic for Client side streaming query listener [spark]

2024-05-20 Thread via GitHub


HyukjinKwon commented on code in PR #46037:
URL: https://github.com/apache/spark/pull/46037#discussion_r1607514868


##
python/pyspark/sql/tests/connect/streaming/test_parity_listener.py:
##
@@ -65,8 +66,140 @@ def onQueryTerminated(self, event):
 df.write.mode("append").saveAsTable("listener_terminated_events_v2")
 
 
+class TestListenerLocal(StreamingQueryListener):
+def __init__(self):
+self.start = []
+self.progress = []
+self.terminated = []
+
+def onQueryStarted(self, event):
+self.start.append(event)
+
+def onQueryProgress(self, event):
+self.progress.append(event)
+
+def onQueryIdle(self, event):
+pass
+
+def onQueryTerminated(self, event):
+self.terminated.append(event)
+
+
 class StreamingListenerParityTests(StreamingListenerTestsMixin, 
ReusedConnectTestCase):
-def test_listener_events(self):
+def test_listener_management(self):
+listener1 = TestListenerLocal()
+listener2 = TestListenerLocal()
+
+try:
+self.spark.streams.addListener(listener1)
+self.spark.streams.addListener(listener2)
+q = 
self.spark.readStream.format("rate").load().writeStream.format("noop").start()
+
+# Both listeners should have listener events already because 
onQueryStarted
+# is always called before DataStreamWriter.start() returns
+self.assertEqual(len(listener1.start), 1)
+self.assertEqual(len(listener2.start), 1)
+
+# removeListener is a blocking call, resources are cleaned up by 
the time it returns
+self.spark.streams.removeListener(listener1)
+self.spark.streams.removeListener(listener2)
+
+# Add back the listener and stop the query, now should see a 
terminated event
+self.spark.streams.addListener(listener1)
+q.stop()
+
+# need to wait a while before QueryTerminatedEvent reaches client
+time.sleep(15)
+self.assertEqual(len(listener1.terminated), 1)
+
+self.check_start_event(listener1.start[0])
+for event in listener1.progress:
+self.check_progress_event(event)
+self.check_terminated_event(listener1.terminated[0])
+
+finally:
+for listener in self.spark.streams._sqlb._listener_bus:
+self.spark.streams.removeListener(listener)
+for q in self.spark.streams.active:
+q.stop()
+
+def test_slow_query(self):

Review Comment:
   This seems flaky:
   
   ```
   ==
   FAIL [20.288s]: test_slow_query 
(pyspark.sql.tests.connect.streaming.test_parity_listener.StreamingListenerParityTests.test_slow_query)
   --
   Traceback (most recent call last):
 File 
"/__w/spark/spark/python/pyspark/sql/tests/connect/streaming/test_parity_listener.py",
 line 164, in test_slow_query
   self.assertTrue(fast_query.id in [str(e.id) for e in 
listener.terminated])
   AssertionError: False is not true
   
   --
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48300][SQL] Codegen Support for `from_xml` [spark]

2024-05-20 Thread via GitHub


yaooqinn commented on code in PR #46609:
URL: https://github.com/apache/spark/pull/46609#discussion_r1607505328


##
sql/core/src/test/scala/org/apache/spark/sql/XmlFunctionsSuite.scala:
##
@@ -40,6 +40,17 @@ class XmlFunctionsSuite extends QueryTest with 
SharedSparkSession {
   Row(Row(1)) :: Nil)
   }
 
+  test("SPARK-48300: from_xml - Codegen Support") {
+withTempView("XmlToStructsTable") {
+  val dataDF = Seq("""1""").toDF("value")
+  dataDF.createOrReplaceTempView("XmlToStructsTable")
+  val df = sql("SELECT from_xml(value, 'a INT') FROM XmlToStructsTable")
+  val plan = df.queryExecution.executedPlan
+  assert(plan.isInstanceOf[WholeStageCodegenExec])

Review Comment:
   ```suggestion
 
assert(df.queryExecution.executedPlan.isInstanceOf[WholeStageCodegenExec])
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48300][SQL] Codegen Support for `from_xml` [spark]

2024-05-20 Thread via GitHub


panbingkun commented on PR #46609:
URL: https://github.com/apache/spark/pull/46609#issuecomment-2121582579

   It has rebase the master. At present, this PR is only for `codegen support 
for from_xml`
   @sandip-db @HyukjinKwon @yaooqinn @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48363][SQL] Cleanup some redundant codes in `from_xml` [spark]

2024-05-20 Thread via GitHub


HyukjinKwon closed pull request #46674: [SPARK-48363][SQL] Cleanup some 
redundant codes in `from_xml`
URL: https://github.com/apache/spark/pull/46674


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48363][SQL] Cleanup some redundant codes in `from_xml` [spark]

2024-05-20 Thread via GitHub


HyukjinKwon commented on PR #46674:
URL: https://github.com/apache/spark/pull/46674#issuecomment-2121561163

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46841][SQL] Add collation support for ICU locales and collation specifiers [spark]

2024-05-20 Thread via GitHub


mkaravel commented on code in PR #46180:
URL: https://github.com/apache/spark/pull/46180#discussion_r1607446081


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##
@@ -288,13 +338,24 @@ private static int collationNameToId(String 
collationName) throws SparkException
 
 private static class CollationSpecUTF8Binary extends CollationSpec {
 
-  private static final int CASE_SENSITIVITY_OFFSET = 0;
-  private static final int CASE_SENSITIVITY_MASK = 0b1;
-
+  /**
+   * Bit 0 in collation id having value
+   * 0 for plain UTF8_BINARY and 1 for UTF8_BINARY_LCASE collation.

Review Comment:
   Please use the full extent of the 100 characters per line that you are 
allowed for comments.
   I am not suggesting to pack everything together all the time, but in cases 
like the one above, I would have expected to have the first line "filled" 
before going to the second line.



##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##
@@ -221,41 +231,80 @@ public Collation(
  */
 private abstract static class CollationSpec {
 
+  /**
+   * Bit 30 in collation id having value 0 for predefined and 1 for 
user-defined collation.
+   */
   private enum DefinitionOrigin {
 PREDEFINED, USER_DEFINED
   }
 
+  /**
+   * Bit 29 in collation id having value
+   * 0 for UTF8_BINARY family and 1 for ICU family of collations.
+   */
   protected enum ImplementationProvider {
 UTF8_BINARY, ICU
   }
 
+  /**
+   * Offset in binary collation id layout.
+   */
   private static final int DEFINITION_ORIGIN_OFFSET = 30;
+
+  /**
+   * Bitmask corresponding to width in bits in binary collation id layout.
+   */
   private static final int DEFINITION_ORIGIN_MASK = 0b1;
+
+  /**
+   * Offset in binary collation id layout.
+   */
   protected static final int IMPLEMENTATION_PROVIDER_OFFSET = 29;
+
+  /**
+   * Bitmask corresponding to width in bits in binary collation id layout.
+   */
   protected static final int IMPLEMENTATION_PROVIDER_MASK = 0b1;
 
   private static final int INDETERMINATE_COLLATION_ID = -1;
 
+  /**
+   * Thread-safe cache mapping collation ids to corresponding `Collation` 
instances.
+   * We add entries to this cache lazily as new `Collation` instances are 
requested.
+   */
   private static final Map collationMap = new 
ConcurrentHashMap<>();
 
+  /**
+   * Utility function to retrieve `ImplementationProvider` enum instance 
from collation id.
+   */
   private static ImplementationProvider getImplementationProvider(int 
collationId) {
 return 
ImplementationProvider.values()[SpecifierUtils.getSpecValue(collationId,
   IMPLEMENTATION_PROVIDER_OFFSET, IMPLEMENTATION_PROVIDER_MASK)];
   }
 
+  /**
+   * Utility function to retrieve `DefinitionOrigin` enum instance from 
collation id.
+   */
   private static DefinitionOrigin getDefinitionOrigin(int collationId) {
 return 
DefinitionOrigin.values()[SpecifierUtils.getSpecValue(collationId,
   DEFINITION_ORIGIN_OFFSET, DEFINITION_ORIGIN_MASK)];
   }
 
+  /**
+   * Main entry point for retrieving `Collation` instance from collation 
id.
+   */
   private static Collation fetchCollation(int collationId) {
+// User-defined collations and INDETERMINATE collations cannot produce 
`Collation` instance

Review Comment:
   ```suggestion
   // User-defined collations and INDETERMINATE collations cannot 
produce a `Collation` instance.
   ```



##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##
@@ -333,77 +397,136 @@ private static CollationSpecUTF8Binary 
fromCollationId(int collationId) {
   @Override
   protected Collation buildCollation() {
 if (collationId == UTF8_BINARY_COLLATION_ID) {
-  return new Collation("UTF8_BINARY", null, UTF8String::binaryCompare, 
"1.0",
-s -> (long) s.hashCode(), true, true, false);
+  return new Collation(
+"UTF8_BINARY",
+PROVIDER_SPARK,
+null,
+UTF8String::binaryCompare,
+"1.0",
+s -> (long) s.hashCode(),
+true,
+true,
+false);

Review Comment:
   Way better! Thank you!
   Since we are in Java and not Scala, I suggest using comments for the boolean 
constants. Something like:
   ```java
   /*isBinaryCollation=*/ true,
   ```



##
common/unsafe/src/test/scala/org/apache/spark/unsafe/types/CollationFactorySuite.scala:
##
@@ -152,4 +219,238 @@ class CollationFactorySuite extends AnyFunSuite with 
Matchers { // scalastyle:ig
   }
 })
   }
+
+  test("test collation caching") {
+Seq(
+  "UTF8_BINARY",
+  

Re: [PR] [SPARK-43815][SQL] Wrap NPE with AnalysisException in CSV, XML, and JSON options [spark]

2024-05-20 Thread via GitHub


HyukjinKwon commented on code in PR #46626:
URL: https://github.com/apache/spark/pull/46626#discussion_r1607447215


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##
@@ -149,7 +149,13 @@ class CSVOptions(
 parameters.getOrElse(DateTimeUtils.TIMEZONE_OPTION, defaultTimeZoneId))
 
   // A language tag in IETF BCP 47 format
-  val locale: Locale = 
parameters.get(LOCALE).map(Locale.forLanguageTag).getOrElse(Locale.US)
+  val locale: Locale = parameters.get(LOCALE)
+.map {
+  case null =>

Review Comment:
   My concern is that, in PySpark we ignore when it's `None` vs here we throw 
an exception. I really think we should fix the whole thing with defining the 
behaviour instead of fixing one case alone.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [MINOR][DOCS] correct the doc error in configuration page (fix rest to reset) [spark]

2024-05-20 Thread via GitHub


HyukjinKwon commented on PR #46663:
URL: https://github.com/apache/spark/pull/46663#issuecomment-2121501005

   Mind taking a look at 
https://github.com/apache/spark/pull/46663/checks?check_run_id=25172009176?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [CONNECT] Allow plugins to use QueryTest in their tests [spark]

2024-05-20 Thread via GitHub


HyukjinKwon commented on PR #46667:
URL: https://github.com/apache/spark/pull/46667#issuecomment-2121499966

   Can we file a JIRA ticket please?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48340][PYTHON] Support TimestampNTZ infer schema miss prefer_timestamp_ntz [spark]

2024-05-20 Thread via GitHub


HyukjinKwon closed pull request #4: [SPARK-48340][PYTHON] Support 
TimestampNTZ infer schema miss prefer_timestamp_ntz
URL: https://github.com/apache/spark/pull/4


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48340][PYTHON] Support TimestampNTZ infer schema miss prefer_timestamp_ntz [spark]

2024-05-20 Thread via GitHub


HyukjinKwon commented on PR #4:
URL: https://github.com/apache/spark/pull/4#issuecomment-2121500193

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48329][SQL] Turn on `spark.sql.sources.v2.bucketing.pushPartValues.enabled` by default [spark]

2024-05-20 Thread via GitHub


superdiaodiao closed pull request #46650: [SPARK-48329][SQL] Turn on 
`spark.sql.sources.v2.bucketing.pushPartValues.enabled` by default
URL: https://github.com/apache/spark/pull/46650


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48329][SQL] SPJ: Default spark.sql.sources.v2.bucketing.pushPartValues.enabled to true [spark]

2024-05-20 Thread via GitHub


superdiaodiao commented on PR #46673:
URL: https://github.com/apache/spark/pull/46673#issuecomment-2121484021

   > Thanks @dongjoon-hyun sorry the pr was not ready.  I was trying to 
integrate the changes from @superdiaodiao who I asaw also made a pr for the 
same, so we can be co-authors.  Reverted the additional config change and test 
case, will check the test result.
   
   I will close my PR and the rest work needs you, let's co-author this time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45579][CORE] Catch errors for FallbackStorage.copy [spark]

2024-05-20 Thread via GitHub


github-actions[bot] closed pull request #43409: [SPARK-45579][CORE] Catch 
errors for FallbackStorage.copy
URL: https://github.com/apache/spark/pull/43409


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45744][CORE] Switch `spark.history.store.serializer` to use `PROTOBUF` by default [spark]

2024-05-20 Thread via GitHub


github-actions[bot] commented on PR #43609:
URL: https://github.com/apache/spark/pull/43609#issuecomment-2121473646

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48258][PYTHON][CONNECT] Checkpoint and localCheckpoint in Spark Connect [spark]

2024-05-20 Thread via GitHub


HyukjinKwon closed pull request #46570: [SPARK-48258][PYTHON][CONNECT] 
Checkpoint and localCheckpoint in Spark Connect
URL: https://github.com/apache/spark/pull/46570


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48258][PYTHON][CONNECT] Checkpoint and localCheckpoint in Spark Connect [spark]

2024-05-20 Thread via GitHub


HyukjinKwon commented on PR #46570:
URL: https://github.com/apache/spark/pull/46570#issuecomment-2121464274

   Merged to master.
   
   I will followup the discussion if there are more to address since we're 
releasing preview soon.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48300][SQL] Codegen Support for `from_xml` & remove some redundant codes [spark]

2024-05-20 Thread via GitHub


panbingkun commented on PR #46609:
URL: https://github.com/apache/spark/pull/46609#issuecomment-2121447497

   After the PR above is merged, I will rebase the PR again just for `codegen`
   Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48363][SQL] Cleanup some redundant codes in `from_xml` [spark]

2024-05-20 Thread via GitHub


panbingkun commented on PR #46674:
URL: https://github.com/apache/spark/pull/46674#issuecomment-2121446653

   cc @sandip-db @HyukjinKwon @cloud-fan @yaooqinn


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48300][SQL] Codegen Support for `from_xml` & remove some redundant codes [spark]

2024-05-20 Thread via GitHub


panbingkun commented on PR #46609:
URL: https://github.com/apache/spark/pull/46609#issuecomment-2121446402

   > @panbingkun Thanks for submitting the PR. Can you please separate the 
`codegen` support and the cleanup in separate PRs?
   Sure,
   A new separate PR for `cleanup`: https://github.com/apache/spark/pull/46674


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-48363][SQL] Cleanup some redundant codes in `from_xml` [spark]

2024-05-20 Thread via GitHub


panbingkun opened a new pull request, #46674:
URL: https://github.com/apache/spark/pull/46674

   ### What changes were proposed in this pull request?
   The PR aims to cleanup some redundant codes (support for `ArrayType` & 
`MapType`)
   
   ### Why are the changes needed?
   As discussed below, we will not support `ArrayType` and `MapType` in 
`from_xml`, so we can re move the logic related to them.
   https://issues.apache.org/jira/browse/SPARK-44810
   https://github.com/apache/spark/assets/15246973/9b78abb2-7907-4ff5-9177-0e09b5f5a9b5;>
   https://github.com/apache/spark/assets/15246973/64f61727-80c2-4946-826c-0940b3ab288d;>
   
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   
   ### How was this patch tested?
   - Add new UT & existed UT.
   - Pass GA.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47920][DOCS][SS][PYTHON] Add doc for python streaming data source API [spark]

2024-05-20 Thread via GitHub


chaoqin-li1123 commented on code in PR #46139:
URL: https://github.com/apache/spark/pull/46139#discussion_r1607401423


##
python/docs/source/user_guide/sql/python_data_source.rst:
##
@@ -84,9 +109,157 @@ Define the reader logic to generate synthetic data. Use 
the `faker` library to p
 row.append(value)
 yield tuple(row)
 
+Implementing Streaming Reader and Writer for Python Data Source
+---
+**Implement the Stream Reader**
+
+This is a dummy streaming data reader that generate 2 rows in every 
microbatch. The streamReader instance has a integer offset that increase by 2 
in every microbatch.
+
+.. code-block:: python
+
+class RangePartition(InputPartition):
+def __init__(self, start, end):
+self.start = start
+self.end = end
+
+class FakeStreamReader(DataSourceStreamReader):
+def __init__(self, schema, options):
+self.current = 0
+
+def initialOffset(self) -> dict:
+"""
+Return the initial start offset of the reader.
+"""
+return {"offset": 0}
+
+def latestOffset(self) -> dict:
+"""
+Return the current latest offset that the next microbatch will 
read to.
+"""
+self.current += 2
+return {"offset": self.current}
+
+def partitions(self, start: dict, end: dict):
+"""
+Plans the partitioning of the current microbatch defined by start 
and end offset,
+it needs to return a sequence of :class:`InputPartition` object.
+"""
+return [RangePartition(start["offset"], end["offset"])]
+
+def commit(self, end: dict):
+"""
+This is invoked when the query has finished processing data before 
end offset, this can be used to clean up resource.
+"""
+pass
+
+def read(self, partition) -> Iterator[Tuple]:
+"""
+Takes a partition as an input and read an iterator of tuples from 
the data source.
+"""
+start, end = partition.start, partition.end
+for i in range(start, end):
+yield (i, str(i))
+
+**Implement the Simple Stream Reader**
+
+If the data source has low throughput and doesn't require partitioning, you 
can implement SimpleDataSourceStreamReader instead of DataSourceStreamReader.
+
+One of simpleStreamReader() and streamReader() must be implemented for 
readable streaming data source. And simpleStreamReader() will only be invoked 
when streamReader() is not implemented.
+
+This is the same dummy streaming reader that generate 2 rows every batch 
implemented with SimpleDataSourceStreamReader interface.
+
+.. code-block:: python
+
+class SimpleStreamReader(SimpleDataSourceStreamReader):
+def initialOffset(self):
+"""
+Return the initial start offset of the reader.
+"""
+return {"offset": 0}
+
+def read(self, start: dict) -> (Iterator[Tuple], dict):
+"""
+Takes start offset as an input, return an iterator of tuples and 
the start offset of next read.
+"""
+start_idx = start["offset"]
+it = iter([(i,) for i in range(start_idx, start_idx + 2)])
+return (it, {"offset": start_idx + 2})
+
+def readBetweenOffsets(self, start: dict, end: dict) -> 
Iterator[Tuple]:

Review Comment:
   This is required because we need to repeat the read in a range after query 
restart.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47920][DOCS][SS][PYTHON] Add doc for python streaming data source API [spark]

2024-05-20 Thread via GitHub


allisonwang-db commented on code in PR #46139:
URL: https://github.com/apache/spark/pull/46139#discussion_r1607398362


##
python/docs/source/user_guide/sql/python_data_source.rst:
##
@@ -84,9 +101,158 @@ Define the reader logic to generate synthetic data. Use 
the `faker` library to p
 row.append(value)
 yield tuple(row)
 
+Implementing Streaming Reader and Writer for Python Data Source
+---
+**Implement the Stream Reader**
+
+This is a dummy streaming data reader that generate 2 rows in every 
microbatch. The streamReader instance has a integer offset that increase by 2 
in every microbatch.
+
+.. code-block:: python
+
+class RangePartition(InputPartition):
+def __init__(self, start, end):
+self.start = start
+self.end = end
+
+class FakeStreamReader(DataSourceStreamReader):
+def __init__(self, schema, options):
+self.current = 0
+
+def initialOffset(self) -> dict:
+"""
+Return the initial start offset of the reader.
+"""
+return {"offset": 0}
+
+def latestOffset(self) -> dict:
+"""
+Return the current latest offset that the next microbatch will 
read to.
+"""
+self.current += 2
+return {"offset": self.current}
+
+def partitions(self, start: dict, end: dict):
+"""
+Plans the partitioning of the current microbatch defined by start 
and end offset,
+it needs to return a sequence of :class:`InputPartition` object.
+"""
+return [RangePartition(start["offset"], end["offset"])]
+
+def commit(self, end: dict):
+"""
+This is invoked when the query has finished processing data before 
end offset, this can be used to clean up resource.
+"""
+pass
+
+def read(self, partition) -> Iterator[Tuple]:
+"""
+Takes a partition as an input and read an iterator of tuples from 
the data source.
+"""
+start, end = partition.start, partition.end
+for i in range(start, end):
+yield (i, str(i))
+
+**Implement the Simple Stream Reader**
+
+If the data source has low throughput and doesn't require partitioning, you 
can implement SimpleDataSourceStreamReader instead of DataSourceStreamReader.
+
+One of simpleStreamReader() and streamReader() must be implemented for 
readable streaming data source. And simpleStreamReader() will only be invoked 
when streamReader() is not implemented.
+
+This is the same dummy streaming reader that generate 2 rows every batch 
implemented with SimpleDataSourceStreamReader interface.
+
+.. code-block:: python
+
+class SimpleStreamReader(SimpleDataSourceStreamReader):
+def initialOffset(self):
+"""
+Return the initial start offset of the reader.
+"""
+return {"offset": 0}
+
+def read(self, start: dict) -> (Iterator[Tuple], dict):
+"""
+Takes start offset as an input, return an iterator of tuples and 
the start offset of next read.
+"""
+start_idx = start["offset"]
+it = iter([(i,) for i in range(start_idx, start_idx + 2)])
+return (it, {"offset": start_idx + 2})
+
+def readBetweenOffsets(self, start: dict, end: dict) -> 
Iterator[Tuple]:
+"""
+Takes start and end offset as input and read an iterator of data 
deterministically.
+This is called whe query replay batches during restart or after 
failure.
+"""
+start_idx = start["offset"]
+end_idx = end["offset"]
+return iter([(i,) for i in range(start_idx, end_idx)])
+
+def commit(self, end):
+"""
+This is invoked when the query has finished processing data before 
end offset, this can be used to clean up resource.
+"""
+pass
+
+**Implement the Stream Writer**
+
+This is a streaming data writer that write the metadata information of each 
microbatch to a local path.
+
+.. code-block:: python
+
+class SimpleCommitMessage(WriterCommitMessage):
+   partition_id: int
+   count: int
+
+class FakeStreamWriter(DataSourceStreamWriter):
+   def __init__(self, options):
+   self.options = options
+   self.path = self.options.get("path")
+   assert self.path is not None
+
+   def write(self, iterator):
+   """
+   Write the data and return the commit message of that partition
+   """
+   from pyspark import TaskContext
+   context = TaskContext.get()
+   partition_id = context.partitionId()
+   cnt = 0
+   for row in iterator:
+   cnt += 1
+   return 

Re: [PR] [WIP][SPARK-48281][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (StringInStr, SubstringIndex) [spark]

2024-05-20 Thread via GitHub


mkaravel commented on PR #46589:
URL: https://github.com/apache/spark/pull/46589#issuecomment-2121389075

   Please update the PR description.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48286] Fix analysis and creation of column with exists default expression [spark]

2024-05-20 Thread via GitHub


urosstan-db commented on PR #46594:
URL: https://github.com/apache/spark/pull/46594#issuecomment-2121385385

   @cloud-fan Sure, sorry, I did not see this failed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-48221][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (Contains, StartsWith, EndsWith, StringLocate) [spark]

2024-05-20 Thread via GitHub


mkaravel commented on PR #46511:
URL: https://github.com/apache/spark/pull/46511#issuecomment-2121384473

   Please fill in the PR description.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48307][SQL] InlineCTE should keep not-inlined relations in the original WithCTE node [spark]

2024-05-20 Thread via GitHub


cloud-fan commented on PR #46617:
URL: https://github.com/apache/spark/pull/46617#issuecomment-2121381949

   > Repartition? Do you mean Relation?
   
   It's Reparition, because we rely on shuffle reuse to reuse CTE relations.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48323][SQL] DB2: Map BooleanType to BOOLEAN instead of CHAR(1) [spark]

2024-05-20 Thread via GitHub


cloud-fan commented on PR #46637:
URL: https://github.com/apache/spark/pull/46637#issuecomment-2121379656

   late LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-20 Thread via GitHub


cloud-fan commented on PR #46280:
URL: https://github.com/apache/spark/pull/46280#issuecomment-2121377372

   I think so. String type with collation should be normal string type in the 
Hive table schema, so that other engines can still read it. We only keep the 
collation info in the Spark-specific table schema JSON string in table 
properties.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48329][SQL] SPJ: Default spark.sql.sources.v2.bucketing.pushPartValues.enabled to true [spark]

2024-05-20 Thread via GitHub


szehon-ho commented on PR #46673:
URL: https://github.com/apache/spark/pull/46673#issuecomment-2121361991

   Thanks @dongjoon-hyun sorry the pr was not ready.  I was trying to integrate 
the changes from @superdiaodiao who I asaw also made a pr for the same, so we 
can be co-authors.  Reverted the additional config change and test case, will 
check the test result.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48286] Fix analysis and creation of column with exists default expression [spark]

2024-05-20 Thread via GitHub


cloud-fan commented on PR #46594:
URL: https://github.com/apache/spark/pull/46594#issuecomment-2121361107

   @urosstan-db  can you re-trigger the test?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48330][SS][PYTHON] Fix the python streaming data source timeout issue for large trigger interval [spark]

2024-05-20 Thread via GitHub


HeartSaVioR closed pull request #46651: [SPARK-48330][SS][PYTHON] Fix the 
python streaming data source timeout issue for large trigger interval
URL: https://github.com/apache/spark/pull/46651


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48329][SQL] SPJ: Default spark.sql.sources.v2.bucketing.pushPartValues.enabled to true [spark]

2024-05-20 Thread via GitHub


szehon-ho commented on code in PR #46673:
URL: https://github.com/apache/spark/pull/46673#discussion_r1607370785


##
docs/sql-migration-guide.md:
##
@@ -55,6 +55,7 @@ license: |
 - Since Spark 4.0, The default value for `spark.sql.legacy.timeParserPolicy` 
has been changed from `EXCEPTION` to `CORRECTED`. Instead of raising an 
`INCONSISTENT_BEHAVIOR_CROSS_VERSION` error, `CANNOT_PARSE_TIMESTAMP` will be 
raised if ANSI mode is enable. `NULL` will be returned if ANSI mode is 
disabled. See [Datetime Patterns for Formatting and 
Parsing](sql-ref-datetime-pattern.html).
 - Since Spark 4.0, A bug falsely allowing `!` instead of `NOT` when `!` is not 
a prefix operator has been fixed. Clauses such as `expr ! IN (...)`, `expr ! 
BETWEEN ...`, or `col ! NULL` now raise syntax errors. To restore the previous 
behavior, set `spark.sql.legacy.bangEqualsNot` to `true`. 
 - Since Spark 4.0, Views allow control over how they react to underlying query 
changes. By default views tolerate column type changes in the query and 
compensate with casts. To restore the previous behavior, allowing up-casts 
only, set `spark.sql.viewSchemaBindingMode` to `DISABLED`. This disables the 
feature and also disallows the `WITH SCHEMA` clause.
+- Since Spark 4.0, The Storage-Partitioned Join feature flag 
`spark.sql.sources.v2.bucketing.enabled` and 
`spark.sql.sources.v2.bucketing.pushPartValues.enabled` is set to `true`. To 
restore the previous behavior, set `spark.sql.sources.v2.bucketing.enabled` and 
`spark.sql.sources.v2.bucketing.pushPartValues.enabled` to `false`.

Review Comment:
   Ok I took this from @superdiaodiao , who also worked on this pr, will revert.



##
sql/core/src/test/scala/org/apache/spark/sql/connector/KeyGroupedPartitioningSuite.scala:
##
@@ -1169,7 +1169,6 @@ class KeyGroupedPartitioningSuite extends 
DistributionAndOrderingSuiteBase {
 Seq(true, false).foreach { shuffle =>
   withSQLConf(
 SQLConf.V2_BUCKETING_SHUFFLE_ENABLED.key -> shuffle.toString,
-SQLConf.V2_BUCKETING_PUSH_PART_VALUES_ENABLED.key -> "true",

Review Comment:
   OK



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]

2024-05-20 Thread via GitHub


GideonPotok commented on PR #46597:
URL: https://github.com/apache/spark/pull/46597#issuecomment-2121332325

   
   @uros-db I agree that we should avoid auxiliary structures. And I don't see 
a good way to move the changes to implementation of  `merge` and `update` 
without keeping an auxiliary map from the collation key to the actual values 
seen (eg from "aa" to "aaaAAa", "AA" for a data frame with the values 
"aaaAAa" and "AA".) That would be an auxiliary structure. There is ton of 
of scaffolding to support having just that OpenHashMap available throughout the 
expression being executed.  

So I advise strongly against us pursuing this 
idea, which is good in theory, at least for now. Having said that, such a 
prototype of an approach might look like this: 
https://github.com/GideonPotok/spark/pull/1 . Thoughts? 
   
   Also, I am done with adding the exception for unsupported complex types! 
Take a look


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48330][SS][PYTHON] Fix the python streaming data source timeout issue for large trigger interval [spark]

2024-05-20 Thread via GitHub


HeartSaVioR commented on PR #46651:
URL: https://github.com/apache/spark/pull/46651#issuecomment-2121324280

   Thanks! Merging to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48330][SS][PYTHON] Fix the python streaming data source timeout issue for large trigger interval [spark]

2024-05-20 Thread via GitHub


HeartSaVioR commented on PR #46651:
URL: https://github.com/apache/spark/pull/46651#issuecomment-2121324009

   Let's do post-review if there are remaining comments. Looks like the change 
is right and unavoidable.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48351] JDBC Connectors - Add cast suite and fix found issue [spark]

2024-05-20 Thread via GitHub


urosstan-db closed pull request #46669: [SPARK-48351] JDBC Connectors - Add 
cast suite and fix found issue
URL: https://github.com/apache/spark/pull/46669


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48342] SQL Batch Lang Parser [spark]

2024-05-20 Thread via GitHub


davidm-db commented on code in PR #46665:
URL: https://github.com/apache/spark/pull/46665#discussion_r1607348639


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/BatchParserSuite.scala:
##
@@ -0,0 +1,155 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.parser
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.catalyst.plans.SQLHelper
+
+class BatchParserSuite extends SparkFunSuite with SQLHelper {
+  import CatalystSqlParser._
+
+  test("single select") {
+val batch = "SELECT 1;"
+val tree = parseBatch(batch)
+assert(tree.collection.length == 1)
+assert(tree.collection.head.isInstanceOf[SparkStatementWithPlan])
+val sparkStatement = 
tree.collection.head.asInstanceOf[SparkStatementWithPlan]
+assert(sparkStatement.getText(batch) == "SELECT 1;")
+  }
+
+  test("single select without ;") {
+val batch = "SELECT 1"
+val tree = parseBatch(batch)
+assert(tree.collection.length == 1)
+assert(tree.collection.head.isInstanceOf[SparkStatementWithPlan])
+val sparkStatement = 
tree.collection.head.asInstanceOf[SparkStatementWithPlan]
+assert(sparkStatement.getText(batch) == "SELECT 1")
+  }
+
+  test("multi select without ; - should fail") {
+val batch = "SELECT 1 SELECT 1"
+val e = intercept[ParseException] {
+  parseBatch(batch)
+}
+assert(e.getErrorClass === "PARSE_SYNTAX_ERROR")
+assert(e.getMessage.contains("Syntax error"))
+assert(e.getMessage.contains("SELECT 1 SELECT 1"))
+  }
+
+  test("multi select") {
+val batch = "BEGIN SELECT 1;SELECT 2; END"
+val tree = parseBatch(batch)
+assert(tree.collection.length == 2)
+assert(tree.collection.forall(_.isInstanceOf[SparkStatementWithPlan]))
+
+batch.split(";")
+  .map(cleanupStatementString)
+  .zip(tree.collection)
+  .foreach { case (expected, statement) =>
+val sparkStatement = statement.asInstanceOf[SparkStatementWithPlan]
+val statementText = sparkStatement.getText(batch)
+assert(statementText == expected)
+  }
+  }
+
+  test("multi statement") {
+val batch =
+  """
+|BEGIN
+|  SELECT 1;
+|  SELECT 2;
+|  INSERT INTO A VALUES (a, b, 3);
+|  SELECT a, b, c FROM T;
+|  SELECT * FROM T;
+|END""".stripMargin
+val tree = parseBatch(batch)
+assert(tree.collection.length == 5)
+assert(tree.collection.forall(_.isInstanceOf[SparkStatementWithPlan]))
+batch.split(";")
+  .map(cleanupStatementString)
+  .zip(tree.collection)
+  .foreach { case (expected, statement) =>
+val sparkStatement = statement.asInstanceOf[SparkStatementWithPlan]
+val statementText = sparkStatement.getText(batch)
+assert(statementText == expected)
+  }
+  }
+
+  test("multi statement without ; at the end") {
+val batch =
+  """
+|BEGIN
+|SELECT 1;
+|SELECT 2;
+|INSERT INTO A VALUES (a, b, 3);
+|SELECT a, b, c FROM T;
+|SELECT * FROM T
+|END""".stripMargin
+val tree = parseBatch(batch)
+assert(tree.collection.length == 5)
+assert(tree.collection.forall(_.isInstanceOf[SparkStatementWithPlan]))
+batch.split(";")
+  .map(cleanupStatementString)
+  .zip(tree.collection)
+  .foreach { case (expected, statement) =>
+val sparkStatement = statement.asInstanceOf[SparkStatementWithPlan]
+val statementText = sparkStatement.getText(batch)
+assert(statementText == expected)
+  }
+  }
+
+  test("nested begin end") {
+val batch =
+  """
+|BEGIN
+|  BEGIN
+|  SELECT 1;
+|  END;
+|  BEGIN
+|BEGIN
+|  SELECT 2;
+|  SELECT 3;
+|END;
+|  END;
+|END""".stripMargin
+val tree = parseBatch(batch)
+assert(tree.collection.length == 2)
+assert(tree.collection.head.isInstanceOf[BatchBody])
+val body1 = tree.collection.head.asInstanceOf[BatchBody]
+assert(body1.collection.length == 1)
+
assert(body1.collection.head.asInstanceOf[SparkStatementWithPlan].getText(batch)
 == 

Re: [PR] [SPARK-48342] SQL Batch Lang Parser [spark]

2024-05-20 Thread via GitHub


davidm-db commented on code in PR #46665:
URL: https://github.com/apache/spark/pull/46665#discussion_r1607347956


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala:
##
@@ -116,6 +116,78 @@ class AstBuilder extends DataTypeAstBuilder with 
SQLConfHelper with Logging {
 }
   }
 
+  override def visitBatchOrSingleStatement(ctx: 
BatchOrSingleStatementContext): BatchBody = {
+if (ctx.batchCompound() != null) {
+  visit(ctx.batchCompound()).asInstanceOf[BatchBody]
+} else {
+  val logicalPlan = visitSingleStatement(ctx.singleStatement())
+  BatchBody(List(SparkStatementWithPlan(
+parsedPlan = logicalPlan,
+sourceStart = ctx.start.getStartIndex,
+sourceEnd = ctx.stop.getStopIndex + 1)))
+}
+  }
+
+  override def visitBatchCompound(ctx: BatchCompoundContext): BatchBody = {
+visitBatchBody(ctx.batchBody(), allowDeclareAtTop = true)
+  }
+
+  private def visitBatchBody(ctx: BatchBodyContext, allowDeclareAtTop: 
Boolean): BatchBody = {
+val buff = ListBuffer[BatchPlanStatement]()
+for (i <- 0 until ctx.getChildCount) {
+  val child = visit(ctx.getChild(i))
+  child match {
+case statement: BatchPlanStatement => buff += statement
+case null => // When terminal nodes are visited (like SEMICOLON, EOF, 
etc.)

Review Comment:
   Improved - didn't find a better way to do it, but I think it's clean like 
this.
   Any other way I think we would need to match child before visiting and than 
again after visiting which would turn out to be really messy.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48342] SQL Batch Lang Parser [spark]

2024-05-20 Thread via GitHub


davidm-db commented on code in PR #46665:
URL: https://github.com/apache/spark/pull/46665#discussion_r1607339764


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/BatchLangLogicalOperators.scala:
##
@@ -0,0 +1,35 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.parser
+
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+
+sealed trait BatchPlanStatement

Review Comment:
   TODO: Figure out the naming of trait and classes here on the scripting sync.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48342] SQL Batch Lang Parser [spark]

2024-05-20 Thread via GitHub


davidm-db commented on code in PR #46665:
URL: https://github.com/apache/spark/pull/46665#discussion_r1607338991


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/BatchParserSuite.scala:
##
@@ -0,0 +1,187 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.parser
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.catalyst.plans.SQLHelper
+import org.apache.spark.sql.exceptions.SqlBatchLangException
+
+class BatchParserSuite extends SparkFunSuite with SQLHelper {
+  import CatalystSqlParser._
+
+  test("single select") {
+val batch = "SELECT 1;"
+val tree = parseBatch(batch)
+assert(tree.collection.length == 1)
+assert(tree.collection.head.isInstanceOf[SparkStatementWithPlan])
+val sparkStatement = 
tree.collection.head.asInstanceOf[SparkStatementWithPlan]
+assert(sparkStatement.getText(batch) == "SELECT 1;")
+  }
+
+  test("single select without ;") {
+val batch = "SELECT 1"
+val tree = parseBatch(batch)
+assert(tree.collection.length == 1)
+assert(tree.collection.head.isInstanceOf[SparkStatementWithPlan])
+val sparkStatement = 
tree.collection.head.asInstanceOf[SparkStatementWithPlan]
+assert(sparkStatement.getText(batch) == "SELECT 1")
+  }
+
+  test("multi select without ; - should fail") {
+val batch = "SELECT 1 SELECT 1"
+intercept[ParseException] {
+  parseBatch(batch)

Review Comment:
   Done.



##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/BatchParserSuite.scala:
##
@@ -0,0 +1,187 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.parser
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.catalyst.plans.SQLHelper
+import org.apache.spark.sql.exceptions.SqlBatchLangException
+
+class BatchParserSuite extends SparkFunSuite with SQLHelper {
+  import CatalystSqlParser._
+
+  test("single select") {
+val batch = "SELECT 1;"
+val tree = parseBatch(batch)
+assert(tree.collection.length == 1)
+assert(tree.collection.head.isInstanceOf[SparkStatementWithPlan])
+val sparkStatement = 
tree.collection.head.asInstanceOf[SparkStatementWithPlan]
+assert(sparkStatement.getText(batch) == "SELECT 1;")
+  }
+
+  test("single select without ;") {
+val batch = "SELECT 1"
+val tree = parseBatch(batch)
+assert(tree.collection.length == 1)
+assert(tree.collection.head.isInstanceOf[SparkStatementWithPlan])
+val sparkStatement = 
tree.collection.head.asInstanceOf[SparkStatementWithPlan]
+assert(sparkStatement.getText(batch) == "SELECT 1")
+  }
+
+  test("multi select without ; - should fail") {
+val batch = "SELECT 1 SELECT 1"
+intercept[ParseException] {
+  parseBatch(batch)
+}
+  }
+
+  test("multi select") {
+val batch = "BEGIN SELECT 1;SELECT 2; END"
+val tree = parseBatch(batch)
+assert(tree.collection.length == 2)
+assert(tree.collection.forall(_.isInstanceOf[SparkStatementWithPlan]))
+
+batch.split(";")
+  .map(_.replace("\n", "").replace("BEGIN", "").replace("END", "").trim)

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please 

Re: [PR] [SPARK-48342] SQL Batch Lang Parser [spark]

2024-05-20 Thread via GitHub


davidm-db commented on code in PR #46665:
URL: https://github.com/apache/spark/pull/46665#discussion_r1607338855


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala:
##
@@ -116,6 +116,78 @@ class AstBuilder extends DataTypeAstBuilder with 
SQLConfHelper with Logging {
 }
   }
 
+  override def visitBatchOrSingleStatement(ctx: 
BatchOrSingleStatementContext): BatchBody = {
+if (ctx.batchCompound() != null) {
+  visit(ctx.batchCompound()).asInstanceOf[BatchBody]
+} else {
+  val logicalPlan = visitSingleStatement(ctx.singleStatement())
+  BatchBody(List(SparkStatementWithPlan(
+parsedPlan = logicalPlan,
+sourceStart = ctx.start.getStartIndex,
+sourceEnd = ctx.stop.getStopIndex + 1)))
+}
+  }
+
+  override def visitBatchCompound(ctx: BatchCompoundContext): BatchBody = {
+visitBatchBody(ctx.batchBody(), allowDeclareAtTop = true)
+  }
+
+  private def visitBatchBody(ctx: BatchBodyContext, allowDeclareAtTop: 
Boolean): BatchBody = {
+val buff = ListBuffer[BatchPlanStatement]()
+for (i <- 0 until ctx.getChildCount) {
+  val child = visit(ctx.getChild(i))
+  child match {
+case statement: BatchPlanStatement => buff += statement
+case null => // When terminal nodes are visited (like SEMICOLON, EOF, 
etc.)
+  }
+}
+
+val statements = buff.toList
+if (allowDeclareAtTop) {

Review Comment:
   Done. Removed exceptions as well, since we needed them only for variable 
stuff... Created a separate work item to add exception support (saved the code 
as well, so it's easy to do it).



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParserInterface.scala:
##
@@ -62,4 +62,10 @@ trait ParserInterface extends DataTypeParserInterface {
*/
   @throws[ParseException]("Text cannot be parsed to a LogicalPlan")
   def parseQuery(sqlText: String): LogicalPlan
+
+  /**
+   * Parse a query string to a [[BatchBody]].
+   */
+  @throws[ParseException]("")

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48342] SQL Batch Lang Parser [spark]

2024-05-20 Thread via GitHub


davidm-db commented on code in PR #46665:
URL: https://github.com/apache/spark/pull/46665#discussion_r1607338131


##
sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4:
##
@@ -42,6 +42,25 @@ options { tokenVocab = SqlBaseLexer; }
   public boolean double_quoted_identifiers = false;
 }
 
+batchOrSingleStatement
+: batchCompound SEMICOLON* EOF

Review Comment:
   Agreed offline on the naming.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP] Propagate column family information from executors to driver via Accumulator [spark]

2024-05-20 Thread via GitHub


ericm-db closed pull request #46644: [WIP] Propagate column family information 
from executors to driver via Accumulator
URL: https://github.com/apache/spark/pull/46644


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-44838][SQL][FOLLOW-UP] Fix the test for raise_error by using default type for strings [spark]

2024-05-20 Thread via GitHub


uros-db commented on PR #46649:
URL: https://github.com/apache/spark/pull/46649#issuecomment-2121183739

   fix should be ready https://github.com/apache/spark/pull/46661
   please review @cloud-fan @HyukjinKwon @dongjoon-hyun
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48031] Decompose viewSchemaMode config, add SHOW CREATE TABLE support [spark]

2024-05-20 Thread via GitHub


gengliangwang commented on code in PR #46652:
URL: https://github.com/apache/spark/pull/46652#discussion_r1607218983


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -1700,15 +1700,21 @@ object SQLConf {
 .booleanConf
 .createWithDefault(true)
 
-  val VIEW_SCHEMA_BINDING_MODE = buildConf("spark.sql.viewSchemaBindingMode")
-.doc("Set to DISABLE to disable the WITH SCHEMA clause for view DDL and 
suppress the line in " +
-  " DESCRIBE EXTENDED. The default, and only other value, is COMPENSATION. 
Views without " +
-  " WITH SCHEMA clause are defaulted to WITH SCHEMA COMPENSATION.")
+  val VIEW_SCHEMA_BINDING_ENABLED = 
buildConf("spark.sql.legacy.viewSchemaBindingMode")
+.internal()
+.doc("Set to false to disable the WITH SCHEMA clause for view DDL and 
suppress the line in " +
+  "DESCRIBE EXTENDED and SHOW CREATE TABLE.")
 .version("4.0.0")
-.stringConf
-.transform(_.toUpperCase(Locale.ROOT))
-.checkValues(Set("COMPENSATION", "DISABLED"))
-.createWithDefault("COMPENSATION")
+.booleanConf
+.createWithDefault(true)
+
+  val VIEW_SCHEMA_COMPENSATION = 
buildConf("spark.sql.legacy.viewSchemaCompensation")

Review Comment:
   The COMPENSATION behavior should not be legacy.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48031] Decompose viewSchemaMode config, add SHOW CREATE TABLE support [spark]

2024-05-20 Thread via GitHub


gengliangwang commented on code in PR #46652:
URL: https://github.com/apache/spark/pull/46652#discussion_r1607218705


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -1700,15 +1700,21 @@ object SQLConf {
 .booleanConf
 .createWithDefault(true)
 
-  val VIEW_SCHEMA_BINDING_MODE = buildConf("spark.sql.viewSchemaBindingMode")
-.doc("Set to DISABLE to disable the WITH SCHEMA clause for view DDL and 
suppress the line in " +
-  " DESCRIBE EXTENDED. The default, and only other value, is COMPENSATION. 
Views without " +
-  " WITH SCHEMA clause are defaulted to WITH SCHEMA COMPENSATION.")
+  val VIEW_SCHEMA_BINDING_ENABLED = 
buildConf("spark.sql.legacy.viewSchemaBindingMode")

Review Comment:
   It is a bit weird that the boolean config name ends with `Mode`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-43815][SQL] Wrap NPE with AnalysisException in CSV, XML, and JSON options [spark]

2024-05-20 Thread via GitHub


gengliangwang commented on code in PR #46626:
URL: https://github.com/apache/spark/pull/46626#discussion_r1607213012


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala:
##
@@ -107,7 +108,13 @@ class JSONOptions(
   val writeNullIfWithDefaultValue = SQLConf.get.jsonWriteNullIfWithDefaultValue
 
   // A language tag in IETF BCP 47 format
-  val locale: Locale = 
parameters.get(LOCALE).map(Locale.forLanguageTag).getOrElse(Locale.US)
+  val locale: Locale = parameters.get(LOCALE)

Review Comment:
   Let's reduce the duplicated code in CSV/JSON/XML options



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48329][SQL] SPJ: Default spark.sql.sources.v2.bucketing.pushPartValues.enabled to true [spark]

2024-05-20 Thread via GitHub


dongjoon-hyun commented on PR #46673:
URL: https://github.com/apache/spark/pull/46673#issuecomment-2121080890

   cc @viirya , @sunchao 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48329][SQL] SPJ: Default spark.sql.sources.v2.bucketing.pushPartValues.enabled to true [spark]

2024-05-20 Thread via GitHub


dongjoon-hyun commented on code in PR #46673:
URL: https://github.com/apache/spark/pull/46673#discussion_r1607189545


##
sql/core/src/test/scala/org/apache/spark/sql/execution/exchange/EnsureRequirementsSuite.scala:
##
@@ -1024,93 +1024,92 @@ class EnsureRequirementsSuite extends 
SharedSparkSession {
   }
 
   test("SPARK-41413: check compatibility when partition values mismatch") {
-withSQLConf(SQLConf.V2_BUCKETING_PUSH_PART_VALUES_ENABLED.key -> "true") {
-  val leftPartValues = Seq(Array[Any](1, 1), Array[Any](2, 2)).map(new 
GenericInternalRow(_))
-  val rightPartValues = Seq(Array[Any](1, 1), Array[Any](2, 2), 
Array[Any](3, 3))
-  .map(new GenericInternalRow(_))
+val leftPartValues = Seq(Array[Any](1, 1), Array[Any](2, 2)).map(new 
GenericInternalRow(_))
+val rightPartValues = Seq(Array[Any](1, 1), Array[Any](2, 2), 
Array[Any](3, 3))
+.map(new GenericInternalRow(_))

Review Comment:
   ditto. Please don't touch any test cases.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48329][SQL] SPJ: Default spark.sql.sources.v2.bucketing.pushPartValues.enabled to true [spark]

2024-05-20 Thread via GitHub


dongjoon-hyun commented on code in PR #46673:
URL: https://github.com/apache/spark/pull/46673#discussion_r1607189098


##
sql/core/src/test/scala/org/apache/spark/sql/connector/KeyGroupedPartitioningSuite.scala:
##
@@ -1169,7 +1169,6 @@ class KeyGroupedPartitioningSuite extends 
DistributionAndOrderingSuiteBase {
 Seq(true, false).foreach { shuffle =>
   withSQLConf(
 SQLConf.V2_BUCKETING_SHUFFLE_ENABLED.key -> shuffle.toString,
-SQLConf.V2_BUCKETING_PUSH_PART_VALUES_ENABLED.key -> "true",

Review Comment:
   Please revert all test case changes from this PR, @szehon-ho .



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48329][SQL] SPJ: Default spark.sql.sources.v2.bucketing.pushPartValues.enabled to true [spark]

2024-05-20 Thread via GitHub


dongjoon-hyun commented on code in PR #46673:
URL: https://github.com/apache/spark/pull/46673#discussion_r1607188807


##
docs/sql-migration-guide.md:
##
@@ -55,6 +55,7 @@ license: |
 - Since Spark 4.0, The default value for `spark.sql.legacy.timeParserPolicy` 
has been changed from `EXCEPTION` to `CORRECTED`. Instead of raising an 
`INCONSISTENT_BEHAVIOR_CROSS_VERSION` error, `CANNOT_PARSE_TIMESTAMP` will be 
raised if ANSI mode is enable. `NULL` will be returned if ANSI mode is 
disabled. See [Datetime Patterns for Formatting and 
Parsing](sql-ref-datetime-pattern.html).
 - Since Spark 4.0, A bug falsely allowing `!` instead of `NOT` when `!` is not 
a prefix operator has been fixed. Clauses such as `expr ! IN (...)`, `expr ! 
BETWEEN ...`, or `col ! NULL` now raise syntax errors. To restore the previous 
behavior, set `spark.sql.legacy.bangEqualsNot` to `true`. 
 - Since Spark 4.0, Views allow control over how they react to underlying query 
changes. By default views tolerate column type changes in the query and 
compensate with casts. To restore the previous behavior, allowing up-casts 
only, set `spark.sql.viewSchemaBindingMode` to `DISABLED`. This disables the 
feature and also disallows the `WITH SCHEMA` clause.
+- Since Spark 4.0, The Storage-Partitioned Join feature flag 
`spark.sql.sources.v2.bucketing.enabled` and 
`spark.sql.sources.v2.bucketing.pushPartValues.enabled` is set to `true`. To 
restore the previous behavior, set `spark.sql.sources.v2.bucketing.enabled` and 
`spark.sql.sources.v2.bucketing.pushPartValues.enabled` to `false`.

Review Comment:
   Please remove `The Storage-Partitioned Join feature flag`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48300][SQL] Codegen Support for `from_xml` & remove some redundant codes [spark]

2024-05-20 Thread via GitHub


sandip-db commented on PR #46609:
URL: https://github.com/apache/spark/pull/46609#issuecomment-2121041342

   @panbingkun Thanks for submitting the PR. Can you please separate the 
`codegen` support and the cleanup in separate PRs?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48320][CORE][DOCS] Add external third-party ecosystem access guide to the doc [spark]

2024-05-20 Thread via GitHub


mridulm commented on code in PR #46634:
URL: https://github.com/apache/spark/pull/46634#discussion_r1607157027


##
common/utils/src/main/scala/org/apache/spark/internal/README.md:
##
@@ -45,3 +45,29 @@ logger.error("Failed to abort the writer after failing to 
write map output.", e)
 ## Exceptions
 
 To ensure logs are compatible with Spark SQL and log analysis tools, avoid 
`Exception.printStackTrace()`. Use `logError`, `logWarning`, and `logInfo` 
methods from the `Logging` trait to log exceptions, maintaining structured and 
parsable logs.
+
+## External third-party ecosystem access
+
+* If you want to output logs in `scala code` through the structured log 
framework, you can define `custom LogKey` and use it in `scala` code as follows:
+
+```scala
+// External third-party ecosystem `custom LogKey` must be `extends LogKey`
+case object CUSTOM_LOG_KEY extends LogKey
+```
+```scala
+import org.apache.spark.internal.MDC;
+
+logInfo(log"${MDC(CUSTOM_LOG_KEY, "key")}")
+```
+
+* If you want to output logs in `java code` through the structured log 
framework, you can define `custom LogKey` and use it in `java` code as follows:
+
+```java
+// External third-party ecosystem `custom LogKey` must be `implements LogKey`
+public static class CUSTOM_LOG_KEY implements LogKey { }
+```
+```java
+import org.apache.spark.internal.MDC;
+
+logger.error("Unable to delete key {} for cache", MDC.of(CUSTOM_LOG_KEY, 
"key"));
+```

Review Comment:
   Let us drop this doc - this feature is essentially private to spark. Let us 
avoid documenting it such that it gives an impression that users can directly 
leverage it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-44838][SQL][FOLLOW-UP] Fix the test for raise_error by using default type for strings [spark]

2024-05-20 Thread via GitHub


dongjoon-hyun commented on PR #46649:
URL: https://github.com/apache/spark/pull/46649#issuecomment-2121013158

   Gentle ping, @uros-db . The CI is still broken.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-48329][SQL] SPJ: Default spark.sql.sources.v2.bucketing.pushPartValues.enabled to true [spark]

2024-05-20 Thread via GitHub


szehon-ho opened a new pull request, #46673:
URL: https://github.com/apache/spark/pull/46673

   ### What changes were proposed in this pull request?
   
   Change 'spark.sql.sources.v2.bucketing.pushPartValues' to true for Spark 4.0 
release
   
   ### Why are the changes needed?
   This flag has proven always useful and no downsides have been found.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Existing unit tests (removed explicit setting of flag to true)
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48328][BUILD] Upgrade `Arrow` to 16.1.0 [spark]

2024-05-20 Thread via GitHub


dongjoon-hyun closed pull request #46646: [SPARK-48328][BUILD] Upgrade `Arrow` 
to 16.1.0
URL: https://github.com/apache/spark/pull/46646


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48330][SS][PYTHON] Fix the python streaming data source timeout issue for large trigger interval [spark]

2024-05-20 Thread via GitHub


chaoqin-li1123 commented on code in PR #46651:
URL: https://github.com/apache/spark/pull/46651#discussion_r1607079327


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/python/PythonStreamingSinkCommitRunner.scala:
##
@@ -39,78 +35,22 @@ import org.apache.spark.sql.types.StructType
  * from the socket, then commit or abort a microbatch.
  */
 class PythonStreamingSinkCommitRunner(

Review Comment:
   It is similar except that streaming commit runner also takes the batch id as 
parameter and throw a different type of exception.



##
python/pyspark/sql/streaming/python_streaming_source_runner.py:
##
@@ -210,7 +210,8 @@ def main(infile: IO, outfile: IO) -> None:
 # Read information about how to connect back to the JVM from the 
environment.
 java_port = int(os.environ["PYTHON_WORKER_FACTORY_PORT"])
 auth_secret = os.environ["PYTHON_WORKER_FACTORY_SECRET"]
-(sock_file, _) = local_connect_and_auth(java_port, auth_secret)
+(sock_file, sock) = local_connect_and_auth(java_port, auth_secret)
+sock.settimeout(None)

Review Comment:
   comment added.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48330][SS][PYTHON] Fix the python streaming data source timeout issue for large trigger interval [spark]

2024-05-20 Thread via GitHub


chaoqin-li1123 commented on code in PR #46651:
URL: https://github.com/apache/spark/pull/46651#discussion_r1607078379


##
python/pyspark/sql/worker/python_streaming_sink_runner.py:
##
@@ -34,22 +33,32 @@
 _parse_datatype_json_string,
 StructType,
 )
-from pyspark.util import handle_worker_exception
+from pyspark.util import handle_worker_exception, local_connect_and_auth
 from pyspark.worker_util import (
 check_python_version,
 read_command,
 pickleSer,
 send_accumulator_updates,
+setup_broadcasts,
 setup_memory_limits,
 setup_spark_files,
 utf8_deserializer,
 )
 
 
 def main(infile: IO, outfile: IO) -> None:
+"""
+Main method for committing or aborting a data source streaming write 
operation.
+
+This process is invoked from the 
`PythonStreamingSinkCommitRunner.runInPython`
+method in the StreamingWrite implementation of the PythonTableProvider. It 
is

Review Comment:
   Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48320][CORE][DOCS] Add external third-party ecosystem access guide to the doc [spark]

2024-05-20 Thread via GitHub


gengliangwang commented on code in PR #46634:
URL: https://github.com/apache/spark/pull/46634#discussion_r1607076786


##
common/utils/src/main/scala/org/apache/spark/internal/README.md:
##
@@ -45,3 +45,29 @@ logger.error("Failed to abort the writer after failing to 
write map output.", e)
 ## Exceptions
 
 To ensure logs are compatible with Spark SQL and log analysis tools, avoid 
`Exception.printStackTrace()`. Use `logError`, `logWarning`, and `logInfo` 
methods from the `Logging` trait to log exceptions, maintaining structured and 
parsable logs.
+
+## External third-party ecosystem access
+
+* If you want to output logs in `scala code` through the structured log 
framework, you can define `custom LogKey` and use it in `scala` code as follows:
+
+```scala
+// External third-party ecosystem `custom LogKey` must be `extends LogKey`
+case object CUSTOM_LOG_KEY extends LogKey
+```
+```scala
+import org.apache.spark.internal.MDC;
+
+logInfo(log"${MDC(CUSTOM_LOG_KEY, "key")}")
+```
+
+* If you want to output logs in `java code` through the structured log 
framework, you can define `custom LogKey` and use it in `java` code as follows:
+
+```java
+// External third-party ecosystem `custom LogKey` must be `implements LogKey`
+public static class CUSTOM_LOG_KEY implements LogKey { }
+```
+```java
+import org.apache.spark.internal.MDC;
+
+logger.error("Unable to delete key {} for cache", MDC.of(CUSTOM_LOG_KEY, 
"key"));
+```

Review Comment:
   @mridulm the modified file here is 
`common/utils/src/main/scala/org/apache/spark/internal/README.md`, a README 
under the internal package.  It is not part of the Spark doc site.
   
   I don't have a strong opinion on this though. If you insist we should NOT 
put this into the doc here, we should enhance the scala/java doc instead.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48017] Add Spark application submission worker for operator [spark-kubernetes-operator]

2024-05-20 Thread via GitHub


dongjoon-hyun commented on PR #10:
URL: 
https://github.com/apache/spark-kubernetes-operator/pull/10#issuecomment-2120919240

   I wrote the current status summary here.
   - 
https://github.com/apache/spark-kubernetes-operator/pull/2#issuecomment-2120918277


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] Operator 0.1.0 [spark-kubernetes-operator]

2024-05-20 Thread via GitHub


dongjoon-hyun commented on PR #2:
URL: 
https://github.com/apache/spark-kubernetes-operator/pull/2#issuecomment-2120918277

   For the record, `submission` module is merged finally today with SPARK-48326 
(TODO).
   - https://github.com/apache/spark-kubernetes-operator/pull/10
   
   From Spark repo side, @jiangzho also made the following improvements 
additionally for `submission` module.
   - https://github.com/apache/spark/pull/46373
   - https://github.com/apache/spark/pull/46371
   
   Thank you again, @jiangzho . Shall we move forward to the next modules?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48330][SS][PYTHON] Fix the python streaming data source timeout issue for large trigger interval [spark]

2024-05-20 Thread via GitHub


allisonwang-db commented on code in PR #46651:
URL: https://github.com/apache/spark/pull/46651#discussion_r1607067727


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/python/PythonStreamingSinkCommitRunner.scala:
##
@@ -39,78 +35,22 @@ import org.apache.spark.sql.types.StructType
  * from the socket, then commit or abort a microbatch.
  */
 class PythonStreamingSinkCommitRunner(

Review Comment:
   After this change, is the PyhtonStreamingSinkCommitRunner the same as the 
batch one now?



##
python/pyspark/sql/worker/python_streaming_sink_runner.py:
##
@@ -34,22 +33,32 @@
 _parse_datatype_json_string,
 StructType,
 )
-from pyspark.util import handle_worker_exception
+from pyspark.util import handle_worker_exception, local_connect_and_auth
 from pyspark.worker_util import (
 check_python_version,
 read_command,
 pickleSer,
 send_accumulator_updates,
+setup_broadcasts,
 setup_memory_limits,
 setup_spark_files,
 utf8_deserializer,
 )
 
 
 def main(infile: IO, outfile: IO) -> None:
+"""
+Main method for committing or aborting a data source streaming write 
operation.
+
+This process is invoked from the 
`PythonStreamingSinkCommitRunner.runInPython`
+method in the StreamingWrite implementation of the PythonTableProvider. It 
is

Review Comment:
   We don't have a PythonTableProvider. Do you mean PythonTable or 
PythonDataSourceV2?



##
python/pyspark/sql/streaming/python_streaming_source_runner.py:
##
@@ -210,7 +210,8 @@ def main(infile: IO, outfile: IO) -> None:
 # Read information about how to connect back to the JVM from the 
environment.
 java_port = int(os.environ["PYTHON_WORKER_FACTORY_PORT"])
 auth_secret = os.environ["PYTHON_WORKER_FACTORY_SECRET"]
-(sock_file, _) = local_connect_and_auth(java_port, auth_secret)
+(sock_file, sock) = local_connect_and_auth(java_port, auth_secret)
+sock.settimeout(None)

Review Comment:
   nit: can we add a short comment here on why we need to set this timeout?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48017] Add Spark application submission worker for operator [spark-kubernetes-operator]

2024-05-20 Thread via GitHub


dongjoon-hyun closed pull request #10: [SPARK-48017] Add Spark application 
submission worker for operator
URL: https://github.com/apache/spark-kubernetes-operator/pull/10


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46841][SQL] Add collation support for ICU locales and collation specifiers [spark]

2024-05-20 Thread via GitHub


nikolamand-db commented on code in PR #46180:
URL: https://github.com/apache/spark/pull/46180#discussion_r1606959835


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##
@@ -118,76 +119,433 @@ public Collation(
 }
 
 /**
- * Constructor with comparators that are inherited from the given collator.
+ * Collation id is defined as 32-bit integer.
+ * We specify binary layouts for different classes of collations.
+ * Classes of collations are differentiated by most significant 3 bits 
(bit 31, 30 and 29),
+ * bit 31 being most significant and bit 0 being least significant.
+ * ---
+ * INDETERMINATE collation id binary layout:
+ * bit 31-0: 1
+ * INDETERMINATE collation id is equal to -1
+ * ---
+ * user-defined collation id binary layout:
+ * bit 31:   0
+ * bit 30:   1
+ * bit 29-0: undefined, reserved for future use
+ * ---
+ * UTF8_BINARY collation id binary layout:
+ * bit 31-22: zeroes
+ * bit 21-18: zeroes, reserved for space trimming
+ * bit 17-16: zeroes, reserved for version
+ * bit 15-3:  zeroes
+ * bit 2: 0, reserved for accent sensitivity
+ * bit 1: 0, reserved for uppercase and case-insensitive
+ * bit 0: 0 = case-sensitive, 1 = lowercase
+ * ---
+ * ICU collation id binary layout:
+ * bit 31-30: zeroes
+ * bit 29:1
+ * bit 28-24: zeroes
+ * bit 23-22: zeroes, reserved for version
+ * bit 21-18: zeroes, reserved for space trimming
+ * bit 17:0 = case-sensitive, 1 = case-insensitive
+ * bit 16:0 = accent-sensitive, 1 = accent-insensitive
+ * bit 15-14: zeroes, reserved for punctuation sensitivity
+ * bit 13-12: zeroes, reserved for first letter preference
+ * bit 11-0:  locale id as specified in `ICULocaleToId` mapping
+ * ---
+ * Some illustrative examples of collation name to id mapping:
+ * - UTF8_BINARY   -> 0
+ * - UTF8_BINARY_LCASE -> 1
+ * - UNICODE   -> 0x2000
+ * - UNICODE_AI-> 0x2001
+ * - UNICODE_CI-> 0x2002
+ * - UNICODE_CI_AI -> 0x2003
+ * - af-> 0x2001
+ * - af_CI_AI  -> 0x20030001
  */
-public Collation(
-String collationName,
-Collator collator,
-String version,
-boolean supportsBinaryEquality,
-boolean supportsBinaryOrdering,
-boolean supportsLowercaseEquality) {
-  this(
-collationName,
-collator,
-(s1, s2) -> collator.compare(s1.toString(), s2.toString()),
-version,
-s -> (long)collator.getCollationKey(s.toString()).hashCode(),
-supportsBinaryEquality,
-supportsBinaryOrdering,
-supportsLowercaseEquality);
+private abstract static class CollationSpec {
+
+  private enum DefinitionOrigin {

Review Comment:
   Added multiple clarifications across `CollationFactory`, resolving this. If 
we need more clarification, let's discuss the individual problematic points in 
the code.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46841][SQL] Add collation support for ICU locales and collation specifiers [spark]

2024-05-20 Thread via GitHub


nikolamand-db commented on code in PR #46180:
URL: https://github.com/apache/spark/pull/46180#discussion_r1606956472


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##
@@ -118,76 +119,433 @@ public Collation(
 }
 
 /**
- * Constructor with comparators that are inherited from the given collator.
+ * Collation id is defined as 32-bit integer.
+ * We specify binary layouts for different classes of collations.
+ * Classes of collations are differentiated by most significant 3 bits 
(bit 31, 30 and 29),
+ * bit 31 being most significant and bit 0 being least significant.
+ * ---
+ * INDETERMINATE collation id binary layout:
+ * bit 31-0: 1
+ * INDETERMINATE collation id is equal to -1
+ * ---
+ * user-defined collation id binary layout:
+ * bit 31:   0
+ * bit 30:   1
+ * bit 29-0: undefined, reserved for future use
+ * ---
+ * UTF8_BINARY collation id binary layout:
+ * bit 31-22: zeroes
+ * bit 21-18: zeroes, reserved for space trimming
+ * bit 17-16: zeroes, reserved for version
+ * bit 15-3:  zeroes
+ * bit 2: 0, reserved for accent sensitivity
+ * bit 1: 0, reserved for uppercase and case-insensitive
+ * bit 0: 0 = case-sensitive, 1 = lowercase
+ * ---
+ * ICU collation id binary layout:
+ * bit 31-30: zeroes
+ * bit 29:1
+ * bit 28-24: zeroes
+ * bit 23-22: zeroes, reserved for version

Review Comment:
   Thanks for catching this. I refactored the comment to unify bits used for 
versioning and space trimming in both `UTF8_BINARY` and ICU collations. Added 
bit-centric view as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48238][BUILD][YARN] Replace YARN AmIpFilter with a forked implementation [spark]

2024-05-20 Thread via GitHub


dongjoon-hyun commented on PR #46611:
URL: https://github.com/apache/spark/pull/46611#issuecomment-2120728328

   Thank you, @pan3793 and all!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46841][SQL] Add collation support for ICU locales and collation specifiers [spark]

2024-05-20 Thread via GitHub


nikolamand-db commented on code in PR #46180:
URL: https://github.com/apache/spark/pull/46180#discussion_r1606952858


##
common/unsafe/src/test/scala/org/apache/spark/unsafe/types/CollationFactorySuite.scala:
##
@@ -152,4 +219,218 @@ class CollationFactorySuite extends AnyFunSuite with 
Matchers { // scalastyle:ig
   }
 })
   }
+
+  test("test collation caching") {
+Seq(
+  "UTF8_BINARY",
+  "UTF8_BINARY_LCASE",
+  "UNICODE",
+  "UNICODE_CI",
+  "UNICODE_AI",
+  "UNICODE_CI_AI",
+  "UNICODE_AI_CI"
+).foreach(collationId => {
+  val col1 = fetchCollation(collationId)
+  val col2 = fetchCollation(collationId)
+  assert(col1 eq col2) // reference equality
+})
+  }
+
+  test("collations with ICU non-root localization") {
+Seq(
+  // language only
+  "en",
+  "en_CS",
+  "en_CI",
+  "en_AS",
+  "en_AI",
+  // language + 3-letter country code
+  "en_USA",
+  "en_USA_CS",
+  "en_USA_CI",
+  "en_USA_AS",
+  "en_USA_AI",
+  // language + script code
+  "sr_Cyrl",
+  "sr_Cyrl_CS",
+  "sr_Cyrl_CI",
+  "sr_Cyrl_AS",
+  "sr_Cyrl_AI",
+  // language + script code + 3-letter country code
+  "sr_Cyrl_SRB",
+  "sr_Cyrl_SRB_CS",
+  "sr_Cyrl_SRB_CI",
+  "sr_Cyrl_SRB_AS",
+  "sr_Cyrl_SRB_AI"
+).foreach(collationICU => {
+  val col = fetchCollation(collationICU)
+  assert(col.collator.getLocale(ULocale.VALID_LOCALE) != ULocale.ROOT)
+})
+  }
+
+  test("invalid names of collations with ICU non-root localization") {

Review Comment:
   Added invalid ordering 
[checks](https://github.com/apache/spark/pull/46180/files#diff-9c12d32db9d55dd6ecb5b10f2fc57c7ba7de7275cab57bf157fa42cbc09f3876R291-R310)
 as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48352][SQL]set max file counter through spark conf [spark]

2024-05-20 Thread via GitHub


guixiaowen commented on PR #46668:
URL: https://github.com/apache/spark/pull/46668#issuecomment-2120702247

   @dongjoon-hyun  Do you have time? Can you help me review this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-20 Thread via GitHub


stefankandic commented on PR #46280:
URL: https://github.com/apache/spark/pull/46280#issuecomment-2120702050

   @cloud-fan I looked into HMS a bit, and it seems that we can't save column 
metadata there, so I guess we will still have to keep converting schema with 
collation to schema without when creating a table in hive even though 
collations are no longer a type?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48354][SQL] JDBC Connectors predicate pushdown testing [spark]

2024-05-20 Thread via GitHub


stefanbuk-db commented on code in PR #46642:
URL: https://github.com/apache/spark/pull/46642#discussion_r1606936510


##
connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MsSqlServerPushdownIntegrationSuite.scala:
##
@@ -0,0 +1,92 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.jdbc.v2
+
+import java.sql.Connection
+
+import test.scala.org.apache.spark.sql.jdbc.v2.V2JDBCPushdownTest
+
+import org.apache.spark.SparkConf
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCTableCatalog
+import org.apache.spark.sql.jdbc.{DatabaseOnDocker, 
DockerJDBCIntegrationSuite, MsSQLServerDatabaseOnDocker}
+
+class MsSqlServerPushdownIntegrationSuite

Review Comment:
   Same answer as above



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48354][SQL] JDBC Connectors predicate pushdown testing [spark]

2024-05-20 Thread via GitHub


stefanbuk-db commented on code in PR #46642:
URL: https://github.com/apache/spark/pull/46642#discussion_r1606935984


##
connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/V2JDBCPushdownTest.scala:
##
@@ -0,0 +1,387 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package test.scala.org.apache.spark.sql.jdbc.v2
+
+import scala.collection.immutable.Seq
+
+import org.apache.spark.sql.{DataFrame, QueryTest, Row}
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, LocalLimit}
+import org.apache.spark.sql.execution.FilterExec
+import org.apache.spark.sql.test.SharedSparkSession
+import org.apache.spark.tags.DockerTest
+
+@DockerTest

Review Comment:
   Well, it seemed as a fitting place, but it can be used for more than docker 
integration tests, what is a fitting place for this trait?
   Also, we could add some way to filterout tests here, but we already use 
`override def excluded` from `SparkFunSuite`, as suites implementing this trait 
are extended from there, do we need another method for that in this trait?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48342] SQL Batch Lang Parser [spark]

2024-05-20 Thread via GitHub


dbatomic commented on code in PR #46665:
URL: https://github.com/apache/spark/pull/46665#discussion_r1606932943


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/BatchParserSuite.scala:
##
@@ -0,0 +1,187 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.parser
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.catalyst.plans.SQLHelper
+import org.apache.spark.sql.exceptions.SqlBatchLangException
+
+class BatchParserSuite extends SparkFunSuite with SQLHelper {
+  import CatalystSqlParser._
+
+  test("single select") {
+val batch = "SELECT 1;"
+val tree = parseBatch(batch)
+assert(tree.collection.length == 1)
+assert(tree.collection.head.isInstanceOf[SparkStatementWithPlan])
+val sparkStatement = 
tree.collection.head.asInstanceOf[SparkStatementWithPlan]
+assert(sparkStatement.getText(batch) == "SELECT 1;")
+  }
+
+  test("single select without ;") {
+val batch = "SELECT 1"
+val tree = parseBatch(batch)
+assert(tree.collection.length == 1)
+assert(tree.collection.head.isInstanceOf[SparkStatementWithPlan])
+val sparkStatement = 
tree.collection.head.asInstanceOf[SparkStatementWithPlan]
+assert(sparkStatement.getText(batch) == "SELECT 1")
+  }
+
+  test("multi select without ; - should fail") {
+val batch = "SELECT 1 SELECT 1"
+intercept[ParseException] {
+  parseBatch(batch)
+}
+  }
+
+  test("multi select") {
+val batch = "BEGIN SELECT 1;SELECT 2; END"
+val tree = parseBatch(batch)
+assert(tree.collection.length == 2)
+assert(tree.collection.forall(_.isInstanceOf[SparkStatementWithPlan]))
+
+batch.split(";")
+  .map(_.replace("\n", "").replace("BEGIN", "").replace("END", "").trim)

Review Comment:
   Can we create helper method called `splitStatementText` or something along 
those lines that will do map/replace/trim thing?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48342] SQL Batch Lang Parser [spark]

2024-05-20 Thread via GitHub


dbatomic commented on code in PR #46665:
URL: https://github.com/apache/spark/pull/46665#discussion_r1606931359


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/BatchParserSuite.scala:
##
@@ -0,0 +1,187 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.parser
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.catalyst.plans.SQLHelper
+import org.apache.spark.sql.exceptions.SqlBatchLangException
+
+class BatchParserSuite extends SparkFunSuite with SQLHelper {
+  import CatalystSqlParser._
+
+  test("single select") {
+val batch = "SELECT 1;"
+val tree = parseBatch(batch)
+assert(tree.collection.length == 1)
+assert(tree.collection.head.isInstanceOf[SparkStatementWithPlan])
+val sparkStatement = 
tree.collection.head.asInstanceOf[SparkStatementWithPlan]
+assert(sparkStatement.getText(batch) == "SELECT 1;")
+  }
+
+  test("single select without ;") {
+val batch = "SELECT 1"
+val tree = parseBatch(batch)
+assert(tree.collection.length == 1)
+assert(tree.collection.head.isInstanceOf[SparkStatementWithPlan])
+val sparkStatement = 
tree.collection.head.asInstanceOf[SparkStatementWithPlan]
+assert(sparkStatement.getText(batch) == "SELECT 1")
+  }
+
+  test("multi select without ; - should fail") {
+val batch = "SELECT 1 SELECT 1"
+intercept[ParseException] {
+  parseBatch(batch)

Review Comment:
   Can we also check error message?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48342] SQL Batch Lang Parser [spark]

2024-05-20 Thread via GitHub


dbatomic commented on code in PR #46665:
URL: https://github.com/apache/spark/pull/46665#discussion_r1606930293


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParserInterface.scala:
##
@@ -62,4 +62,10 @@ trait ParserInterface extends DataTypeParserInterface {
*/
   @throws[ParseException]("Text cannot be parsed to a LogicalPlan")
   def parseQuery(sqlText: String): LogicalPlan
+
+  /**
+   * Parse a query string to a [[BatchBody]].
+   */
+  @throws[ParseException]("")

Review Comment:
   Please add a message here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48351] JDBC Connectors - Add cast suite and fix found issue [spark]

2024-05-20 Thread via GitHub


urosstan-db commented on code in PR #46669:
URL: https://github.com/apache/spark/pull/46669#discussion_r1606929242


##
sql/core/src/main/scala/org/apache/spark/sql/jdbc/MySQLDialect.scala:
##
@@ -52,6 +52,27 @@ private case class MySQLDialect() extends JdbcDialect with 
SQLConfHelper {
 supportedFunctions.contains(funcName)
 
   class MySQLSQLBuilder extends JDBCSQLBuilder {
+override def visitCast(expr: String, exprDataType: DataType, dataType: 
DataType): String = {
+  dataType match {
+case _: IntegralType =>
+  // MySQL does not support cast to SHORT INT, BIGINT

Review Comment:
   nit: SHORT, INT



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48354][SQL] JDBC Connectors predicate pushdown testing [spark]

2024-05-20 Thread via GitHub


stefanbuk-db commented on code in PR #46642:
URL: https://github.com/apache/spark/pull/46642#discussion_r1606926117


##
sql/core/src/main/scala/org/apache/spark/sql/jdbc/MsSqlServerDialect.scala:
##
@@ -141,6 +160,9 @@ private case class MsSqlServerDialect() extends JdbcDialect 
{
 case ShortType if !SQLConf.get.legacyMsSqlServerNumericMappingEnabled =>
   Some(JdbcType("SMALLINT", java.sql.Types.SMALLINT))
 case ByteType => Some(JdbcType("SMALLINT", java.sql.Types.TINYINT))
+case LongType => Some(JdbcType("BIGINT", java.sql.Types.BIGINT))
+case DoubleType => Some(JdbcType("FLOAT", java.sql.Types.FLOAT))
+case _ if !SQLConf.get.legacyMsSqlServerNumericMappingEnabled => 
JdbcUtils.getCommonJDBCType(dt)

Review Comment:
   If question is about a 
`!SQLConf.get.legacyMsSqlServerNumericMappingEnabled`, it is added here because 
there is this config, false by default, and when it is set, some type mapping 
shouldn't be supported (not sure which, or why there is this config), but if we 
didn't have this check here, some tests with this config would fail, as we 
would, for example convert ShortType to `SMALLINT` even tho we shouldn't.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48354][SQL] JDBC Connectors predicate pushdown testing [spark]

2024-05-20 Thread via GitHub


stefanbuk-db commented on code in PR #46642:
URL: https://github.com/apache/spark/pull/46642#discussion_r1606922807


##
sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala:
##
@@ -155,7 +162,8 @@ private case class PostgresDialect() extends JdbcDialect 
with SQLConfHelper {
   getJDBCType(et).map(_.databaseTypeDefinition)
 .orElse(JdbcUtils.getCommonJDBCType(et).map(_.databaseTypeDefinition))
 .map(typeName => JdbcType(s"$typeName[]", java.sql.Types.ARRAY))
-case _ => None
+case LongType => Some(JdbcType("BIGINT", Types.BIGINT))
+case _ => JdbcUtils.getCommonJDBCType(dt)

Review Comment:
   Not sure if this is what you mean but in visitCast we have 
`getJDBCType(dataType).map(_.databaseTypeDefinition).getOrElse(dataType.typeName)`,
 so if we return None, `JdbcUtils.getCommonJDBCType` won't be called here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48354][SQL] JDBC Connectors predicate pushdown testing [spark]

2024-05-20 Thread via GitHub


stefanbuk-db commented on code in PR #46642:
URL: https://github.com/apache/spark/pull/46642#discussion_r1606917652


##
connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MySqlPushdownIntegrationSuite.scala:
##
@@ -0,0 +1,130 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.jdbc.v2
+
+import java.sql.Connection
+
+import test.scala.org.apache.spark.sql.jdbc.v2.V2JDBCPushdownTest
+
+import org.apache.spark.SparkConf
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCTableCatalog
+import org.apache.spark.sql.jdbc.{DatabaseOnDocker, 
DockerJDBCIntegrationSuite, MySQLDatabaseOnDocker}
+
+class MySqlPushdownIntegrationSuite
+  extends DockerJDBCIntegrationSuite
+with V2JDBCPushdownTest {

Review Comment:
   We could. That would run all tests from `V2JDBCTest` as well? Not sure we 
want that?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-48360] Simplify conditionals with predicate branches [spark]

2024-05-20 Thread via GitHub


tom-s-powell opened a new pull request, #46671:
URL: https://github.com/apache/spark/pull/46671

   ### What changes were proposed in this pull request?
   I am proposing additional optimizer rules for conditionals (`If` and 
`CaseWhen` expressions) whose branches are themselves predicates. These could 
be written as Boolean logic and would thus allow filters containing such 
expressions to be pushed to a data source. 
   
   ### Why are the changes needed?
   The primary motivation is to allow filters to be pushed and thus potentially 
handled by a datasource.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Unit tests have been added.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [WIP] Don't review: E2e [spark]

2024-05-20 Thread via GitHub


GideonPotok opened a new pull request, #46670:
URL: https://github.com/apache/spark/pull/46670

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   >