date:20231011

Re: [PR] [SPARK-45503][SS] RocksDB to use LZ4 [spark]

2023-10-11 Thread via GitHub



pan3793 commented on PR #43338:
URL: https://github.com/apache/spark/pull/43338#issuecomment-1759035172

   > The RocksDB Team recommend LZ4 or ZSTD ...
   
   Why choose lz4 instead of zstd? I suppose zstd is a more future-proofing 
algorithm


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45484][SQL][3.5] Deprecated the incorrect parquet compression codec lz4raw [spark]

2023-10-11 Thread via GitHub



beliefer commented on PR #43330:
URL: https://github.com/apache/spark/pull/43330#issuecomment-1758993812

   > So this change is only needed in 3.5, and we already fixed it differently 
in 4.0?
   
   Yes. This PR only used for 3.5.1. and 
https://github.com/apache/spark/pull/43310 used to fix it in 4.0.0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-44649][SQL] Runtime Filter supports passing equivalent creation side expressions [spark]

2023-10-11 Thread via GitHub



beliefer commented on code in PR #42317:
URL: https://github.com/apache/spark/pull/42317#discussion_r1356124917


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InjectRuntimeFilter.scala:
##
@@ -121,14 +128,44 @@ object InjectRuntimeFilter extends Rule[LogicalPlan] with 
PredicateHelper with J
*/
   private def extractSelectiveFilterOverScan(
   plan: LogicalPlan,
-  filterCreationSideExp: Expression): Option[LogicalPlan] = {
-@tailrec
+  filterCreationSideExp: Expression): Option[(Expression, LogicalPlan)] = {

Review Comment:
   OK



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-44262][SQL] Add `dropTable` and `getInsertStatement` to JdbcDialect [spark]

2023-10-11 Thread via GitHub



cloud-fan commented on code in PR #41855:
URL: https://github.com/apache/spark/pull/41855#discussion_r1356116942


##
sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala:
##
@@ -193,6 +193,22 @@ abstract class JdbcDialect extends Serializable with 
Logging {
 statement.executeUpdate(s"CREATE TABLE $tableName ($strSchema) 
$createTableOptions")
   }
 
+  /**
+   * Returns an Insert SQL statement for inserting a row into the target table 
via JDBC conn.

Review Comment:
   We need more document here, to explain the placeholder thing.



##
sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala:
##
@@ -193,6 +193,22 @@ abstract class JdbcDialect extends Serializable with 
Logging {
 statement.executeUpdate(s"CREATE TABLE $tableName ($strSchema) 
$createTableOptions")
   }
 
+  /**
+   * Returns an Insert SQL statement for inserting a row into the target table 
via JDBC conn.
+   *
+   * @param table The name of the table.
+   * @param fields The fields of the row that will be inserted.
+   * @return The SQL query to use for insert data into table.
+   */
+  @Since("4.0.0")
+  def insertIntoTable(
+table: String,

Review Comment:
   nit: 4 spaces indentation



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-44649][SQL] Runtime Filter supports passing equivalent creation side expressions [spark]

2023-10-11 Thread via GitHub



beliefer commented on code in PR #42317:
URL: https://github.com/apache/spark/pull/42317#discussion_r1356096817


##
sql/core/src/test/scala/org/apache/spark/sql/InjectRuntimeFilterSuite.scala:
##
@@ -390,34 +390,58 @@ class InjectRuntimeFilterSuite extends QueryTest with 
SQLTestUtils with SharedSp
   test("Runtime bloom filter join: two joins") {
 
withSQLConf(SQLConf.RUNTIME_BLOOM_FILTER_APPLICATION_SIDE_SCAN_SIZE_THRESHOLD.key
 -> "3000",
   SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "2000") {
-  assertRewroteWithBloomFilter("select * from bf1 join bf2 join bf3 on 
bf1.c1 = bf2.c2 " +

Review Comment:
   Control the join exists shuffle.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45327][BUILD] Upgrade zstd-jni to 1.5.5-6 [spark]

2023-10-11 Thread via GitHub



panbingkun commented on PR #43113:
URL: https://github.com/apache/spark/pull/43113#issuecomment-1758950109

   waiting for https://github.com/apache/spark/pull/43345


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45442][PYTHON][DOCS] Refine docstring of DataFrame.show [spark]

2023-10-11 Thread via GitHub



zhengruifeng commented on PR #43252:
URL: https://github.com/apache/spark/pull/43252#issuecomment-1758918774

   merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45442][PYTHON][DOCS] Refine docstring of DataFrame.show [spark]

2023-10-11 Thread via GitHub



zhengruifeng closed pull request #43252: [SPARK-45442][PYTHON][DOCS] Refine 
docstring of DataFrame.show
URL: https://github.com/apache/spark/pull/43252


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45430] Fix for FramelessOffsetWindowFunction when IGNORE NULLS and offset > rowCount [spark]

2023-10-11 Thread via GitHub



cloud-fan commented on code in PR #43236:
URL: https://github.com/apache/spark/pull/43236#discussion_r1356054361


##
sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowFunctionFrame.scala:
##
@@ -200,15 +200,19 @@ class FrameLessOffsetWindowFunctionFrame(
 
   override def prepare(rows: ExternalAppendOnlyUnsafeRowArray): Unit = {
 resetStates(rows)
-if (ignoreNulls) {
-  findNextRowWithNonNullInput()
+if (Math.abs(offset) > rows.length) {
+  inputIndex = offset

Review Comment:
   what does this do? skip everything and return default value?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45427][CORE] Add RPC SSL settings to SSLOptions and SparkTransportConf [spark]

2023-10-11 Thread via GitHub



hasnain-db commented on PR #43238:
URL: https://github.com/apache/spark/pull/43238#issuecomment-1758897493

   @JoshRosen I updated the docs PR to hopefully add more clarity around the 
various encryption options (and, to your request, clarified how they overlap 
with the authentication setting).
   
   > There may be some advantages to the hierarchical namespacing of SSL 
configurations...
   
   Agreed.  If we reuse the `spark.ssl.rpc` namespace as is done in the PR the 
implementation is also much easier -- but we can also make the settings 
inheritance work if we use a different namespace for it - implementation 
complexity shouldn't hinder UX IMO.
   
   >  I'm wondering if we use some long-term vision of a future end state to 
guide us here.
   
   I don't have enough context to make a call here and defer to @mridulm . I 
will say that as a random observer here who hasn't had to manage too many spark 
deployments just yet, I feel like the current encryption mechanism has a strong 
point in its favor: it's quite simple to configure just a shared secret. SSL 
requires provisioning keys and certs which is always more work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45132][SQL] Fix IDENTIFIER for function invocation [spark]

2023-10-11 Thread via GitHub



cloud-fan commented on code in PR #42888:
URL: https://github.com/apache/spark/pull/42888#discussion_r1356029023


##
sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4:
##
@@ -1196,6 +1195,7 @@ qualifiedNameList
 
 functionName
 : IDENTIFIER_KW LEFT_PAREN expression RIGHT_PAREN
+| identFunc=IDENTIFIER_KW   // IDENTIFIER itself is also a valid function 
name.

Review Comment:
   ```suggestion
   | identFunc=IDENTIFIER_KW  // IDENTIFIER itself is also a valid function 
name.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-45510][SQL] Replace `scala.collection.generic.Growable` to `scala.collection.mutable.Growable` [spark]

2023-10-11 Thread via GitHub



Hisoka-X opened a new pull request, #43347:
URL: https://github.com/apache/spark/pull/43347

   
   
   ### What changes were proposed in this pull request?
   Since scala 2.13.0, `scala.collection.generic.Growable` marked as 
deprecated. This PR change it to `scala.collection.mutable.Growable`
   
   
   
   ### Why are the changes needed?
   Remove deprecated api.
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   
   ### How was this patch tested?
   exist test.
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45488][SQL] XML: Add support for value in 'rowTag' element [spark]

2023-10-11 Thread via GitHub



shujingyang-db commented on code in PR #43319:
URL: https://github.com/apache/spark/pull/43319#discussion_r1356022084


##
sql/core/src/test/resources/test-data/xml-resources/root-level-value.xml:
##
@@ -0,0 +1,9 @@
+
+
+value1
+value2
+
+value3
+
+
+

Review Comment:
   Added a test `root-level value tag for not attributes-only object`👍



##
sql/core/src/test/resources/test-data/xml-resources/root-level-value.xml:
##
@@ -0,0 +1,9 @@
+
+
+value1
+value2
+
+value3
+
+
+

Review Comment:
   Added a test `root-level value tag for not attributes-only object`👍



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-44649][SQL] Runtime Filter supports passing equivalent creation side expressions [spark]

2023-10-11 Thread via GitHub



beliefer commented on code in PR #42317:
URL: https://github.com/apache/spark/pull/42317#discussion_r1356017720


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InjectRuntimeFilter.scala:
##
@@ -29,7 +29,14 @@ import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.types._
 
 /**
- * Insert a filter on one side of the join if the other side has a selective 
predicate.
+ * Insert a runtime filter on one side of the join (we call this side the 
application side) if
+ * the plan of other side satisfy one of the following scenes:

Review Comment:
   To be precise, `the other side` may not be `creation side`, the creation 
side may be a sub plan of the other side.
   
   So I adjust this comment.
   
   ```
   Insert a runtime filter on one side of the join (we call this side the 
application side)
   if we can extract a plan (we call this plan the creation side) from the 
other side and
   construct a runtime filter with the creation side. A simple case is that the 
creation side
   is a table scan with a selective filter.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45488][SQL] XML: Add support for value in 'rowTag' element [spark]

2023-10-11 Thread via GitHub



HyukjinKwon commented on PR #43319:
URL: https://github.com/apache/spark/pull/43319#issuecomment-1758848151

   The test failures look unrelated but mind retriggering 
https://github.com/shujingyang-db/spark/runs/17621262726 again?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45508][CORE] Add "--add-opens=java.base/jdk.internal.ref=ALL-UNNAMED" so Platform can access Cleaner on Java 9+ [spark]

2023-10-11 Thread via GitHub



LuciferYang commented on code in PR #43344:
URL: https://github.com/apache/spark/pull/43344#discussion_r1356005221


##
common/unsafe/src/test/java/org/apache/spark/unsafe/PlatformUtilSuite.java:
##
@@ -157,4 +157,11 @@ public void heapMemoryReuse() {
 Assertions.assertEquals(1024 * 1024 + 7, onheap4.size());
 Assertions.assertEquals(obj3, onheap4.getBaseObject());
   }
+
+  @Test
+  public void cleanerCreateMethodIsDefined() {
+// Regression test for SPARK-45508: we don't expect the "no cleaner" 
fallback
+// path to be hit in normal usage.
+Assertions.assertTrue(Platform.cleanerCreateMethodIsDefined());

Review Comment:
   > If we want to backport this change then we will have to prepare a separate 
PR because #43074 changed the JUnit assertions helper names so the same source 
can't compile in all branches.
   
   sorry for making this change ...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-42881][SQL][FOLLOWUP] Update the results of JsonBenchmark-jdk21 after get_json_object supports codgen [spark]

2023-10-11 Thread via GitHub



LuciferYang commented on PR #43346:
URL: https://github.com/apache/spark/pull/43346#issuecomment-1758843965

   Just update benchmark result. Merged into master for Spark 4.0, thanks 
@panbingkun ~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-42881][SQL][FOLLOWUP] Update the results of JsonBenchmark-jdk21 after get_json_object supports codgen [spark]

2023-10-11 Thread via GitHub



LuciferYang closed pull request #43346: [SPARK-42881][SQL][FOLLOWUP] Update the 
results of JsonBenchmark-jdk21 after get_json_object supports codgen
URL: https://github.com/apache/spark/pull/43346


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-42881][SQL] Codegen Support for get_json_object [spark]

2023-10-11 Thread via GitHub



panbingkun commented on code in PR #40506:
URL: https://github.com/apache/spark/pull/40506#discussion_r1355992806


##
sql/core/benchmarks/JsonBenchmark-results.txt:
##
@@ -3,127 +3,128 @@ Benchmark for performance of JSON parsing
 

 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 17.0.8+7-LTS on Linux 5.15.0-1046-azure
+OpenJDK 64-Bit Server VM 17.0.8+7-LTS on Linux 5.15.0-1047-azure

Review Comment:
   Followup pr: https://github.com/apache/spark/pull/43346



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-42881][SQL][FOLLOWUP] Codegen Support for get_json_object [spark]

2023-10-11 Thread via GitHub



panbingkun commented on PR #43346:
URL: https://github.com/apache/spark/pull/43346#issuecomment-1758833377

   cc @LuciferYang @MaxGekk


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45499][CORE][TESTS][FOLLOWUP] Use `ReferenceQueue#remove` instead of `ReferenceQueue#poll` [spark]

2023-10-11 Thread via GitHub



LuciferYang commented on PR #43345:
URL: https://github.com/apache/spark/pull/43345#issuecomment-1758833221

   OK, add `1000ms` timeout


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-42881][SQL][FOLLOWUP] Codegen Support for get_json_object [spark]

2023-10-11 Thread via GitHub



panbingkun opened a new pull request, #43346:
URL: https://github.com/apache/spark/pull/43346

   ### What changes were proposed in this pull request?
   The pr aims to followup https://github.com/apache/spark/pull/40506,
   update JsonBenchmark-jdk21-results.txt for it.
   
   
   ### Why are the changes needed?
   Update JsonBenchmark-jdk21-results.txt.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   
   ### How was this patch tested?
   Only update the results of the benchmark,
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45501][CORE][SQL] Use pattern matching for type checking and conversion [spark]

2023-10-11 Thread via GitHub



LuciferYang commented on PR #43327:
URL: https://github.com/apache/spark/pull/43327#issuecomment-1758831231

   GA passed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45499][CORE][TESTS][FOLLOWUP] Use `ReferenceQueue#remove` instead of `ReferenceQueue#poll` [spark]

2023-10-11 Thread via GitHub



LuciferYang commented on code in PR #43345:
URL: https://github.com/apache/spark/pull/43345#discussion_r1355987310


##
core/src/test/scala/org/apache/spark/util/CompletionIteratorSuite.scala:
##
@@ -64,6 +64,6 @@ class CompletionIteratorSuite extends SparkFunSuite {
   }
 }
 assert(ref.refersTo(null))
-assert(refQueue.poll() === ref)
+assert(refQueue.remove() === ref)

Review Comment:
   also cc @srowen 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45439][SQL][UI] Reduce memory usage of LiveStageMetrics.accumIdsToMetricType [spark]

2023-10-11 Thread via GitHub



JoshRosen commented on code in PR #43250:
URL: https://github.com/apache/spark/pull/43250#discussion_r1355983329


##
sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala:
##
@@ -490,7 +490,12 @@ private class LiveExecutionData(val executionId: Long) 
extends LiveEntity {
   var details: String = null
   var physicalPlanDescription: String = null
   var modifiedConfigs: Map[String, String] = _
-  var metrics = collection.Seq[SQLPlanMetric]()
+  private var _metrics = collection.Seq[SQLPlanMetric]()
+  def metrics: collection.Seq[SQLPlanMetric] = _metrics
+  // This mapping is shared across all LiveStageMetrics instances associated 
with
+  // this LiveExecutionData, helping to reduce memory overhead by avoiding 
waste
+  // from separate immutable maps with largely overlapping sets of entries.
+  val metricAccumulatorIdToMetricType = new ConcurrentHashMap[Long, String]()

Review Comment:
   Our Spark fork contains a codepath that performs multi-threaded access to 
this data structure, so that's why I used a thread-safe data structure here.
   
   If you would prefer that I avoid leaking those proprietary thread-safety 
concerns into OSS Apache Spark then I would be glad to refactor this to use a 
`mutable.Map` instead and keep the thread-safe version in our fork.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45499][CORE][TESTS][FOLLOWUP] Use `ReferenceQueue#remove` instead of `ReferenceQueue#poll` [spark]

2023-10-11 Thread via GitHub



LuciferYang commented on code in PR #43345:
URL: https://github.com/apache/spark/pull/43345#discussion_r1355979774


##
core/src/test/scala/org/apache/spark/util/CompletionIteratorSuite.scala:
##
@@ -64,6 +64,6 @@ class CompletionIteratorSuite extends SparkFunSuite {
   }
 }
 assert(ref.refersTo(null))
-assert(refQueue.poll() === ref)
+assert(refQueue.remove() === ref)

Review Comment:
   `remove` is a blocking method, should we add a `timeout(ms)`? like 
`.remove(1000)`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45499][CORE][TESTS][FOLLOWUP] Use `ReferenceQueue#remove` instead of `ReferenceQueue#poll` [spark]

2023-10-11 Thread via GitHub



LuciferYang commented on code in PR #43345:
URL: https://github.com/apache/spark/pull/43345#discussion_r1355977615


##
core/src/test/scala/org/apache/spark/util/CompletionIteratorSuite.scala:
##
@@ -64,6 +64,6 @@ class CompletionIteratorSuite extends SparkFunSuite {
   }
 }
 assert(ref.refersTo(null))
-assert(refQueue.poll() === ref)
+assert(refQueue.remove() === ref)

Review Comment:
   cc @dongjoon-hyun @yaooqinn FYI



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-45499][CORE][TESTS][FOLLOWUP] Use `ReferenceQueue#remove` instead of `ReferenceQueue#poll` [spark]

2023-10-11 Thread via GitHub



LuciferYang opened a new pull request, #43345:
URL: https://github.com/apache/spark/pull/43345

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45495][core] Support stage level task resource profile for k8s cluster when dynamic allocation disabled [spark]

2023-10-11 Thread via GitHub



wbo4958 commented on code in PR #43323:
URL: https://github.com/apache/spark/pull/43323#discussion_r1355964012


##
core/src/test/scala/org/apache/spark/resource/ResourceProfileManagerSuite.scala:
##
@@ -138,7 +138,7 @@ class ResourceProfileManagerSuite extends SparkFunSuite {
   rpmanager.isSupported(taskProf)
 }.getMessage
 assert(error === "TaskResourceProfiles are only supported for Standalone " 
+
-  "and Yarn cluster for now when dynamic allocation is disabled.")
+  "and Yarn and Kubernetes cluster for now when dynamic allocation is 
disabled.")

Review Comment:
   Good suggestion, Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45508][CORE] Add "--add-opens=java.base/jdk.internal.ref=ALL-UNNAMED" so Platform can access cleaner on Java 9+ [spark]

2023-10-11 Thread via GitHub



JoshRosen commented on code in PR #43344:
URL: https://github.com/apache/spark/pull/43344#discussion_r1355953702


##
common/unsafe/src/test/java/org/apache/spark/unsafe/PlatformUtilSuite.java:
##
@@ -157,4 +157,11 @@ public void heapMemoryReuse() {
 Assertions.assertEquals(1024 * 1024 + 7, onheap4.size());
 Assertions.assertEquals(obj3, onheap4.getBaseObject());
   }
+
+  @Test
+  public void cleanerCreateMethodIsDefined() {
+// Regression test for SPARK-45508: we don't expect the "no cleaner" 
fallback
+// path to be hit in normal usage.
+Assertions.assertTrue(Platform.cleanerCreateMethodIsDefined());

Review Comment:
   If we want to backport this change then we will have to prepare a separate 
PR because https://github.com/apache/spark/pull/43074 changed the JUnit 
assertions helper names so the same source can't compile in all branches.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45402][SQL][PYTHON] Add UDTF API for 'eval' and 'terminate' methods to consume previous 'analyze' result [spark]

2023-10-11 Thread via GitHub



ueshin closed pull request #43204: [SPARK-45402][SQL][PYTHON] Add UDTF API for 
'eval' and 'terminate' methods to consume previous 'analyze' result
URL: https://github.com/apache/spark/pull/43204


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45402][SQL][PYTHON] Add UDTF API for 'eval' and 'terminate' methods to consume previous 'analyze' result [spark]

2023-10-11 Thread via GitHub



ueshin commented on PR #43204:
URL: https://github.com/apache/spark/pull/43204#issuecomment-1758789340

   Thanks! merging to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45402][SQL][PYTHON] Add UDTF API for 'eval' and 'terminate' methods to consume previous 'analyze' result [spark]

2023-10-11 Thread via GitHub



ueshin commented on PR #43204:
URL: https://github.com/apache/spark/pull/43204#issuecomment-1758789225

   The failed tests seem not related to this PR. Let me merge this now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-45508][CORE] Add "--add-opens=java.base/jdk.internal.ref=ALL-UNNAMED" so Platform can access cleaner on Java 9+ [spark]

2023-10-11 Thread via GitHub



JoshRosen opened a new pull request, #43344:
URL: https://github.com/apache/spark/pull/43344

   ### What changes were proposed in this pull request?
   
   This PR adds `--add-opens=java.base/jdk.internal.ref=ALL-UNNAMED` to our JVM 
flags so that we can access `jdk.internal.ref.Cleaner` in JDK 9+.
   
   ### Why are the changes needed?
   
   This allows Spark to allocate direct memory while ignoring the JVM's 
MaxDirectMemorySize limit. Spark uses JDK internal APIs to directly construct 
DirectByteBuffers while bypassing that limit, but there is a fallback path at 
https://github.com/apache/spark/blob/v3.5.0/common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java#L213
 that is used if we cannot reflectively access the `Cleaner` API.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Added a unit test in `PlatformUtilSuite`.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45505][PYTHON] Refactor analyzeInPython to make it reusable [spark]

2023-10-11 Thread via GitHub



ueshin commented on code in PR #43340:
URL: https://github.com/apache/spark/pull/43340#discussion_r1355948844


##
sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonPlannerRunner.scala:
##
@@ -0,0 +1,177 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.python
+
+import java.io.{BufferedInputStream, BufferedOutputStream, DataInputStream, 
DataOutputStream, EOFException, InputStream}
+import java.nio.ByteBuffer
+import java.nio.channels.SelectionKey
+import java.util.HashMap
+
+import scala.jdk.CollectionConverters._
+
+import net.razorvine.pickle.Pickler
+
+import org.apache.spark.{JobArtifactSet, SparkEnv, SparkException}
+import org.apache.spark.api.python.{PythonFunction, PythonWorker, 
PythonWorkerUtils, SpecialLengths}
+import org.apache.spark.internal.config.BUFFER_SIZE
+import org.apache.spark.internal.config.Python._
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.util.DirectByteBufferOutputStream
+
+/**
+ * A helper class to run Python functions in Spark driver.
+ */
+abstract class PythonPlannerRunner[T](func: PythonFunction) {
+
+  protected val workerModule: String
+
+  protected def writeToPython(dataOut: DataOutputStream, pickler: Pickler): 
Unit
+
+  protected def receiveFromPython(dataIn: DataInputStream): T
+
+  def runInPython(): T = {
+val env = SparkEnv.get
+val bufferSize: Int = env.conf.get(BUFFER_SIZE)
+val authSocketTimeout = env.conf.get(PYTHON_AUTH_SOCKET_TIMEOUT)
+val reuseWorker = env.conf.get(PYTHON_WORKER_REUSE)
+val localdir = env.blockManager.diskBlockManager.localDirs.map(f => 
f.getPath()).mkString(",")
+val simplifiedTraceback: Boolean = SQLConf.get.pysparkSimplifiedTraceback
+val workerMemoryMb = SQLConf.get.pythonUDTFAnalyzerMemory
+
+val jobArtifactUUID = JobArtifactSet.getCurrentJobArtifactState.map(_.uuid)
+
+val envVars = new HashMap[String, String](func.envVars)
+val pythonExec = func.pythonExec
+val pythonVer = func.pythonVer
+val pythonIncludes = func.pythonIncludes.asScala.toSet
+val broadcastVars = func.broadcastVars.asScala.toSeq
+val maybeAccumulator = Option(func.accumulator).map(_.copyAndReset())
+
+envVars.put("SPARK_LOCAL_DIRS", localdir)
+if (reuseWorker) {
+  envVars.put("SPARK_REUSE_WORKER", "1")
+}
+if (simplifiedTraceback) {
+  envVars.put("SPARK_SIMPLIFIED_TRACEBACK", "1")
+}
+workerMemoryMb.foreach { memoryMb =>
+  envVars.put("PYSPARK_UDTF_ANALYZER_MEMORY_MB", memoryMb.toString)
+}

Review Comment:
   This part seems to be UDTF analyzer specific so not reusable. should use a 
more generic name, or add another abstract method to additionally setup the 
`envVars`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45113][PYTHON][DOCS][FOLLOWUP] Make doctests deterministic [spark]

2023-10-11 Thread via GitHub



zhengruifeng commented on PR #43331:
URL: https://github.com/apache/spark/pull/43331#issuecomment-1758768732

   merged to master, thank you all


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45113][PYTHON][DOCS][FOLLOWUP] Make doctests deterministic [spark]

2023-10-11 Thread via GitHub



zhengruifeng closed pull request #43331: [SPARK-45113][PYTHON][DOCS][FOLLOWUP] 
Make doctests deterministic
URL: https://github.com/apache/spark/pull/43331


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45220][PYTHON][DOCS] Refine docstring of DataFrame.join [spark]

2023-10-11 Thread via GitHub



allisonwang-db commented on code in PR #43039:
URL: https://github.com/apache/spark/pull/43039#discussion_r1355919218


##
python/pyspark/sql/dataframe.py:
##
@@ -2646,67 +2647,147 @@ def join(
 
 Examples
 
-The following performs a full outer join between ``df1`` and ``df2``.
+The following examples demonstrate various join types between ``df1`` 
and ``df2``.
 
+>>> import pyspark.sql.functions as sf
 >>> from pyspark.sql import Row
->>> from pyspark.sql.functions import desc
->>> df = spark.createDataFrame([(2, "Alice"), (5, "Bob")]).toDF("age", 
"name")
->>> df2 = spark.createDataFrame([Row(height=80, name="Tom"), 
Row(height=85, name="Bob")])
->>> df3 = spark.createDataFrame([Row(age=2, name="Alice"), Row(age=5, 
name="Bob")])
->>> df4 = spark.createDataFrame([
-... Row(age=10, height=80, name="Alice"),
-... Row(age=5, height=None, name="Bob"),
-... Row(age=None, height=None, name="Tom"),
-... Row(age=None, height=None, name=None),
+>>> df = spark.createDataFrame([Row(name="Alice", age=2), 
Row(name="Bob", age=5)])
+>>> df2 = spark.createDataFrame([Row(name="Tom", height=80), 
Row(name="Bob", height=85)])
+>>> df3 = spark.createDataFrame([
+... Row(name="Alice", age=10, height=80),
+... Row(name="Bob", age=5, height=None),
+... Row(name="Tom", age=None, height=None),
+... Row(name=None, age=None, height=None),
 ... ])
 
 Inner join on columns (default)
 
->>> df.join(df2, 'name').select(df.name, df2.height).show()
-++--+
-|name|height|
-++--+
-| Bob|85|
-++--+
->>> df.join(df4, ['name', 'age']).select(df.name, df.age).show()
-++---+
-|name|age|
-++---+
-| Bob|  5|
-++---+
-
-Outer join for both DataFrames on the 'name' column.
-
->>> df.join(df2, df.name == df2.name, 'outer').select(
-... df.name, df2.height).sort(desc("name")).show()
+>>> df.join(df2, "name").show()
+++---+--+
+|name|age|height|
+++---+--+
+| Bob|  5|85|
+++---+--+
+
+>>> df.join(df3, ["name", "age"]).show()
+++---+--+
+|name|age|height|
+++---+--+
+| Bob|  5|  NULL|
+++---+--+
+
+Outer join on a single column with an explicit join condition.
+
+When the join condition is explicited stated: `df.name == df2.name`, 
this will
+produce all records where the names match, as well as those that don't 
(since
+it's an outer join). If there are names in `df2` that are not present 
in `df`,
+they will appear with `NULL` in the `name` column of `df`, and vice 
versa for `df2`.
+
+>>> joined = df.join(df2, df.name == df2.name, 
"outer").sort(sf.desc(df.name))
+>>> joined.show()

Review Comment:
   Sounds good. Created 
[SPARK-45509](https://issues.apache.org/jira/browse/SPARK-45509)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45433][SQL][3.4] Fix CSV/JSON schema inference when timestamps do not match specified timestampFormat [spark]

2023-10-11 Thread via GitHub



Hisoka-X commented on PR #43343:
URL: https://github.com/apache/spark/pull/43343#issuecomment-1758754571

   cc @MaxGekk 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-45433][SQL][3.4] Fix CSV/JSON schema inference when timestamps do not match specified timestampFormat [spark]

2023-10-11 Thread via GitHub



Hisoka-X opened a new pull request, #43343:
URL: https://github.com/apache/spark/pull/43343

   
   
   ### What changes were proposed in this pull request?
   This is a backport PR of #43243. Fix the bug of schema inference when 
timestamps do not match specified timestampFormat. Please check #43243 for 
detail.
   
   
   
   ### Why are the changes needed?
   Fix schema inference bug on 3.4.
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   
   ### How was this patch tested?
   add new test.
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45009][SQL] Decorrelate predicate subqueries in join condition [spark]

2023-10-11 Thread via GitHub



andylam-db commented on code in PR #42725:
URL: https://github.com/apache/spark/pull/42725#discussion_r1355858190


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -4378,6 +4378,25 @@ object SQLConf {
   .booleanConf
   .createWithDefault(true)
 
+  val DECORRELATE_PREDICATE_SUBQUERIES_IN_JOIN_CONDITION =
+
buildConf("spark.sql.optimizer.decorrelatePredicateSubqueriesInJoinPredicate.enabled")
+  .internal()
+  .doc("Decorrelate predicate (in and exists) subqueries with correlated 
references in join " +
+"predicates.")
+  .version("4.0.0")
+  .booleanConf
+  .createWithDefault(true)
+
+  val OPTIMIZE_UNCORRELATED_IN_SUBQUERIES_IN_JOIN_CONDITION =
+
buildConf("spark.sql.optimizer.optimizeUncorrelatedInSubqueriesInJoinCondition.enabled")
+  .internal()
+  .doc("When true, optimize uncorrelated IN subqueries in join predicates 
by rewriting them " +
+s"to joins. This interacts with 
${LEGACY_NULL_IN_EMPTY_LIST_BEHAVIOR.key} because it " +
+"can rewrite IN predicates.")
+  .version("4.0.0")
+  .booleanConf
+  .createWithDefault(false)

Review Comment:
   Is it easier to just rely on LEGACY_NULL_IN_EMPTY_LIST_BEHAVIOR instead of 
creating this new flag?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-45487] Fix SQLSTATEs and temp errors [spark]

2023-10-11 Thread via GitHub



srielau opened a new pull request, #43342:
URL: https://github.com/apache/spark/pull/43342

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45220][PYTHON][DOCS] Refine docstring of DataFrame.join [spark]

2023-10-11 Thread via GitHub



allisonwang-db commented on code in PR #43039:
URL: https://github.com/apache/spark/pull/43039#discussion_r1355919218


##
python/pyspark/sql/dataframe.py:
##
@@ -2646,67 +2647,147 @@ def join(
 
 Examples
 
-The following performs a full outer join between ``df1`` and ``df2``.
+The following examples demonstrate various join types between ``df1`` 
and ``df2``.
 
+>>> import pyspark.sql.functions as sf
 >>> from pyspark.sql import Row
->>> from pyspark.sql.functions import desc
->>> df = spark.createDataFrame([(2, "Alice"), (5, "Bob")]).toDF("age", 
"name")
->>> df2 = spark.createDataFrame([Row(height=80, name="Tom"), 
Row(height=85, name="Bob")])
->>> df3 = spark.createDataFrame([Row(age=2, name="Alice"), Row(age=5, 
name="Bob")])
->>> df4 = spark.createDataFrame([
-... Row(age=10, height=80, name="Alice"),
-... Row(age=5, height=None, name="Bob"),
-... Row(age=None, height=None, name="Tom"),
-... Row(age=None, height=None, name=None),
+>>> df = spark.createDataFrame([Row(name="Alice", age=2), 
Row(name="Bob", age=5)])
+>>> df2 = spark.createDataFrame([Row(name="Tom", height=80), 
Row(name="Bob", height=85)])
+>>> df3 = spark.createDataFrame([
+... Row(name="Alice", age=10, height=80),
+... Row(name="Bob", age=5, height=None),
+... Row(name="Tom", age=None, height=None),
+... Row(name=None, age=None, height=None),
 ... ])
 
 Inner join on columns (default)
 
->>> df.join(df2, 'name').select(df.name, df2.height).show()
-++--+
-|name|height|
-++--+
-| Bob|85|
-++--+
->>> df.join(df4, ['name', 'age']).select(df.name, df.age).show()
-++---+
-|name|age|
-++---+
-| Bob|  5|
-++---+
-
-Outer join for both DataFrames on the 'name' column.
-
->>> df.join(df2, df.name == df2.name, 'outer').select(
-... df.name, df2.height).sort(desc("name")).show()
+>>> df.join(df2, "name").show()
+++---+--+
+|name|age|height|
+++---+--+
+| Bob|  5|85|
+++---+--+
+
+>>> df.join(df3, ["name", "age"]).show()
+++---+--+
+|name|age|height|
+++---+--+
+| Bob|  5|  NULL|
+++---+--+
+
+Outer join on a single column with an explicit join condition.
+
+When the join condition is explicited stated: `df.name == df2.name`, 
this will
+produce all records where the names match, as well as those that don't 
(since
+it's an outer join). If there are names in `df2` that are not present 
in `df`,
+they will appear with `NULL` in the `name` column of `df`, and vice 
versa for `df2`.
+
+>>> joined = df.join(df2, df.name == df2.name, 
"outer").sort(sf.desc(df.name))
+>>> joined.show()

Review Comment:
   Sounds good.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP][SPARK-45507][SQL] Correctness fix for correlated scalar subqueries with COUNT aggregates [spark]

2023-10-11 Thread via GitHub



andylam-db commented on PR #43341:
URL: https://github.com/apache/spark/pull/43341#issuecomment-1758745012

   Pinging for first round of reviews :-) @jchen5 @agubichev 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP][SPARK-45507][SQL] Correctness fix for correlated scalar subqueries with COUNT aggregates [spark]

2023-10-11 Thread via GitHub



andylam-db commented on code in PR #43341:
URL: https://github.com/apache/spark/pull/43341#discussion_r1355919290


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala:
##
@@ -360,8 +360,31 @@ object PullupCorrelatedPredicates extends 
Rule[LogicalPlan] with PredicateHelper
 plan.transformExpressionsWithPruning(_.containsPattern(PLAN_EXPRESSION)) {
   case ScalarSubquery(sub, children, exprId, conditions, hint, 
mayHaveCountBugOld)
 if children.nonEmpty =>
-val (newPlan, newCond) = decorrelate(sub, plan)
-val mayHaveCountBug = if (mayHaveCountBugOld.isEmpty) {
+
+def mayHaveCountBugAgg(a: Aggregate): Boolean = {
+  a.groupingExpressions.isEmpty && 
a.aggregateExpressions.exists(_.exists {
+case a: AggregateExpression => 
a.aggregateFunction.defaultResult.isDefined
+case _ => false
+  })
+}
+
+// We want to handle count bug for scalar subqueries, except for the 
cases where the
+// subquery is a simple top level Aggregate which can have a count bug 
(note: the below
+// logic also takes into account nested COUNTs). This is because for 
these cases, we don't
+// want to introduce redundant left outer joins in 
[[DecorrelateInnerQuery]], when the
+// necessary left outer join will be added in 
[[RewriteCorrelatedScalarSubquery]].
+val shouldHandleCountBug = !(sub match {

Review Comment:
   Should we flag this with a config?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-45507] Correctness fix for correlated scalar subqueries with COUNT aggregates [spark]

2023-10-11 Thread via GitHub



andylam-db opened a new pull request, #43341:
URL: https://github.com/apache/spark/pull/43341

   
   
   ### What changes were proposed in this pull request?
   
   We want to use the count bug handling in `DecorrelateInnerQuery` to detect 
potential count bugs in scalar subqueries.  it It is always safe to use 
`DecorrelateInnerQuery` to handle count bugs, but for efficiency reasons, like 
for the common case of COUNT on top of the scalar subquery, we would like to 
avoid an extra left outer join. This PR therefore introduces a simple check to 
detect such cases before `decorrelate()` - if true, then don't do count bug 
handling in `decorrelate()`, otherwise, do count bug handling. 
   
   ### Why are the changes needed?
   
   This PR fixes correctness issues for correlated scalar subqueries pertaining 
to the COUNT bug. Examples can be found in the JIRA ticket.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, results will change. 
   
   ### How was this patch tested?
   
   Added SQL end-to-end tests in `count.sql`
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45499][CORE][TESTS] Replace `Reference#isEnqueued` with `Reference#refersTo(null)` [spark]

2023-10-11 Thread via GitHub



LuciferYang commented on PR #43325:
URL: https://github.com/apache/spark/pull/43325#issuecomment-1758737300

   Sorry, it seems that the changed test cases have a chance of failing. I will 
investigate this today and if there is no progress, I'll revert this PR first.
   
   https://github.com/apache/spark/actions/runs/6484510414/job/17608536854
   
   ```
   2023-10-11T18:19:40.4344178Z [0m[[0m[0minfo[0m] [0m[0m[31m- reference 
to sub iterator should not be available after completion *** FAILED *** (226 
milliseconds)[0m[0m
   2023-10-11T18:19:40.4353343Z [0m[[0m[0minfo[0m] [0m[0m[31m  null did 
not equal java.lang.ref.PhantomReference@49631390 
(CompletionIteratorSuite.scala:67)[0m[0m
   2023-10-11T18:19:40.4355215Z [0m[[0m[0minfo[0m] [0m[0m[31m  
org.scalatest.exceptions.TestFailedException:[0m[0m
   2023-10-11T18:19:40.4357722Z [0m[[0m[0minfo[0m] [0m[0m[31m  at 
org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)[0m[0m
   2023-10-11T18:19:40.4360203Z [0m[[0m[0minfo[0m] [0m[0m[31m  at 
org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)[0m[0m
   2023-10-11T18:19:40.4362462Z [0m[[0m[0minfo[0m] [0m[0m[31m  at 
org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)[0m[0m
   2023-10-11T18:19:40.4364350Z [0m[[0m[0minfo[0m] [0m[0m[31m  at 
org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)[0m[0m
   2023-10-11T18:19:40.4366469Z [0m[[0m[0minfo[0m] [0m[0m[31m  at 
org.apache.spark.util.CompletionIteratorSuite.$anonfun$new$3(CompletionIteratorSuite.scala:67)[0m[0m
   2023-10-11T18:19:40.4369159Z [0m[[0m[0minfo[0m] [0m[0m[31m  at 
org.scalatest.enablers.Timed$$anon$1.timeoutAfter(Timed.scala:127)[0m[0m
   2023-10-11T18:19:40.4372116Z [0m[[0m[0minfo[0m] [0m[0m[31m  at 
org.scalatest.concurrent.TimeLimits$.failAfterImpl(TimeLimits.scala:282)[0m[0m
   2023-10-11T18:19:40.4377482Z [0m[[0m[0minfo[0m] [0m[0m[31m  at 
org.scalatest.concurrent.TimeLimits.failAfter(TimeLimits.scala:231)[0m[0m
   2023-10-11T18:19:40.4380196Z [0m[[0m[0minfo[0m] [0m[0m[31m  at 
org.scalatest.concurrent.TimeLimits.failAfter$(TimeLimits.scala:230)[0m[0m
   2023-10-11T18:19:40.4382087Z [0m[[0m[0minfo[0m] [0m[0m[31m  at 
org.apache.spark.SparkFunSuite.failAfter(SparkFunSuite.scala:69)[0m[0m
   2023-10-11T18:19:40.4383931Z [0m[[0m[0minfo[0m] [0m[0m[31m  at 
org.apache.spark.SparkFunSuite.$anonfun$test$2(SparkFunSuite.scala:155)[0m[0m
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45486][CONNECT] Make add_artifact request idempotent [spark]

2023-10-11 Thread via GitHub



HyukjinKwon commented on PR #43314:
URL: https://github.com/apache/spark/pull/43314#issuecomment-1758732953

   Mind retriggering the failed tests 
https://github.com/cdkrot/apache_spark/runs/17602108213 ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45220][PYTHON][DOCS] Refine docstring of DataFrame.join [spark]

2023-10-11 Thread via GitHub



HyukjinKwon commented on code in PR #43039:
URL: https://github.com/apache/spark/pull/43039#discussion_r1355911443


##
python/pyspark/sql/dataframe.py:
##
@@ -2646,67 +2647,147 @@ def join(
 
 Examples
 
-The following performs a full outer join between ``df1`` and ``df2``.
+The following examples demonstrate various join types between ``df1`` 
and ``df2``.
 
+>>> import pyspark.sql.functions as sf
 >>> from pyspark.sql import Row
->>> from pyspark.sql.functions import desc
->>> df = spark.createDataFrame([(2, "Alice"), (5, "Bob")]).toDF("age", 
"name")
->>> df2 = spark.createDataFrame([Row(height=80, name="Tom"), 
Row(height=85, name="Bob")])
->>> df3 = spark.createDataFrame([Row(age=2, name="Alice"), Row(age=5, 
name="Bob")])
->>> df4 = spark.createDataFrame([
-... Row(age=10, height=80, name="Alice"),
-... Row(age=5, height=None, name="Bob"),
-... Row(age=None, height=None, name="Tom"),
-... Row(age=None, height=None, name=None),
+>>> df = spark.createDataFrame([Row(name="Alice", age=2), 
Row(name="Bob", age=5)])
+>>> df2 = spark.createDataFrame([Row(name="Tom", height=80), 
Row(name="Bob", height=85)])
+>>> df3 = spark.createDataFrame([
+... Row(name="Alice", age=10, height=80),
+... Row(name="Bob", age=5, height=None),
+... Row(name="Tom", age=None, height=None),
+... Row(name=None, age=None, height=None),
 ... ])
 
 Inner join on columns (default)
 
->>> df.join(df2, 'name').select(df.name, df2.height).show()
-++--+
-|name|height|
-++--+
-| Bob|85|
-++--+
->>> df.join(df4, ['name', 'age']).select(df.name, df.age).show()
-++---+
-|name|age|
-++---+
-| Bob|  5|
-++---+
-
-Outer join for both DataFrames on the 'name' column.
-
->>> df.join(df2, df.name == df2.name, 'outer').select(
-... df.name, df2.height).sort(desc("name")).show()
+>>> df.join(df2, "name").show()
+++---+--+
+|name|age|height|
+++---+--+
+| Bob|  5|85|
+++---+--+
+
+>>> df.join(df3, ["name", "age"]).show()
+++---+--+
+|name|age|height|
+++---+--+
+| Bob|  5|  NULL|
+++---+--+
+
+Outer join on a single column with an explicit join condition.
+
+When the join condition is explicited stated: `df.name == df2.name`, 
this will
+produce all records where the names match, as well as those that don't 
(since
+it's an outer join). If there are names in `df2` that are not present 
in `df`,
+they will appear with `NULL` in the `name` column of `df`, and vice 
versa for `df2`.
+
+>>> joined = df.join(df2, df.name == df2.name, 
"outer").sort(sf.desc(df.name))
+>>> joined.show()

Review Comment:
   Can we exclude those examples in this PRs, and mind filing JIRAs for both 
issues @allisonwang-db?



##
python/pyspark/sql/dataframe.py:
##
@@ -2646,67 +2647,147 @@ def join(
 
 Examples
 
-The following performs a full outer join between ``df1`` and ``df2``.
+The following examples demonstrate various join types between ``df1`` 
and ``df2``.
 
+>>> import pyspark.sql.functions as sf
 >>> from pyspark.sql import Row
->>> from pyspark.sql.functions import desc
->>> df = spark.createDataFrame([(2, "Alice"), (5, "Bob")]).toDF("age", 
"name")
->>> df2 = spark.createDataFrame([Row(height=80, name="Tom"), 
Row(height=85, name="Bob")])
->>> df3 = spark.createDataFrame([Row(age=2, name="Alice"), Row(age=5, 
name="Bob")])
->>> df4 = spark.createDataFrame([
-... Row(age=10, height=80, name="Alice"),
-... Row(age=5, height=None, name="Bob"),
-... Row(age=None, height=None, name="Tom"),
-... Row(age=None, height=None, name=None),
+>>> df = spark.createDataFrame([Row(name="Alice", age=2), 
Row(name="Bob", age=5)])
+>>> df2 = spark.createDataFrame([Row(name="Tom", height=80), 
Row(name="Bob", height=85)])
+>>> df3 = spark.createDataFrame([
+... Row(name="Alice", age=10, height=80),
+... Row(name="Bob", age=5, height=None),
+... Row(name="Tom", age=None, height=None),
+... Row(name=None, age=None, height=None),
 ... ])
 
 Inner join on columns (default)
 
->>> df.join(df2, 'name').select(df.name, df2.height).show()
-++--+
-|name|height|
-++--+
-| Bob|85|
-++--+
->>> df.join(df4, ['name', 'age']).select(df.name, df.age).show()
-+

Re: [PR] [SPARK-45221][PYTHON][DOCS] Refine docstring of DataFrameReader.parquet [spark]

2023-10-11 Thread via GitHub



HyukjinKwon commented on PR #43301:
URL: https://github.com/apache/spark/pull/43301#issuecomment-1758730466

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45221][PYTHON][DOCS] Refine docstring of DataFrameReader.parquet [spark]

2023-10-11 Thread via GitHub



HyukjinKwon closed pull request #43301: [SPARK-45221][PYTHON][DOCS] Refine 
docstring of DataFrameReader.parquet
URL: https://github.com/apache/spark/pull/43301


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45442][PYTHON][DOCS] Refine docstring of DataFrame.show [spark]

2023-10-11 Thread via GitHub



HyukjinKwon commented on PR #43252:
URL: https://github.com/apache/spark/pull/43252#issuecomment-1758731045

   @allisonwang-db mind rebasing


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] Bump org.apache.zookeeper:zookeeper from 3.5.7 to 3.7.2 in /connector/kafka-0-10-sql [spark]

2023-10-11 Thread via GitHub



dependabot[bot] commented on PR #43335:
URL: https://github.com/apache/spark/pull/43335#issuecomment-1758729602

   OK, I won't notify you again about this release, but will get in touch when 
a new version is available. If you'd rather skip all updates until the next 
major or minor version, let me know by commenting `@dependabot ignore this 
major version` or `@dependabot ignore this minor version`.
   
   If you change your mind, just re-open this PR and I'll resolve any conflicts 
on it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] Bump org.apache.zookeeper:zookeeper from 3.5.7 to 3.7.2 in /connector/kafka-0-10-sql [spark]

2023-10-11 Thread via GitHub



HyukjinKwon closed pull request #43335: Bump org.apache.zookeeper:zookeeper 
from 3.5.7 to 3.7.2 in /connector/kafka-0-10-sql
URL: https://github.com/apache/spark/pull/43335


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] Bump org.apache.zookeeper:zookeeper from 3.5.7 to 3.7.2 in /connector/kafka-0-10 [spark]

2023-10-11 Thread via GitHub



dependabot[bot] commented on PR #43336:
URL: https://github.com/apache/spark/pull/43336#issuecomment-1758729567

   OK, I won't notify you again about this release, but will get in touch when 
a new version is available. If you'd rather skip all updates until the next 
major or minor version, let me know by commenting `@dependabot ignore this 
major version` or `@dependabot ignore this minor version`.
   
   If you change your mind, just re-open this PR and I'll resolve any conflicts 
on it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] Bump org.apache.zookeeper:zookeeper from 3.5.7 to 3.7.2 in /connector/kafka-0-10 [spark]

2023-10-11 Thread via GitHub



HyukjinKwon closed pull request #43336: Bump org.apache.zookeeper:zookeeper 
from 3.5.7 to 3.7.2 in /connector/kafka-0-10
URL: https://github.com/apache/spark/pull/43336


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] Bump org.apache.zookeeper:zookeeper from 3.5.7 to 3.7.2 [spark]

2023-10-11 Thread via GitHub



dependabot[bot] commented on PR #43337:
URL: https://github.com/apache/spark/pull/43337#issuecomment-1758729539

   OK, I won't notify you again about this release, but will get in touch when 
a new version is available. If you'd rather skip all updates until the next 
major or minor version, let me know by commenting `@dependabot ignore this 
major version` or `@dependabot ignore this minor version`.
   
   If you change your mind, just re-open this PR and I'll resolve any conflicts 
on it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] Bump org.apache.zookeeper:zookeeper from 3.5.7 to 3.7.2 [spark]

2023-10-11 Thread via GitHub



HyukjinKwon closed pull request #43337: Bump org.apache.zookeeper:zookeeper 
from 3.5.7 to 3.7.2
URL: https://github.com/apache/spark/pull/43337


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45505][PYTHON] Refactor analyzeInPython to make it reusable [spark]

2023-10-11 Thread via GitHub



allisonwang-db commented on PR #43340:
URL: https://github.com/apache/spark/pull/43340#issuecomment-1758727468

   cc @ueshin 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-45505][PYTHON] Refactor analyzeInPython to make it reusable [spark]

2023-10-11 Thread via GitHub



allisonwang-db opened a new pull request, #43340:
URL: https://github.com/apache/spark/pull/43340

   
   
   ### What changes were proposed in this pull request?
   
   Currently, the `analyzeInPython` method in UserDefinedPythonTableFunction 
object can starts a Python process in driver and run a Python function in the 
Python process. This PR aims to refactor this logic into a reusable runner 
class.
   
   ### Why are the changes needed?
   
   To make the code more reusable.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   Existing tests
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-42584][CONNECT] Improve output of Column.explain [spark]

2023-10-11 Thread via GitHub



github-actions[bot] commented on PR #40528:
URL: https://github.com/apache/spark/pull/40528#issuecomment-1758724597

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-43396][CORE] Add config to control max ratio of decommissioning executors [spark]

2023-10-11 Thread via GitHub



github-actions[bot] commented on PR #41076:
URL: https://github.com/apache/spark/pull/41076#issuecomment-1758724577

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] Bump golang.org/x/net from 0.10.0 to 0.17.0 [spark-connect-go]

2023-10-11 Thread via GitHub



HyukjinKwon commented on PR #16:
URL: https://github.com/apache/spark-connect-go/pull/16#issuecomment-1758722272

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] Bump golang.org/x/net from 0.10.0 to 0.17.0 [spark-connect-go]

2023-10-11 Thread via GitHub



dependabot[bot] commented on PR #16:
URL: https://github.com/apache/spark-connect-go/pull/16#issuecomment-1758722205

   OK, I won't notify you again about this release, but will get in touch when 
a new version is available. If you'd rather skip all updates until the next 
major or minor version, let me know by commenting `@dependabot ignore this 
major version` or `@dependabot ignore this minor version`.
   
   If you change your mind, just re-open this PR and I'll resolve any conflicts 
on it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] Bump golang.org/x/net from 0.10.0 to 0.17.0 [spark-connect-go]

2023-10-11 Thread via GitHub



HyukjinKwon closed pull request #16: Bump golang.org/x/net from 0.10.0 to 0.17.0
URL: https://github.com/apache/spark-connect-go/pull/16


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45483][CONNECT] Correct the function groups in connect.functions [spark]

2023-10-11 Thread via GitHub



zhengruifeng commented on PR #43309:
URL: https://github.com/apache/spark/pull/43309#issuecomment-1758721600

   thank you all for reviews


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45415] Allow selective disabling of "fallocate" in RocksDB statestore [spark]

2023-10-11 Thread via GitHub



HeartSaVioR closed pull request #43202: [SPARK-45415] Allow selective disabling 
of "fallocate" in RocksDB statestore
URL: https://github.com/apache/spark/pull/43202


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45415] Allow selective disabling of "fallocate" in RocksDB statestore [spark]

2023-10-11 Thread via GitHub



HeartSaVioR commented on PR #43202:
URL: https://github.com/apache/spark/pull/43202#issuecomment-1758701984

   Thanks! Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-45504][SS] Lower CPU Priority Of RocksDB Background Threads [spark]

2023-10-11 Thread via GitHub



siying opened a new pull request, #43339:
URL: https://github.com/apache/spark/pull/43339

   
   ### What changes were proposed in this pull request?
   Lower RocksDB background threads' CPU priority.
   
   
   ### Why are the changes needed?
   RocksDB background threads execute background tasks such as compactions and 
flushes. It should be in general lower priority than Spark tasks excution. For 
the case where a task may wait for some RocksDB background task to finish, such 
as checkpointing, or waiting async checkpointing to finish, the task slot is 
waiting so we are likely to have enough CPU for low pri CPU anyway.
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   Run benchmarks and see the results.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45009][SQL] Decorrelate predicate subqueries in join condition [spark]

2023-10-11 Thread via GitHub



andylam-db commented on code in PR #42725:
URL: https://github.com/apache/spark/pull/42725#discussion_r1355858190


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -4378,6 +4378,25 @@ object SQLConf {
   .booleanConf
   .createWithDefault(true)
 
+  val DECORRELATE_PREDICATE_SUBQUERIES_IN_JOIN_CONDITION =
+
buildConf("spark.sql.optimizer.decorrelatePredicateSubqueriesInJoinPredicate.enabled")
+  .internal()
+  .doc("Decorrelate predicate (in and exists) subqueries with correlated 
references in join " +
+"predicates.")
+  .version("4.0.0")
+  .booleanConf
+  .createWithDefault(true)
+
+  val OPTIMIZE_UNCORRELATED_IN_SUBQUERIES_IN_JOIN_CONDITION =
+
buildConf("spark.sql.optimizer.optimizeUncorrelatedInSubqueriesInJoinCondition.enabled")
+  .internal()
+  .doc("When true, optimize uncorrelated IN subqueries in join predicates 
by rewriting them " +
+s"to joins. This interacts with 
${LEGACY_NULL_IN_EMPTY_LIST_BEHAVIOR.key} because it " +
+"can rewrite IN predicates.")
+  .version("4.0.0")
+  .booleanConf
+  .createWithDefault(false)

Review Comment:
   Is it easier to just rely on LEGACY_NULL_IN_EMPTY_LIST_BEHAVIOR instead of 
creating this new flag?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-45503][SS] RocksDB to use LZ4 [spark]

2023-10-11 Thread via GitHub



siying opened a new pull request, #43338:
URL: https://github.com/apache/spark/pull/43338

   ### What changes were proposed in this pull request?
   Turn to use LZ4 in RocksDB state store.
   
   ### Why are the changes needed?
   LZ4 is generally faster than Snappy. That's probably we use LZ4 in 
changelogs and other places by default. However, we don't change RocksDB's 
default of Snappy compression style. The RocksDB Team recommend LZ4 or ZSTD and 
the default is kept to Snappy only for backward compatible reason. We should 
use LZ4 instead.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   Existing tests should fix it. Some benchmarks are run and we see positive 
improvements.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] Bump golang.org/x/net from 0.10.0 to 0.17.0 [spark-connect-go]

2023-10-11 Thread via GitHub



dependabot[bot] opened a new pull request, #16:
URL: https://github.com/apache/spark-connect-go/pull/16

   Bumps [golang.org/x/net](https://github.com/golang/net) from 0.10.0 to 
0.17.0.
   
   Commits
   
   https://github.com/golang/net/commit/b225e7ca6dde1ef5a5ae5ce922861bda011cfabd";>b225e7c
 http2: limit maximum handler goroutines to MaxConcurrentStreams
   https://github.com/golang/net/commit/88194ad8ab44a02ea952c169883c3f57db6cf9f4";>88194ad
 go.mod: update golang.org/x dependencies
   https://github.com/golang/net/commit/2b60a61f1e4cf3a5ecded0bd7e77ea168289e6de";>2b60a61
 quic: fix several bugs in flow control accounting
   https://github.com/golang/net/commit/73d82efb96cacc0c378bc150b56675fc191894b9";>73d82ef
 quic: handle DATA_BLOCKED frames
   https://github.com/golang/net/commit/5d5a036a503f8accd748f7453c0162115187be13";>5d5a036
 quic: handle streams moving from the data queue to the meta queue
   https://github.com/golang/net/commit/350aad2603e57013fafb1a9e2089a382fe67dc80";>350aad2
 quic: correctly extend peer's flow control window after MAX_DATA
   https://github.com/golang/net/commit/21814e71db756f39b69fb1a3e06350fa555a79b1";>21814e7
 quic: validate connection id transport parameters
   https://github.com/golang/net/commit/a600b3518eed7a9a4e24380b4b249cb986d9b64d";>a600b35
 quic: avoid redundant MAX_DATA updates
   https://github.com/golang/net/commit/ea633599b58dc6a50d33c7f5438edfaa8bc313df";>ea63359
 http2: check stream body is present on read timeout
   https://github.com/golang/net/commit/ddd8598e5694aa5e966e44573a53e895f6fa5eb2";>ddd8598
 quic: version negotiation
   Additional commits viewable in https://github.com/golang/net/compare/v0.10.0...v0.17.0";>compare 
view
   
   
   
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=golang.org/x/net&package-manager=go_modules&previous-version=0.10.0&new-version=0.17.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot show  ignore conditions` will show all of 
the ignore conditions of the specified dependency
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   You can disable automated security fix PRs for this repo from the [Security 
Alerts page](https://github.com/apache/spark-connect-go/network/alerts).
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-42881][SQL] Codegen Support for get_json_object [spark]

2023-10-11 Thread via GitHub



panbingkun commented on code in PR #40506:
URL: https://github.com/apache/spark/pull/40506#discussion_r1355839029


##
sql/core/benchmarks/JsonBenchmark-results.txt:
##
@@ -3,127 +3,128 @@ Benchmark for performance of JSON parsing
 

 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 17.0.8+7-LTS on Linux 5.15.0-1046-azure
+OpenJDK 64-Bit Server VM 17.0.8+7-LTS on Linux 5.15.0-1047-azure

Review Comment:
   Okay,I will sumbit a followup pr to complete it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45415] Allow selective disabling of "fallocate" in RocksDB statestore [spark]

2023-10-11 Thread via GitHub



HeartSaVioR commented on code in PR #43202:
URL: https://github.com/apache/spark/pull/43202#discussion_r1355833024


##
docs/structured-streaming-programming-guide.md:
##
@@ -2385,6 +2385,11 @@ Here are the configs regarding to RocksDB instance of 
the state store provider:
 Total memory to be occupied by blocks in high priority pool as a 
fraction of memory allocated across all RocksDB instances on a single node 
using maxMemoryUsageMB.
 0.1
   
+  
+spark.sql.streaming.stateStore.rocksdb.allowFAllocate

Review Comment:
   Nice, we can move forward. :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] Bump org.apache.zookeeper:zookeeper from 3.5.7 to 3.7.2 in /connector/kafka-0-10 [spark]

2023-10-11 Thread via GitHub



dependabot[bot] opened a new pull request, #43336:
URL: https://github.com/apache/spark/pull/43336

   Bumps org.apache.zookeeper:zookeeper from 3.5.7 to 3.7.2.
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=org.apache.zookeeper:zookeeper&package-manager=maven&previous-version=3.5.7&new-version=3.7.2)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot show  ignore conditions` will show all of 
the ignore conditions of the specified dependency
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   You can disable automated security fix PRs for this repo from the [Security 
Alerts page](https://github.com/apache/spark/network/alerts).
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] Bump org.apache.zookeeper:zookeeper from 3.5.7 to 3.7.2 [spark]

2023-10-11 Thread via GitHub



dependabot[bot] opened a new pull request, #43337:
URL: https://github.com/apache/spark/pull/43337

   Bumps org.apache.zookeeper:zookeeper from 3.5.7 to 3.7.2.
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=org.apache.zookeeper:zookeeper&package-manager=maven&previous-version=3.5.7&new-version=3.7.2)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot show  ignore conditions` will show all of 
the ignore conditions of the specified dependency
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   You can disable automated security fix PRs for this repo from the [Security 
Alerts page](https://github.com/apache/spark/network/alerts).
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] Bump org.apache.zookeeper:zookeeper from 3.5.7 to 3.7.2 in /connector/kafka-0-10-sql [spark]

2023-10-11 Thread via GitHub



dependabot[bot] opened a new pull request, #43335:
URL: https://github.com/apache/spark/pull/43335

   Bumps org.apache.zookeeper:zookeeper from 3.5.7 to 3.7.2.
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=org.apache.zookeeper:zookeeper&package-manager=maven&previous-version=3.5.7&new-version=3.7.2)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot show  ignore conditions` will show all of 
the ignore conditions of the specified dependency
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   You can disable automated security fix PRs for this repo from the [Security 
Alerts page](https://github.com/apache/spark/network/alerts).
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45009][SQL] Decorrelate predicate subqueries in join condition [spark]

2023-10-11 Thread via GitHub



jchen5 commented on code in PR #42725:
URL: https://github.com/apache/spark/pull/42725#discussion_r1355693885


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -4378,6 +4378,25 @@ object SQLConf {
   .booleanConf
   .createWithDefault(true)
 
+  val DECORRELATE_PREDICATE_SUBQUERIES_IN_JOIN_CONDITION =
+
buildConf("spark.sql.optimizer.decorrelatePredicateSubqueriesInJoinPredicate.enabled")
+  .internal()
+  .doc("Decorrelate predicate (in and exists) subqueries with correlated 
references in join " +
+"predicates.")
+  .version("4.0.0")
+  .booleanConf
+  .createWithDefault(true)
+
+  val OPTIMIZE_UNCORRELATED_IN_SUBQUERIES_IN_JOIN_CONDITION =
+
buildConf("spark.sql.optimizer.optimizeUncorrelatedInSubqueriesInJoinCondition.enabled")
+  .internal()
+  .doc("When true, optimize uncorrelated IN subqueries in join predicates 
by rewriting them " +
+s"to joins. This interacts with 
${LEGACY_NULL_IN_EMPTY_LIST_BEHAVIOR.key} because it " +
+"can rewrite IN predicates.")
+  .version("4.0.0")
+  .booleanConf
+  .createWithDefault(false)

Review Comment:
   We can have this true by default now - LEGACY_NULL_IN_EMPTY_LIST_BEHAVIOR 
was enabled by default recently.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [WIP][SPARK-45022][SQL] Provide context for dataset API errors [spark]

2023-10-11 Thread via GitHub



MaxGekk opened a new pull request, #43334:
URL: https://github.com/apache/spark/pull/43334

   ### What changes were proposed in this pull request?
   This PR captures the dataset APIs used by the user code and the call site in 
the user code and provides better error messages.
   
   E.g. consider the following Spark app `SimpleApp.scala`:
   ```scala
  1  import org.apache.spark.sql.SparkSession
  2  import org.apache.spark.sql.functions._
  3
  4  object SimpleApp {
  5def main(args: Array[String]) {
  6  val spark = SparkSession.builder.appName("Simple 
Application").config("spark.sql.ansi.enabled", true).getOrCreate()
  7  import spark.implicits._
  8
  9  val c = col("a") / col("b")
 10
 11  Seq((1, 0)).toDF("a", "b").select(c).show()
 12
 13  spark.stop()
 14}
 15  }
   ``` 
   
   After this PR the error message contains the error context (which Spark 
Dataset API is called from where in the user code) in the following form:
   ```
   Exception in thread "main" org.apache.spark.SparkArithmeticException: 
[DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 
and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" 
to bypass this error.
   == Dataset ==
   "div" was called from SimpleApp$.main(SimpleApp.scala:9)
   
at 
org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:201)
at 
org.apache.spark.sql.catalyst.expressions.DivModLike.eval(arithmetic.scala:672
   ...
   ```
   which is similar to the already provided context in case of SQL queries:
   ```
   org.apache.spark.SparkArithmeticException: [DIVIDE_BY_ZERO] Division by 
zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If 
necessary set "spark.sql.ansi.enabled" to "false" to bypass this error.
   == SQL(line 1, position 1) ==
   a / b
   ^
   
at 
org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:201)
at 
org.apache.spark.sql.errors.QueryExecutionErrors.divideByZeroError(QueryExecutionErrors.scala)
   ...
   ```
   
   Please note that stack trace in `spark-shell` doesn't contain meaningful 
elements:
   ```
   scala> Thread.currentThread().getStackTrace.foreach(println)
   java.base/java.lang.Thread.getStackTrace(Thread.java:1602)
   $line15.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(:23)
   $line15.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(:27)
   $line15.$read$$iw$$iw$$iw$$iw$$iw$$iw.(:29)
   $line15.$read$$iw$$iw$$iw$$iw$$iw.(:31)
   $line15.$read$$iw$$iw$$iw$$iw.(:33)
   $line15.$read$$iw$$iw$$iw.(:35)
   $line15.$read$$iw$$iw.(:37)
   $line15.$read$$iw.(:39)
   $line15.$read.(:41)
   $line15.$read$.(:45)
   $line15.$read$.()
   $line15.$eval$.$print$lzycompute(:7)
   $line15.$eval$.$print(:6)
   $line15.$eval.$print()
   java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)
   
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   ...
   ```
   so this change doesn't help with that usecase.
   
   ### Why are the changes needed?
   To provide more user friendly errors.
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes.
   
   ### How was this patch tested?
   Added new UTs to `QueryExecutionAnsiErrorsSuite`.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-44855][CONNECT] Small tweaks to attaching ExecuteGrpcResponseSender to ExecuteResponseObserver [spark]

2023-10-11 Thread via GitHub



hvanhovell closed pull request #43181: [SPARK-44855][CONNECT] Small tweaks to 
attaching ExecuteGrpcResponseSender to ExecuteResponseObserver
URL: https://github.com/apache/spark/pull/43181


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45402][SQL][PYTHON] Add UDTF API for 'eval' and 'terminate' methods to consume previous 'analyze' result [spark]

2023-10-11 Thread via GitHub



dtenedor commented on code in PR #43204:
URL: https://github.com/apache/spark/pull/43204#discussion_r1355605235


##
sql/core/src/main/scala/org/apache/spark/sql/execution/python/UserDefinedPythonFunction.scala:
##
@@ -284,6 +285,14 @@ object UserDefinedPythonTableFunction {
   val schema = DataType.fromJson(
 PythonWorkerUtils.readUTF(length, dataIn)).asInstanceOf[StructType]
 
+  // Receive the pickled AnalyzeResult buffer, if any.
+  val pickledAnalyzeResult: Array[Byte] = dataIn.readInt() match {
+case length =>
+  val obj = new Array[Byte](length)
+  dataIn.readFully(obj)
+  obj
+  }

Review Comment:
   Sure, this is done.



##
sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonUDTFExec.scala:
##
@@ -132,7 +135,15 @@ object PythonUDTFRunner {
   case None =>
 dataOut.writeInt(0)
 }
+// Write the pickled AnalyzeResult buffer from the UDTF "analyze" method, 
if any.
+dataOut.writeBoolean(udtf.pickledAnalyzeResult.nonEmpty)
+udtf.pickledAnalyzeResult.map { p =>
+  dataOut.writeInt(p.length)
+  dataOut.write(p)

Review Comment:
   Done.



##
sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonUDTFExec.scala:
##
@@ -132,7 +135,15 @@ object PythonUDTFRunner {
   case None =>
 dataOut.writeInt(0)
 }
+// Write the pickled AnalyzeResult buffer from the UDTF "analyze" method, 
if any.
+dataOut.writeBoolean(udtf.pickledAnalyzeResult.nonEmpty)
+udtf.pickledAnalyzeResult.map { p =>

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45495][core] Support stage level task resource profile for k8s cluster when dynamic allocation disabled [spark]

2023-10-11 Thread via GitHub



tgravescs commented on code in PR #43323:
URL: https://github.com/apache/spark/pull/43323#discussion_r1355509647


##
core/src/test/scala/org/apache/spark/resource/ResourceProfileManagerSuite.scala:
##
@@ -138,7 +138,7 @@ class ResourceProfileManagerSuite extends SparkFunSuite {
   rpmanager.isSupported(taskProf)
 }.getMessage
 assert(error === "TaskResourceProfiles are only supported for Standalone " 
+
-  "and Yarn cluster for now when dynamic allocation is disabled.")
+  "and Yarn and Kubernetes cluster for now when dynamic allocation is 
disabled.")

Review Comment:
   nit could change this to "Standalone, YARN, and Kubernetes"



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45488] XML: Add support for value in 'rowTag' element [spark]

2023-10-11 Thread via GitHub



shujingyang-db commented on code in PR #43319:
URL: https://github.com/apache/spark/pull/43319#discussion_r1355376309


##
sql/core/src/test/resources/test-data/xml-resources/root-level-value.xml:
##
@@ -0,0 +1,9 @@
+
+
+value1
+value2
+
+value3
+
+
+

Review Comment:
   Sure I will add another test case for this. In this case, `5` and `6` will 
also be skipped. We only read values if the struct only consists of attributes 
and `valueTag`. If there's another field (i.e. `tag`), the value in between the 
tags will be ignored.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45402][SQL][PYTHON] Add UDTF API for 'eval' and 'terminate' methods to consume previous 'analyze' result [spark]

2023-10-11 Thread via GitHub



ueshin commented on code in PR #43204:
URL: https://github.com/apache/spark/pull/43204#discussion_r1355450724


##
sql/core/src/main/scala/org/apache/spark/sql/execution/python/UserDefinedPythonFunction.scala:
##
@@ -284,6 +285,14 @@ object UserDefinedPythonTableFunction {
   val schema = DataType.fromJson(
 PythonWorkerUtils.readUTF(length, dataIn)).asInstanceOf[StructType]
 
+  // Receive the pickled AnalyzeResult buffer, if any.
+  val pickledAnalyzeResult: Array[Byte] = dataIn.readInt() match {
+case length =>
+  val obj = new Array[Byte](length)
+  dataIn.readFully(obj)
+  obj
+  }

Review Comment:
   Shall we use `PythonWorkerUtils.readBytes` as 
https://github.com/apache/spark/pull/43321 was merged?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45402][SQL][PYTHON] Add UDTF API for 'eval' and 'terminate' methods to consume previous 'analyze' result [spark]

2023-10-11 Thread via GitHub



ueshin commented on code in PR #43204:
URL: https://github.com/apache/spark/pull/43204#discussion_r1355452243


##
sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonUDTFExec.scala:
##
@@ -132,7 +135,15 @@ object PythonUDTFRunner {
   case None =>
 dataOut.writeInt(0)
 }
+// Write the pickled AnalyzeResult buffer from the UDTF "analyze" method, 
if any.
+dataOut.writeBoolean(udtf.pickledAnalyzeResult.nonEmpty)
+udtf.pickledAnalyzeResult.map { p =>

Review Comment:
   nit: `foreach` instead of `map`?



##
sql/core/src/main/scala/org/apache/spark/sql/execution/python/UserDefinedPythonFunction.scala:
##
@@ -284,6 +285,14 @@ object UserDefinedPythonTableFunction {
   val schema = DataType.fromJson(
 PythonWorkerUtils.readUTF(length, dataIn)).asInstanceOf[StructType]
 
+  // Receive the pickled AnalyzeResult buffer, if any.
+  val pickledAnalyzeResult: Array[Byte] = dataIn.readInt() match {
+case length =>
+  val obj = new Array[Byte](length)
+  dataIn.readFully(obj)
+  obj
+  }

Review Comment:
   Shall we use `PythonWorkerUtils.readBytes` as 
https://github.com/apache/spark/pull/43321 is merged?



##
sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonUDTFExec.scala:
##
@@ -132,7 +135,15 @@ object PythonUDTFRunner {
   case None =>
 dataOut.writeInt(0)
 }
+// Write the pickled AnalyzeResult buffer from the UDTF "analyze" method, 
if any.
+dataOut.writeBoolean(udtf.pickledAnalyzeResult.nonEmpty)
+udtf.pickledAnalyzeResult.map { p =>
+  dataOut.writeInt(p.length)
+  dataOut.write(p)

Review Comment:
   ditto, `PythonWorkerUtils.writeBytes`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45009][SQL] Decorrelate predicate subqueries in join condition [spark]

2023-10-11 Thread via GitHub



andylam-db commented on code in PR #42725:
URL: https://github.com/apache/spark/pull/42725#discussion_r1355441657


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala:
##
@@ -159,6 +160,66 @@ object RewritePredicateSubquery extends Rule[LogicalPlan] 
with PredicateHelper {
   Project(p.output, Filter(newCond.get, inputPlan))
   }
 
+// This case takes care of predicate subqueries in join conditions that 
are not pushed down
+// to the children nodes by [[PushDownPredicates]].

Review Comment:
   Yes -- pushdown rules are in `operatorOptimizationBatch` and 
`RewriteSubquery` batch runs after those (almost the last batch in the 
Optimizer)



##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -4378,6 +4378,25 @@ object SQLConf {
   .booleanConf
   .createWithDefault(true)
 
+  val DECORRELATE_PREDICATE_SUBQUERIES_IN_JOIN_CONDITION =
+
buildConf("spark.sql.optimizer.decorrelatePredicateSubqueriesInJoinPredicate.enabled")
+  .internal()
+  .doc("Decorrelate predicate (in and exists) subqueries with correlated 
references in join " +
+"predicates.")
+  .version("4.0.0")
+  .booleanConf
+  .createWithDefault(true)
+
+  val OPTIMIZE_UNCORRELATED_IN_SUBQUERIES_IN_JOIN_CONDITION =
+
buildConf("spark.sql.optimizer.optimizeUncorrelatedInSubqueriesInJoinCondition.enabled")
+  .internal()
+  .doc("When true, optimize uncorrelated IN subqueries in join predicates 
by rewriting them " +
+s"to joins. This interacts with 
${LEGACY_NULL_IN_EMPTY_LIST_BEHAVIOR.key} because it " +

Review Comment:
   This happens because with this change, IN subqueries are rewritten to joins 
and therefore force the "correct" NULL-in-empty-list behavior. This config is 
added in case users want to use the legacy behavior. This has nothing to do 
with EXISTS.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45415] Allow selective disabling of "fallocate" in RocksDB statestore [spark]

2023-10-11 Thread via GitHub



siying commented on code in PR #43202:
URL: https://github.com/apache/spark/pull/43202#discussion_r1355420314


##
docs/structured-streaming-programming-guide.md:
##
@@ -2385,6 +2385,11 @@ Here are the configs regarding to RocksDB instance of 
the state store provider:
 Total memory to be occupied by blocks in high priority pool as a 
fraction of memory allocated across all RocksDB instances on a single node 
using maxMemoryUsageMB.
 0.1
   
+  
+spark.sql.streaming.stateStore.rocksdb.allowFAllocate

Review Comment:
   Keeping it consistent with RocksDB conf name sounds good to me. All good for 
me then.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45488] XML: Add support for value in 'rowTag' element [spark]

2023-10-11 Thread via GitHub



shujingyang-db commented on code in PR #43319:
URL: https://github.com/apache/spark/pull/43319#discussion_r1355401700


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala:
##
@@ -219,11 +219,27 @@ private[sql] object XmlInferSchema {
   dataTypes += inferredType
   nameToDataType += (field -> dataTypes)
 
+case c: Characters if !c.isWhiteSpace =>
+  // This can be an attribute-only object
+  val valueTagType = inferFrom(c.getData, options)
+  nameToDataType += options.valueTag -> ArrayBuffer(valueTagType)
+
 case _: EndElement =>
   shouldStop = StaxXmlParserUtils.checkEndElement(parser)
 
 case _ => // do nothing
   }
+}
+ // A structure object is an attribute-only element
+ // if it only consists of attributes and valueTags.
+// If not, we will remove the valueTag field from the schema
+val attributesOnly = nameToDataType.forall {
+   case (fieldName, dataTypes) =>
+ dataTypes.length == 1 &&
+   (fieldName == options.valueTag || 
fieldName.startsWith(options.attributePrefix))

Review Comment:
   I agree on it. However this is our current 
[behavior](https://github.com/shujingyang-db/spark/blob/8fd10a40641c831155ffd644e331f0b818f72700/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala#L198),
 we might need to keep aligned with it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45488] XML: Add support for value in 'rowTag' element [spark]

2023-10-11 Thread via GitHub



shujingyang-db commented on PR #43319:
URL: https://github.com/apache/spark/pull/43319#issuecomment-1758122788

   @srowen Thanks for bringing up this case! In the case of
   ```
   Great Book to read!
   ```
   If `rowTag` is `book`, the resulting schema will be `foo STRING` and skip 
the characters `Great` and ` to read!` in between the tags. This is because we 
only read values in between tags if the struct _only_ consists of attributes 
and valueTag. Our definition to the `valueTag` is: 
   > valueTag: The tag used for the value when there are attributes in the 
element having no child. Default is _VALUE.
   
   If there's another field, the value in between the tags will be ignored.
   
   If the user specifies the schema to be `foo STRING, _VALUE STRING`, we will 
also ignore the value and leave `_VALUE` empty. ([current 
behavior](https://github.com/shujingyang-db/spark/blob/8fd10a40641c831155ffd644e331f0b818f72700/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala#L198))


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45415] Allow selective disabling of "fallocate" in RocksDB statestore [spark]

2023-10-11 Thread via GitHub



schenksj commented on code in PR #43202:
URL: https://github.com/apache/spark/pull/43202#discussion_r1355389886


##
docs/structured-streaming-programming-guide.md:
##
@@ -2385,6 +2385,11 @@ Here are the configs regarding to RocksDB instance of 
the state store provider:
 Total memory to be occupied by blocks in high priority pool as a 
fraction of memory allocated across all RocksDB instances on a single node 
using maxMemoryUsageMB.
 0.1
   
+  
+spark.sql.streaming.stateStore.rocksdb.allowFAllocate

Review Comment:
   Thank you @siying !  Is it the capitalization that you are finding 
bothersome?  I carried though the `allowFAllocate` casing from the RocksDB 
DBOptions class: 
https://javadoc.io/doc/org.rocksdb/rocksdbjni/6.8.1/org/rocksdb/DBOptions.html#allowFAllocate()
   
   I'm definitely not tied to it, but do like maintaining consistency across 
the two packages.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45488] XML: Add support for value in 'rowTag' element [spark]

2023-10-11 Thread via GitHub



shujingyang-db commented on code in PR #43319:
URL: https://github.com/apache/spark/pull/43319#discussion_r1355376309


##
sql/core/src/test/resources/test-data/xml-resources/root-level-value.xml:
##
@@ -0,0 +1,9 @@
+
+
+value1
+value2
+
+value3
+
+
+

Review Comment:
   Sure I will add another test case for this. In this case, `5` and `6` will 
also be skipped. We only read values if the struct only consists of attributes 
and `valueTag`. If there's another field (i.e. `tag`), the value in between the 
tags will be ignored.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45488] XML: Add support for value in 'rowTag' element [spark]

2023-10-11 Thread via GitHub



shujingyang-db commented on code in PR #43319:
URL: https://github.com/apache/spark/pull/43319#discussion_r1355375216


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala:
##
@@ -219,11 +219,27 @@ private[sql] object XmlInferSchema {
   dataTypes += inferredType
   nameToDataType += (field -> dataTypes)
 
+case c: Characters if !c.isWhiteSpace =>
+  // This can be an attribute-only object
+  val valueTagType = inferFrom(c.getData, options)
+  nameToDataType += options.valueTag -> ArrayBuffer(valueTagType)

Review Comment:
   IMHO it will make the loop much more complicated as we need to examine the 
case you mentioned above 
   ```
   45
   67
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45488] XML: Add support for value in 'rowTag' element [spark]

2023-10-11 Thread via GitHub



shujingyang-db commented on code in PR #43319:
URL: https://github.com/apache/spark/pull/43319#discussion_r1355372857


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala:
##
@@ -122,7 +122,12 @@ class StaxXmlParser(
   }
   val parser = StaxXmlParserUtils.filteredReader(xml)
   val rootAttributes = StaxXmlParserUtils.gatherRootAttributes(parser)
-  Some(convertObject(parser, schema, options, rootAttributes))
+  // A structure object is an attribute-only element
+  // if it only consists of attributes and valueTags.
+  val isRootAttributesOnly = schema.fields.forall { f =>
+f.name == options.valueTag || 
f.name.startsWith(options.attributePrefix)
+  }
+  Some(convertObject(parser, schema, options, rootAttributes, 
isRootAttributesOnly))

Review Comment:
   We only read values if the struct only consists of attributes and 
`valueTag`. If there's another field, the value in between the tags will be 
ignored.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45415] Allow selective disabling of "fallocate" in RocksDB statestore [spark]

2023-10-11 Thread via GitHub



siying commented on code in PR #43202:
URL: https://github.com/apache/spark/pull/43202#discussion_r1355366397


##
docs/structured-streaming-programming-guide.md:
##
@@ -2385,6 +2385,11 @@ Here are the configs regarding to RocksDB instance of 
the state store provider:
 Total memory to be occupied by blocks in high priority pool as a 
fraction of memory allocated across all RocksDB instances on a single node 
using maxMemoryUsageMB.
 0.1
   
+  
+spark.sql.streaming.stateStore.rocksdb.allowFAllocate

Review Comment:
   is it just me that feels that "FAllocate" feels odd to me, why not 
"allowFallocate?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45204][CONNECT] Add optional ExecuteHolder to SparkConnectPlanner [spark]

2023-10-11 Thread via GitHub



hvanhovell closed pull request #43311: [SPARK-45204][CONNECT] Add optional 
ExecuteHolder to SparkConnectPlanner
URL: https://github.com/apache/spark/pull/43311


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45204][CONNECT] Add optional ExecuteHolder to SparkConnectPlanner [spark]

2023-10-11 Thread via GitHub



hvanhovell commented on PR #43311:
URL: https://github.com/apache/spark/pull/43311#issuecomment-1758090274

   Merging.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45204][CONNECT] Add optional ExecuteHolder to SparkConnectPlanner [spark]

2023-10-11 Thread via GitHub



hvanhovell commented on code in PR #43311:
URL: https://github.com/apache/spark/pull/43311#discussion_r1355364109


##
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##
@@ -86,14 +86,29 @@ final case class InvalidCommandInput(
 private val cause: Throwable = null)
 extends Exception(message, cause)
 
-class SparkConnectPlanner(val sessionHolder: SessionHolder) extends Logging {
+class SparkConnectPlanner(
+val sessionHolder: SessionHolder,
+val executeHolderOpt: Option[ExecuteHolder] = None)

Review Comment:
   Ok, makes sense.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45430] Fix for FramelessOffsetWindowFunction when IGNORE NULLS and offset > rowCount [spark]

2023-10-11 Thread via GitHub



vitaliili-db commented on code in PR #43236:
URL: https://github.com/apache/spark/pull/43236#discussion_r1355356343


##
sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowFunctionFrame.scala:
##
@@ -200,15 +200,19 @@ class FrameLessOffsetWindowFunctionFrame(
 
   override def prepare(rows: ExternalAppendOnlyUnsafeRowArray): Unit = {
 resetStates(rows)
-if (ignoreNulls) {
-  findNextRowWithNonNullInput()
+if (Math.abs(offset) > rows.length) {
+  inputIndex = offset
 } else {
-  // drain the first few rows if offset is larger than zero
-  while (inputIndex < offset) {
-if (inputIterator.hasNext) inputIterator.next()
-inputIndex += 1
+  if (ignoreNulls) {
+findNextRowWithNonNullInput()
+  } else {
+// drain the first few rows if offset is larger than zero
+while (inputIndex < offset) {
+  if (inputIterator.hasNext) inputIterator.next()
+  inputIndex += 1
+}
+inputIndex = offset
   }
-  inputIndex = offset

Review Comment:
   yes, it works when `ignoreNulls == true` but breaks when `ignoreNulls == 
false`
   
   In particular it will trigger last `else` branch in `doWrite`:
   
   ```
else {
   (current: InternalRow) =>
 if (inputIndex >= 0 && inputIndex < input.length) {
   val r = WindowFunctionFrame.getNextOrNull(inputIterator)
   projection(r)
 } else {
   // Use default values since the offset row does not exist.
   fillDefaultValue(current)
 }
 inputIndex += 1
 }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45430] Fix for FramelessOffsetWindowFunction when IGNORE NULLS and offset > rowCount [spark]

2023-10-11 Thread via GitHub



vitaliili-db commented on code in PR #43236:
URL: https://github.com/apache/spark/pull/43236#discussion_r1355356343


##
sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowFunctionFrame.scala:
##
@@ -200,15 +200,19 @@ class FrameLessOffsetWindowFunctionFrame(
 
   override def prepare(rows: ExternalAppendOnlyUnsafeRowArray): Unit = {
 resetStates(rows)
-if (ignoreNulls) {
-  findNextRowWithNonNullInput()
+if (Math.abs(offset) > rows.length) {
+  inputIndex = offset
 } else {
-  // drain the first few rows if offset is larger than zero
-  while (inputIndex < offset) {
-if (inputIterator.hasNext) inputIterator.next()
-inputIndex += 1
+  if (ignoreNulls) {
+findNextRowWithNonNullInput()
+  } else {
+// drain the first few rows if offset is larger than zero
+while (inputIndex < offset) {
+  if (inputIterator.hasNext) inputIterator.next()
+  inputIndex += 1
+}
+inputIndex = offset
   }
-  inputIndex = offset

Review Comment:
   yes, it works when `ignoreNulls == true` but breaks when `ignoreNulls == 
false`
   
   In particular it will trigger last `else` branch in `doWrite`:
   
   ```
 } else {
   (current: InternalRow) =>
 if (inputIndex >= 0 && inputIndex < input.length) {
   val r = WindowFunctionFrame.getNextOrNull(inputIterator)
   projection(r)
 } else {
   // Use default values since the offset row does not exist.
   fillDefaultValue(current)
 }
 inputIndex += 1
 }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45430] Fix for FramelessOffsetWindowFunction when IGNORE NULLS and offset > rowCount [spark]

2023-10-11 Thread via GitHub



vitaliili-db commented on code in PR #43236:
URL: https://github.com/apache/spark/pull/43236#discussion_r1355356343


##
sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowFunctionFrame.scala:
##
@@ -200,15 +200,19 @@ class FrameLessOffsetWindowFunctionFrame(
 
   override def prepare(rows: ExternalAppendOnlyUnsafeRowArray): Unit = {
 resetStates(rows)
-if (ignoreNulls) {
-  findNextRowWithNonNullInput()
+if (Math.abs(offset) > rows.length) {
+  inputIndex = offset
 } else {
-  // drain the first few rows if offset is larger than zero
-  while (inputIndex < offset) {
-if (inputIterator.hasNext) inputIterator.next()
-inputIndex += 1
+  if (ignoreNulls) {
+findNextRowWithNonNullInput()
+  } else {
+// drain the first few rows if offset is larger than zero
+while (inputIndex < offset) {
+  if (inputIterator.hasNext) inputIterator.next()
+  inputIndex += 1
+}
+inputIndex = offset
   }
-  inputIndex = offset

Review Comment:
   yes, it works when `ignoreNulls == true` but breaks when `ignoreNulls == 
false`



##
sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowFunctionFrame.scala:
##
@@ -200,15 +200,19 @@ class FrameLessOffsetWindowFunctionFrame(
 
   override def prepare(rows: ExternalAppendOnlyUnsafeRowArray): Unit = {
 resetStates(rows)
-if (ignoreNulls) {
-  findNextRowWithNonNullInput()
+if (Math.abs(offset) > rows.length) {
+  inputIndex = offset
 } else {
-  // drain the first few rows if offset is larger than zero
-  while (inputIndex < offset) {
-if (inputIterator.hasNext) inputIterator.next()
-inputIndex += 1
+  if (ignoreNulls) {

Review Comment:
   when `offset > rows.length` it is safe to just set `inputIndex` to its 
value. I can add conditional inside `if (ignoreNulls)` if you feel strongly 
about it :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 >

1 - 100 of 200 matches

Mail list logo