Re: [PR] [SPARK-45503][SS] RocksDB to use LZ4 [spark]
pan3793 commented on PR #43338: URL: https://github.com/apache/spark/pull/43338#issuecomment-1759035172 > The RocksDB Team recommend LZ4 or ZSTD ... Why choose lz4 instead of zstd? I suppose zstd is a more future-proofing algorithm -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45484][SQL][3.5] Deprecated the incorrect parquet compression codec lz4raw [spark]
beliefer commented on PR #43330: URL: https://github.com/apache/spark/pull/43330#issuecomment-1758993812 > So this change is only needed in 3.5, and we already fixed it differently in 4.0? Yes. This PR only used for 3.5.1. and https://github.com/apache/spark/pull/43310 used to fix it in 4.0.0 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-44649][SQL] Runtime Filter supports passing equivalent creation side expressions [spark]
beliefer commented on code in PR #42317: URL: https://github.com/apache/spark/pull/42317#discussion_r1356124917 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InjectRuntimeFilter.scala: ## @@ -121,14 +128,44 @@ object InjectRuntimeFilter extends Rule[LogicalPlan] with PredicateHelper with J */ private def extractSelectiveFilterOverScan( plan: LogicalPlan, - filterCreationSideExp: Expression): Option[LogicalPlan] = { -@tailrec + filterCreationSideExp: Expression): Option[(Expression, LogicalPlan)] = { Review Comment: OK -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-44262][SQL] Add `dropTable` and `getInsertStatement` to JdbcDialect [spark]
cloud-fan commented on code in PR #41855: URL: https://github.com/apache/spark/pull/41855#discussion_r1356116942 ## sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala: ## @@ -193,6 +193,22 @@ abstract class JdbcDialect extends Serializable with Logging { statement.executeUpdate(s"CREATE TABLE $tableName ($strSchema) $createTableOptions") } + /** + * Returns an Insert SQL statement for inserting a row into the target table via JDBC conn. Review Comment: We need more document here, to explain the placeholder thing. ## sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala: ## @@ -193,6 +193,22 @@ abstract class JdbcDialect extends Serializable with Logging { statement.executeUpdate(s"CREATE TABLE $tableName ($strSchema) $createTableOptions") } + /** + * Returns an Insert SQL statement for inserting a row into the target table via JDBC conn. + * + * @param table The name of the table. + * @param fields The fields of the row that will be inserted. + * @return The SQL query to use for insert data into table. + */ + @Since("4.0.0") + def insertIntoTable( +table: String, Review Comment: nit: 4 spaces indentation -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-44649][SQL] Runtime Filter supports passing equivalent creation side expressions [spark]
beliefer commented on code in PR #42317: URL: https://github.com/apache/spark/pull/42317#discussion_r1356096817 ## sql/core/src/test/scala/org/apache/spark/sql/InjectRuntimeFilterSuite.scala: ## @@ -390,34 +390,58 @@ class InjectRuntimeFilterSuite extends QueryTest with SQLTestUtils with SharedSp test("Runtime bloom filter join: two joins") { withSQLConf(SQLConf.RUNTIME_BLOOM_FILTER_APPLICATION_SIDE_SCAN_SIZE_THRESHOLD.key -> "3000", SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "2000") { - assertRewroteWithBloomFilter("select * from bf1 join bf2 join bf3 on bf1.c1 = bf2.c2 " + Review Comment: Control the join exists shuffle. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45327][BUILD] Upgrade zstd-jni to 1.5.5-6 [spark]
panbingkun commented on PR #43113: URL: https://github.com/apache/spark/pull/43113#issuecomment-1758950109 waiting for https://github.com/apache/spark/pull/43345 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45442][PYTHON][DOCS] Refine docstring of DataFrame.show [spark]
zhengruifeng commented on PR #43252: URL: https://github.com/apache/spark/pull/43252#issuecomment-1758918774 merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45442][PYTHON][DOCS] Refine docstring of DataFrame.show [spark]
zhengruifeng closed pull request #43252: [SPARK-45442][PYTHON][DOCS] Refine docstring of DataFrame.show URL: https://github.com/apache/spark/pull/43252 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45430] Fix for FramelessOffsetWindowFunction when IGNORE NULLS and offset > rowCount [spark]
cloud-fan commented on code in PR #43236: URL: https://github.com/apache/spark/pull/43236#discussion_r1356054361 ## sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowFunctionFrame.scala: ## @@ -200,15 +200,19 @@ class FrameLessOffsetWindowFunctionFrame( override def prepare(rows: ExternalAppendOnlyUnsafeRowArray): Unit = { resetStates(rows) -if (ignoreNulls) { - findNextRowWithNonNullInput() +if (Math.abs(offset) > rows.length) { + inputIndex = offset Review Comment: what does this do? skip everything and return default value? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45427][CORE] Add RPC SSL settings to SSLOptions and SparkTransportConf [spark]
hasnain-db commented on PR #43238: URL: https://github.com/apache/spark/pull/43238#issuecomment-1758897493 @JoshRosen I updated the docs PR to hopefully add more clarity around the various encryption options (and, to your request, clarified how they overlap with the authentication setting). > There may be some advantages to the hierarchical namespacing of SSL configurations... Agreed. If we reuse the `spark.ssl.rpc` namespace as is done in the PR the implementation is also much easier -- but we can also make the settings inheritance work if we use a different namespace for it - implementation complexity shouldn't hinder UX IMO. > I'm wondering if we use some long-term vision of a future end state to guide us here. I don't have enough context to make a call here and defer to @mridulm . I will say that as a random observer here who hasn't had to manage too many spark deployments just yet, I feel like the current encryption mechanism has a strong point in its favor: it's quite simple to configure just a shared secret. SSL requires provisioning keys and certs which is always more work. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45132][SQL] Fix IDENTIFIER for function invocation [spark]
cloud-fan commented on code in PR #42888: URL: https://github.com/apache/spark/pull/42888#discussion_r1356029023 ## sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4: ## @@ -1196,6 +1195,7 @@ qualifiedNameList functionName : IDENTIFIER_KW LEFT_PAREN expression RIGHT_PAREN +| identFunc=IDENTIFIER_KW // IDENTIFIER itself is also a valid function name. Review Comment: ```suggestion | identFunc=IDENTIFIER_KW // IDENTIFIER itself is also a valid function name. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-45510][SQL] Replace `scala.collection.generic.Growable` to `scala.collection.mutable.Growable` [spark]
Hisoka-X opened a new pull request, #43347: URL: https://github.com/apache/spark/pull/43347 ### What changes were proposed in this pull request? Since scala 2.13.0, `scala.collection.generic.Growable` marked as deprecated. This PR change it to `scala.collection.mutable.Growable` ### Why are the changes needed? Remove deprecated api. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? exist test. ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45488][SQL] XML: Add support for value in 'rowTag' element [spark]
shujingyang-db commented on code in PR #43319: URL: https://github.com/apache/spark/pull/43319#discussion_r1356022084 ## sql/core/src/test/resources/test-data/xml-resources/root-level-value.xml: ## @@ -0,0 +1,9 @@ + + +value1 +value2 + +value3 + + + Review Comment: Added a test `root-level value tag for not attributes-only object`👍 ## sql/core/src/test/resources/test-data/xml-resources/root-level-value.xml: ## @@ -0,0 +1,9 @@ + + +value1 +value2 + +value3 + + + Review Comment: Added a test `root-level value tag for not attributes-only object`👍 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-44649][SQL] Runtime Filter supports passing equivalent creation side expressions [spark]
beliefer commented on code in PR #42317: URL: https://github.com/apache/spark/pull/42317#discussion_r1356017720 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InjectRuntimeFilter.scala: ## @@ -29,7 +29,14 @@ import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.types._ /** - * Insert a filter on one side of the join if the other side has a selective predicate. + * Insert a runtime filter on one side of the join (we call this side the application side) if + * the plan of other side satisfy one of the following scenes: Review Comment: To be precise, `the other side` may not be `creation side`, the creation side may be a sub plan of the other side. So I adjust this comment. ``` Insert a runtime filter on one side of the join (we call this side the application side) if we can extract a plan (we call this plan the creation side) from the other side and construct a runtime filter with the creation side. A simple case is that the creation side is a table scan with a selective filter. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45488][SQL] XML: Add support for value in 'rowTag' element [spark]
HyukjinKwon commented on PR #43319: URL: https://github.com/apache/spark/pull/43319#issuecomment-1758848151 The test failures look unrelated but mind retriggering https://github.com/shujingyang-db/spark/runs/17621262726 again? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45508][CORE] Add "--add-opens=java.base/jdk.internal.ref=ALL-UNNAMED" so Platform can access Cleaner on Java 9+ [spark]
LuciferYang commented on code in PR #43344: URL: https://github.com/apache/spark/pull/43344#discussion_r1356005221 ## common/unsafe/src/test/java/org/apache/spark/unsafe/PlatformUtilSuite.java: ## @@ -157,4 +157,11 @@ public void heapMemoryReuse() { Assertions.assertEquals(1024 * 1024 + 7, onheap4.size()); Assertions.assertEquals(obj3, onheap4.getBaseObject()); } + + @Test + public void cleanerCreateMethodIsDefined() { +// Regression test for SPARK-45508: we don't expect the "no cleaner" fallback +// path to be hit in normal usage. +Assertions.assertTrue(Platform.cleanerCreateMethodIsDefined()); Review Comment: > If we want to backport this change then we will have to prepare a separate PR because #43074 changed the JUnit assertions helper names so the same source can't compile in all branches. sorry for making this change ... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-42881][SQL][FOLLOWUP] Update the results of JsonBenchmark-jdk21 after get_json_object supports codgen [spark]
LuciferYang commented on PR #43346: URL: https://github.com/apache/spark/pull/43346#issuecomment-1758843965 Just update benchmark result. Merged into master for Spark 4.0, thanks @panbingkun ~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-42881][SQL][FOLLOWUP] Update the results of JsonBenchmark-jdk21 after get_json_object supports codgen [spark]
LuciferYang closed pull request #43346: [SPARK-42881][SQL][FOLLOWUP] Update the results of JsonBenchmark-jdk21 after get_json_object supports codgen URL: https://github.com/apache/spark/pull/43346 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-42881][SQL] Codegen Support for get_json_object [spark]
panbingkun commented on code in PR #40506: URL: https://github.com/apache/spark/pull/40506#discussion_r1355992806 ## sql/core/benchmarks/JsonBenchmark-results.txt: ## @@ -3,127 +3,128 @@ Benchmark for performance of JSON parsing Preparing data for benchmarking ... -OpenJDK 64-Bit Server VM 17.0.8+7-LTS on Linux 5.15.0-1046-azure +OpenJDK 64-Bit Server VM 17.0.8+7-LTS on Linux 5.15.0-1047-azure Review Comment: Followup pr: https://github.com/apache/spark/pull/43346 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-42881][SQL][FOLLOWUP] Codegen Support for get_json_object [spark]
panbingkun commented on PR #43346: URL: https://github.com/apache/spark/pull/43346#issuecomment-1758833377 cc @LuciferYang @MaxGekk -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45499][CORE][TESTS][FOLLOWUP] Use `ReferenceQueue#remove` instead of `ReferenceQueue#poll` [spark]
LuciferYang commented on PR #43345: URL: https://github.com/apache/spark/pull/43345#issuecomment-1758833221 OK, add `1000ms` timeout -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-42881][SQL][FOLLOWUP] Codegen Support for get_json_object [spark]
panbingkun opened a new pull request, #43346: URL: https://github.com/apache/spark/pull/43346 ### What changes were proposed in this pull request? The pr aims to followup https://github.com/apache/spark/pull/40506, update JsonBenchmark-jdk21-results.txt for it. ### Why are the changes needed? Update JsonBenchmark-jdk21-results.txt. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Only update the results of the benchmark, ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45501][CORE][SQL] Use pattern matching for type checking and conversion [spark]
LuciferYang commented on PR #43327: URL: https://github.com/apache/spark/pull/43327#issuecomment-1758831231 GA passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45499][CORE][TESTS][FOLLOWUP] Use `ReferenceQueue#remove` instead of `ReferenceQueue#poll` [spark]
LuciferYang commented on code in PR #43345: URL: https://github.com/apache/spark/pull/43345#discussion_r1355987310 ## core/src/test/scala/org/apache/spark/util/CompletionIteratorSuite.scala: ## @@ -64,6 +64,6 @@ class CompletionIteratorSuite extends SparkFunSuite { } } assert(ref.refersTo(null)) -assert(refQueue.poll() === ref) +assert(refQueue.remove() === ref) Review Comment: also cc @srowen -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45439][SQL][UI] Reduce memory usage of LiveStageMetrics.accumIdsToMetricType [spark]
JoshRosen commented on code in PR #43250: URL: https://github.com/apache/spark/pull/43250#discussion_r1355983329 ## sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala: ## @@ -490,7 +490,12 @@ private class LiveExecutionData(val executionId: Long) extends LiveEntity { var details: String = null var physicalPlanDescription: String = null var modifiedConfigs: Map[String, String] = _ - var metrics = collection.Seq[SQLPlanMetric]() + private var _metrics = collection.Seq[SQLPlanMetric]() + def metrics: collection.Seq[SQLPlanMetric] = _metrics + // This mapping is shared across all LiveStageMetrics instances associated with + // this LiveExecutionData, helping to reduce memory overhead by avoiding waste + // from separate immutable maps with largely overlapping sets of entries. + val metricAccumulatorIdToMetricType = new ConcurrentHashMap[Long, String]() Review Comment: Our Spark fork contains a codepath that performs multi-threaded access to this data structure, so that's why I used a thread-safe data structure here. If you would prefer that I avoid leaking those proprietary thread-safety concerns into OSS Apache Spark then I would be glad to refactor this to use a `mutable.Map` instead and keep the thread-safe version in our fork. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45499][CORE][TESTS][FOLLOWUP] Use `ReferenceQueue#remove` instead of `ReferenceQueue#poll` [spark]
LuciferYang commented on code in PR #43345: URL: https://github.com/apache/spark/pull/43345#discussion_r1355979774 ## core/src/test/scala/org/apache/spark/util/CompletionIteratorSuite.scala: ## @@ -64,6 +64,6 @@ class CompletionIteratorSuite extends SparkFunSuite { } } assert(ref.refersTo(null)) -assert(refQueue.poll() === ref) +assert(refQueue.remove() === ref) Review Comment: `remove` is a blocking method, should we add a `timeout(ms)`? like `.remove(1000)` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45499][CORE][TESTS][FOLLOWUP] Use `ReferenceQueue#remove` instead of `ReferenceQueue#poll` [spark]
LuciferYang commented on code in PR #43345: URL: https://github.com/apache/spark/pull/43345#discussion_r1355977615 ## core/src/test/scala/org/apache/spark/util/CompletionIteratorSuite.scala: ## @@ -64,6 +64,6 @@ class CompletionIteratorSuite extends SparkFunSuite { } } assert(ref.refersTo(null)) -assert(refQueue.poll() === ref) +assert(refQueue.remove() === ref) Review Comment: cc @dongjoon-hyun @yaooqinn FYI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-45499][CORE][TESTS][FOLLOWUP] Use `ReferenceQueue#remove` instead of `ReferenceQueue#poll` [spark]
LuciferYang opened a new pull request, #43345: URL: https://github.com/apache/spark/pull/43345 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45495][core] Support stage level task resource profile for k8s cluster when dynamic allocation disabled [spark]
wbo4958 commented on code in PR #43323: URL: https://github.com/apache/spark/pull/43323#discussion_r1355964012 ## core/src/test/scala/org/apache/spark/resource/ResourceProfileManagerSuite.scala: ## @@ -138,7 +138,7 @@ class ResourceProfileManagerSuite extends SparkFunSuite { rpmanager.isSupported(taskProf) }.getMessage assert(error === "TaskResourceProfiles are only supported for Standalone " + - "and Yarn cluster for now when dynamic allocation is disabled.") + "and Yarn and Kubernetes cluster for now when dynamic allocation is disabled.") Review Comment: Good suggestion, Done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45508][CORE] Add "--add-opens=java.base/jdk.internal.ref=ALL-UNNAMED" so Platform can access cleaner on Java 9+ [spark]
JoshRosen commented on code in PR #43344: URL: https://github.com/apache/spark/pull/43344#discussion_r1355953702 ## common/unsafe/src/test/java/org/apache/spark/unsafe/PlatformUtilSuite.java: ## @@ -157,4 +157,11 @@ public void heapMemoryReuse() { Assertions.assertEquals(1024 * 1024 + 7, onheap4.size()); Assertions.assertEquals(obj3, onheap4.getBaseObject()); } + + @Test + public void cleanerCreateMethodIsDefined() { +// Regression test for SPARK-45508: we don't expect the "no cleaner" fallback +// path to be hit in normal usage. +Assertions.assertTrue(Platform.cleanerCreateMethodIsDefined()); Review Comment: If we want to backport this change then we will have to prepare a separate PR because https://github.com/apache/spark/pull/43074 changed the JUnit assertions helper names so the same source can't compile in all branches. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45402][SQL][PYTHON] Add UDTF API for 'eval' and 'terminate' methods to consume previous 'analyze' result [spark]
ueshin closed pull request #43204: [SPARK-45402][SQL][PYTHON] Add UDTF API for 'eval' and 'terminate' methods to consume previous 'analyze' result URL: https://github.com/apache/spark/pull/43204 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45402][SQL][PYTHON] Add UDTF API for 'eval' and 'terminate' methods to consume previous 'analyze' result [spark]
ueshin commented on PR #43204: URL: https://github.com/apache/spark/pull/43204#issuecomment-1758789340 Thanks! merging to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45402][SQL][PYTHON] Add UDTF API for 'eval' and 'terminate' methods to consume previous 'analyze' result [spark]
ueshin commented on PR #43204: URL: https://github.com/apache/spark/pull/43204#issuecomment-1758789225 The failed tests seem not related to this PR. Let me merge this now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-45508][CORE] Add "--add-opens=java.base/jdk.internal.ref=ALL-UNNAMED" so Platform can access cleaner on Java 9+ [spark]
JoshRosen opened a new pull request, #43344: URL: https://github.com/apache/spark/pull/43344 ### What changes were proposed in this pull request? This PR adds `--add-opens=java.base/jdk.internal.ref=ALL-UNNAMED` to our JVM flags so that we can access `jdk.internal.ref.Cleaner` in JDK 9+. ### Why are the changes needed? This allows Spark to allocate direct memory while ignoring the JVM's MaxDirectMemorySize limit. Spark uses JDK internal APIs to directly construct DirectByteBuffers while bypassing that limit, but there is a fallback path at https://github.com/apache/spark/blob/v3.5.0/common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java#L213 that is used if we cannot reflectively access the `Cleaner` API. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added a unit test in `PlatformUtilSuite`. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45505][PYTHON] Refactor analyzeInPython to make it reusable [spark]
ueshin commented on code in PR #43340: URL: https://github.com/apache/spark/pull/43340#discussion_r1355948844 ## sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonPlannerRunner.scala: ## @@ -0,0 +1,177 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.python + +import java.io.{BufferedInputStream, BufferedOutputStream, DataInputStream, DataOutputStream, EOFException, InputStream} +import java.nio.ByteBuffer +import java.nio.channels.SelectionKey +import java.util.HashMap + +import scala.jdk.CollectionConverters._ + +import net.razorvine.pickle.Pickler + +import org.apache.spark.{JobArtifactSet, SparkEnv, SparkException} +import org.apache.spark.api.python.{PythonFunction, PythonWorker, PythonWorkerUtils, SpecialLengths} +import org.apache.spark.internal.config.BUFFER_SIZE +import org.apache.spark.internal.config.Python._ +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.util.DirectByteBufferOutputStream + +/** + * A helper class to run Python functions in Spark driver. + */ +abstract class PythonPlannerRunner[T](func: PythonFunction) { + + protected val workerModule: String + + protected def writeToPython(dataOut: DataOutputStream, pickler: Pickler): Unit + + protected def receiveFromPython(dataIn: DataInputStream): T + + def runInPython(): T = { +val env = SparkEnv.get +val bufferSize: Int = env.conf.get(BUFFER_SIZE) +val authSocketTimeout = env.conf.get(PYTHON_AUTH_SOCKET_TIMEOUT) +val reuseWorker = env.conf.get(PYTHON_WORKER_REUSE) +val localdir = env.blockManager.diskBlockManager.localDirs.map(f => f.getPath()).mkString(",") +val simplifiedTraceback: Boolean = SQLConf.get.pysparkSimplifiedTraceback +val workerMemoryMb = SQLConf.get.pythonUDTFAnalyzerMemory + +val jobArtifactUUID = JobArtifactSet.getCurrentJobArtifactState.map(_.uuid) + +val envVars = new HashMap[String, String](func.envVars) +val pythonExec = func.pythonExec +val pythonVer = func.pythonVer +val pythonIncludes = func.pythonIncludes.asScala.toSet +val broadcastVars = func.broadcastVars.asScala.toSeq +val maybeAccumulator = Option(func.accumulator).map(_.copyAndReset()) + +envVars.put("SPARK_LOCAL_DIRS", localdir) +if (reuseWorker) { + envVars.put("SPARK_REUSE_WORKER", "1") +} +if (simplifiedTraceback) { + envVars.put("SPARK_SIMPLIFIED_TRACEBACK", "1") +} +workerMemoryMb.foreach { memoryMb => + envVars.put("PYSPARK_UDTF_ANALYZER_MEMORY_MB", memoryMb.toString) +} Review Comment: This part seems to be UDTF analyzer specific so not reusable. should use a more generic name, or add another abstract method to additionally setup the `envVars`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45113][PYTHON][DOCS][FOLLOWUP] Make doctests deterministic [spark]
zhengruifeng commented on PR #43331: URL: https://github.com/apache/spark/pull/43331#issuecomment-1758768732 merged to master, thank you all -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45113][PYTHON][DOCS][FOLLOWUP] Make doctests deterministic [spark]
zhengruifeng closed pull request #43331: [SPARK-45113][PYTHON][DOCS][FOLLOWUP] Make doctests deterministic URL: https://github.com/apache/spark/pull/43331 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45220][PYTHON][DOCS] Refine docstring of DataFrame.join [spark]
allisonwang-db commented on code in PR #43039: URL: https://github.com/apache/spark/pull/43039#discussion_r1355919218 ## python/pyspark/sql/dataframe.py: ## @@ -2646,67 +2647,147 @@ def join( Examples -The following performs a full outer join between ``df1`` and ``df2``. +The following examples demonstrate various join types between ``df1`` and ``df2``. +>>> import pyspark.sql.functions as sf >>> from pyspark.sql import Row ->>> from pyspark.sql.functions import desc ->>> df = spark.createDataFrame([(2, "Alice"), (5, "Bob")]).toDF("age", "name") ->>> df2 = spark.createDataFrame([Row(height=80, name="Tom"), Row(height=85, name="Bob")]) ->>> df3 = spark.createDataFrame([Row(age=2, name="Alice"), Row(age=5, name="Bob")]) ->>> df4 = spark.createDataFrame([ -... Row(age=10, height=80, name="Alice"), -... Row(age=5, height=None, name="Bob"), -... Row(age=None, height=None, name="Tom"), -... Row(age=None, height=None, name=None), +>>> df = spark.createDataFrame([Row(name="Alice", age=2), Row(name="Bob", age=5)]) +>>> df2 = spark.createDataFrame([Row(name="Tom", height=80), Row(name="Bob", height=85)]) +>>> df3 = spark.createDataFrame([ +... Row(name="Alice", age=10, height=80), +... Row(name="Bob", age=5, height=None), +... Row(name="Tom", age=None, height=None), +... Row(name=None, age=None, height=None), ... ]) Inner join on columns (default) ->>> df.join(df2, 'name').select(df.name, df2.height).show() -++--+ -|name|height| -++--+ -| Bob|85| -++--+ ->>> df.join(df4, ['name', 'age']).select(df.name, df.age).show() -++---+ -|name|age| -++---+ -| Bob| 5| -++---+ - -Outer join for both DataFrames on the 'name' column. - ->>> df.join(df2, df.name == df2.name, 'outer').select( -... df.name, df2.height).sort(desc("name")).show() +>>> df.join(df2, "name").show() +++---+--+ +|name|age|height| +++---+--+ +| Bob| 5|85| +++---+--+ + +>>> df.join(df3, ["name", "age"]).show() +++---+--+ +|name|age|height| +++---+--+ +| Bob| 5| NULL| +++---+--+ + +Outer join on a single column with an explicit join condition. + +When the join condition is explicited stated: `df.name == df2.name`, this will +produce all records where the names match, as well as those that don't (since +it's an outer join). If there are names in `df2` that are not present in `df`, +they will appear with `NULL` in the `name` column of `df`, and vice versa for `df2`. + +>>> joined = df.join(df2, df.name == df2.name, "outer").sort(sf.desc(df.name)) +>>> joined.show() Review Comment: Sounds good. Created [SPARK-45509](https://issues.apache.org/jira/browse/SPARK-45509) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45433][SQL][3.4] Fix CSV/JSON schema inference when timestamps do not match specified timestampFormat [spark]
Hisoka-X commented on PR #43343: URL: https://github.com/apache/spark/pull/43343#issuecomment-1758754571 cc @MaxGekk -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-45433][SQL][3.4] Fix CSV/JSON schema inference when timestamps do not match specified timestampFormat [spark]
Hisoka-X opened a new pull request, #43343: URL: https://github.com/apache/spark/pull/43343 ### What changes were proposed in this pull request? This is a backport PR of #43243. Fix the bug of schema inference when timestamps do not match specified timestampFormat. Please check #43243 for detail. ### Why are the changes needed? Fix schema inference bug on 3.4. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? add new test. ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45009][SQL] Decorrelate predicate subqueries in join condition [spark]
andylam-db commented on code in PR #42725: URL: https://github.com/apache/spark/pull/42725#discussion_r1355858190 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -4378,6 +4378,25 @@ object SQLConf { .booleanConf .createWithDefault(true) + val DECORRELATE_PREDICATE_SUBQUERIES_IN_JOIN_CONDITION = + buildConf("spark.sql.optimizer.decorrelatePredicateSubqueriesInJoinPredicate.enabled") + .internal() + .doc("Decorrelate predicate (in and exists) subqueries with correlated references in join " + +"predicates.") + .version("4.0.0") + .booleanConf + .createWithDefault(true) + + val OPTIMIZE_UNCORRELATED_IN_SUBQUERIES_IN_JOIN_CONDITION = + buildConf("spark.sql.optimizer.optimizeUncorrelatedInSubqueriesInJoinCondition.enabled") + .internal() + .doc("When true, optimize uncorrelated IN subqueries in join predicates by rewriting them " + +s"to joins. This interacts with ${LEGACY_NULL_IN_EMPTY_LIST_BEHAVIOR.key} because it " + +"can rewrite IN predicates.") + .version("4.0.0") + .booleanConf + .createWithDefault(false) Review Comment: Is it easier to just rely on LEGACY_NULL_IN_EMPTY_LIST_BEHAVIOR instead of creating this new flag? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-45487] Fix SQLSTATEs and temp errors [spark]
srielau opened a new pull request, #43342: URL: https://github.com/apache/spark/pull/43342 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45220][PYTHON][DOCS] Refine docstring of DataFrame.join [spark]
allisonwang-db commented on code in PR #43039: URL: https://github.com/apache/spark/pull/43039#discussion_r1355919218 ## python/pyspark/sql/dataframe.py: ## @@ -2646,67 +2647,147 @@ def join( Examples -The following performs a full outer join between ``df1`` and ``df2``. +The following examples demonstrate various join types between ``df1`` and ``df2``. +>>> import pyspark.sql.functions as sf >>> from pyspark.sql import Row ->>> from pyspark.sql.functions import desc ->>> df = spark.createDataFrame([(2, "Alice"), (5, "Bob")]).toDF("age", "name") ->>> df2 = spark.createDataFrame([Row(height=80, name="Tom"), Row(height=85, name="Bob")]) ->>> df3 = spark.createDataFrame([Row(age=2, name="Alice"), Row(age=5, name="Bob")]) ->>> df4 = spark.createDataFrame([ -... Row(age=10, height=80, name="Alice"), -... Row(age=5, height=None, name="Bob"), -... Row(age=None, height=None, name="Tom"), -... Row(age=None, height=None, name=None), +>>> df = spark.createDataFrame([Row(name="Alice", age=2), Row(name="Bob", age=5)]) +>>> df2 = spark.createDataFrame([Row(name="Tom", height=80), Row(name="Bob", height=85)]) +>>> df3 = spark.createDataFrame([ +... Row(name="Alice", age=10, height=80), +... Row(name="Bob", age=5, height=None), +... Row(name="Tom", age=None, height=None), +... Row(name=None, age=None, height=None), ... ]) Inner join on columns (default) ->>> df.join(df2, 'name').select(df.name, df2.height).show() -++--+ -|name|height| -++--+ -| Bob|85| -++--+ ->>> df.join(df4, ['name', 'age']).select(df.name, df.age).show() -++---+ -|name|age| -++---+ -| Bob| 5| -++---+ - -Outer join for both DataFrames on the 'name' column. - ->>> df.join(df2, df.name == df2.name, 'outer').select( -... df.name, df2.height).sort(desc("name")).show() +>>> df.join(df2, "name").show() +++---+--+ +|name|age|height| +++---+--+ +| Bob| 5|85| +++---+--+ + +>>> df.join(df3, ["name", "age"]).show() +++---+--+ +|name|age|height| +++---+--+ +| Bob| 5| NULL| +++---+--+ + +Outer join on a single column with an explicit join condition. + +When the join condition is explicited stated: `df.name == df2.name`, this will +produce all records where the names match, as well as those that don't (since +it's an outer join). If there are names in `df2` that are not present in `df`, +they will appear with `NULL` in the `name` column of `df`, and vice versa for `df2`. + +>>> joined = df.join(df2, df.name == df2.name, "outer").sort(sf.desc(df.name)) +>>> joined.show() Review Comment: Sounds good. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-45507][SQL] Correctness fix for correlated scalar subqueries with COUNT aggregates [spark]
andylam-db commented on PR #43341: URL: https://github.com/apache/spark/pull/43341#issuecomment-1758745012 Pinging for first round of reviews :-) @jchen5 @agubichev -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-45507][SQL] Correctness fix for correlated scalar subqueries with COUNT aggregates [spark]
andylam-db commented on code in PR #43341: URL: https://github.com/apache/spark/pull/43341#discussion_r1355919290 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala: ## @@ -360,8 +360,31 @@ object PullupCorrelatedPredicates extends Rule[LogicalPlan] with PredicateHelper plan.transformExpressionsWithPruning(_.containsPattern(PLAN_EXPRESSION)) { case ScalarSubquery(sub, children, exprId, conditions, hint, mayHaveCountBugOld) if children.nonEmpty => -val (newPlan, newCond) = decorrelate(sub, plan) -val mayHaveCountBug = if (mayHaveCountBugOld.isEmpty) { + +def mayHaveCountBugAgg(a: Aggregate): Boolean = { + a.groupingExpressions.isEmpty && a.aggregateExpressions.exists(_.exists { +case a: AggregateExpression => a.aggregateFunction.defaultResult.isDefined +case _ => false + }) +} + +// We want to handle count bug for scalar subqueries, except for the cases where the +// subquery is a simple top level Aggregate which can have a count bug (note: the below +// logic also takes into account nested COUNTs). This is because for these cases, we don't +// want to introduce redundant left outer joins in [[DecorrelateInnerQuery]], when the +// necessary left outer join will be added in [[RewriteCorrelatedScalarSubquery]]. +val shouldHandleCountBug = !(sub match { Review Comment: Should we flag this with a config? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-45507] Correctness fix for correlated scalar subqueries with COUNT aggregates [spark]
andylam-db opened a new pull request, #43341: URL: https://github.com/apache/spark/pull/43341 ### What changes were proposed in this pull request? We want to use the count bug handling in `DecorrelateInnerQuery` to detect potential count bugs in scalar subqueries. it It is always safe to use `DecorrelateInnerQuery` to handle count bugs, but for efficiency reasons, like for the common case of COUNT on top of the scalar subquery, we would like to avoid an extra left outer join. This PR therefore introduces a simple check to detect such cases before `decorrelate()` - if true, then don't do count bug handling in `decorrelate()`, otherwise, do count bug handling. ### Why are the changes needed? This PR fixes correctness issues for correlated scalar subqueries pertaining to the COUNT bug. Examples can be found in the JIRA ticket. ### Does this PR introduce _any_ user-facing change? Yes, results will change. ### How was this patch tested? Added SQL end-to-end tests in `count.sql` ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45499][CORE][TESTS] Replace `Reference#isEnqueued` with `Reference#refersTo(null)` [spark]
LuciferYang commented on PR #43325: URL: https://github.com/apache/spark/pull/43325#issuecomment-1758737300 Sorry, it seems that the changed test cases have a chance of failing. I will investigate this today and if there is no progress, I'll revert this PR first. https://github.com/apache/spark/actions/runs/6484510414/job/17608536854 ``` 2023-10-11T18:19:40.4344178Z [0m[[0m[0minfo[0m] [0m[0m[31m- reference to sub iterator should not be available after completion *** FAILED *** (226 milliseconds)[0m[0m 2023-10-11T18:19:40.4353343Z [0m[[0m[0minfo[0m] [0m[0m[31m null did not equal java.lang.ref.PhantomReference@49631390 (CompletionIteratorSuite.scala:67)[0m[0m 2023-10-11T18:19:40.4355215Z [0m[[0m[0minfo[0m] [0m[0m[31m org.scalatest.exceptions.TestFailedException:[0m[0m 2023-10-11T18:19:40.4357722Z [0m[[0m[0minfo[0m] [0m[0m[31m at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)[0m[0m 2023-10-11T18:19:40.4360203Z [0m[[0m[0minfo[0m] [0m[0m[31m at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)[0m[0m 2023-10-11T18:19:40.4362462Z [0m[[0m[0minfo[0m] [0m[0m[31m at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)[0m[0m 2023-10-11T18:19:40.4364350Z [0m[[0m[0minfo[0m] [0m[0m[31m at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)[0m[0m 2023-10-11T18:19:40.4366469Z [0m[[0m[0minfo[0m] [0m[0m[31m at org.apache.spark.util.CompletionIteratorSuite.$anonfun$new$3(CompletionIteratorSuite.scala:67)[0m[0m 2023-10-11T18:19:40.4369159Z [0m[[0m[0minfo[0m] [0m[0m[31m at org.scalatest.enablers.Timed$$anon$1.timeoutAfter(Timed.scala:127)[0m[0m 2023-10-11T18:19:40.4372116Z [0m[[0m[0minfo[0m] [0m[0m[31m at org.scalatest.concurrent.TimeLimits$.failAfterImpl(TimeLimits.scala:282)[0m[0m 2023-10-11T18:19:40.4377482Z [0m[[0m[0minfo[0m] [0m[0m[31m at org.scalatest.concurrent.TimeLimits.failAfter(TimeLimits.scala:231)[0m[0m 2023-10-11T18:19:40.4380196Z [0m[[0m[0minfo[0m] [0m[0m[31m at org.scalatest.concurrent.TimeLimits.failAfter$(TimeLimits.scala:230)[0m[0m 2023-10-11T18:19:40.4382087Z [0m[[0m[0minfo[0m] [0m[0m[31m at org.apache.spark.SparkFunSuite.failAfter(SparkFunSuite.scala:69)[0m[0m 2023-10-11T18:19:40.4383931Z [0m[[0m[0minfo[0m] [0m[0m[31m at org.apache.spark.SparkFunSuite.$anonfun$test$2(SparkFunSuite.scala:155)[0m[0m ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45486][CONNECT] Make add_artifact request idempotent [spark]
HyukjinKwon commented on PR #43314: URL: https://github.com/apache/spark/pull/43314#issuecomment-1758732953 Mind retriggering the failed tests https://github.com/cdkrot/apache_spark/runs/17602108213 ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45220][PYTHON][DOCS] Refine docstring of DataFrame.join [spark]
HyukjinKwon commented on code in PR #43039: URL: https://github.com/apache/spark/pull/43039#discussion_r1355911443 ## python/pyspark/sql/dataframe.py: ## @@ -2646,67 +2647,147 @@ def join( Examples -The following performs a full outer join between ``df1`` and ``df2``. +The following examples demonstrate various join types between ``df1`` and ``df2``. +>>> import pyspark.sql.functions as sf >>> from pyspark.sql import Row ->>> from pyspark.sql.functions import desc ->>> df = spark.createDataFrame([(2, "Alice"), (5, "Bob")]).toDF("age", "name") ->>> df2 = spark.createDataFrame([Row(height=80, name="Tom"), Row(height=85, name="Bob")]) ->>> df3 = spark.createDataFrame([Row(age=2, name="Alice"), Row(age=5, name="Bob")]) ->>> df4 = spark.createDataFrame([ -... Row(age=10, height=80, name="Alice"), -... Row(age=5, height=None, name="Bob"), -... Row(age=None, height=None, name="Tom"), -... Row(age=None, height=None, name=None), +>>> df = spark.createDataFrame([Row(name="Alice", age=2), Row(name="Bob", age=5)]) +>>> df2 = spark.createDataFrame([Row(name="Tom", height=80), Row(name="Bob", height=85)]) +>>> df3 = spark.createDataFrame([ +... Row(name="Alice", age=10, height=80), +... Row(name="Bob", age=5, height=None), +... Row(name="Tom", age=None, height=None), +... Row(name=None, age=None, height=None), ... ]) Inner join on columns (default) ->>> df.join(df2, 'name').select(df.name, df2.height).show() -++--+ -|name|height| -++--+ -| Bob|85| -++--+ ->>> df.join(df4, ['name', 'age']).select(df.name, df.age).show() -++---+ -|name|age| -++---+ -| Bob| 5| -++---+ - -Outer join for both DataFrames on the 'name' column. - ->>> df.join(df2, df.name == df2.name, 'outer').select( -... df.name, df2.height).sort(desc("name")).show() +>>> df.join(df2, "name").show() +++---+--+ +|name|age|height| +++---+--+ +| Bob| 5|85| +++---+--+ + +>>> df.join(df3, ["name", "age"]).show() +++---+--+ +|name|age|height| +++---+--+ +| Bob| 5| NULL| +++---+--+ + +Outer join on a single column with an explicit join condition. + +When the join condition is explicited stated: `df.name == df2.name`, this will +produce all records where the names match, as well as those that don't (since +it's an outer join). If there are names in `df2` that are not present in `df`, +they will appear with `NULL` in the `name` column of `df`, and vice versa for `df2`. + +>>> joined = df.join(df2, df.name == df2.name, "outer").sort(sf.desc(df.name)) +>>> joined.show() Review Comment: Can we exclude those examples in this PRs, and mind filing JIRAs for both issues @allisonwang-db? ## python/pyspark/sql/dataframe.py: ## @@ -2646,67 +2647,147 @@ def join( Examples -The following performs a full outer join between ``df1`` and ``df2``. +The following examples demonstrate various join types between ``df1`` and ``df2``. +>>> import pyspark.sql.functions as sf >>> from pyspark.sql import Row ->>> from pyspark.sql.functions import desc ->>> df = spark.createDataFrame([(2, "Alice"), (5, "Bob")]).toDF("age", "name") ->>> df2 = spark.createDataFrame([Row(height=80, name="Tom"), Row(height=85, name="Bob")]) ->>> df3 = spark.createDataFrame([Row(age=2, name="Alice"), Row(age=5, name="Bob")]) ->>> df4 = spark.createDataFrame([ -... Row(age=10, height=80, name="Alice"), -... Row(age=5, height=None, name="Bob"), -... Row(age=None, height=None, name="Tom"), -... Row(age=None, height=None, name=None), +>>> df = spark.createDataFrame([Row(name="Alice", age=2), Row(name="Bob", age=5)]) +>>> df2 = spark.createDataFrame([Row(name="Tom", height=80), Row(name="Bob", height=85)]) +>>> df3 = spark.createDataFrame([ +... Row(name="Alice", age=10, height=80), +... Row(name="Bob", age=5, height=None), +... Row(name="Tom", age=None, height=None), +... Row(name=None, age=None, height=None), ... ]) Inner join on columns (default) ->>> df.join(df2, 'name').select(df.name, df2.height).show() -++--+ -|name|height| -++--+ -| Bob|85| -++--+ ->>> df.join(df4, ['name', 'age']).select(df.name, df.age).show() -+
Re: [PR] [SPARK-45221][PYTHON][DOCS] Refine docstring of DataFrameReader.parquet [spark]
HyukjinKwon commented on PR #43301: URL: https://github.com/apache/spark/pull/43301#issuecomment-1758730466 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45221][PYTHON][DOCS] Refine docstring of DataFrameReader.parquet [spark]
HyukjinKwon closed pull request #43301: [SPARK-45221][PYTHON][DOCS] Refine docstring of DataFrameReader.parquet URL: https://github.com/apache/spark/pull/43301 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45442][PYTHON][DOCS] Refine docstring of DataFrame.show [spark]
HyukjinKwon commented on PR #43252: URL: https://github.com/apache/spark/pull/43252#issuecomment-1758731045 @allisonwang-db mind rebasing -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] Bump org.apache.zookeeper:zookeeper from 3.5.7 to 3.7.2 in /connector/kafka-0-10-sql [spark]
dependabot[bot] commented on PR #43335: URL: https://github.com/apache/spark/pull/43335#issuecomment-1758729602 OK, I won't notify you again about this release, but will get in touch when a new version is available. If you'd rather skip all updates until the next major or minor version, let me know by commenting `@dependabot ignore this major version` or `@dependabot ignore this minor version`. If you change your mind, just re-open this PR and I'll resolve any conflicts on it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] Bump org.apache.zookeeper:zookeeper from 3.5.7 to 3.7.2 in /connector/kafka-0-10-sql [spark]
HyukjinKwon closed pull request #43335: Bump org.apache.zookeeper:zookeeper from 3.5.7 to 3.7.2 in /connector/kafka-0-10-sql URL: https://github.com/apache/spark/pull/43335 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] Bump org.apache.zookeeper:zookeeper from 3.5.7 to 3.7.2 in /connector/kafka-0-10 [spark]
dependabot[bot] commented on PR #43336: URL: https://github.com/apache/spark/pull/43336#issuecomment-1758729567 OK, I won't notify you again about this release, but will get in touch when a new version is available. If you'd rather skip all updates until the next major or minor version, let me know by commenting `@dependabot ignore this major version` or `@dependabot ignore this minor version`. If you change your mind, just re-open this PR and I'll resolve any conflicts on it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] Bump org.apache.zookeeper:zookeeper from 3.5.7 to 3.7.2 in /connector/kafka-0-10 [spark]
HyukjinKwon closed pull request #43336: Bump org.apache.zookeeper:zookeeper from 3.5.7 to 3.7.2 in /connector/kafka-0-10 URL: https://github.com/apache/spark/pull/43336 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] Bump org.apache.zookeeper:zookeeper from 3.5.7 to 3.7.2 [spark]
dependabot[bot] commented on PR #43337: URL: https://github.com/apache/spark/pull/43337#issuecomment-1758729539 OK, I won't notify you again about this release, but will get in touch when a new version is available. If you'd rather skip all updates until the next major or minor version, let me know by commenting `@dependabot ignore this major version` or `@dependabot ignore this minor version`. If you change your mind, just re-open this PR and I'll resolve any conflicts on it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] Bump org.apache.zookeeper:zookeeper from 3.5.7 to 3.7.2 [spark]
HyukjinKwon closed pull request #43337: Bump org.apache.zookeeper:zookeeper from 3.5.7 to 3.7.2 URL: https://github.com/apache/spark/pull/43337 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45505][PYTHON] Refactor analyzeInPython to make it reusable [spark]
allisonwang-db commented on PR #43340: URL: https://github.com/apache/spark/pull/43340#issuecomment-1758727468 cc @ueshin -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-45505][PYTHON] Refactor analyzeInPython to make it reusable [spark]
allisonwang-db opened a new pull request, #43340: URL: https://github.com/apache/spark/pull/43340 ### What changes were proposed in this pull request? Currently, the `analyzeInPython` method in UserDefinedPythonTableFunction object can starts a Python process in driver and run a Python function in the Python process. This PR aims to refactor this logic into a reusable runner class. ### Why are the changes needed? To make the code more reusable. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-42584][CONNECT] Improve output of Column.explain [spark]
github-actions[bot] commented on PR #40528: URL: https://github.com/apache/spark/pull/40528#issuecomment-1758724597 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-43396][CORE] Add config to control max ratio of decommissioning executors [spark]
github-actions[bot] commented on PR #41076: URL: https://github.com/apache/spark/pull/41076#issuecomment-1758724577 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] Bump golang.org/x/net from 0.10.0 to 0.17.0 [spark-connect-go]
HyukjinKwon commented on PR #16: URL: https://github.com/apache/spark-connect-go/pull/16#issuecomment-1758722272 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] Bump golang.org/x/net from 0.10.0 to 0.17.0 [spark-connect-go]
dependabot[bot] commented on PR #16: URL: https://github.com/apache/spark-connect-go/pull/16#issuecomment-1758722205 OK, I won't notify you again about this release, but will get in touch when a new version is available. If you'd rather skip all updates until the next major or minor version, let me know by commenting `@dependabot ignore this major version` or `@dependabot ignore this minor version`. If you change your mind, just re-open this PR and I'll resolve any conflicts on it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] Bump golang.org/x/net from 0.10.0 to 0.17.0 [spark-connect-go]
HyukjinKwon closed pull request #16: Bump golang.org/x/net from 0.10.0 to 0.17.0 URL: https://github.com/apache/spark-connect-go/pull/16 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45483][CONNECT] Correct the function groups in connect.functions [spark]
zhengruifeng commented on PR #43309: URL: https://github.com/apache/spark/pull/43309#issuecomment-1758721600 thank you all for reviews -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45415] Allow selective disabling of "fallocate" in RocksDB statestore [spark]
HeartSaVioR closed pull request #43202: [SPARK-45415] Allow selective disabling of "fallocate" in RocksDB statestore URL: https://github.com/apache/spark/pull/43202 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45415] Allow selective disabling of "fallocate" in RocksDB statestore [spark]
HeartSaVioR commented on PR #43202: URL: https://github.com/apache/spark/pull/43202#issuecomment-1758701984 Thanks! Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-45504][SS] Lower CPU Priority Of RocksDB Background Threads [spark]
siying opened a new pull request, #43339: URL: https://github.com/apache/spark/pull/43339 ### What changes were proposed in this pull request? Lower RocksDB background threads' CPU priority. ### Why are the changes needed? RocksDB background threads execute background tasks such as compactions and flushes. It should be in general lower priority than Spark tasks excution. For the case where a task may wait for some RocksDB background task to finish, such as checkpointing, or waiting async checkpointing to finish, the task slot is waiting so we are likely to have enough CPU for low pri CPU anyway. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Run benchmarks and see the results. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45009][SQL] Decorrelate predicate subqueries in join condition [spark]
andylam-db commented on code in PR #42725: URL: https://github.com/apache/spark/pull/42725#discussion_r1355858190 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -4378,6 +4378,25 @@ object SQLConf { .booleanConf .createWithDefault(true) + val DECORRELATE_PREDICATE_SUBQUERIES_IN_JOIN_CONDITION = + buildConf("spark.sql.optimizer.decorrelatePredicateSubqueriesInJoinPredicate.enabled") + .internal() + .doc("Decorrelate predicate (in and exists) subqueries with correlated references in join " + +"predicates.") + .version("4.0.0") + .booleanConf + .createWithDefault(true) + + val OPTIMIZE_UNCORRELATED_IN_SUBQUERIES_IN_JOIN_CONDITION = + buildConf("spark.sql.optimizer.optimizeUncorrelatedInSubqueriesInJoinCondition.enabled") + .internal() + .doc("When true, optimize uncorrelated IN subqueries in join predicates by rewriting them " + +s"to joins. This interacts with ${LEGACY_NULL_IN_EMPTY_LIST_BEHAVIOR.key} because it " + +"can rewrite IN predicates.") + .version("4.0.0") + .booleanConf + .createWithDefault(false) Review Comment: Is it easier to just rely on LEGACY_NULL_IN_EMPTY_LIST_BEHAVIOR instead of creating this new flag? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-45503][SS] RocksDB to use LZ4 [spark]
siying opened a new pull request, #43338: URL: https://github.com/apache/spark/pull/43338 ### What changes were proposed in this pull request? Turn to use LZ4 in RocksDB state store. ### Why are the changes needed? LZ4 is generally faster than Snappy. That's probably we use LZ4 in changelogs and other places by default. However, we don't change RocksDB's default of Snappy compression style. The RocksDB Team recommend LZ4 or ZSTD and the default is kept to Snappy only for backward compatible reason. We should use LZ4 instead. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests should fix it. Some benchmarks are run and we see positive improvements. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] Bump golang.org/x/net from 0.10.0 to 0.17.0 [spark-connect-go]
dependabot[bot] opened a new pull request, #16: URL: https://github.com/apache/spark-connect-go/pull/16 Bumps [golang.org/x/net](https://github.com/golang/net) from 0.10.0 to 0.17.0. Commits https://github.com/golang/net/commit/b225e7ca6dde1ef5a5ae5ce922861bda011cfabd";>b225e7c http2: limit maximum handler goroutines to MaxConcurrentStreams https://github.com/golang/net/commit/88194ad8ab44a02ea952c169883c3f57db6cf9f4";>88194ad go.mod: update golang.org/x dependencies https://github.com/golang/net/commit/2b60a61f1e4cf3a5ecded0bd7e77ea168289e6de";>2b60a61 quic: fix several bugs in flow control accounting https://github.com/golang/net/commit/73d82efb96cacc0c378bc150b56675fc191894b9";>73d82ef quic: handle DATA_BLOCKED frames https://github.com/golang/net/commit/5d5a036a503f8accd748f7453c0162115187be13";>5d5a036 quic: handle streams moving from the data queue to the meta queue https://github.com/golang/net/commit/350aad2603e57013fafb1a9e2089a382fe67dc80";>350aad2 quic: correctly extend peer's flow control window after MAX_DATA https://github.com/golang/net/commit/21814e71db756f39b69fb1a3e06350fa555a79b1";>21814e7 quic: validate connection id transport parameters https://github.com/golang/net/commit/a600b3518eed7a9a4e24380b4b249cb986d9b64d";>a600b35 quic: avoid redundant MAX_DATA updates https://github.com/golang/net/commit/ea633599b58dc6a50d33c7f5438edfaa8bc313df";>ea63359 http2: check stream body is present on read timeout https://github.com/golang/net/commit/ddd8598e5694aa5e966e44573a53e895f6fa5eb2";>ddd8598 quic: version negotiation Additional commits viewable in https://github.com/golang/net/compare/v0.10.0...v0.17.0";>compare view [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=golang.org/x/net&package-manager=go_modules&previous-version=0.10.0&new-version=0.17.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/apache/spark-connect-go/network/alerts). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-42881][SQL] Codegen Support for get_json_object [spark]
panbingkun commented on code in PR #40506: URL: https://github.com/apache/spark/pull/40506#discussion_r1355839029 ## sql/core/benchmarks/JsonBenchmark-results.txt: ## @@ -3,127 +3,128 @@ Benchmark for performance of JSON parsing Preparing data for benchmarking ... -OpenJDK 64-Bit Server VM 17.0.8+7-LTS on Linux 5.15.0-1046-azure +OpenJDK 64-Bit Server VM 17.0.8+7-LTS on Linux 5.15.0-1047-azure Review Comment: Okay,I will sumbit a followup pr to complete it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45415] Allow selective disabling of "fallocate" in RocksDB statestore [spark]
HeartSaVioR commented on code in PR #43202: URL: https://github.com/apache/spark/pull/43202#discussion_r1355833024 ## docs/structured-streaming-programming-guide.md: ## @@ -2385,6 +2385,11 @@ Here are the configs regarding to RocksDB instance of the state store provider: Total memory to be occupied by blocks in high priority pool as a fraction of memory allocated across all RocksDB instances on a single node using maxMemoryUsageMB. 0.1 + +spark.sql.streaming.stateStore.rocksdb.allowFAllocate Review Comment: Nice, we can move forward. :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] Bump org.apache.zookeeper:zookeeper from 3.5.7 to 3.7.2 in /connector/kafka-0-10 [spark]
dependabot[bot] opened a new pull request, #43336: URL: https://github.com/apache/spark/pull/43336 Bumps org.apache.zookeeper:zookeeper from 3.5.7 to 3.7.2. [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=org.apache.zookeeper:zookeeper&package-manager=maven&previous-version=3.5.7&new-version=3.7.2)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/apache/spark/network/alerts). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] Bump org.apache.zookeeper:zookeeper from 3.5.7 to 3.7.2 [spark]
dependabot[bot] opened a new pull request, #43337: URL: https://github.com/apache/spark/pull/43337 Bumps org.apache.zookeeper:zookeeper from 3.5.7 to 3.7.2. [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=org.apache.zookeeper:zookeeper&package-manager=maven&previous-version=3.5.7&new-version=3.7.2)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/apache/spark/network/alerts). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] Bump org.apache.zookeeper:zookeeper from 3.5.7 to 3.7.2 in /connector/kafka-0-10-sql [spark]
dependabot[bot] opened a new pull request, #43335: URL: https://github.com/apache/spark/pull/43335 Bumps org.apache.zookeeper:zookeeper from 3.5.7 to 3.7.2. [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=org.apache.zookeeper:zookeeper&package-manager=maven&previous-version=3.5.7&new-version=3.7.2)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/apache/spark/network/alerts). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45009][SQL] Decorrelate predicate subqueries in join condition [spark]
jchen5 commented on code in PR #42725: URL: https://github.com/apache/spark/pull/42725#discussion_r1355693885 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -4378,6 +4378,25 @@ object SQLConf { .booleanConf .createWithDefault(true) + val DECORRELATE_PREDICATE_SUBQUERIES_IN_JOIN_CONDITION = + buildConf("spark.sql.optimizer.decorrelatePredicateSubqueriesInJoinPredicate.enabled") + .internal() + .doc("Decorrelate predicate (in and exists) subqueries with correlated references in join " + +"predicates.") + .version("4.0.0") + .booleanConf + .createWithDefault(true) + + val OPTIMIZE_UNCORRELATED_IN_SUBQUERIES_IN_JOIN_CONDITION = + buildConf("spark.sql.optimizer.optimizeUncorrelatedInSubqueriesInJoinCondition.enabled") + .internal() + .doc("When true, optimize uncorrelated IN subqueries in join predicates by rewriting them " + +s"to joins. This interacts with ${LEGACY_NULL_IN_EMPTY_LIST_BEHAVIOR.key} because it " + +"can rewrite IN predicates.") + .version("4.0.0") + .booleanConf + .createWithDefault(false) Review Comment: We can have this true by default now - LEGACY_NULL_IN_EMPTY_LIST_BEHAVIOR was enabled by default recently. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [WIP][SPARK-45022][SQL] Provide context for dataset API errors [spark]
MaxGekk opened a new pull request, #43334: URL: https://github.com/apache/spark/pull/43334 ### What changes were proposed in this pull request? This PR captures the dataset APIs used by the user code and the call site in the user code and provides better error messages. E.g. consider the following Spark app `SimpleApp.scala`: ```scala 1 import org.apache.spark.sql.SparkSession 2 import org.apache.spark.sql.functions._ 3 4 object SimpleApp { 5def main(args: Array[String]) { 6 val spark = SparkSession.builder.appName("Simple Application").config("spark.sql.ansi.enabled", true).getOrCreate() 7 import spark.implicits._ 8 9 val c = col("a") / col("b") 10 11 Seq((1, 0)).toDF("a", "b").select(c).show() 12 13 spark.stop() 14} 15 } ``` After this PR the error message contains the error context (which Spark Dataset API is called from where in the user code) in the following form: ``` Exception in thread "main" org.apache.spark.SparkArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. == Dataset == "div" was called from SimpleApp$.main(SimpleApp.scala:9) at org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:201) at org.apache.spark.sql.catalyst.expressions.DivModLike.eval(arithmetic.scala:672 ... ``` which is similar to the already provided context in case of SQL queries: ``` org.apache.spark.SparkArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. == SQL(line 1, position 1) == a / b ^ at org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:201) at org.apache.spark.sql.errors.QueryExecutionErrors.divideByZeroError(QueryExecutionErrors.scala) ... ``` Please note that stack trace in `spark-shell` doesn't contain meaningful elements: ``` scala> Thread.currentThread().getStackTrace.foreach(println) java.base/java.lang.Thread.getStackTrace(Thread.java:1602) $line15.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(:23) $line15.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(:27) $line15.$read$$iw$$iw$$iw$$iw$$iw$$iw.(:29) $line15.$read$$iw$$iw$$iw$$iw$$iw.(:31) $line15.$read$$iw$$iw$$iw$$iw.(:33) $line15.$read$$iw$$iw$$iw.(:35) $line15.$read$$iw$$iw.(:37) $line15.$read$$iw.(:39) $line15.$read.(:41) $line15.$read$.(:45) $line15.$read$.() $line15.$eval$.$print$lzycompute(:7) $line15.$eval$.$print(:6) $line15.$eval.$print() java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ... ``` so this change doesn't help with that usecase. ### Why are the changes needed? To provide more user friendly errors. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Added new UTs to `QueryExecutionAnsiErrorsSuite`. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-44855][CONNECT] Small tweaks to attaching ExecuteGrpcResponseSender to ExecuteResponseObserver [spark]
hvanhovell closed pull request #43181: [SPARK-44855][CONNECT] Small tweaks to attaching ExecuteGrpcResponseSender to ExecuteResponseObserver URL: https://github.com/apache/spark/pull/43181 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45402][SQL][PYTHON] Add UDTF API for 'eval' and 'terminate' methods to consume previous 'analyze' result [spark]
dtenedor commented on code in PR #43204: URL: https://github.com/apache/spark/pull/43204#discussion_r1355605235 ## sql/core/src/main/scala/org/apache/spark/sql/execution/python/UserDefinedPythonFunction.scala: ## @@ -284,6 +285,14 @@ object UserDefinedPythonTableFunction { val schema = DataType.fromJson( PythonWorkerUtils.readUTF(length, dataIn)).asInstanceOf[StructType] + // Receive the pickled AnalyzeResult buffer, if any. + val pickledAnalyzeResult: Array[Byte] = dataIn.readInt() match { +case length => + val obj = new Array[Byte](length) + dataIn.readFully(obj) + obj + } Review Comment: Sure, this is done. ## sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonUDTFExec.scala: ## @@ -132,7 +135,15 @@ object PythonUDTFRunner { case None => dataOut.writeInt(0) } +// Write the pickled AnalyzeResult buffer from the UDTF "analyze" method, if any. +dataOut.writeBoolean(udtf.pickledAnalyzeResult.nonEmpty) +udtf.pickledAnalyzeResult.map { p => + dataOut.writeInt(p.length) + dataOut.write(p) Review Comment: Done. ## sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonUDTFExec.scala: ## @@ -132,7 +135,15 @@ object PythonUDTFRunner { case None => dataOut.writeInt(0) } +// Write the pickled AnalyzeResult buffer from the UDTF "analyze" method, if any. +dataOut.writeBoolean(udtf.pickledAnalyzeResult.nonEmpty) +udtf.pickledAnalyzeResult.map { p => Review Comment: Done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45495][core] Support stage level task resource profile for k8s cluster when dynamic allocation disabled [spark]
tgravescs commented on code in PR #43323: URL: https://github.com/apache/spark/pull/43323#discussion_r1355509647 ## core/src/test/scala/org/apache/spark/resource/ResourceProfileManagerSuite.scala: ## @@ -138,7 +138,7 @@ class ResourceProfileManagerSuite extends SparkFunSuite { rpmanager.isSupported(taskProf) }.getMessage assert(error === "TaskResourceProfiles are only supported for Standalone " + - "and Yarn cluster for now when dynamic allocation is disabled.") + "and Yarn and Kubernetes cluster for now when dynamic allocation is disabled.") Review Comment: nit could change this to "Standalone, YARN, and Kubernetes" -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45488] XML: Add support for value in 'rowTag' element [spark]
shujingyang-db commented on code in PR #43319: URL: https://github.com/apache/spark/pull/43319#discussion_r1355376309 ## sql/core/src/test/resources/test-data/xml-resources/root-level-value.xml: ## @@ -0,0 +1,9 @@ + + +value1 +value2 + +value3 + + + Review Comment: Sure I will add another test case for this. In this case, `5` and `6` will also be skipped. We only read values if the struct only consists of attributes and `valueTag`. If there's another field (i.e. `tag`), the value in between the tags will be ignored. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45402][SQL][PYTHON] Add UDTF API for 'eval' and 'terminate' methods to consume previous 'analyze' result [spark]
ueshin commented on code in PR #43204: URL: https://github.com/apache/spark/pull/43204#discussion_r1355450724 ## sql/core/src/main/scala/org/apache/spark/sql/execution/python/UserDefinedPythonFunction.scala: ## @@ -284,6 +285,14 @@ object UserDefinedPythonTableFunction { val schema = DataType.fromJson( PythonWorkerUtils.readUTF(length, dataIn)).asInstanceOf[StructType] + // Receive the pickled AnalyzeResult buffer, if any. + val pickledAnalyzeResult: Array[Byte] = dataIn.readInt() match { +case length => + val obj = new Array[Byte](length) + dataIn.readFully(obj) + obj + } Review Comment: Shall we use `PythonWorkerUtils.readBytes` as https://github.com/apache/spark/pull/43321 was merged? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45402][SQL][PYTHON] Add UDTF API for 'eval' and 'terminate' methods to consume previous 'analyze' result [spark]
ueshin commented on code in PR #43204: URL: https://github.com/apache/spark/pull/43204#discussion_r1355452243 ## sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonUDTFExec.scala: ## @@ -132,7 +135,15 @@ object PythonUDTFRunner { case None => dataOut.writeInt(0) } +// Write the pickled AnalyzeResult buffer from the UDTF "analyze" method, if any. +dataOut.writeBoolean(udtf.pickledAnalyzeResult.nonEmpty) +udtf.pickledAnalyzeResult.map { p => Review Comment: nit: `foreach` instead of `map`? ## sql/core/src/main/scala/org/apache/spark/sql/execution/python/UserDefinedPythonFunction.scala: ## @@ -284,6 +285,14 @@ object UserDefinedPythonTableFunction { val schema = DataType.fromJson( PythonWorkerUtils.readUTF(length, dataIn)).asInstanceOf[StructType] + // Receive the pickled AnalyzeResult buffer, if any. + val pickledAnalyzeResult: Array[Byte] = dataIn.readInt() match { +case length => + val obj = new Array[Byte](length) + dataIn.readFully(obj) + obj + } Review Comment: Shall we use `PythonWorkerUtils.readBytes` as https://github.com/apache/spark/pull/43321 is merged? ## sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonUDTFExec.scala: ## @@ -132,7 +135,15 @@ object PythonUDTFRunner { case None => dataOut.writeInt(0) } +// Write the pickled AnalyzeResult buffer from the UDTF "analyze" method, if any. +dataOut.writeBoolean(udtf.pickledAnalyzeResult.nonEmpty) +udtf.pickledAnalyzeResult.map { p => + dataOut.writeInt(p.length) + dataOut.write(p) Review Comment: ditto, `PythonWorkerUtils.writeBytes`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45009][SQL] Decorrelate predicate subqueries in join condition [spark]
andylam-db commented on code in PR #42725: URL: https://github.com/apache/spark/pull/42725#discussion_r1355441657 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala: ## @@ -159,6 +160,66 @@ object RewritePredicateSubquery extends Rule[LogicalPlan] with PredicateHelper { Project(p.output, Filter(newCond.get, inputPlan)) } +// This case takes care of predicate subqueries in join conditions that are not pushed down +// to the children nodes by [[PushDownPredicates]]. Review Comment: Yes -- pushdown rules are in `operatorOptimizationBatch` and `RewriteSubquery` batch runs after those (almost the last batch in the Optimizer) ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -4378,6 +4378,25 @@ object SQLConf { .booleanConf .createWithDefault(true) + val DECORRELATE_PREDICATE_SUBQUERIES_IN_JOIN_CONDITION = + buildConf("spark.sql.optimizer.decorrelatePredicateSubqueriesInJoinPredicate.enabled") + .internal() + .doc("Decorrelate predicate (in and exists) subqueries with correlated references in join " + +"predicates.") + .version("4.0.0") + .booleanConf + .createWithDefault(true) + + val OPTIMIZE_UNCORRELATED_IN_SUBQUERIES_IN_JOIN_CONDITION = + buildConf("spark.sql.optimizer.optimizeUncorrelatedInSubqueriesInJoinCondition.enabled") + .internal() + .doc("When true, optimize uncorrelated IN subqueries in join predicates by rewriting them " + +s"to joins. This interacts with ${LEGACY_NULL_IN_EMPTY_LIST_BEHAVIOR.key} because it " + Review Comment: This happens because with this change, IN subqueries are rewritten to joins and therefore force the "correct" NULL-in-empty-list behavior. This config is added in case users want to use the legacy behavior. This has nothing to do with EXISTS. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45415] Allow selective disabling of "fallocate" in RocksDB statestore [spark]
siying commented on code in PR #43202: URL: https://github.com/apache/spark/pull/43202#discussion_r1355420314 ## docs/structured-streaming-programming-guide.md: ## @@ -2385,6 +2385,11 @@ Here are the configs regarding to RocksDB instance of the state store provider: Total memory to be occupied by blocks in high priority pool as a fraction of memory allocated across all RocksDB instances on a single node using maxMemoryUsageMB. 0.1 + +spark.sql.streaming.stateStore.rocksdb.allowFAllocate Review Comment: Keeping it consistent with RocksDB conf name sounds good to me. All good for me then. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45488] XML: Add support for value in 'rowTag' element [spark]
shujingyang-db commented on code in PR #43319: URL: https://github.com/apache/spark/pull/43319#discussion_r1355401700 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala: ## @@ -219,11 +219,27 @@ private[sql] object XmlInferSchema { dataTypes += inferredType nameToDataType += (field -> dataTypes) +case c: Characters if !c.isWhiteSpace => + // This can be an attribute-only object + val valueTagType = inferFrom(c.getData, options) + nameToDataType += options.valueTag -> ArrayBuffer(valueTagType) + case _: EndElement => shouldStop = StaxXmlParserUtils.checkEndElement(parser) case _ => // do nothing } +} + // A structure object is an attribute-only element + // if it only consists of attributes and valueTags. +// If not, we will remove the valueTag field from the schema +val attributesOnly = nameToDataType.forall { + case (fieldName, dataTypes) => + dataTypes.length == 1 && + (fieldName == options.valueTag || fieldName.startsWith(options.attributePrefix)) Review Comment: I agree on it. However this is our current [behavior](https://github.com/shujingyang-db/spark/blob/8fd10a40641c831155ffd644e331f0b818f72700/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala#L198), we might need to keep aligned with it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45488] XML: Add support for value in 'rowTag' element [spark]
shujingyang-db commented on PR #43319: URL: https://github.com/apache/spark/pull/43319#issuecomment-1758122788 @srowen Thanks for bringing up this case! In the case of ``` Great Book to read! ``` If `rowTag` is `book`, the resulting schema will be `foo STRING` and skip the characters `Great` and ` to read!` in between the tags. This is because we only read values in between tags if the struct _only_ consists of attributes and valueTag. Our definition to the `valueTag` is: > valueTag: The tag used for the value when there are attributes in the element having no child. Default is _VALUE. If there's another field, the value in between the tags will be ignored. If the user specifies the schema to be `foo STRING, _VALUE STRING`, we will also ignore the value and leave `_VALUE` empty. ([current behavior](https://github.com/shujingyang-db/spark/blob/8fd10a40641c831155ffd644e331f0b818f72700/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala#L198)) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45415] Allow selective disabling of "fallocate" in RocksDB statestore [spark]
schenksj commented on code in PR #43202: URL: https://github.com/apache/spark/pull/43202#discussion_r1355389886 ## docs/structured-streaming-programming-guide.md: ## @@ -2385,6 +2385,11 @@ Here are the configs regarding to RocksDB instance of the state store provider: Total memory to be occupied by blocks in high priority pool as a fraction of memory allocated across all RocksDB instances on a single node using maxMemoryUsageMB. 0.1 + +spark.sql.streaming.stateStore.rocksdb.allowFAllocate Review Comment: Thank you @siying ! Is it the capitalization that you are finding bothersome? I carried though the `allowFAllocate` casing from the RocksDB DBOptions class: https://javadoc.io/doc/org.rocksdb/rocksdbjni/6.8.1/org/rocksdb/DBOptions.html#allowFAllocate() I'm definitely not tied to it, but do like maintaining consistency across the two packages. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45488] XML: Add support for value in 'rowTag' element [spark]
shujingyang-db commented on code in PR #43319: URL: https://github.com/apache/spark/pull/43319#discussion_r1355376309 ## sql/core/src/test/resources/test-data/xml-resources/root-level-value.xml: ## @@ -0,0 +1,9 @@ + + +value1 +value2 + +value3 + + + Review Comment: Sure I will add another test case for this. In this case, `5` and `6` will also be skipped. We only read values if the struct only consists of attributes and `valueTag`. If there's another field (i.e. `tag`), the value in between the tags will be ignored. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45488] XML: Add support for value in 'rowTag' element [spark]
shujingyang-db commented on code in PR #43319: URL: https://github.com/apache/spark/pull/43319#discussion_r1355375216 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala: ## @@ -219,11 +219,27 @@ private[sql] object XmlInferSchema { dataTypes += inferredType nameToDataType += (field -> dataTypes) +case c: Characters if !c.isWhiteSpace => + // This can be an attribute-only object + val valueTagType = inferFrom(c.getData, options) + nameToDataType += options.valueTag -> ArrayBuffer(valueTagType) Review Comment: IMHO it will make the loop much more complicated as we need to examine the case you mentioned above ``` 45 67 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45488] XML: Add support for value in 'rowTag' element [spark]
shujingyang-db commented on code in PR #43319: URL: https://github.com/apache/spark/pull/43319#discussion_r1355372857 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala: ## @@ -122,7 +122,12 @@ class StaxXmlParser( } val parser = StaxXmlParserUtils.filteredReader(xml) val rootAttributes = StaxXmlParserUtils.gatherRootAttributes(parser) - Some(convertObject(parser, schema, options, rootAttributes)) + // A structure object is an attribute-only element + // if it only consists of attributes and valueTags. + val isRootAttributesOnly = schema.fields.forall { f => +f.name == options.valueTag || f.name.startsWith(options.attributePrefix) + } + Some(convertObject(parser, schema, options, rootAttributes, isRootAttributesOnly)) Review Comment: We only read values if the struct only consists of attributes and `valueTag`. If there's another field, the value in between the tags will be ignored. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45415] Allow selective disabling of "fallocate" in RocksDB statestore [spark]
siying commented on code in PR #43202: URL: https://github.com/apache/spark/pull/43202#discussion_r1355366397 ## docs/structured-streaming-programming-guide.md: ## @@ -2385,6 +2385,11 @@ Here are the configs regarding to RocksDB instance of the state store provider: Total memory to be occupied by blocks in high priority pool as a fraction of memory allocated across all RocksDB instances on a single node using maxMemoryUsageMB. 0.1 + +spark.sql.streaming.stateStore.rocksdb.allowFAllocate Review Comment: is it just me that feels that "FAllocate" feels odd to me, why not "allowFallocate? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45204][CONNECT] Add optional ExecuteHolder to SparkConnectPlanner [spark]
hvanhovell closed pull request #43311: [SPARK-45204][CONNECT] Add optional ExecuteHolder to SparkConnectPlanner URL: https://github.com/apache/spark/pull/43311 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45204][CONNECT] Add optional ExecuteHolder to SparkConnectPlanner [spark]
hvanhovell commented on PR #43311: URL: https://github.com/apache/spark/pull/43311#issuecomment-1758090274 Merging. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45204][CONNECT] Add optional ExecuteHolder to SparkConnectPlanner [spark]
hvanhovell commented on code in PR #43311: URL: https://github.com/apache/spark/pull/43311#discussion_r1355364109 ## connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala: ## @@ -86,14 +86,29 @@ final case class InvalidCommandInput( private val cause: Throwable = null) extends Exception(message, cause) -class SparkConnectPlanner(val sessionHolder: SessionHolder) extends Logging { +class SparkConnectPlanner( +val sessionHolder: SessionHolder, +val executeHolderOpt: Option[ExecuteHolder] = None) Review Comment: Ok, makes sense. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45430] Fix for FramelessOffsetWindowFunction when IGNORE NULLS and offset > rowCount [spark]
vitaliili-db commented on code in PR #43236: URL: https://github.com/apache/spark/pull/43236#discussion_r1355356343 ## sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowFunctionFrame.scala: ## @@ -200,15 +200,19 @@ class FrameLessOffsetWindowFunctionFrame( override def prepare(rows: ExternalAppendOnlyUnsafeRowArray): Unit = { resetStates(rows) -if (ignoreNulls) { - findNextRowWithNonNullInput() +if (Math.abs(offset) > rows.length) { + inputIndex = offset } else { - // drain the first few rows if offset is larger than zero - while (inputIndex < offset) { -if (inputIterator.hasNext) inputIterator.next() -inputIndex += 1 + if (ignoreNulls) { +findNextRowWithNonNullInput() + } else { +// drain the first few rows if offset is larger than zero +while (inputIndex < offset) { + if (inputIterator.hasNext) inputIterator.next() + inputIndex += 1 +} +inputIndex = offset } - inputIndex = offset Review Comment: yes, it works when `ignoreNulls == true` but breaks when `ignoreNulls == false` In particular it will trigger last `else` branch in `doWrite`: ``` else { (current: InternalRow) => if (inputIndex >= 0 && inputIndex < input.length) { val r = WindowFunctionFrame.getNextOrNull(inputIterator) projection(r) } else { // Use default values since the offset row does not exist. fillDefaultValue(current) } inputIndex += 1 } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45430] Fix for FramelessOffsetWindowFunction when IGNORE NULLS and offset > rowCount [spark]
vitaliili-db commented on code in PR #43236: URL: https://github.com/apache/spark/pull/43236#discussion_r1355356343 ## sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowFunctionFrame.scala: ## @@ -200,15 +200,19 @@ class FrameLessOffsetWindowFunctionFrame( override def prepare(rows: ExternalAppendOnlyUnsafeRowArray): Unit = { resetStates(rows) -if (ignoreNulls) { - findNextRowWithNonNullInput() +if (Math.abs(offset) > rows.length) { + inputIndex = offset } else { - // drain the first few rows if offset is larger than zero - while (inputIndex < offset) { -if (inputIterator.hasNext) inputIterator.next() -inputIndex += 1 + if (ignoreNulls) { +findNextRowWithNonNullInput() + } else { +// drain the first few rows if offset is larger than zero +while (inputIndex < offset) { + if (inputIterator.hasNext) inputIterator.next() + inputIndex += 1 +} +inputIndex = offset } - inputIndex = offset Review Comment: yes, it works when `ignoreNulls == true` but breaks when `ignoreNulls == false` In particular it will trigger last `else` branch in `doWrite`: ``` } else { (current: InternalRow) => if (inputIndex >= 0 && inputIndex < input.length) { val r = WindowFunctionFrame.getNextOrNull(inputIterator) projection(r) } else { // Use default values since the offset row does not exist. fillDefaultValue(current) } inputIndex += 1 } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45430] Fix for FramelessOffsetWindowFunction when IGNORE NULLS and offset > rowCount [spark]
vitaliili-db commented on code in PR #43236: URL: https://github.com/apache/spark/pull/43236#discussion_r1355356343 ## sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowFunctionFrame.scala: ## @@ -200,15 +200,19 @@ class FrameLessOffsetWindowFunctionFrame( override def prepare(rows: ExternalAppendOnlyUnsafeRowArray): Unit = { resetStates(rows) -if (ignoreNulls) { - findNextRowWithNonNullInput() +if (Math.abs(offset) > rows.length) { + inputIndex = offset } else { - // drain the first few rows if offset is larger than zero - while (inputIndex < offset) { -if (inputIterator.hasNext) inputIterator.next() -inputIndex += 1 + if (ignoreNulls) { +findNextRowWithNonNullInput() + } else { +// drain the first few rows if offset is larger than zero +while (inputIndex < offset) { + if (inputIterator.hasNext) inputIterator.next() + inputIndex += 1 +} +inputIndex = offset } - inputIndex = offset Review Comment: yes, it works when `ignoreNulls == true` but breaks when `ignoreNulls == false` ## sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowFunctionFrame.scala: ## @@ -200,15 +200,19 @@ class FrameLessOffsetWindowFunctionFrame( override def prepare(rows: ExternalAppendOnlyUnsafeRowArray): Unit = { resetStates(rows) -if (ignoreNulls) { - findNextRowWithNonNullInput() +if (Math.abs(offset) > rows.length) { + inputIndex = offset } else { - // drain the first few rows if offset is larger than zero - while (inputIndex < offset) { -if (inputIterator.hasNext) inputIterator.next() -inputIndex += 1 + if (ignoreNulls) { Review Comment: when `offset > rows.length` it is safe to just set `inputIndex` to its value. I can add conditional inside `if (ignoreNulls)` if you feel strongly about it :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org