Re: [PR] [SPARK-48791][CORE] Fix perf regression caused by the accumulators registration overhead using CopyOnWriteArrayList [spark]

2024-07-02 Thread via GitHub
Ngone51 commented on code in PR #47197: URL: https://github.com/apache/spark/pull/47197#discussion_r1663616953 ## core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala: ## @@ -264,12 +264,37 @@ class TaskMetrics private[spark] () extends Serializable { /** * Ext

Re: [PR] [SPARK-48791][CORE] Fix perf regression caused by the accumulators registration overhead using CopyOnWriteArrayList [spark]

2024-07-02 Thread via GitHub
Ngone51 commented on code in PR #47197: URL: https://github.com/apache/spark/pull/47197#discussion_r1663611672 ## core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala: ## @@ -340,7 +365,7 @@ private[spark] object TaskMetrics extends Logging { externalAccums.a

Re: [PR] [SPARK-48791][CORE] Fix perf regression caused by the accumulators registration overhead using CopyOnWriteArrayList [spark]

2024-07-02 Thread via GitHub
Ngone51 commented on code in PR #47197: URL: https://github.com/apache/spark/pull/47197#discussion_r1663609924 ## core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala: ## @@ -264,12 +264,37 @@ class TaskMetrics private[spark] () extends Serializable { /** * Ext

Re: [PR] [SPARK-48791][CORE] Fix perf regression caused by the accumulators registration overhead using CopyOnWriteArrayList [spark]

2024-07-02 Thread via GitHub
cloud-fan commented on code in PR #47197: URL: https://github.com/apache/spark/pull/47197#discussion_r1663603187 ## core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala: ## @@ -340,7 +365,7 @@ private[spark] object TaskMetrics extends Logging { externalAccums

Re: [PR] [SPARK-48791][CORE] Fix perf regression caused by the accumulators registration overhead using CopyOnWriteArrayList [spark]

2024-07-02 Thread via GitHub
mridulm commented on code in PR #47197: URL: https://github.com/apache/spark/pull/47197#discussion_r1663603110 ## core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala: ## @@ -264,12 +264,37 @@ class TaskMetrics private[spark] () extends Serializable { /** * Ext

[PR] [SPARK-48791][CORE] Fix perf regression caused by the accumulators registration overhead using CopyOnWriteArrayList [spark]

2024-07-02 Thread via GitHub
Ngone51 opened a new pull request, #47197: URL: https://github.com/apache/spark/pull/47197 ### What changes were proposed in this pull request? This PR proposes to use the `ArrayBuffer` together with the read/write lock rather than `CopyOnWriteArrayList` for `TaskMetrics._

Re: [PR] [SPARK-48780][SQL] Make errors in NamedParametersSupport generic to handle functions and procedures [spark]

2024-07-02 Thread via GitHub
aokolnychyi commented on PR #47189: URL: https://github.com/apache/spark/pull/47189#issuecomment-2205197284 Hm, seems like a failure in streaming. I'll check tomorrow. ``` [info] - SPARK-41224: collect data using arrow *** FAILED *** (39 milliseconds) [info] VerifyEvents.this.

Re: [PR] [SPARK-48787][BUILD] Upgrade Kafka to 3.7.1 [spark]

2024-07-02 Thread via GitHub
HyukjinKwon commented on PR #47191: URL: https://github.com/apache/spark/pull/47191#issuecomment-2205127779 cc @HeartSaVioR -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [PR] FileStreamSource maxCachedFiles set to 0 causes batch with no files to be processed [spark]

2024-07-02 Thread via GitHub
HyukjinKwon commented on PR #47195: URL: https://github.com/apache/spark/pull/47195#issuecomment-2205127608 Mind filing a JiRA? See also https://spark.apache.org/contributing.html -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHu

Re: [PR] [Only Test] Make HiveGenericUDF's DeferredObject lazy [spark]

2024-07-02 Thread via GitHub
jackylee-ch commented on code in PR #47193: URL: https://github.com/apache/spark/pull/47193#discussion_r1663475103 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUDFEvaluators.scala: ## @@ -129,7 +129,7 @@ class HiveGenericUDFEvaluator( override def returnType: Data

Re: [PR] [SPARK-48790][TESTING] Use checkDatasetUnorderly in DeprecatedDatasetAggregatorSuite [spark]

2024-07-02 Thread via GitHub
amaliujia commented on PR #47196: URL: https://github.com/apache/spark/pull/47196#issuecomment-2205080743 @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To uns

[PR] [SPARK-48790][TESTING] Use checkDatasetUnorderly in DeprecatedDatasetAggregatorSuite [spark]

2024-07-02 Thread via GitHub
amaliujia opened a new pull request, #47196: URL: https://github.com/apache/spark/pull/47196 ### What changes were proposed in this pull request? Use `checkDatasetUnorderly` in DeprecatedDatasetAggregatorSuite. The tests do not need depending on the ordering of the result.

Re: [PR] [SPARK-48307][SQL][FOLLOWUP] Eliminate the use of mutable.ArrayBuffer [spark]

2024-07-02 Thread via GitHub
yaooqinn commented on code in PR #47185: URL: https://github.com/apache/spark/pull/47185#discussion_r1663462271 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InlineCTE.scala: ## @@ -140,26 +140,31 @@ case class InlineCTE( cteMap: mutable.Map[Long,

Re: [PR] [SPARK-48307][SQL][FOLLOWUP] Eliminate the use of mutable.ArrayBuffer [spark]

2024-07-02 Thread via GitHub
yaooqinn commented on code in PR #47185: URL: https://github.com/apache/spark/pull/47185#discussion_r1663462046 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InlineCTE.scala: ## @@ -140,26 +140,31 @@ case class InlineCTE( cteMap: mutable.Map[Long,

Re: [PR] [SPARK-48773] Document config "spark.default.parallelism" by config builder framework [spark]

2024-07-02 Thread via GitHub
amaliujia commented on code in PR #47171: URL: https://github.com/apache/spark/pull/47171#discussion_r1663461576 ## core/src/main/scala/org/apache/spark/internal/config/package.scala: ## @@ -42,6 +42,18 @@ package object config { private[spark] val SPARK_TASK_PREFIX = "spark.

Re: [PR] [SPARK-48307][SQL][FOLLOWUP] Eliminate the use of mutable.ArrayBuffer [spark]

2024-07-02 Thread via GitHub
amaliujia commented on code in PR #47185: URL: https://github.com/apache/spark/pull/47185#discussion_r1663459563 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InlineCTE.scala: ## @@ -140,26 +140,31 @@ case class InlineCTE( cteMap: mutable.Map[Long

[PR] FileStreamSource maxCachedFiles set to 0 causes batch with no files to be processed [spark]

2024-07-02 Thread via GitHub
ragnarok56 opened a new pull request, #47195: URL: https://github.com/apache/spark/pull/47195 ### What changes were proposed in this pull request? Fix an issue that was identified during testing with changes from https://github.com/apache/spark/pull/45362. When setting `maxCachedFile

Re: [PR] [SPARK-48714][PYTHON] Implement `DataFrame.mergeInto` in PySpark [spark]

2024-07-02 Thread via GitHub
viirya commented on code in PR #47086: URL: https://github.com/apache/spark/pull/47086#discussion_r1663449734 ## python/pyspark/sql/dataframe.py: ## @@ -5984,6 +5984,41 @@ def writeTo(self, table: str) -> DataFrameWriterV2: """ ... +@dispatch_df_method +

Re: [PR] [Only Test] Make HiveGenericUDF's DeferredObject lazy [spark]

2024-07-02 Thread via GitHub
cloud-fan commented on code in PR #47193: URL: https://github.com/apache/spark/pull/47193#discussion_r1663441930 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUDFEvaluators.scala: ## @@ -129,7 +129,7 @@ class HiveGenericUDFEvaluator( override def returnType: DataTy

Re: [PR] [SPARK-48773] Document config "spark.default.parallelism" by config builder framework [spark]

2024-07-02 Thread via GitHub
yaooqinn commented on code in PR #47171: URL: https://github.com/apache/spark/pull/47171#discussion_r1663441737 ## core/src/main/scala/org/apache/spark/internal/config/package.scala: ## @@ -42,6 +42,18 @@ package object config { private[spark] val SPARK_TASK_PREFIX = "spark.t

Re: [PR] [SPARK-48760][SQL] Introduce ALTER TABLE ... CLUSTER BY SQL syntax to change clustering columns [spark]

2024-07-02 Thread via GitHub
cloud-fan closed pull request #47156: [SPARK-48760][SQL] Introduce ALTER TABLE ... CLUSTER BY SQL syntax to change clustering columns URL: https://github.com/apache/spark/pull/47156 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

Re: [PR] [SPARK-48774][SQL] Use SparkSession in SQLImplicits [spark]

2024-07-02 Thread via GitHub
HyukjinKwon closed pull request #47173: [SPARK-48774][SQL] Use SparkSession in SQLImplicits URL: https://github.com/apache/spark/pull/47173 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specif

Re: [PR] [SPARK-48774][SQL] Use SparkSession in SQLImplicits [spark]

2024-07-02 Thread via GitHub
HyukjinKwon commented on PR #47173: URL: https://github.com/apache/spark/pull/47173#issuecomment-2205012789 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [PR] [SPARK-48787][BUILD] Upgrade Kafka to 3.7.1 [spark]

2024-07-02 Thread via GitHub
panbingkun commented on PR #47191: URL: https://github.com/apache/spark/pull/47191#issuecomment-2205008709 cc @dongjoon-hyun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [PR] [SPARK-48760][SQL] Introduce ALTER TABLE ... CLUSTER BY SQL syntax to change clustering columns [spark]

2024-07-02 Thread via GitHub
cloud-fan commented on PR #47156: URL: https://github.com/apache/spark/pull/47156#issuecomment-2205008789 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific c

Re: [PR] [MINOR][TESTS] Replace `getResource` with `getWorkspaceFilePath` to enable `HiveUDFSuite` to run successfully in the IDE [spark]

2024-07-02 Thread via GitHub
panbingkun commented on PR #47194: URL: https://github.com/apache/spark/pull/47194#issuecomment-2205000451 cc @HyukjinKwon @LuciferYang -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specif

Re: [PR] [MINOR][TESTS] Replace `getResource` with `getWorkspaceFilePath` to enable `HiveUDFSuite` to run successfully in the IDE [spark]

2024-07-02 Thread via GitHub
panbingkun commented on PR #47194: URL: https://github.com/apache/spark/pull/47194#issuecomment-2205000216 Error detail: ``` Cannot invoke "java.net.URL.getFile()" because the return value of "java.lang.ClassLoader.getResource(String)" is null java.lang.NullPointerException: Cannot

[PR] [MINOR][TESTS] Replace `getResource` with `getWorkspaceFilePath` to enable `HiveUDFSuite` to run successfully in the IDE [spark]

2024-07-02 Thread via GitHub
panbingkun opened a new pull request, #47194: URL: https://github.com/apache/spark/pull/47194 ### What changes were proposed in this pull request? The pr aims to replace `getResource` with `getWorkspaceFilePath` to enable `HiveUDFSuite` to run successfully in the IDE. ### Why are t

Re: [PR] [SPARK-39901][CORE][SQL] Redesign `ignoreCorruptFiles` to make it more accurate by adding a new config `spark.files.ignoreCorruptFiles.errorClasses` [spark]

2024-07-02 Thread via GitHub
LuciferYang commented on PR #47090: URL: https://github.com/apache/spark/pull/47090#issuecomment-2204997073 cc @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. T

Re: [PR] [Only Test] Make HiveGenericUDF's DeferredObject lazy [spark]

2024-07-02 Thread via GitHub
LuciferYang commented on code in PR #47193: URL: https://github.com/apache/spark/pull/47193#discussion_r1663418649 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUDFEvaluators.scala: ## @@ -129,7 +129,7 @@ class HiveGenericUDFEvaluator( override def returnType: Data

Re: [PR] [SPARK-42051][SQL] Codegen Support for HiveGenericUDF [spark]

2024-07-02 Thread via GitHub
yaooqinn commented on code in PR #39555: URL: https://github.com/apache/spark/pull/39555#discussion_r1663417611 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUDFs.scala: ## @@ -120,19 +121,18 @@ private[hive] class DeferredObjectAdapter(oi: ObjectInspector, dataType:

Re: [PR] [SPARK-48784][SQL] Add ::: syntax as a shorthand for try_cast [spark]

2024-07-02 Thread via GitHub
yaooqinn commented on PR #47186: URL: https://github.com/apache/spark/pull/47186#issuecomment-2204987052 It looks very uncommon and incurs a cognitive cost for users. Also cc @dongjoon-hyun -- This is an automated message from the Apache Git Service. To respond to the message, please log

Re: [PR] [SPARK-42051][SQL] Codegen Support for HiveGenericUDF [spark]

2024-07-02 Thread via GitHub
panbingkun commented on code in PR #39555: URL: https://github.com/apache/spark/pull/39555#discussion_r1663411103 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUDFs.scala: ## @@ -120,19 +121,18 @@ private[hive] class DeferredObjectAdapter(oi: ObjectInspector, dataTyp

[PR] [Only Test] Make HiveGenericUDF lazy [spark]

2024-07-02 Thread via GitHub
panbingkun opened a new pull request, #47193: URL: https://github.com/apache/spark/pull/47193 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How

Re: [PR] [SPARK-48307][SQL][FOLLOWUP] Eliminate the use of mutable.ArrayBuffer [spark]

2024-07-02 Thread via GitHub
yaooqinn commented on code in PR #47185: URL: https://github.com/apache/spark/pull/47185#discussion_r1663401781 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InlineCTE.scala: ## @@ -140,26 +140,31 @@ case class InlineCTE( cteMap: mutable.Map[Long,

Re: [PR] [SPARK-48307][SQL][FOLLOWUP] Eliminate the use of mutable.ArrayBuffer [spark]

2024-07-02 Thread via GitHub
cloud-fan commented on code in PR #47185: URL: https://github.com/apache/spark/pull/47185#discussion_r1663398411 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InlineCTE.scala: ## @@ -140,26 +140,31 @@ case class InlineCTE( cteMap: mutable.Map[Long

Re: [PR] [SPARK-46741][SQL] Cache Table with CTE won't work [spark]

2024-07-02 Thread via GitHub
AngersZh commented on PR #44767: URL: https://github.com/apache/spark/pull/44767#issuecomment-2204898273 > what was behaviour before? Would be great to show the result before/after For the query in `cache.sql` ``` EXPLAIN EXTENDED SELECT * FROM cache_nested_cte_table ```

Re: [PR] [SPARK-48720][SQL] Clarify the command `ALTER TABLE ... UNSET TBLPROPERTIES ...` diff between v1 and v2 in doc [spark]

2024-07-02 Thread via GitHub
panbingkun commented on code in PR #47097: URL: https://github.com/apache/spark/pull/47097#discussion_r1663363438 ## docs/sql-ref-syntax-ddl-alter-table.md: ## @@ -236,20 +236,30 @@ ALTER TABLE table_identifier DROP [ IF EXISTS ] partition_spec [PURGE] ### SET AND UNSET

[PR] [SPARK-48628] Add task peak on/off heap memory metrics [spark]

2024-07-02 Thread via GitHub
liuzqt opened a new pull request, #47192: URL: https://github.com/apache/spark/pull/47192 ### What changes were proposed in this pull request? Add task on/off heap execution memory in `TaskMetrics`, tracked in `TaskMemoryManager`. ### Why are the changes needed?

Re: [PR] [SPARK-48726] Create the StateSchemaV3 file format, and write this out for the TransformWithStateExec operator [spark]

2024-07-02 Thread via GitHub
anishshri-db commented on code in PR #47104: URL: https://github.com/apache/spark/pull/47104#discussion_r1663358230 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/SchemaHelper.scala: ## @@ -28,6 +33,61 @@ import org.apache.spark.util.Utils /** * Hel

Re: [PR] [SPARK-48726] Create the StateSchemaV3 file format, and write this out for the TransformWithStateExec operator [spark]

2024-07-02 Thread via GitHub
anishshri-db commented on code in PR #47104: URL: https://github.com/apache/spark/pull/47104#discussion_r1663357917 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/TransformWithStateExec.scala: ## @@ -340,11 +370,48 @@ case class TransformWithStateExec(

Re: [PR] [SPARK-48726] Create the StateSchemaV3 file format, and write this out for the TransformWithStateExec operator [spark]

2024-07-02 Thread via GitHub
anishshri-db commented on code in PR #47104: URL: https://github.com/apache/spark/pull/47104#discussion_r1663357402 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/TransformWithStateExec.scala: ## @@ -92,6 +93,35 @@ case class TransformWithStateExec( }

Re: [PR] [SPARK-48726] Create the StateSchemaV3 file format, and write this out for the TransformWithStateExec operator [spark]

2024-07-02 Thread via GitHub
anishshri-db commented on code in PR #47104: URL: https://github.com/apache/spark/pull/47104#discussion_r1663356046 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateSchemaV3File.scala: ## @@ -0,0 +1,117 @@ +/* + * Licensed to the Apache Software Fou

Re: [PR] [SPARK-48726] Create the StateSchemaV3 file format, and write this out for the TransformWithStateExec operator [spark]

2024-07-02 Thread via GitHub
anishshri-db commented on code in PR #47104: URL: https://github.com/apache/spark/pull/47104#discussion_r1663348445 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala: ## @@ -187,23 +187,33 @@ class IncrementalExecution( } }

Re: [PR] [SPARK-48785][DOCS] Add a simple Python data source example in the user guide [spark]

2024-07-02 Thread via GitHub
HyukjinKwon closed pull request #47187: [SPARK-48785][DOCS] Add a simple Python data source example in the user guide URL: https://github.com/apache/spark/pull/47187 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [PR] [SPARK-48785][DOCS] Add a simple Python data source example in the user guide [spark]

2024-07-02 Thread via GitHub
HyukjinKwon commented on PR #47187: URL: https://github.com/apache/spark/pull/47187#issuecomment-2204817803 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [PR] [SPARK-48726] Create the StateSchemaV3 file format, and write this out for the TransformWithStateExec operator [spark]

2024-07-02 Thread via GitHub
anishshri-db commented on code in PR #47104: URL: https://github.com/apache/spark/pull/47104#discussion_r1663347349 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FlatMapGroupsWithStateExec.scala: ## @@ -189,7 +189,9 @@ trait FlatMapGroupsWithStateExecBase

Re: [PR] [SPARK-48726] Create the StateSchemaV3 file format, and write this out for the TransformWithStateExec operator [spark]

2024-07-02 Thread via GitHub
anishshri-db commented on code in PR #47104: URL: https://github.com/apache/spark/pull/47104#discussion_r1663344634 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ColumnFamilySchemaFactory.scala: ## @@ -0,0 +1,94 @@ +/* + * Licensed to the Apache Software Fo

Re: [PR] [SPARK-46741][SQL] Cache Table with CTE won't work [spark]

2024-07-02 Thread via GitHub
HyukjinKwon commented on PR #44767: URL: https://github.com/apache/spark/pull/44767#issuecomment-2204787858 what was behaviour before? Would be great to show the result before/after -- This is an automated message from the Apache Git Service. To respond to the message, please log on to Git

Re: [PR] [SPARK-48510][2/2] Support UDAF `toColumn` API in Spark Connect [spark]

2024-07-02 Thread via GitHub
HyukjinKwon commented on code in PR #46849: URL: https://github.com/apache/spark/pull/46849#discussion_r1663332341 ## connector/connect/common/src/main/protobuf/spark/connect/expressions.proto: ## @@ -401,6 +402,11 @@ message JavaUDF { bool aggregate = 3; } +message TypedA

Re: [PR] [SPARK-44167][SQL] Add Catalog APIs for loading stored procedures [spark]

2024-07-02 Thread via GitHub
HyukjinKwon commented on PR #47190: URL: https://github.com/apache/spark/pull/47190#issuecomment-2204778320 cc @cloud-fan and @allisonwang-db is this related to something you're working on? -- This is an automated message from the Apache Git Service. To respond to the message, please log

Re: [PR] [SPARK-48714][PYTHON] Implement `DataFrame.mergeInto` in PySpark [spark]

2024-07-02 Thread via GitHub
HyukjinKwon closed pull request #47086: [SPARK-48714][PYTHON] Implement `DataFrame.mergeInto` in PySpark URL: https://github.com/apache/spark/pull/47086 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

Re: [PR] [SPARK-48714][PYTHON] Implement `DataFrame.mergeInto` in PySpark [spark]

2024-07-02 Thread via GitHub
HyukjinKwon commented on PR #47086: URL: https://github.com/apache/spark/pull/47086#issuecomment-2204769072 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [PR] [SPARK-48777][BUILD]Migrate build system to Bazel [spark-connect-go]

2024-07-02 Thread via GitHub
HyukjinKwon closed pull request #23: [SPARK-48777][BUILD]Migrate build system to Bazel URL: https://github.com/apache/spark-connect-go/pull/23 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spe

Re: [PR] [SPARK-48777][BUILD]Migrate build system to Bazel [spark-connect-go]

2024-07-02 Thread via GitHub
HyukjinKwon commented on PR #23: URL: https://github.com/apache/spark-connect-go/pull/23#issuecomment-2204767584 Merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comm

Re: [PR] [SPARK-48710][PYTHON] Use NumPy 2.0 compatible types [spark]

2024-07-02 Thread via GitHub
HyukjinKwon closed pull request #47083: [SPARK-48710][PYTHON] Use NumPy 2.0 compatible types URL: https://github.com/apache/spark/pull/47083 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spec

Re: [PR] [SPARK-48710][PYTHON][3.5] Limit NumPy version to supported range (>=1.15,<2) [spark]

2024-07-02 Thread via GitHub
HyukjinKwon closed pull request #47175: [SPARK-48710][PYTHON][3.5] Limit NumPy version to supported range (>=1.15,<2) URL: https://github.com/apache/spark/pull/47175 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [PR] [SPARK-48710][PYTHON][3.5] Limit NumPy version to supported range (>=1.15,<2) [spark]

2024-07-02 Thread via GitHub
HyukjinKwon commented on PR #47175: URL: https://github.com/apache/spark/pull/47175#issuecomment-2204766403 Merged to `branch-3.5` and `branch-3.4`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go t

Re: [PR] [SPARK-48710][PYTHON] Use NumPy 2.0 compatible types [spark]

2024-07-02 Thread via GitHub
HyukjinKwon commented on PR #47083: URL: https://github.com/apache/spark/pull/47083#issuecomment-2204766546 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [PR] [SPARK-48710][PYTHON][3.5] Limit NumPy version to supported range (>=1.15,<2) [spark]

2024-07-02 Thread via GitHub
HyukjinKwon commented on PR #47175: URL: https://github.com/apache/spark/pull/47175#issuecomment-2204765055 > Do we need this in branch-3.4 too? i will backport this to `branch-3.4` -- This is an automated message from the Apache Git Service. To respond to the message, please log on

[PR] [SPARK-48787][BUILD] Upgrade Kafka to 3.7.1 [spark]

2024-07-02 Thread via GitHub
panbingkun opened a new pull request, #47191: URL: https://github.com/apache/spark/pull/47191 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How

Re: [PR] Op state metadata [spark]

2024-07-02 Thread via GitHub
ericm-db closed pull request #47109: Op state metadata URL: https://github.com/apache/spark/pull/47109 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: r

Re: [PR] [SPARK-48720][SQL] Clarify the command `ALTER TABLE ... UNSET TBLPROPERTIES ...` diff between v1 and v2 in doc [spark]

2024-07-02 Thread via GitHub
panbingkun commented on code in PR #47097: URL: https://github.com/apache/spark/pull/47097#discussion_r1663267483 ## docs/sql-ref-syntax-ddl-alter-table.md: ## @@ -236,20 +236,30 @@ ALTER TABLE table_identifier DROP [ IF EXISTS ] partition_spec [PURGE] ### SET AND UNSET

Re: [PR] [SPARK-48720][SQL] Clarify the command `ALTER TABLE ... UNSET TBLPROPERTIES ...` diff between v1 and v2 in doc [spark]

2024-07-02 Thread via GitHub
panbingkun commented on code in PR #47097: URL: https://github.com/apache/spark/pull/47097#discussion_r1663267483 ## docs/sql-ref-syntax-ddl-alter-table.md: ## @@ -236,20 +236,30 @@ ALTER TABLE table_identifier DROP [ IF EXISTS ] partition_spec [PURGE] ### SET AND UNSET

Re: [PR] [SPARK-44167][SQL] Add Catalog APIs for loading stored procedures [spark]

2024-07-02 Thread via GitHub
aokolnychyi commented on code in PR #47190: URL: https://github.com/apache/spark/pull/47190#discussion_r1663250258 ## sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/Procedure.java: ## @@ -0,0 +1,67 @@ +/* + * Licensed to the Apache Software Foundation (ASF) un

Re: [PR] [SPARK-44167][SQL] Add Catalog APIs for loading stored procedures [spark]

2024-07-02 Thread via GitHub
aokolnychyi commented on code in PR #47190: URL: https://github.com/apache/spark/pull/47190#discussion_r1663248731 ## sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/ProcedureParameterImpl.java: ## @@ -0,0 +1,90 @@ +/* + * Licensed to the Apache Software Founda

[PR] [SPARK-48786]Define Conf Properties for Spark Operator Controller [spark-kubernetes-operator]

2024-07-02 Thread via GitHub
jiangzho opened a new pull request, #17: URL: https://github.com/apache/spark-kubernetes-operator/pull/17 ### What changes were proposed in this pull request? This is a breakdown PR of #12 - introducing config properties for controller module. Also, the config package incl

Re: [PR] [SPARK-44167][SQL] Add Catalog APIs for loading stored procedures [spark]

2024-07-02 Thread via GitHub
aokolnychyi commented on code in PR #47190: URL: https://github.com/apache/spark/pull/47190#discussion_r1663231799 ## sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/ProcedureParameter.java: ## @@ -0,0 +1,84 @@ +/* + * Licensed to the Apache Software Foundation

Re: [PR] [SPARK-44167][SQL] Add Catalog APIs for loading stored procedures [spark]

2024-07-02 Thread via GitHub
aokolnychyi commented on code in PR #47190: URL: https://github.com/apache/spark/pull/47190#discussion_r1663231799 ## sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/ProcedureParameter.java: ## @@ -0,0 +1,84 @@ +/* + * Licensed to the Apache Software Foundation

Re: [PR] [SPARK-44167][SQL] Add Catalog APIs for loading stored procedures [spark]

2024-07-02 Thread via GitHub
aokolnychyi commented on code in PR #47190: URL: https://github.com/apache/spark/pull/47190#discussion_r1663231799 ## sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/ProcedureParameter.java: ## @@ -0,0 +1,84 @@ +/* + * Licensed to the Apache Software Foundation

Re: [PR] [SPARK-48780][SQL] Make errors in NamedParametersSupport generic to handle functions and procedures [spark]

2024-07-02 Thread via GitHub
aokolnychyi commented on PR #47189: URL: https://github.com/apache/spark/pull/47189#issuecomment-2204564355 cc @cloud-fan @viirya @dongjoon-hyun @huaxingao @sunchao @HyukjinKwon @gengliangwang -- This is an automated message from the Apache Git Service. To respond to the message, please

Re: [PR] [SPARK-44167][SQL] Add Catalog APIs for loading stored procedures [spark]

2024-07-02 Thread via GitHub
aokolnychyi commented on PR #47190: URL: https://github.com/apache/spark/pull/47190#issuecomment-2204563955 cc @cloud-fan @viirya @dongjoon-hyun @huaxingao @sunchao @HyukjinKwon @gengliangwang -- This is an automated message from the Apache Git Service. To respond to the message, please

[PR] [SPARK-44167][SQL] Add Catalog APIs for loading stored procedures [spark]

2024-07-02 Thread via GitHub
aokolnychyi opened a new pull request, #47190: URL: https://github.com/apache/spark/pull/47190 ### What changes were proposed in this pull request? This PR contains new connector APIs for loading stored procedures per [discussed and voted](https://lists.apache.org/thread/

Re: [PR] [SPARK-48710][PYTHON] Use NumPy 2.0 compatible types [spark]

2024-07-02 Thread via GitHub
jakirkham commented on PR #47083: URL: https://github.com/apache/spark/pull/47083#issuecomment-2204550456 It should be possible to write code that is compatible with NumPy 1 & 2. That is what most projects are doing Would look over [the migration guide]( https://numpy.org/devdocs/num

Re: [PR] [SPARK-48177][BUILD] Upgrade `Apache Parquet` to 1.14.1 [spark]

2024-07-02 Thread via GitHub
sunchao commented on PR #46447: URL: https://github.com/apache/spark/pull/46447#issuecomment-2204544790 The benchmark results look OK to me as well - there is no big deviation from the previous result. Thanks @Fokko for the PR! -- This is an automated message from the Apache Git Service.

Re: [PR] [SPARK-48177][BUILD] Upgrade `Apache Parquet` to 1.14.1 [spark]

2024-07-02 Thread via GitHub
sunchao commented on code in PR #46447: URL: https://github.com/apache/spark/pull/46447#discussion_r1663221944 ## sql/core/benchmarks/DataSourceReadBenchmark-results.txt: ## @@ -1,431 +1,431 @@ -

Re: [PR] [SPARK-48780][SQL] Make errors in NamedParametersSupport generic to handle functions and procedures [spark]

2024-07-02 Thread via GitHub
aokolnychyi commented on code in PR #47189: URL: https://github.com/apache/spark/pull/47189#discussion_r1663218593 ## common/utils/src/main/resources/error/error-conditions.json: ## @@ -3578,7 +3578,7 @@ }, "REQUIRED_PARAMETER_NOT_FOUND" : { "message" : [ - "Cann

Re: [PR] [SPARK-48780][SQL] Make errors in NamedParametersSupport generic to handle functions and procedures [spark]

2024-07-02 Thread via GitHub
aokolnychyi commented on code in PR #47189: URL: https://github.com/apache/spark/pull/47189#discussion_r1663218593 ## common/utils/src/main/resources/error/error-conditions.json: ## @@ -3578,7 +3578,7 @@ }, "REQUIRED_PARAMETER_NOT_FOUND" : { "message" : [ - "Cann

[PR] [SPARK-48780][SQL] Make errors in NamedParametersSupport generic to handle functions and procedures [spark]

2024-07-02 Thread via GitHub
aokolnychyi opened a new pull request, #47189: URL: https://github.com/apache/spark/pull/47189 ### What changes were proposed in this pull request? This PR makes errors in `NamedParametersSupport` generic so that we can reuse that class to handle argument rearrangement bot

Re: [PR] [SPARK-48726] Create the StateSchemaV3 file format, and write this out for the TransformWithStateExec operator [spark]

2024-07-02 Thread via GitHub
ericm-db commented on code in PR #47104: URL: https://github.com/apache/spark/pull/47104#discussion_r1663216919 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StatefulProcessorHandleImpl.scala: ## @@ -232,14 +209,14 @@ class StatefulProcessorHandleImpl(

Re: [PR] [SPARK-48726] Create the StateSchemaV3 file format, and write this out for the TransformWithStateExec operator [spark]

2024-07-02 Thread via GitHub
anishshri-db commented on code in PR #47104: URL: https://github.com/apache/spark/pull/47104#discussion_r1663216827 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateSchemaV3File.scala: ## @@ -0,0 +1,112 @@ +/* + * Licensed to the Apache Software Fou

Re: [PR] [SPARK-48726] Create the StateSchemaV3 file format, and write this out for the TransformWithStateExec operator [spark]

2024-07-02 Thread via GitHub
anishshri-db commented on code in PR #47104: URL: https://github.com/apache/spark/pull/47104#discussion_r1663215623 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/SchemaHelper.scala: ## @@ -28,6 +33,61 @@ import org.apache.spark.util.Utils /** * Hel

Re: [PR] [SPARK-48726] Create the StateSchemaV3 file format, and write this out for the TransformWithStateExec operator [spark]

2024-07-02 Thread via GitHub
anishshri-db commented on code in PR #47104: URL: https://github.com/apache/spark/pull/47104#discussion_r1663213134 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/TransformWithStateExec.scala: ## @@ -340,11 +370,54 @@ case class TransformWithStateExec(

Re: [PR] [SPARK-48726] Create the StateSchemaV3 file format, and write this out for the TransformWithStateExec operator [spark]

2024-07-02 Thread via GitHub
anishshri-db commented on code in PR #47104: URL: https://github.com/apache/spark/pull/47104#discussion_r1663212263 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala: ## @@ -246,7 +246,12 @@ case class StreamingSymmetricHash

Re: [PR] [SPARK-48726] Create the StateSchemaV3 file format, and write this out for the TransformWithStateExec operator [spark]

2024-07-02 Thread via GitHub
anishshri-db commented on code in PR #47104: URL: https://github.com/apache/spark/pull/47104#discussion_r1663211544 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StatefulProcessorHandleImpl.scala: ## @@ -232,14 +209,14 @@ class StatefulProcessorHandleImpl(

Re: [PR] [SPARK-48726] Create the StateSchemaV3 file format, and write this out for the TransformWithStateExec operator [spark]

2024-07-02 Thread via GitHub
anishshri-db commented on code in PR #47104: URL: https://github.com/apache/spark/pull/47104#discussion_r1663210224 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala: ## @@ -187,23 +187,33 @@ class IncrementalExecution( } }

Re: [PR] [SPARK-48726] Create the StateSchemaV3 file format, and write this out for the TransformWithStateExec operator [spark]

2024-07-02 Thread via GitHub
anishshri-db commented on code in PR #47104: URL: https://github.com/apache/spark/pull/47104#discussion_r1663208852 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala: ## @@ -325,6 +338,23 @@ class HDFSMetadataLog[T <: AnyRef : ClassTag](s

Re: [PR] [SPARK-48726] Create the StateSchemaV3 file format, and write this out for the TransformWithStateExec operator [spark]

2024-07-02 Thread via GitHub
anishshri-db commented on code in PR #47104: URL: https://github.com/apache/spark/pull/47104#discussion_r1663207877 ## common/utils/src/main/resources/error/error-conditions.json: ## @@ -3803,6 +3803,12 @@ ], "sqlState" : "XXKST" }, + "STATE_STORE_NEW_COLUMN_FAMILY

Re: [PR] [SPARK-48493][PYTHON] Enhance Python Datasource Reader with direct Arrow Batch support for improved performance [spark]

2024-07-02 Thread via GitHub
allisonwang-db commented on code in PR #46826: URL: https://github.com/apache/spark/pull/46826#discussion_r1663198085 ## python/pyspark/sql/datasource.py: ## @@ -332,8 +331,13 @@ def partitions(self) -> Sequence[InputPartition]: message_parameters={"feature": "parti

[PR] [SPARK-48772][SS][SQL] State Source Change Feed Reader Mode [spark]

2024-07-02 Thread via GitHub
eason-yuchen-liu opened a new pull request, #47188: URL: https://github.com/apache/spark/pull/47188 ### What changes were proposed in this pull request? This PR adds ability of showing the evolution of state as Change Data Capture (CDC) format to state data source. An e

Re: [PR] [DRAFT] Virtual Column Family for RocksDB - StateStore API change [spark]

2024-07-02 Thread via GitHub
jingz-db closed pull request #47169: [DRAFT] Virtual Column Family for RocksDB - StateStore API change URL: https://github.com/apache/spark/pull/47169 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] [SPARK-48783][DOCS] Update the table-valued function docs [spark]

2024-07-02 Thread via GitHub
allisonwang-db commented on PR #47184: URL: https://github.com/apache/spark/pull/47184#issuecomment-2204413706 cc @srielau @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[PR] [SPARK-48785][DOCS] Add a simple Python data source example in the user guide [spark]

2024-07-02 Thread via GitHub
allisonwang-db opened a new pull request, #47187: URL: https://github.com/apache/spark/pull/47187 ### What changes were proposed in this pull request? This PR adds a self-contained, simple example implementation of a Python data source in the user guide to help users get start

Re: [PR] [SPARK-48770][SS] Change to read operator metadata once on driver to check if we can find info for numColsPrefixKey used for session window agg queries [spark]

2024-07-02 Thread via GitHub
HeartSaVioR closed pull request #47167: [SPARK-48770][SS] Change to read operator metadata once on driver to check if we can find info for numColsPrefixKey used for session window agg queries URL: https://github.com/apache/spark/pull/47167 -- This is an automated message from the Apache Git

Re: [PR] [SPARK-48770][SS] Change to read operator metadata once on driver to check if we can find info for numColsPrefixKey used for session window agg queries [spark]

2024-07-02 Thread via GitHub
HeartSaVioR commented on PR #47167: URL: https://github.com/apache/spark/pull/47167#issuecomment-2204327011 Thanks! Merging to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] [SPARK-48773] Document config "spark.default.parallelism" by config builder framework [spark]

2024-07-02 Thread via GitHub
amaliujia commented on code in PR #47171: URL: https://github.com/apache/spark/pull/47171#discussion_r1663120895 ## core/src/main/scala/org/apache/spark/internal/config/package.scala: ## @@ -42,6 +42,17 @@ package object config { private[spark] val SPARK_TASK_PREFIX = "spark.

Re: [PR] [SPARK-48773] Document config "spark.default.parallelism" by config builder framework [spark]

2024-07-02 Thread via GitHub
amaliujia commented on code in PR #47171: URL: https://github.com/apache/spark/pull/47171#discussion_r1663120895 ## core/src/main/scala/org/apache/spark/internal/config/package.scala: ## @@ -42,6 +42,17 @@ package object config { private[spark] val SPARK_TASK_PREFIX = "spark.

Re: [PR] [SPARK-48589][SQL][SS] Add option snapshotStartBatchId and snapshotPartitionId to state data source [spark]

2024-07-02 Thread via GitHub
HeartSaVioR closed pull request #46944: [SPARK-48589][SQL][SS] Add option snapshotStartBatchId and snapshotPartitionId to state data source URL: https://github.com/apache/spark/pull/46944 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

Re: [PR] [SPARK-48589][SQL][SS] Add option snapshotStartBatchId and snapshotPartitionId to state data source [spark]

2024-07-02 Thread via GitHub
HeartSaVioR commented on PR #46944: URL: https://github.com/apache/spark/pull/46944#issuecomment-2204313492 Thanks! Merging to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[PR] [SPARK-48784][SQL] Add ::: syntax as a shorthand for try_cast [spark]

2024-07-02 Thread via GitHub
gene-db opened a new pull request, #47186: URL: https://github.com/apache/spark/pull/47186 ### What changes were proposed in this pull request? Add the `:::` (triple colon) as syntactic sugar for `try_cast`. ### Why are the changes needed? This syntactic sugar makes it ea

  1   2   >