[jira] [Assigned] (SPARK-45938) Add `utils` to the dependency list of the `core` module in `module.py`
[ https://issues.apache.org/jira/browse/SPARK-45938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie reassigned SPARK-45938: Assignee: Yang Jie > Add `utils` to the dependency list of the `core` module in `module.py` > -- > > Key: SPARK-45938 > URL: https://issues.apache.org/jira/browse/SPARK-45938 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45938) Add `utils` to the dependency list of the `core` module in `module.py`
[ https://issues.apache.org/jira/browse/SPARK-45938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie resolved SPARK-45938. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43818 [https://github.com/apache/spark/pull/43818] > Add `utils` to the dependency list of the `core` module in `module.py` > -- > > Key: SPARK-45938 > URL: https://issues.apache.org/jira/browse/SPARK-45938 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32246) Have a way to optionally run streaming-kinesis-asl
[ https://issues.apache.org/jira/browse/SPARK-32246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-32246: --- Labels: pull-request-available (was: ) > Have a way to optionally run streaming-kinesis-asl > -- > > Key: SPARK-32246 > URL: https://issues.apache.org/jira/browse/SPARK-32246 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 2.4.6, 3.0.0, 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > > See https://github.com/HyukjinKwon/spark/pull/4. Kinesis tests depends on > external Amazon kinesis service. > We should have a way to run it optionally. Currently, this is not being run > in Github Actions. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45948) Make single-pod spark jobs respect spark.app.id
[ https://issues.apache.org/jira/browse/SPARK-45948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45948: --- Labels: pull-request-available (was: ) > Make single-pod spark jobs respect spark.app.id > --- > > Key: SPARK-45948 > URL: https://issues.apache.org/jira/browse/SPARK-45948 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45948) Make single-pod spark jobs respect spark.app.id
Dongjoon Hyun created SPARK-45948: - Summary: Make single-pod spark jobs respect spark.app.id Key: SPARK-45948 URL: https://issues.apache.org/jira/browse/SPARK-45948 Project: Spark Issue Type: Sub-task Components: Kubernetes Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45946) Fix use of deprecated FileUtils write in RocksDBSuite
[ https://issues.apache.org/jira/browse/SPARK-45946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786629#comment-17786629 ] Anish Shrigondekar commented on SPARK-45946: [~kabhwan] - PR here - [GitHub Pull Request #43832|https://github.com/apache/spark/pull/43832] PTAL, thx > Fix use of deprecated FileUtils write in RocksDBSuite > - > > Key: SPARK-45946 > URL: https://issues.apache.org/jira/browse/SPARK-45946 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Anish Shrigondekar >Priority: Major > Labels: pull-request-available > > Fix use of deprecated FileUtils write in RocksDBSuite -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45947) Set a human readable description for Dataset api
[ https://issues.apache.org/jira/browse/SPARK-45947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-45947: Description: We should set the view name to sparkSession.sparkContext.setJobDescription("xxx") !screenshot-1.png! was: Need to sparkSession.sparkContext.setJobDescription("xxx") !screenshot-1.png! > Set a human readable description for Dataset api > > > Key: SPARK-45947 > URL: https://issues.apache.org/jira/browse/SPARK-45947 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Priority: Major > Attachments: screenshot-1.png > > > We should set the view name to > sparkSession.sparkContext.setJobDescription("xxx") > !screenshot-1.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45947) Set a human readable description for Dataset api
[ https://issues.apache.org/jira/browse/SPARK-45947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-45947: Description: Need to sparkSession.sparkContext.setJobDescription("xxx") !screenshot-1.png! > Set a human readable description for Dataset api > > > Key: SPARK-45947 > URL: https://issues.apache.org/jira/browse/SPARK-45947 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Priority: Major > Attachments: screenshot-1.png > > > Need to sparkSession.sparkContext.setJobDescription("xxx") > !screenshot-1.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45947) Set a human readable description for Dataset api
Yuming Wang created SPARK-45947: --- Summary: Set a human readable description for Dataset api Key: SPARK-45947 URL: https://issues.apache.org/jira/browse/SPARK-45947 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Yuming Wang Attachments: screenshot-1.png -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45947) Set a human readable description for Dataset api
[ https://issues.apache.org/jira/browse/SPARK-45947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-45947: Attachment: screenshot-1.png > Set a human readable description for Dataset api > > > Key: SPARK-45947 > URL: https://issues.apache.org/jira/browse/SPARK-45947 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Priority: Major > Attachments: screenshot-1.png > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45946) Fix use of deprecated FileUtils write in RocksDBSuite
[ https://issues.apache.org/jira/browse/SPARK-45946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45946: --- Labels: pull-request-available (was: ) > Fix use of deprecated FileUtils write in RocksDBSuite > - > > Key: SPARK-45946 > URL: https://issues.apache.org/jira/browse/SPARK-45946 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Anish Shrigondekar >Priority: Major > Labels: pull-request-available > > Fix use of deprecated FileUtils write in RocksDBSuite -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45946) Fix use of deprecated FileUtils write in RocksDBSuite
Anish Shrigondekar created SPARK-45946: -- Summary: Fix use of deprecated FileUtils write in RocksDBSuite Key: SPARK-45946 URL: https://issues.apache.org/jira/browse/SPARK-45946 Project: Spark Issue Type: Task Components: Structured Streaming Affects Versions: 4.0.0 Reporter: Anish Shrigondekar Fix use of deprecated FileUtils write in RocksDBSuite -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33393) Support SHOW TABLE EXTENDED in DSv2
[ https://issues.apache.org/jira/browse/SPARK-33393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-33393. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 37588 [https://github.com/apache/spark/pull/37588] > Support SHOW TABLE EXTENDED in DSv2 > --- > > Key: SPARK-33393 > URL: https://issues.apache.org/jira/browse/SPARK-33393 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Max Gekk >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Current implementation of DSv2 SHOW TABLE doesn't support the EXTENDED mode > in: > https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowTablesExec.scala#L33 > which is supported in DSv1: > https://github.com/apache/spark/blob/7e99fcd64efa425f3c985df4fe957a3be274a49a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala#L870 > Need to add the same functionality to ShowTablesExec. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45866) Reuse of exchange in AQE does not happen when run time filters are pushed down to the underlying Scan ( like iceberg )
[ https://issues.apache.org/jira/browse/SPARK-45866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Asif updated SPARK-45866: - Labels: pull-request-available (was: ) > Reuse of exchange in AQE does not happen when run time filters are pushed > down to the underlying Scan ( like iceberg ) > -- > > Key: SPARK-45866 > URL: https://issues.apache.org/jira/browse/SPARK-45866 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.1 >Reporter: Asif >Priority: Major > Labels: pull-request-available > > In certain types of queries for eg TPCDS Query 14b, the reuse of exchange > does not happen in AQE , resulting in perf degradation. > The spark TPCDS tests are unable to catch the problem, because the > InMemoryScan used for testing do not implement the equals & hashCode > correctly , in the sense , that they do take into account the pushed down run > time filters. > In concrete Scan implementations, for eg iceberg's SparkBatchQueryScan , the > equality check , apart from other things, also involves Runtime Filters > pushed ( which is correct). > In spark the issue is this: > For a given stage being materialized, just before materialization starts, > the run time filters are confined to the BatchScanExec level. > Only when the actual RDD corresponding to the BatchScanExec, is being > evaluated, do the runtime filters get pushed to the underlying Scan. > Now if a new stage is created and it checks in the stageCache using its > canonicalized plan to see if a stage can be reused, it fails to find the > r-usable stage even if the stage exists, because the canonicalized spark > plan present in the stage cache, has now the run time filters pushed to the > Scan , so the incoming canonicalized spark plan does not match the key as > their underlying scans differ . that is incoming spark plan's scan does not > have runtime filters , while the canonicalized spark plan present as key in > the stage cache has the scan with runtime filters pushed. > The fix as I have worked is to provide, two methods in the > SupportsRuntimeV2Filtering interface , > default boolean equalToIgnoreRuntimeFilters(Scan other) { > return this.equals(other); > } > default int hashCodeIgnoreRuntimeFilters() { > return this.hashCode(); > } > In the BatchScanExec, if the scan implements SupportsRuntimeV2Filtering, then > instead of batch.equals, it should call scan.equalToIgnoreRuntimeFilters > And the underlying Scan implementations should provide equality which > excludes run time filters. > Similarly the hashCode of BatchScanExec, should use > scan.hashCodeIgnoreRuntimeFilters instead of ( batch.hashCode). > Will be creating a PR with bug test for review. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45747) Support session window aggregation in state reader
[ https://issues.apache.org/jira/browse/SPARK-45747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-45747. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43788 [https://github.com/apache/spark/pull/43788] > Support session window aggregation in state reader > -- > > Key: SPARK-45747 > URL: https://issues.apache.org/jira/browse/SPARK-45747 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.5.0 >Reporter: Chaoqin Li >Assignee: Chaoqin Li >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > We are introducing state reader in SPARK-45511, but currently session window > operator is not supported because the numColPrefixKey is unknown. We can read > the operator state metadata introduced in SPARK-45558 to determine the number > of prefix columns and load the state of session window correctly. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45747) Support session window aggregation in state reader
[ https://issues.apache.org/jira/browse/SPARK-45747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-45747: Assignee: Chaoqin Li > Support session window aggregation in state reader > -- > > Key: SPARK-45747 > URL: https://issues.apache.org/jira/browse/SPARK-45747 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.5.0 >Reporter: Chaoqin Li >Assignee: Chaoqin Li >Priority: Major > Labels: pull-request-available > > We are introducing state reader in SPARK-45511, but currently session window > operator is not supported because the numColPrefixKey is unknown. We can read > the operator state metadata introduced in SPARK-45558 to determine the number > of prefix columns and load the state of session window correctly. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-45943) DataSourceV2Relation.computeStats throws IllegalStateException in test mode
[ https://issues.apache.org/jira/browse/SPARK-45943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786584#comment-17786584 ] Zhen Wang edited comment on SPARK-45943 at 11/16/23 3:24 AM: - I encountered the same problem, and after debugging, I found that RewriteMergeIntoTable Rule rewrites MergeIntoTable with ReplaceData and there is a HiveTableRelation without tableStats in ReplaceData.groupFilterCondition. Since DetermineTableStats is applied after RewriteMergeIntoTable it does not set tableStats for HiveTableRelation. Reproduce: {code:java} create table sample.hive_table (id int, name string); create table iceberg_catalog.sample.iceberg_table ( id int, name string) USING iceberg; MERGE INTO iceberg_catalog.sample.iceberg_table t USING (SELECT * FROM sample.hive_table) u ON t.id = u.id WHEN MATCHED THEN UPDATE SET t.name = u.name WHEN NOT MATCHED THEN INSERT *; {code} error: {code:java} ERROR ExecuteStatement: Error operating ExecuteStatement: java.lang.IllegalStateException: Table stats must be specified. at org.apache.spark.sql.catalyst.catalog.HiveTableRelation.$anonfun$computeStats$3(interface.scala:845) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.catalyst.catalog.HiveTableRelation.computeStats(interface.scala:845) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.default(SizeInBytesOnlyStatsPlanVisitor.scala:56) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.default(SizeInBytesOnlyStatsPlanVisitor.scala:28) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor.visit(LogicalPlanVisitor.scala:49) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor.visit$(LogicalPlanVisitor.scala:25) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visit(SizeInBytesOnlyStatsPlanVisitor.scala:28) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.$anonfun$stats$1(LogicalPlanStats.scala:37) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.stats(LogicalPlanStats.scala:33) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.stats$(LogicalPlanStats.scala:33) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.stats(LogicalPlan.scala:32) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visitUnaryNode(SizeInBytesOnlyStatsPlanVisitor.scala:40) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visitFilter(SizeInBytesOnlyStatsPlanVisitor.scala:80) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visitFilter(SizeInBytesOnlyStatsPlanVisitor.scala:28) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor.visit(LogicalPlanVisitor.scala:30) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor.visit$(LogicalPlanVisitor.scala:25) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visit(SizeInBytesOnlyStatsPlanVisitor.scala:28) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.$anonfun$stats$1(LogicalPlanStats.scala:37) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.stats(LogicalPlanStats.scala:33) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.stats$(LogicalPlanStats.scala:33) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.stats(LogicalPlan.scala:32) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visitUnaryNode(SizeInBytesOnlyStatsPlanVisitor.scala:40) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visitProject(SizeInBytesOnlyStatsPlanVisitor.scala:149) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visitProject(SizeInBytesOnlyStatsPlanVisitor.scala:28) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor.visit(LogicalPlanVisitor.scala:38) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor.visit$(LogicalPlanVisitor.scala:25) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visit(SizeInBytesOnlyStatsPlanVisitor.scala:28) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.$anonfun$stats$1(LogicalPlanStats.scala:37) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.stats(LogicalPlanStats.scala:33) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.stats$(Lo
[jira] [Updated] (SPARK-45945) Add a helper function for `parser`
[ https://issues.apache.org/jira/browse/SPARK-45945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45945: --- Labels: pull-request-available (was: ) > Add a helper function for `parser` > -- > > Key: SPARK-45945 > URL: https://issues.apache.org/jira/browse/SPARK-45945 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45945) Add a helper function for `parser`
Ruifeng Zheng created SPARK-45945: - Summary: Add a helper function for `parser` Key: SPARK-45945 URL: https://issues.apache.org/jira/browse/SPARK-45945 Project: Spark Issue Type: Improvement Components: Connect Affects Versions: 4.0.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45943) DataSourceV2Relation.computeStats throws IllegalStateException in test mode
[ https://issues.apache.org/jira/browse/SPARK-45943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786584#comment-17786584 ] Zhen Wang commented on SPARK-45943: --- I encountered the same problem, and after debugging, I found that RewriteMergeIntoTable Rule rewrites MergeIntoTable with ReplaceData and there is a HiveTableRelation without tableStats in ReplaceData.groupFilterCondition. Since DetermineTableStats is applied after RewriteMergeIntoTable it does not set tableStats for HiveTableRelation. Reproduce: {code:java} create table sample.hive_table (id int, name string); create table iceberg_catalog.sample.iceberg_table ( id int, name string) USING iceberg; MERGE INTO iceberg_table t USING (SELECT * FROM hive_table) u ON t.id = u.id WHEN MATCHED THEN UPDATE SET t.name = u.name WHEN NOT MATCHED THEN INSERT *; {code} error: {code:java} ERROR ExecuteStatement: Error operating ExecuteStatement: java.lang.IllegalStateException: Table stats must be specified. at org.apache.spark.sql.catalyst.catalog.HiveTableRelation.$anonfun$computeStats$3(interface.scala:845) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.catalyst.catalog.HiveTableRelation.computeStats(interface.scala:845) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.default(SizeInBytesOnlyStatsPlanVisitor.scala:56) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.default(SizeInBytesOnlyStatsPlanVisitor.scala:28) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor.visit(LogicalPlanVisitor.scala:49) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor.visit$(LogicalPlanVisitor.scala:25) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visit(SizeInBytesOnlyStatsPlanVisitor.scala:28) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.$anonfun$stats$1(LogicalPlanStats.scala:37) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.stats(LogicalPlanStats.scala:33) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.stats$(LogicalPlanStats.scala:33) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.stats(LogicalPlan.scala:32) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visitUnaryNode(SizeInBytesOnlyStatsPlanVisitor.scala:40) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visitFilter(SizeInBytesOnlyStatsPlanVisitor.scala:80) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visitFilter(SizeInBytesOnlyStatsPlanVisitor.scala:28) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor.visit(LogicalPlanVisitor.scala:30) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor.visit$(LogicalPlanVisitor.scala:25) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visit(SizeInBytesOnlyStatsPlanVisitor.scala:28) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.$anonfun$stats$1(LogicalPlanStats.scala:37) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.stats(LogicalPlanStats.scala:33) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.stats$(LogicalPlanStats.scala:33) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.stats(LogicalPlan.scala:32) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visitUnaryNode(SizeInBytesOnlyStatsPlanVisitor.scala:40) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visitProject(SizeInBytesOnlyStatsPlanVisitor.scala:149) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visitProject(SizeInBytesOnlyStatsPlanVisitor.scala:28) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor.visit(LogicalPlanVisitor.scala:38) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor.visit$(LogicalPlanVisitor.scala:25) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visit(SizeInBytesOnlyStatsPlanVisitor.scala:28) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.$anonfun$stats$1(LogicalPlanStats.scala:37) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.stats(LogicalPlanStats.scala:33) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.stats$(LogicalPlanStats.scala:33) at org.apache.spark.sql.catalyst.plans.logical.LogicalPl
[jira] [Commented] (SPARK-45861) Add user guide for dataframe creation
[ https://issues.apache.org/jira/browse/SPARK-45861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786581#comment-17786581 ] BingKun Pan commented on SPARK-45861: - Okay, I see, Let me to try it. > Add user guide for dataframe creation > - > > Key: SPARK-45861 > URL: https://issues.apache.org/jira/browse/SPARK-45861 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > Attachments: screenshot-1.png, screenshot-2.png > > > Add a simple user guide for data frame creation. > This user guide should cover the following APIs: > # df.createDataFrame > # spark.read.format(...) (can be csv, json, parquet -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45930) Allow non-deterministic Python UDFs in MapInPandas/MapInArrow
[ https://issues.apache.org/jira/browse/SPARK-45930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-45930. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43810 [https://github.com/apache/spark/pull/43810] > Allow non-deterministic Python UDFs in MapInPandas/MapInArrow > - > > Key: SPARK-45930 > URL: https://issues.apache.org/jira/browse/SPARK-45930 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Currently if a Python udf is non-deterministic, the analyzer will fail with > this error:[INVALID_NON_DETERMINISTIC_EXPRESSIONS] The operator expects a > deterministic expression, but the actual expression is "pyUDF()", "a". > SQLSTATE: 42K0E; -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45930) Allow non-deterministic Python UDFs in MapInPandas/MapInArrow
[ https://issues.apache.org/jira/browse/SPARK-45930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-45930: Assignee: Allison Wang > Allow non-deterministic Python UDFs in MapInPandas/MapInArrow > - > > Key: SPARK-45930 > URL: https://issues.apache.org/jira/browse/SPARK-45930 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > Labels: pull-request-available > > Currently if a Python udf is non-deterministic, the analyzer will fail with > this error:[INVALID_NON_DETERMINISTIC_EXPRESSIONS] The operator expects a > deterministic expression, but the actual expression is "pyUDF()", "a". > SQLSTATE: 42K0E; -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45931) Refine docstring of `mapInPandas`
[ https://issues.apache.org/jira/browse/SPARK-45931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-45931. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43811 [https://github.com/apache/spark/pull/43811] > Refine docstring of `mapInPandas` > - > > Key: SPARK-45931 > URL: https://issues.apache.org/jira/browse/SPARK-45931 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Refine the docstring of the mapInPandas function. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45931) Refine docstring of `mapInPandas`
[ https://issues.apache.org/jira/browse/SPARK-45931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-45931: Assignee: Allison Wang > Refine docstring of `mapInPandas` > - > > Key: SPARK-45931 > URL: https://issues.apache.org/jira/browse/SPARK-45931 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > Labels: pull-request-available > > Refine the docstring of the mapInPandas function. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45936) Optimize `Index.symmetric_difference`
[ https://issues.apache.org/jira/browse/SPARK-45936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-45936: Assignee: Ruifeng Zheng > Optimize `Index.symmetric_difference` > - > > Key: SPARK-45936 > URL: https://issues.apache.org/jira/browse/SPARK-45936 > Project: Spark > Issue Type: Improvement > Components: PS >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45936) Optimize `Index.symmetric_difference`
[ https://issues.apache.org/jira/browse/SPARK-45936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-45936. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43816 [https://github.com/apache/spark/pull/43816] > Optimize `Index.symmetric_difference` > - > > Key: SPARK-45936 > URL: https://issues.apache.org/jira/browse/SPARK-45936 > Project: Spark > Issue Type: Improvement > Components: PS >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45935) Fix RST files link substitutions error
[ https://issues.apache.org/jira/browse/SPARK-45935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-45935: Affects Version/s: 3.5.0 3.4.1 3.3.3 (was: 3.4.2) (was: 3.5.1) (was: 3.3.4) > Fix RST files link substitutions error > -- > > Key: SPARK-45935 > URL: https://issues.apache.org/jira/browse/SPARK-45935 > Project: Spark > Issue Type: Bug > Components: Documentation, PySpark >Affects Versions: 3.3.3, 3.4.1, 3.5.0, 4.0.0 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45935) Fix RST files link substitutions error
[ https://issues.apache.org/jira/browse/SPARK-45935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-45935: Affects Version/s: 3.4.2 3.5.1 3.3.4 > Fix RST files link substitutions error > -- > > Key: SPARK-45935 > URL: https://issues.apache.org/jira/browse/SPARK-45935 > Project: Spark > Issue Type: Bug > Components: Documentation, PySpark >Affects Versions: 3.4.2, 4.0.0, 3.5.1, 3.3.4 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44685) Remove deprecated Catalog#createExternalTable
[ https://issues.apache.org/jira/browse/SPARK-44685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-44685: --- Labels: pull-request-available release-notes (was: release-notes) > Remove deprecated Catalog#createExternalTable > - > > Key: SPARK-44685 > URL: https://issues.apache.org/jira/browse/SPARK-44685 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Jia Fan >Priority: Major > Labels: pull-request-available, release-notes > > I should remove Catalog#createExternalTable becuase it deprecated when 2.2.0 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44699) Add logging for complete write events to file in EventLogFileWriter.closeWriter
[ https://issues.apache.org/jira/browse/SPARK-44699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-44699: --- Labels: pull-request-available (was: ) > Add logging for complete write events to file in > EventLogFileWriter.closeWriter > --- > > Key: SPARK-44699 > URL: https://issues.apache.org/jira/browse/SPARK-44699 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 3.4.1 >Reporter: shuyouZZ >Priority: Major > Labels: pull-request-available > > Sometimes we want to know when to finish logging the events to eventLog file, > we need add a log to make it clearer. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45944) Leaked file streams in ParquetFileFormat
[ https://issues.apache.org/jira/browse/SPARK-45944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45944: -- Description: - [https://github.com/apache/spark/actions/runs/6859020738/job/18650698085] - [https://github.com/apache/spark/actions/runs/6872747886/job/18691717269] {code:java} Cause: java.lang.IllegalStateException: There are 1 possibly leaked file streams. 29975[info] at org.apache.spark.DebugFilesystem$.assertNoOpenStreams(DebugFilesystem.scala:54) 29976[info] at org.apache.spark.sql.test.SharedSparkSessionBase.$anonfun$afterEach$1(SharedSparkSession.scala:165) ... 29977[info] at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormatSuite.eventually(ParquetFileFormatSuite.scala:31) 29984[info] at org.apache.spark.sql.test.SharedSparkSessionBase.afterEach(SharedSparkSession.scala:164) 29985[info] at org.apache.spark.sql.test.SharedSparkSessionBase.afterEach$(SharedSparkSession.scala:158) 29986[info] at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormatSuite.afterEach(ParquetFileFormatSuite.scala:31) 29987[info] at org.scalatest.BeforeAndAfterEach.$anonfun$runTest$1(BeforeAndAfterEach.scala:247) 29988[info] at ... 30025[info] Cause: java.lang.Throwable: 30026[info] at org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:35) 30027[info] at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:75) 30028[info] at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:997) 30029[info] at org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69) 30030[info] at org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:796) 30031[info] at org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:666) 30032[info] at org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:85) 30033[info] at org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:76) 30034[info] at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readParquetFootersInParallel$1(ParquetFileFormat.scala:450) {code} was: - https://github.com/apache/spark/actions/runs/6872747886/job/18691717269 {code:java} Cause: java.lang.IllegalStateException: There are 1 possibly leaked file streams. 29975[info] at org.apache.spark.DebugFilesystem$.assertNoOpenStreams(DebugFilesystem.scala:54) 29976[info] at org.apache.spark.sql.test.SharedSparkSessionBase.$anonfun$afterEach$1(SharedSparkSession.scala:165) ... 29977[info] at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormatSuite.eventually(ParquetFileFormatSuite.scala:31) 29984[info] at org.apache.spark.sql.test.SharedSparkSessionBase.afterEach(SharedSparkSession.scala:164) 29985[info] at org.apache.spark.sql.test.SharedSparkSessionBase.afterEach$(SharedSparkSession.scala:158) 29986[info] at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormatSuite.afterEach(ParquetFileFormatSuite.scala:31) 29987[info] at org.scalatest.BeforeAndAfterEach.$anonfun$runTest$1(BeforeAndAfterEach.scala:247) 29988[info] at ... 30025[info] Cause: java.lang.Throwable: 30026[info] at org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:35) 30027[info] at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:75) 30028[info] at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:997) 30029[info] at org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69) 30030[info] at org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:796) 30031[info] at org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:666) 30032[info] at org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:85) 30033[info] at org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:76) 30034[info] at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readParquetFootersInParallel$1(ParquetFileFormat.scala:450) {code} > Leaked file streams in ParquetFileFormat > > > Key: SPARK-45944 > URL: https://issues.apache.org/jira/browse/SPARK-45944 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > - [https://github.com/apache/spark/actions/runs/6859020738/job/18650698085] > - [https://github.com/apache/spark/actions/runs/6872747886/job/18691717269] > {code:java} > Cause: java.lang.IllegalStateException: There are 1 possibly leaked file > streams. > 29975[info] at > org.apache.spark.DebugFilesystem$.assertNoOp
[jira] [Updated] (SPARK-45944) Leaked file streams in ParquetFileFormat
[ https://issues.apache.org/jira/browse/SPARK-45944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45944: -- Summary: Leaked file streams in ParquetFileFormat (was: Leaked file streams in ParquetFileFormatV1Suite) > Leaked file streams in ParquetFileFormat > > > Key: SPARK-45944 > URL: https://issues.apache.org/jira/browse/SPARK-45944 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > - https://github.com/apache/spark/actions/runs/6872747886/job/18691717269 > {code:java} > Cause: java.lang.IllegalStateException: There are 1 possibly leaked file > streams. > 29975[info] at > org.apache.spark.DebugFilesystem$.assertNoOpenStreams(DebugFilesystem.scala:54) > 29976[info] at > org.apache.spark.sql.test.SharedSparkSessionBase.$anonfun$afterEach$1(SharedSparkSession.scala:165) > ... > 29977[info] at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormatSuite.eventually(ParquetFileFormatSuite.scala:31) > 29984[info] at > org.apache.spark.sql.test.SharedSparkSessionBase.afterEach(SharedSparkSession.scala:164) > 29985[info] at > org.apache.spark.sql.test.SharedSparkSessionBase.afterEach$(SharedSparkSession.scala:158) > 29986[info] at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormatSuite.afterEach(ParquetFileFormatSuite.scala:31) > 29987[info] at > org.scalatest.BeforeAndAfterEach.$anonfun$runTest$1(BeforeAndAfterEach.scala:247) > 29988[info] at > ... > 30025[info] Cause: java.lang.Throwable: > 30026[info] at > org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:35) > 30027[info] at > org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:75) > 30028[info] at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:997) > 30029[info] at > org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69) > 30030[info] at > org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:796) > 30031[info] at > org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:666) > 30032[info] at > org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:85) > 30033[info] at > org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:76) > 30034[info] at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readParquetFootersInParallel$1(ParquetFileFormat.scala:450) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45944) Leaked file streams in ParquetFileFormatV1Suite
Dongjoon Hyun created SPARK-45944: - Summary: Leaked file streams in ParquetFileFormatV1Suite Key: SPARK-45944 URL: https://issues.apache.org/jira/browse/SPARK-45944 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 4.0.0 Reporter: Dongjoon Hyun - https://github.com/apache/spark/actions/runs/6872747886/job/18691717269 {code:java} Cause: java.lang.IllegalStateException: There are 1 possibly leaked file streams. 29975[info] at org.apache.spark.DebugFilesystem$.assertNoOpenStreams(DebugFilesystem.scala:54) 29976[info] at org.apache.spark.sql.test.SharedSparkSessionBase.$anonfun$afterEach$1(SharedSparkSession.scala:165) ... 29977[info] at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormatSuite.eventually(ParquetFileFormatSuite.scala:31) 29984[info] at org.apache.spark.sql.test.SharedSparkSessionBase.afterEach(SharedSparkSession.scala:164) 29985[info] at org.apache.spark.sql.test.SharedSparkSessionBase.afterEach$(SharedSparkSession.scala:158) 29986[info] at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormatSuite.afterEach(ParquetFileFormatSuite.scala:31) 29987[info] at org.scalatest.BeforeAndAfterEach.$anonfun$runTest$1(BeforeAndAfterEach.scala:247) 29988[info] at ... 30025[info] Cause: java.lang.Throwable: 30026[info] at org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:35) 30027[info] at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:75) 30028[info] at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:997) 30029[info] at org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69) 30030[info] at org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:796) 30031[info] at org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:666) 30032[info] at org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:85) 30033[info] at org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:76) 30034[info] at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readParquetFootersInParallel$1(ParquetFileFormat.scala:450) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45592) AQE and InMemoryTableScanExec correctness bug
[ https://issues.apache.org/jira/browse/SPARK-45592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45592: -- Fix Version/s: 3.4.2 > AQE and InMemoryTableScanExec correctness bug > - > > Key: SPARK-45592 > URL: https://issues.apache.org/jira/browse/SPARK-45592 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1, 3.5.0 >Reporter: Emil Ejbyfeldt >Assignee: Emil Ejbyfeldt >Priority: Blocker > Labels: correctness, pull-request-available > Fix For: 3.4.2, 4.0.0, 3.5.1 > > > The following query should return 100 > {code:java} > import org.apache.spark.storage.StorageLevel > val df = spark.range(0, 100, 1, 5).map(l => (l, l)) > val ee = df.select($"_1".as("src"), $"_2".as("dst")) > .persist(StorageLevel.MEMORY_AND_DISK) > ee.count() > val minNbrs1 = ee > .groupBy("src").agg(min(col("dst")).as("min_number")) > .persist(StorageLevel.MEMORY_AND_DISK) > val join = ee.join(minNbrs1, "src") > join.count(){code} > but on spark 3.5.0 there is a correctness bug causing it to return `104800` > or some other smaller value. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45934) Fix `Spark Standalone` documentation table layout
[ https://issues.apache.org/jira/browse/SPARK-45934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45934: -- Fix Version/s: 3.5.1 > Fix `Spark Standalone` documentation table layout > - > > Key: SPARK-45934 > URL: https://issues.apache.org/jira/browse/SPARK-45934 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.5.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0, 3.5.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44488) Support deserializing long fields into `Metadata` object
[ https://issues.apache.org/jira/browse/SPARK-44488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-44488: --- Labels: pull-request-available (was: ) > Support deserializing long fields into `Metadata` object > > > Key: SPARK-44488 > URL: https://issues.apache.org/jira/browse/SPARK-44488 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.4.1 >Reporter: Richard Chen >Assignee: Richard Chen >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45943) DataSourceV2Relation.computeStats throws IllegalStateException in test mode
Asif created SPARK-45943: Summary: DataSourceV2Relation.computeStats throws IllegalStateException in test mode Key: SPARK-45943 URL: https://issues.apache.org/jira/browse/SPARK-45943 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.1 Reporter: Asif This issue surfaces when the new unit test of PR SPARK-45866|https://github.com/apache/spark/pull/43824] is added -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45934) Fix `Spark Standalone` documentation table layout
[ https://issues.apache.org/jira/browse/SPARK-45934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-45934. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43814 [https://github.com/apache/spark/pull/43814] > Fix `Spark Standalone` documentation table layout > - > > Key: SPARK-45934 > URL: https://issues.apache.org/jira/browse/SPARK-45934 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.5.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45719) Upgrade AWS SDK to v2 for Kubernetes integration tests
[ https://issues.apache.org/jira/browse/SPARK-45719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45719: -- Component/s: Tests (was: Spark Core) > Upgrade AWS SDK to v2 for Kubernetes integration tests > -- > > Key: SPARK-45719 > URL: https://issues.apache.org/jira/browse/SPARK-45719 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes, Tests >Affects Versions: 4.0.0 >Reporter: Lantao Jin >Assignee: junyuc25 >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Sub-task of [SPARK-44124|https://issues.apache.org/jira/browse/SPARK-44124]. > In this issue, we will upgrade AWS SDK in Credentials providers, AWS clients > and related Kubernetes integration tests. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45719) Upgrade AWS SDK to v2 for Kubernetes integration tests
[ https://issues.apache.org/jira/browse/SPARK-45719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45719: -- Target Version/s: (was: 4.0.0) > Upgrade AWS SDK to v2 for Kubernetes integration tests > -- > > Key: SPARK-45719 > URL: https://issues.apache.org/jira/browse/SPARK-45719 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes, Spark Core >Affects Versions: 4.0.0 >Reporter: Lantao Jin >Assignee: junyuc25 >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Sub-task of [SPARK-44124|https://issues.apache.org/jira/browse/SPARK-44124]. > In this issue, we will upgrade AWS SDK in Credentials providers, AWS clients > and related Kubernetes integration tests. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45810) Create API to stop consuming rows from the input table
[ https://issues.apache.org/jira/browse/SPARK-45810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-45810. --- Assignee: Daniel Resolution: Fixed Issue resolved by pull request 43682 https://github.com/apache/spark/pull/43682 > Create API to stop consuming rows from the input table > -- > > Key: SPARK-45810 > URL: https://issues.apache.org/jira/browse/SPARK-45810 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Daniel >Assignee: Daniel >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45868) Make spark.table use the same parser with vanilla spark
[ https://issues.apache.org/jira/browse/SPARK-45868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-45868: - Assignee: Ruifeng Zheng > Make spark.table use the same parser with vanilla spark > --- > > Key: SPARK-45868 > URL: https://issues.apache.org/jira/browse/SPARK-45868 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45868) Make spark.table use the same parser with vanilla spark
[ https://issues.apache.org/jira/browse/SPARK-45868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-45868. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43741 [https://github.com/apache/spark/pull/43741] > Make spark.table use the same parser with vanilla spark > --- > > Key: SPARK-45868 > URL: https://issues.apache.org/jira/browse/SPARK-45868 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-45924) Canonicalization of SubqueryAdaptiveBroadcastExec is broken and is not equivalent with SubqueryBroadcastExec
[ https://issues.apache.org/jira/browse/SPARK-45924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Asif closed SPARK-45924. this is not a bug > Canonicalization of SubqueryAdaptiveBroadcastExec is broken and is not > equivalent with SubqueryBroadcastExec > > > Key: SPARK-45924 > URL: https://issues.apache.org/jira/browse/SPARK-45924 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.1 >Reporter: Asif >Priority: Major > Labels: pull-request-available > > while writing bug test for > [SPARK-45866|https://issues.apache.org/jira/projects/SPARK/issues/SPARK-45866], > found that canonicalization of SubqueryAdaptiveBroadcastExec is broken in > the sense that buildPlan : LogicalPlan is not canonicalized which causes > batchscans to differ when reuse of exchange check happens in AQE. > Moreover the equivalence of SubqueryAdaptiveBroadcastExec and > SubqueryBroadcastExec is not there which also aggravates the re-use of > exchange in aqe broken. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-45925) SubqueryBroadcastExec is not equivalent with SubqueryAdaptiveBroadcastExec causing re-use of exchange not happening in AQE
[ https://issues.apache.org/jira/browse/SPARK-45925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Asif closed SPARK-45925. this is not an issue > SubqueryBroadcastExec is not equivalent with SubqueryAdaptiveBroadcastExec > causing re-use of exchange not happening in AQE > -- > > Key: SPARK-45925 > URL: https://issues.apache.org/jira/browse/SPARK-45925 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.1 >Reporter: Asif >Priority: Major > Labels: pull-request-available > > A created stage may contain SubqueryAdaptiveBroadcastExec while incominng > exchange may contain SubqueryBroadcastExec and though they are equivalent , > the match does not happen because equals/hashCode do not match , resulting in > non re-use of exchange. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45924) Canonicalization of SubqueryAdaptiveBroadcastExec is broken and is not equivalent with SubqueryBroadcastExec
[ https://issues.apache.org/jira/browse/SPARK-45924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Asif resolved SPARK-45924. -- Resolution: Not A Bug > Canonicalization of SubqueryAdaptiveBroadcastExec is broken and is not > equivalent with SubqueryBroadcastExec > > > Key: SPARK-45924 > URL: https://issues.apache.org/jira/browse/SPARK-45924 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.1 >Reporter: Asif >Priority: Major > Labels: pull-request-available > > while writing bug test for > [SPARK-45866|https://issues.apache.org/jira/projects/SPARK/issues/SPARK-45866], > found that canonicalization of SubqueryAdaptiveBroadcastExec is broken in > the sense that buildPlan : LogicalPlan is not canonicalized which causes > batchscans to differ when reuse of exchange check happens in AQE. > Moreover the equivalence of SubqueryAdaptiveBroadcastExec and > SubqueryBroadcastExec is not there which also aggravates the re-use of > exchange in aqe broken. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45925) SubqueryBroadcastExec is not equivalent with SubqueryAdaptiveBroadcastExec causing re-use of exchange not happening in AQE
[ https://issues.apache.org/jira/browse/SPARK-45925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Asif resolved SPARK-45925. -- Resolution: Not A Problem > SubqueryBroadcastExec is not equivalent with SubqueryAdaptiveBroadcastExec > causing re-use of exchange not happening in AQE > -- > > Key: SPARK-45925 > URL: https://issues.apache.org/jira/browse/SPARK-45925 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.1 >Reporter: Asif >Priority: Major > Labels: pull-request-available > > A created stage may contain SubqueryAdaptiveBroadcastExec while incominng > exchange may contain SubqueryBroadcastExec and though they are equivalent , > the match does not happen because equals/hashCode do not match , resulting in > non re-use of exchange. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45942) Only do the thread interruption check for putIterator on executors
[ https://issues.apache.org/jira/browse/SPARK-45942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45942: --- Labels: pull-request-available (was: ) > Only do the thread interruption check for putIterator on executors > -- > > Key: SPARK-45942 > URL: https://issues.apache.org/jira/browse/SPARK-45942 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Huanli Wang >Priority: Major > Labels: pull-request-available > > https://issues.apache.org/jira/browse/SPARK-45025 > introduces a peaceful thread interruption handling. However, there is an edge > case: when a streaming query is stopped on the driver, it interrupts the > stream execution thread. If the streaming query is doing memory store > operations on driver and performs {{doPutIterator}} at the same time, the > [unroll process will be > broken|https://github.com/apache/spark/blob/39fc6108bfaaa0ce471f6460880109f948ba5c62/core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala#L224] > and [returns used > memory|https://github.com/apache/spark/blob/39fc6108bfaaa0ce471f6460880109f948ba5c62/core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala#L245-L247]. > This can result in {{closeChannelException}} as it falls into this [case > clause|https://github.com/apache/spark/blob/aa646d3050028272f7333deaef52f20e6975e0ed/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1614-L1622] > which opens an I/O channel and persists the data into the disk. However, > because the thread is interrupted, the channel will be closed at the begin: > [https://github.com/openjdk-mirror/jdk7u-jdk/blob/master/src/share/classes/java/nio/channels/spi/AbstractInterruptibleChannel.java#L172] > and throws out {{closeChannelException}} > On executors, [the task will be killed if the thread is > interrupted|https://github.com/apache/spark/blob/39fc6108bfaaa0ce471f6460880109f948ba5c62/core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala#L374], > however, we don't do it on the driver. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45942) Only do the thread interruption check for putIterator on executors
Huanli Wang created SPARK-45942: --- Summary: Only do the thread interruption check for putIterator on executors Key: SPARK-45942 URL: https://issues.apache.org/jira/browse/SPARK-45942 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 4.0.0 Reporter: Huanli Wang https://issues.apache.org/jira/browse/SPARK-45025 introduces a peaceful thread interruption handling. However, there is an edge case: when a streaming query is stopped on the driver, it interrupts the stream execution thread. If the streaming query is doing memory store operations on driver and performs {{doPutIterator}} at the same time, the [unroll process will be broken|https://github.com/apache/spark/blob/39fc6108bfaaa0ce471f6460880109f948ba5c62/core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala#L224] and [returns used memory|https://github.com/apache/spark/blob/39fc6108bfaaa0ce471f6460880109f948ba5c62/core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala#L245-L247]. This can result in {{closeChannelException}} as it falls into this [case clause|https://github.com/apache/spark/blob/aa646d3050028272f7333deaef52f20e6975e0ed/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1614-L1622] which opens an I/O channel and persists the data into the disk. However, because the thread is interrupted, the channel will be closed at the begin: [https://github.com/openjdk-mirror/jdk7u-jdk/blob/master/src/share/classes/java/nio/channels/spi/AbstractInterruptibleChannel.java#L172] and throws out {{closeChannelException}} On executors, [the task will be killed if the thread is interrupted|https://github.com/apache/spark/blob/39fc6108bfaaa0ce471f6460880109f948ba5c62/core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala#L374], however, we don't do it on the driver. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45941) Update pandas to 2.1.3
[ https://issues.apache.org/jira/browse/SPARK-45941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-45941: - Assignee: Bjørn Jørgensen > Update pandas to 2.1.3 > -- > > Key: SPARK-45941 > URL: https://issues.apache.org/jira/browse/SPARK-45941 > Project: Spark > Issue Type: Dependency upgrade > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Major > Labels: pull-request-available > > https://pandas.pydata.org/docs/whatsnew/v2.1.3.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45941) Update pandas to 2.1.3
[ https://issues.apache.org/jira/browse/SPARK-45941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-45941. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43822 [https://github.com/apache/spark/pull/43822] > Update pandas to 2.1.3 > -- > > Key: SPARK-45941 > URL: https://issues.apache.org/jira/browse/SPARK-45941 > Project: Spark > Issue Type: Dependency upgrade > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > https://pandas.pydata.org/docs/whatsnew/v2.1.3.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45941) Update pandas to 2.1.3
[ https://issues.apache.org/jira/browse/SPARK-45941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45941: --- Labels: pull-request-available (was: ) > Update pandas to 2.1.3 > -- > > Key: SPARK-45941 > URL: https://issues.apache.org/jira/browse/SPARK-45941 > Project: Spark > Issue Type: Dependency upgrade > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Bjørn Jørgensen >Priority: Major > Labels: pull-request-available > > https://pandas.pydata.org/docs/whatsnew/v2.1.3.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45941) Update pandas to 2.1.3
Bjørn Jørgensen created SPARK-45941: --- Summary: Update pandas to 2.1.3 Key: SPARK-45941 URL: https://issues.apache.org/jira/browse/SPARK-45941 Project: Spark Issue Type: Dependency upgrade Components: Pandas API on Spark Affects Versions: 4.0.0 Reporter: Bjørn Jørgensen https://pandas.pydata.org/docs/whatsnew/v2.1.3.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45940) Add InputPartition to DataSourceReader interface
Allison Wang created SPARK-45940: Summary: Add InputPartition to DataSourceReader interface Key: SPARK-45940 URL: https://issues.apache.org/jira/browse/SPARK-45940 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 4.0.0 Reporter: Allison Wang Add InputPartition class and make the partitions method return a list of input partitions. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45861) Add user guide for dataframe creation
[ https://issues.apache.org/jira/browse/SPARK-45861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786487#comment-17786487 ] Allison Wang commented on SPARK-45861: -- [~panbingkun] again, thanks for working on this. Let me give you more details. When people search on Google for example "spark create dataframe", you can see there are many results, one of them being the PySpark documentation - createDataFrame. But there are many other ways to create a dataframe, for example from various data sources (CSV, JDBC, Parquet, etc), from pandas dataframe, from `spark.sql`, etc. We want to create a new documentation page under `{*}User Guides{*}` to explain all kinds of ways people can use to create a Spark data frame. It's different from the quickstart in that the user guide will provide more comprehensive examples. Feel free to take a look at the results when you search "spark create dataframe" or even "create dataframe" to get more inspirations. cc [~afolting] [~smilegator] > Add user guide for dataframe creation > - > > Key: SPARK-45861 > URL: https://issues.apache.org/jira/browse/SPARK-45861 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > Attachments: screenshot-1.png, screenshot-2.png > > > Add a simple user guide for data frame creation. > This user guide should cover the following APIs: > # df.createDataFrame > # spark.read.format(...) (can be csv, json, parquet -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-45390) Remove `distutils` usage
[ https://issues.apache.org/jira/browse/SPARK-45390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786455#comment-17786455 ] Dongjoon Hyun edited comment on SPARK-45390 at 11/15/23 6:00 PM: - Your concern is legit like we had the same concern at Apache Spark 2.4.x. :) For the following, let me ask you in this way. Do you think Python 3.12 support all minimum Python package requirements of Spark 3.5? Have you test your branch-3.5 with Python 3.12 + SPARK-45390 ? bq. I'm not sure what you mean here. was (Author: dongjoon): Your concern is legit like we had the same concern at Apache Spark 2.4.x. :) For the following, let me ask you in this way. Do you think Python 3.12 support all minimum Python package requirements of Spark 3.5? Have you test your branch-3.5 with Python 3.12 + SPARK-45390 ? > I'm not sure what you mean here. > Remove `distutils` usage > > > Key: SPARK-45390 > URL: https://issues.apache.org/jira/browse/SPARK-45390 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > [PEP-632|https://peps.python.org/pep-0632] deprecated {{distutils}} module in > Python {{3.10}} and dropped in Python {{3.12}} in favor of {{packaging}} > package. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45390) Remove `distutils` usage
[ https://issues.apache.org/jira/browse/SPARK-45390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786455#comment-17786455 ] Dongjoon Hyun commented on SPARK-45390: --- Your concern is legit like we had the same concern at Apache Spark 2.4.x. :) For the following, let me ask you in this way. Do you think Python 3.12 support all minimum Python package requirements of Spark 3.5? Have you test your branch-3.5 with Python 3.12 + SPARK-45390 ? > I'm not sure what you mean here. > Remove `distutils` usage > > > Key: SPARK-45390 > URL: https://issues.apache.org/jira/browse/SPARK-45390 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > [PEP-632|https://peps.python.org/pep-0632] deprecated {{distutils}} module in > Python {{3.10}} and dropped in Python {{3.12}} in favor of {{packaging}} > package. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45208) Website doesn't have horizontal scrollbar
[ https://issues.apache.org/jira/browse/SPARK-45208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45208: -- Description: This was reported in the dev mailing list. - https://lists.apache.org/thread/cfhqgltx0f4flrtb1p5c40zoopdy5yt9 I find a recent issue with the official Spark documentation on the website. Specifically, the Kubernetes configuration lists on the right-hand side are not visible and doc doesn't have a horizontal scrollbar. - [https://spark.apache.org/docs/3.5.0/running-on-kubernetes.html#configuration] - [https://spark.apache.org/docs/3.4.1/running-on-kubernetes.html#configuration] Wide tables are broken in the same way. was: I find a recent issue with the official Spark documentation on the website. Specifically, the Kubernetes configuration lists on the right-hand side are not visible and doc doesn't have a horizontal scrollbar. - [https://spark.apache.org/docs/3.5.0/running-on-kubernetes.html#configuration] - [https://spark.apache.org/docs/3.4.1/running-on-kubernetes.html#configuration] Wide tables are broken in the same way. - https://spark.apache.org/docs/latest/spark-standalone.html > Website doesn't have horizontal scrollbar > - > > Key: SPARK-45208 > URL: https://issues.apache.org/jira/browse/SPARK-45208 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 3.5.0 >Reporter: Qian Sun >Priority: Major > > This was reported in the dev mailing list. > - https://lists.apache.org/thread/cfhqgltx0f4flrtb1p5c40zoopdy5yt9 > I find a recent issue with the official Spark documentation on the website. > Specifically, the Kubernetes configuration lists on the right-hand side are > not visible and doc doesn't have a horizontal scrollbar. > > - > [https://spark.apache.org/docs/3.5.0/running-on-kubernetes.html#configuration] > - > [https://spark.apache.org/docs/3.4.1/running-on-kubernetes.html#configuration] > Wide tables are broken in the same way. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45934) Fix `Spark Standalone` documentation table layout
[ https://issues.apache.org/jira/browse/SPARK-45934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45934: -- Parent: SPARK-45869 Issue Type: Sub-task (was: Bug) > Fix `Spark Standalone` documentation table layout > - > > Key: SPARK-45934 > URL: https://issues.apache.org/jira/browse/SPARK-45934 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.5.0 >Reporter: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45934) Fix `Spark Standalone` documentation table layout
[ https://issues.apache.org/jira/browse/SPARK-45934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-45934: - Assignee: Dongjoon Hyun > Fix `Spark Standalone` documentation table layout > - > > Key: SPARK-45934 > URL: https://issues.apache.org/jira/browse/SPARK-45934 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.5.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45934) Fix `Spark Standalone` documentation table layout
[ https://issues.apache.org/jira/browse/SPARK-45934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45934: -- Summary: Fix `Spark Standalone` documentation table layout (was: Fix `spark-standalone.md` table layout) > Fix `Spark Standalone` documentation table layout > - > > Key: SPARK-45934 > URL: https://issues.apache.org/jira/browse/SPARK-45934 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 3.5.0, 4.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45934) Fix `Spark Standalone` documentation table layout
[ https://issues.apache.org/jira/browse/SPARK-45934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45934: -- Affects Version/s: (was: 4.0.0) > Fix `Spark Standalone` documentation table layout > - > > Key: SPARK-45934 > URL: https://issues.apache.org/jira/browse/SPARK-45934 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 3.5.0 >Reporter: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45934) Fix `spark-standalone.md` table layout
[ https://issues.apache.org/jira/browse/SPARK-45934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45934: -- Summary: Fix `spark-standalone.md` table layout (was: Fix `spark-standalone.md` and `running-on-kubernetes.md` table layout) > Fix `spark-standalone.md` table layout > -- > > Key: SPARK-45934 > URL: https://issues.apache.org/jira/browse/SPARK-45934 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 3.5.0, 4.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43393) Sequence expression can overflow
[ https://issues.apache.org/jira/browse/SPARK-43393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786438#comment-17786438 ] Dongjoon Hyun commented on SPARK-43393: --- Due to the compilation failure, this is reverted from branch-3.5 via https://github.com/apache/spark/commit/e38310c74e6cae8c8c8489ffcbceb80ed37a7cae . > Sequence expression can overflow > > > Key: SPARK-43393 > URL: https://issues.apache.org/jira/browse/SPARK-43393 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Deepayan Patra >Assignee: Deepayan Patra >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Spark has a (long-standing) overflow bug in the {{sequence}} expression. > > Consider the following operations: > {{spark.sql("CREATE TABLE foo (l LONG);")}} > {{spark.sql(s"INSERT INTO foo VALUES (${Long.MaxValue});")}} > {{spark.sql("SELECT sequence(0, l) FROM foo;").collect()}} > > The result of these operations will be: > {{Array[org.apache.spark.sql.Row] = Array([WrappedArray()])}} > an unintended consequence of overflow. > > The sequence is applied to values {{0}} and {{Long.MaxValue}} with a step > size of {{1}} which uses a length computation defined > [here|https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3451]. > In this calculation, with {{{}start = 0{}}}, {{{}stop = Long.MaxValue{}}}, > and {{{}step = 1{}}}, the calculated {{len}} overflows to > {{{}Long.MinValue{}}}. The computation, in binary looks like: > 0111 - > > -- > 0111 / > 0001 > -- > 0111 + > 0001 > -- > 1000 > The following > [check|https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3454] > passes as the negative {{Long.MinValue}} is still {{{}<= > MAX_ROUNDED_ARRAY_LENGTH{}}}. The following cast to {{toInt}} uses this > representation and [truncates the upper > bits|https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3457] > resulting in an empty length of 0. > Other overflows are similarly problematic. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43393) Sequence expression can overflow
[ https://issues.apache.org/jira/browse/SPARK-43393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-43393: -- Fix Version/s: (was: 3.5.1) > Sequence expression can overflow > > > Key: SPARK-43393 > URL: https://issues.apache.org/jira/browse/SPARK-43393 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Deepayan Patra >Assignee: Deepayan Patra >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Spark has a (long-standing) overflow bug in the {{sequence}} expression. > > Consider the following operations: > {{spark.sql("CREATE TABLE foo (l LONG);")}} > {{spark.sql(s"INSERT INTO foo VALUES (${Long.MaxValue});")}} > {{spark.sql("SELECT sequence(0, l) FROM foo;").collect()}} > > The result of these operations will be: > {{Array[org.apache.spark.sql.Row] = Array([WrappedArray()])}} > an unintended consequence of overflow. > > The sequence is applied to values {{0}} and {{Long.MaxValue}} with a step > size of {{1}} which uses a length computation defined > [here|https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3451]. > In this calculation, with {{{}start = 0{}}}, {{{}stop = Long.MaxValue{}}}, > and {{{}step = 1{}}}, the calculated {{len}} overflows to > {{{}Long.MinValue{}}}. The computation, in binary looks like: > 0111 - > > -- > 0111 / > 0001 > -- > 0111 + > 0001 > -- > 1000 > The following > [check|https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3454] > passes as the negative {{Long.MinValue}} is still {{{}<= > MAX_ROUNDED_ARRAY_LENGTH{}}}. The following cast to {{toInt}} uses this > representation and [truncates the upper > bits|https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3457] > resulting in an empty length of 0. > Other overflows are similarly problematic. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45934) Fix `spark-standalone.md` and `running-on-kubernetes.md` table layout
[ https://issues.apache.org/jira/browse/SPARK-45934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45934: -- Affects Version/s: 3.5.0 > Fix `spark-standalone.md` and `running-on-kubernetes.md` table layout > - > > Key: SPARK-45934 > URL: https://issues.apache.org/jira/browse/SPARK-45934 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 3.5.0, 4.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45938) Add `utils` to the dependency list of the `core` module in `module.py`
[ https://issues.apache.org/jira/browse/SPARK-45938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45938: --- Labels: pull-request-available (was: ) > Add `utils` to the dependency list of the `core` module in `module.py` > -- > > Key: SPARK-45938 > URL: https://issues.apache.org/jira/browse/SPARK-45938 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45938) Add `utils` to the dependency list of the `core` module in `module.py`
Yang Jie created SPARK-45938: Summary: Add `utils` to the dependency list of the `core` module in `module.py` Key: SPARK-45938 URL: https://issues.apache.org/jira/browse/SPARK-45938 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 4.0.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45390) Remove `distutils` usage
[ https://issues.apache.org/jira/browse/SPARK-45390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786401#comment-17786401 ] Nicholas Chammas commented on SPARK-45390: -- {quote}We don't promise to support all future unreleased Python versions {quote} "all future unreleased versions" is a tall ask that no-one is making. :) The relevant circumstances here are that a) Python 3.12 is already out and the backwards-incompatible changes are known and [very limited|https://docs.python.org/3/whatsnew/3.12.html], and b) Spark 4.0 may be a disruptive change and so many people may remain on Spark 3.5 for longer than usual. If we expect 3.5 -> 4.0 to be an easy migration, then backporting a fix like this to 3.5 is not as important. {quote}we need much more validation because all Python package ecosystem should work there without any issues {quote} I'm not sure what you mean here. Anyway, I suppose we could just wait and see. Maybe I'm wrong, but I suspect many users will find it surprising that Spark 3.5 doesn't work on Python 3.12, especially if this is the only (or close to the only) fix required. > Remove `distutils` usage > > > Key: SPARK-45390 > URL: https://issues.apache.org/jira/browse/SPARK-45390 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > [PEP-632|https://peps.python.org/pep-0632] deprecated {{distutils}} module in > Python {{3.10}} and dropped in Python {{3.12}} in favor of {{packaging}} > package. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45937) Fix documentation of spark.executor.maxNumFailures
[ https://issues.apache.org/jira/browse/SPARK-45937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786395#comment-17786395 ] Thomas Graves commented on SPARK-45937: --- @Cheng Pan Could you fix this as followup? > Fix documentation of spark.executor.maxNumFailures > -- > > Key: SPARK-45937 > URL: https://issues.apache.org/jira/browse/SPARK-45937 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Thomas Graves >Priority: Critical > > https://issues.apache.org/jira/browse/SPARK-41210 added support for > spark.executor.maxNumFailures on Kubernetes, it made this config generic and > deprecated the yarn version. This config isn't documented and defaults are > not documented. > > [https://github.com/apache/spark/commit/40872e9a094f8459b0b6f626937ced48a8d98efb] > \ > It also added {color:#0a3069}spark.executor.failuresValidityInterval.{color} > > {color:#0a3069}Both need to have default values specified for yarn and k8s, > it also needs to remove the yarn documentation for equivalent configs > spark.yarn.max.executor.failures configuration{color} > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45937) Fix documentation of spark.executor.maxNumFailures
Thomas Graves created SPARK-45937: - Summary: Fix documentation of spark.executor.maxNumFailures Key: SPARK-45937 URL: https://issues.apache.org/jira/browse/SPARK-45937 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.5.0 Reporter: Thomas Graves https://issues.apache.org/jira/browse/SPARK-41210 added support for spark.executor.maxNumFailures on Kubernetes, it made this config generic and deprecated the yarn version. This config isn't documented and defaults are not documented. [https://github.com/apache/spark/commit/40872e9a094f8459b0b6f626937ced48a8d98efb] \ It also added {color:#0a3069}spark.executor.failuresValidityInterval.{color} {color:#0a3069}Both need to have default values specified for yarn and k8s, it also needs to remove the yarn documentation for equivalent configs spark.yarn.max.executor.failures configuration{color} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45905) least common type between decimal types should retain integral digits first
[ https://issues.apache.org/jira/browse/SPARK-45905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-45905. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43781 [https://github.com/apache/spark/pull/43781] > least common type between decimal types should retain integral digits first > --- > > Key: SPARK-45905 > URL: https://issues.apache.org/jira/browse/SPARK-45905 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45390) Remove `distutils` usage
[ https://issues.apache.org/jira/browse/SPARK-45390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786312#comment-17786312 ] Dongjoon Hyun commented on SPARK-45390: --- Apache Spark 3.5 is released first on 2023-09-13 and Python 3.12 is released on 2023-10-02. We don't promise to support all future unreleased Python versions, [~nchammas]. As you pointed out, this is designed to be an improvement for Apache Spark 4.0.0 only. BTW, in order to support Python 3.12 in Apache Spark 3.5.x, we need much more validation because all Python package ecosystem should work there without any issues. > Remove `distutils` usage > > > Key: SPARK-45390 > URL: https://issues.apache.org/jira/browse/SPARK-45390 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > [PEP-632|https://peps.python.org/pep-0632] deprecated {{distutils}} module in > Python {{3.10}} and dropped in Python {{3.12}} in favor of {{packaging}} > package. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45915) Treat decimal(x, 0) the same as IntegralType in PromoteStrings
[ https://issues.apache.org/jira/browse/SPARK-45915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-45915. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43812 [https://github.com/apache/spark/pull/43812] > Treat decimal(x, 0) the same as IntegralType in PromoteStrings > -- > > Key: SPARK-45915 > URL: https://issues.apache.org/jira/browse/SPARK-45915 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45935) Fix RST files link substitutions error
[ https://issues.apache.org/jira/browse/SPARK-45935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45935: --- Labels: pull-request-available (was: ) > Fix RST files link substitutions error > -- > > Key: SPARK-45935 > URL: https://issues.apache.org/jira/browse/SPARK-45935 > Project: Spark > Issue Type: Bug > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45935) Fix RST files link substitutions error
BingKun Pan created SPARK-45935: --- Summary: Fix RST files link substitutions error Key: SPARK-45935 URL: https://issues.apache.org/jira/browse/SPARK-45935 Project: Spark Issue Type: Bug Components: Documentation, PySpark Affects Versions: 4.0.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45934) Fix `spark-standalone.md` and `running-on-kubernetes.md` table layout
[ https://issues.apache.org/jira/browse/SPARK-45934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45934: --- Labels: pull-request-available (was: ) > Fix `spark-standalone.md` and `running-on-kubernetes.md` table layout > - > > Key: SPARK-45934 > URL: https://issues.apache.org/jira/browse/SPARK-45934 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45934) Fix `spark-standalone.md` and `running-on-kubernetes.md` table layout
Dongjoon Hyun created SPARK-45934: - Summary: Fix `spark-standalone.md` and `running-on-kubernetes.md` table layout Key: SPARK-45934 URL: https://issues.apache.org/jira/browse/SPARK-45934 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-42471) Distributed ML <> spark connect
[ https://issues.apache.org/jira/browse/SPARK-42471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786292#comment-17786292 ] Faiz Halde edited comment on SPARK-42471 at 11/15/23 10:22 AM: --- Hello, our use-case requires us to use spark-connect and we have some of our jobs that used spark ML ( scala ). Is this Umbrella tracking the work required to make spark ml compatible with spark-connect? Because so far we've been struggling with this. May I know if this is already done and if there are docs on how to make this work? Thanks! was (Author: JIRAUSER300204): Hello, our use-case requires us to use spark-connect and we have some of our jobs that used spark ML. Is this Umbrella tracking the work required to make spark ml compatible with spark-connect? Because so far we've been struggling with this. May I know if this is already done and if there are docs on how to make this work? Thanks! > Distributed ML <> spark connect > --- > > Key: SPARK-42471 > URL: https://issues.apache.org/jira/browse/SPARK-42471 > Project: Spark > Issue Type: Umbrella > Components: Connect, ML >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Weichen Xu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45933) Runtime filter should infers more application side.
Jiaan Geng created SPARK-45933: -- Summary: Runtime filter should infers more application side. Key: SPARK-45933 URL: https://issues.apache.org/jira/browse/SPARK-45933 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 4.0.0 Reporter: Jiaan Geng Assignee: Jiaan Geng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45924) Canonicalization of SubqueryAdaptiveBroadcastExec is broken and is not equivalent with SubqueryBroadcastExec
[ https://issues.apache.org/jira/browse/SPARK-45924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45924: -- Assignee: (was: Apache Spark) > Canonicalization of SubqueryAdaptiveBroadcastExec is broken and is not > equivalent with SubqueryBroadcastExec > > > Key: SPARK-45924 > URL: https://issues.apache.org/jira/browse/SPARK-45924 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.1 >Reporter: Asif >Priority: Major > Labels: pull-request-available > > while writing bug test for > [SPARK-45866|https://issues.apache.org/jira/projects/SPARK/issues/SPARK-45866], > found that canonicalization of SubqueryAdaptiveBroadcastExec is broken in > the sense that buildPlan : LogicalPlan is not canonicalized which causes > batchscans to differ when reuse of exchange check happens in AQE. > Moreover the equivalence of SubqueryAdaptiveBroadcastExec and > SubqueryBroadcastExec is not there which also aggravates the re-use of > exchange in aqe broken. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45924) Canonicalization of SubqueryAdaptiveBroadcastExec is broken and is not equivalent with SubqueryBroadcastExec
[ https://issues.apache.org/jira/browse/SPARK-45924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45924: -- Assignee: Apache Spark > Canonicalization of SubqueryAdaptiveBroadcastExec is broken and is not > equivalent with SubqueryBroadcastExec > > > Key: SPARK-45924 > URL: https://issues.apache.org/jira/browse/SPARK-45924 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.1 >Reporter: Asif >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > > while writing bug test for > [SPARK-45866|https://issues.apache.org/jira/projects/SPARK/issues/SPARK-45866], > found that canonicalization of SubqueryAdaptiveBroadcastExec is broken in > the sense that buildPlan : LogicalPlan is not canonicalized which causes > batchscans to differ when reuse of exchange check happens in AQE. > Moreover the equivalence of SubqueryAdaptiveBroadcastExec and > SubqueryBroadcastExec is not there which also aggravates the re-use of > exchange in aqe broken. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-45861) Add user guide for dataframe creation
[ https://issues.apache.org/jira/browse/SPARK-45861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786216#comment-17786216 ] BingKun Pan edited comment on SPARK-45861 at 11/15/23 8:17 AM: --- Unfortunately, I have found similar documents at [https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_df.html], !screenshot-1.png|width=751,height=541! so do we need to move them under menu `User Guides`, or ? we feel a bit repetitive, Perhaps we should organize the menu categories? !screenshot-2.png|width=671,height=43! was (Author: panbingkun): Unfortunately, I have found similar documents at [https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_df.html], !screenshot-1.png|width=751,height=541! so do we need to move them under menu `User Guides`, or ? we feel a bit repetitive? Perhaps we should organize the menu categories? !screenshot-2.png|width=671,height=43! > Add user guide for dataframe creation > - > > Key: SPARK-45861 > URL: https://issues.apache.org/jira/browse/SPARK-45861 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > Attachments: screenshot-1.png, screenshot-2.png > > > Add a simple user guide for data frame creation. > This user guide should cover the following APIs: > # df.createDataFrame > # spark.read.format(...) (can be csv, json, parquet -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-45861) Add user guide for dataframe creation
[ https://issues.apache.org/jira/browse/SPARK-45861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786216#comment-17786216 ] BingKun Pan edited comment on SPARK-45861 at 11/15/23 8:08 AM: --- Unfortunately, I have found similar documents at [https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_df.html], !screenshot-1.png|width=751,height=541! so do we need to move them under menu `User Guides`, or ? we feel a bit repetitive? Perhaps we should organize the menu categories? !screenshot-2.png|width=671,height=43! was (Author: panbingkun): Unfortunately, I have found similar documents at https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_df.html, !screenshot-1.png! so do we need to move them under menu ``, or ? we feel a bit repetitive? Perhaps we should organize the menu categories? !screenshot-2.png! > Add user guide for dataframe creation > - > > Key: SPARK-45861 > URL: https://issues.apache.org/jira/browse/SPARK-45861 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > Attachments: screenshot-1.png, screenshot-2.png > > > Add a simple user guide for data frame creation. > This user guide should cover the following APIs: > # df.createDataFrame > # spark.read.format(...) (can be csv, json, parquet -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45861) Add user guide for dataframe creation
[ https://issues.apache.org/jira/browse/SPARK-45861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786216#comment-17786216 ] BingKun Pan commented on SPARK-45861: - Unfortunately, I have found similar documents at https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_df.html, !screenshot-1.png! so do we need to move them under menu ``, or ? we feel a bit repetitive? Perhaps we should organize the menu categories? !screenshot-2.png! > Add user guide for dataframe creation > - > > Key: SPARK-45861 > URL: https://issues.apache.org/jira/browse/SPARK-45861 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > Attachments: screenshot-1.png, screenshot-2.png > > > Add a simple user guide for data frame creation. > This user guide should cover the following APIs: > # df.createDataFrame > # spark.read.format(...) (can be csv, json, parquet -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45861) Add user guide for dataframe creation
[ https://issues.apache.org/jira/browse/SPARK-45861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-45861: Attachment: screenshot-2.png > Add user guide for dataframe creation > - > > Key: SPARK-45861 > URL: https://issues.apache.org/jira/browse/SPARK-45861 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > Attachments: screenshot-1.png, screenshot-2.png > > > Add a simple user guide for data frame creation. > This user guide should cover the following APIs: > # df.createDataFrame > # spark.read.format(...) (can be csv, json, parquet -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45861) Add user guide for dataframe creation
[ https://issues.apache.org/jira/browse/SPARK-45861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-45861: Attachment: screenshot-1.png > Add user guide for dataframe creation > - > > Key: SPARK-45861 > URL: https://issues.apache.org/jira/browse/SPARK-45861 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > Attachments: screenshot-1.png, screenshot-2.png > > > Add a simple user guide for data frame creation. > This user guide should cover the following APIs: > # df.createDataFrame > # spark.read.format(...) (can be csv, json, parquet -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org