[jira] [Assigned] (SPARK-45938) Add `utils` to the dependency list of the `core` module in `module.py`

2023-11-15 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie reassigned SPARK-45938:


Assignee: Yang Jie

> Add `utils` to the dependency list of the `core` module in `module.py`
> --
>
> Key: SPARK-45938
> URL: https://issues.apache.org/jira/browse/SPARK-45938
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45938) Add `utils` to the dependency list of the `core` module in `module.py`

2023-11-15 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-45938.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43818
[https://github.com/apache/spark/pull/43818]

> Add `utils` to the dependency list of the `core` module in `module.py`
> --
>
> Key: SPARK-45938
> URL: https://issues.apache.org/jira/browse/SPARK-45938
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32246) Have a way to optionally run streaming-kinesis-asl

2023-11-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-32246:
---
Labels: pull-request-available  (was: )

> Have a way to optionally run streaming-kinesis-asl
> --
>
> Key: SPARK-32246
> URL: https://issues.apache.org/jira/browse/SPARK-32246
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.4.6, 3.0.0, 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>
> See https://github.com/HyukjinKwon/spark/pull/4. Kinesis tests depends on 
> external Amazon kinesis service.
> We should have a way to run it optionally. Currently, this is not being run 
> in Github Actions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45948) Make single-pod spark jobs respect spark.app.id

2023-11-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45948:
---
Labels: pull-request-available  (was: )

> Make single-pod spark jobs respect spark.app.id
> ---
>
> Key: SPARK-45948
> URL: https://issues.apache.org/jira/browse/SPARK-45948
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45948) Make single-pod spark jobs respect spark.app.id

2023-11-15 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-45948:
-

 Summary: Make single-pod spark jobs respect spark.app.id
 Key: SPARK-45948
 URL: https://issues.apache.org/jira/browse/SPARK-45948
 Project: Spark
  Issue Type: Sub-task
  Components: Kubernetes
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45946) Fix use of deprecated FileUtils write in RocksDBSuite

2023-11-15 Thread Anish Shrigondekar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786629#comment-17786629
 ] 

Anish Shrigondekar commented on SPARK-45946:


[~kabhwan] - PR here - [GitHub Pull Request 
#43832|https://github.com/apache/spark/pull/43832]

 

PTAL, thx

> Fix use of deprecated FileUtils write in RocksDBSuite
> -
>
> Key: SPARK-45946
> URL: https://issues.apache.org/jira/browse/SPARK-45946
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Anish Shrigondekar
>Priority: Major
>  Labels: pull-request-available
>
> Fix use of deprecated FileUtils write in RocksDBSuite



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45947) Set a human readable description for Dataset api

2023-11-15 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-45947:

Description: 
We should set the view name to 
sparkSession.sparkContext.setJobDescription("xxx")


 !screenshot-1.png! 


  was:
Need to sparkSession.sparkContext.setJobDescription("xxx")
 !screenshot-1.png! 



> Set a human readable description for Dataset api
> 
>
> Key: SPARK-45947
> URL: https://issues.apache.org/jira/browse/SPARK-45947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: screenshot-1.png
>
>
> We should set the view name to 
> sparkSession.sparkContext.setJobDescription("xxx")
>  !screenshot-1.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45947) Set a human readable description for Dataset api

2023-11-15 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-45947:

Description: 
Need to sparkSession.sparkContext.setJobDescription("xxx")
 !screenshot-1.png! 


> Set a human readable description for Dataset api
> 
>
> Key: SPARK-45947
> URL: https://issues.apache.org/jira/browse/SPARK-45947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: screenshot-1.png
>
>
> Need to sparkSession.sparkContext.setJobDescription("xxx")
>  !screenshot-1.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45947) Set a human readable description for Dataset api

2023-11-15 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-45947:
---

 Summary: Set a human readable description for Dataset api
 Key: SPARK-45947
 URL: https://issues.apache.org/jira/browse/SPARK-45947
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Yuming Wang
 Attachments: screenshot-1.png





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45947) Set a human readable description for Dataset api

2023-11-15 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-45947:

Attachment: screenshot-1.png

> Set a human readable description for Dataset api
> 
>
> Key: SPARK-45947
> URL: https://issues.apache.org/jira/browse/SPARK-45947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: screenshot-1.png
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45946) Fix use of deprecated FileUtils write in RocksDBSuite

2023-11-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45946:
---
Labels: pull-request-available  (was: )

> Fix use of deprecated FileUtils write in RocksDBSuite
> -
>
> Key: SPARK-45946
> URL: https://issues.apache.org/jira/browse/SPARK-45946
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Anish Shrigondekar
>Priority: Major
>  Labels: pull-request-available
>
> Fix use of deprecated FileUtils write in RocksDBSuite



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45946) Fix use of deprecated FileUtils write in RocksDBSuite

2023-11-15 Thread Anish Shrigondekar (Jira)
Anish Shrigondekar created SPARK-45946:
--

 Summary: Fix use of deprecated FileUtils write in RocksDBSuite
 Key: SPARK-45946
 URL: https://issues.apache.org/jira/browse/SPARK-45946
 Project: Spark
  Issue Type: Task
  Components: Structured Streaming
Affects Versions: 4.0.0
Reporter: Anish Shrigondekar


Fix use of deprecated FileUtils write in RocksDBSuite



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33393) Support SHOW TABLE EXTENDED in DSv2

2023-11-15 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33393.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 37588
[https://github.com/apache/spark/pull/37588]

> Support SHOW TABLE EXTENDED in DSv2
> ---
>
> Key: SPARK-33393
> URL: https://issues.apache.org/jira/browse/SPARK-33393
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Max Gekk
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Current implementation of DSv2 SHOW TABLE doesn't support the EXTENDED mode 
> in:
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowTablesExec.scala#L33
> which is supported in DSv1:
> https://github.com/apache/spark/blob/7e99fcd64efa425f3c985df4fe957a3be274a49a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala#L870
> Need to add the same functionality to ShowTablesExec.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45866) Reuse of exchange in AQE does not happen when run time filters are pushed down to the underlying Scan ( like iceberg )

2023-11-15 Thread Asif (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-45866:
-
Labels: pull-request-available  (was: )

> Reuse of exchange in AQE does not happen when run time filters are pushed 
> down to the underlying Scan ( like iceberg )
> --
>
> Key: SPARK-45866
> URL: https://issues.apache.org/jira/browse/SPARK-45866
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Asif
>Priority: Major
>  Labels: pull-request-available
>
> In certain types of queries for eg TPCDS Query 14b,  the reuse of exchange 
> does not happen in AQE , resulting in perf degradation.
> The spark TPCDS tests are unable to catch the problem, because the 
> InMemoryScan used for testing do not implement the equals & hashCode 
> correctly , in the sense , that they do take into account the pushed down run 
> time filters.
> In concrete Scan implementations, for eg iceberg's SparkBatchQueryScan , the 
> equality check , apart from other things, also involves Runtime Filters 
> pushed ( which is correct).
> In spark the issue is this:
> For a given stage being materialized,  just before materialization starts, 
> the run time filters are confined to the BatchScanExec level.
> Only when the actual RDD corresponding to the BatchScanExec, is being 
> evaluated,  do the runtime filters get pushed to the underlying Scan.
> Now if a new stage is created and it checks in the stageCache using its 
> canonicalized plan to see if a stage can be reused, it fails to find the 
> r-usable  stage even if the stage exists, because the canonicalized spark 
> plan present in the stage cache, has now the run time filters pushed to the 
> Scan , so the incoming canonicalized spark plan does not match the key as 
> their underlying scans differ . that is incoming spark plan's scan does not 
> have runtime filters , while the canonicalized spark plan present as key in 
> the stage cache has the scan with runtime filters pushed.
> The fix as I have worked is to provide, two methods in the 
> SupportsRuntimeV2Filtering interface ,
> default boolean equalToIgnoreRuntimeFilters(Scan other) {
> return this.equals(other);
>   }
>   default int hashCodeIgnoreRuntimeFilters() {
> return this.hashCode();
>   }
> In the BatchScanExec, if the scan implements SupportsRuntimeV2Filtering, then 
> instead of batch.equals, it should call scan.equalToIgnoreRuntimeFilters
> And the underlying Scan implementations should provide equality which 
> excludes run time filters.
> Similarly the hashCode of BatchScanExec, should use 
> scan.hashCodeIgnoreRuntimeFilters instead of ( batch.hashCode).
> Will be creating a PR with bug test for review.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45747) Support session window aggregation in state reader

2023-11-15 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-45747.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43788
[https://github.com/apache/spark/pull/43788]

> Support session window aggregation in state reader
> --
>
> Key: SPARK-45747
> URL: https://issues.apache.org/jira/browse/SPARK-45747
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Chaoqin Li
>Assignee: Chaoqin Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> We are introducing state reader in SPARK-45511, but currently session window 
> operator is not supported because the numColPrefixKey is unknown. We can read 
> the operator state metadata introduced in SPARK-45558 to determine the number 
> of prefix columns and load the state of session window correctly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45747) Support session window aggregation in state reader

2023-11-15 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-45747:


Assignee: Chaoqin Li

> Support session window aggregation in state reader
> --
>
> Key: SPARK-45747
> URL: https://issues.apache.org/jira/browse/SPARK-45747
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Chaoqin Li
>Assignee: Chaoqin Li
>Priority: Major
>  Labels: pull-request-available
>
> We are introducing state reader in SPARK-45511, but currently session window 
> operator is not supported because the numColPrefixKey is unknown. We can read 
> the operator state metadata introduced in SPARK-45558 to determine the number 
> of prefix columns and load the state of session window correctly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-45943) DataSourceV2Relation.computeStats throws IllegalStateException in test mode

2023-11-15 Thread Zhen Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786584#comment-17786584
 ] 

Zhen Wang edited comment on SPARK-45943 at 11/16/23 3:24 AM:
-

I encountered the same problem, and after debugging, I found that 
RewriteMergeIntoTable Rule rewrites MergeIntoTable with ReplaceData and there 
is a HiveTableRelation without tableStats in ReplaceData.groupFilterCondition. 
Since DetermineTableStats is applied after RewriteMergeIntoTable it does not 
set tableStats for HiveTableRelation.
 
Reproduce:
{code:java}
create table sample.hive_table (id int, name string);
 
create table iceberg_catalog.sample.iceberg_table (
  id int,
  name string)
USING iceberg;
 
MERGE INTO iceberg_catalog.sample.iceberg_table t USING (SELECT * FROM 
sample.hive_table) u ON t.id = u.id
WHEN MATCHED THEN UPDATE SET t.name = u.name
WHEN NOT MATCHED THEN INSERT *; {code}
error:
{code:java}
ERROR ExecuteStatement: Error operating ExecuteStatement: 
java.lang.IllegalStateException: Table stats must be specified.
at 
org.apache.spark.sql.catalyst.catalog.HiveTableRelation.$anonfun$computeStats$3(interface.scala:845)
at scala.Option.getOrElse(Option.scala:189)
at 
org.apache.spark.sql.catalyst.catalog.HiveTableRelation.computeStats(interface.scala:845)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.default(SizeInBytesOnlyStatsPlanVisitor.scala:56)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.default(SizeInBytesOnlyStatsPlanVisitor.scala:28)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor.visit(LogicalPlanVisitor.scala:49)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor.visit$(LogicalPlanVisitor.scala:25)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visit(SizeInBytesOnlyStatsPlanVisitor.scala:28)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.$anonfun$stats$1(LogicalPlanStats.scala:37)
at scala.Option.getOrElse(Option.scala:189)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.stats(LogicalPlanStats.scala:33)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.stats$(LogicalPlanStats.scala:33)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.stats(LogicalPlan.scala:32)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visitUnaryNode(SizeInBytesOnlyStatsPlanVisitor.scala:40)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visitFilter(SizeInBytesOnlyStatsPlanVisitor.scala:80)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visitFilter(SizeInBytesOnlyStatsPlanVisitor.scala:28)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor.visit(LogicalPlanVisitor.scala:30)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor.visit$(LogicalPlanVisitor.scala:25)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visit(SizeInBytesOnlyStatsPlanVisitor.scala:28)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.$anonfun$stats$1(LogicalPlanStats.scala:37)
at scala.Option.getOrElse(Option.scala:189)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.stats(LogicalPlanStats.scala:33)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.stats$(LogicalPlanStats.scala:33)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.stats(LogicalPlan.scala:32)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visitUnaryNode(SizeInBytesOnlyStatsPlanVisitor.scala:40)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visitProject(SizeInBytesOnlyStatsPlanVisitor.scala:149)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visitProject(SizeInBytesOnlyStatsPlanVisitor.scala:28)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor.visit(LogicalPlanVisitor.scala:38)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor.visit$(LogicalPlanVisitor.scala:25)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visit(SizeInBytesOnlyStatsPlanVisitor.scala:28)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.$anonfun$stats$1(LogicalPlanStats.scala:37)
at scala.Option.getOrElse(Option.scala:189)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.stats(LogicalPlanStats.scala:33)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.stats$(Lo

[jira] [Updated] (SPARK-45945) Add a helper function for `parser`

2023-11-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45945:
---
Labels: pull-request-available  (was: )

> Add a helper function for `parser`
> --
>
> Key: SPARK-45945
> URL: https://issues.apache.org/jira/browse/SPARK-45945
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45945) Add a helper function for `parser`

2023-11-15 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-45945:
-

 Summary: Add a helper function for `parser`
 Key: SPARK-45945
 URL: https://issues.apache.org/jira/browse/SPARK-45945
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 4.0.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45943) DataSourceV2Relation.computeStats throws IllegalStateException in test mode

2023-11-15 Thread Zhen Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786584#comment-17786584
 ] 

Zhen Wang commented on SPARK-45943:
---

I encountered the same problem, and after debugging, I found that 
RewriteMergeIntoTable Rule rewrites MergeIntoTable with ReplaceData and there 
is a HiveTableRelation without tableStats in ReplaceData.groupFilterCondition. 
Since DetermineTableStats is applied after RewriteMergeIntoTable it does not 
set tableStats for HiveTableRelation.
 
Reproduce:
{code:java}
create table sample.hive_table (id int, name string);
 
create table iceberg_catalog.sample.iceberg_table (
  id int,
  name string)
USING iceberg;
 
MERGE INTO iceberg_table t USING (SELECT * FROM hive_table) u ON t.id = u.id
WHEN MATCHED THEN UPDATE SET t.name = u.name
WHEN NOT MATCHED THEN INSERT *; {code}
error:
{code:java}
ERROR ExecuteStatement: Error operating ExecuteStatement: 
java.lang.IllegalStateException: Table stats must be specified.
at 
org.apache.spark.sql.catalyst.catalog.HiveTableRelation.$anonfun$computeStats$3(interface.scala:845)
at scala.Option.getOrElse(Option.scala:189)
at 
org.apache.spark.sql.catalyst.catalog.HiveTableRelation.computeStats(interface.scala:845)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.default(SizeInBytesOnlyStatsPlanVisitor.scala:56)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.default(SizeInBytesOnlyStatsPlanVisitor.scala:28)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor.visit(LogicalPlanVisitor.scala:49)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor.visit$(LogicalPlanVisitor.scala:25)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visit(SizeInBytesOnlyStatsPlanVisitor.scala:28)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.$anonfun$stats$1(LogicalPlanStats.scala:37)
at scala.Option.getOrElse(Option.scala:189)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.stats(LogicalPlanStats.scala:33)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.stats$(LogicalPlanStats.scala:33)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.stats(LogicalPlan.scala:32)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visitUnaryNode(SizeInBytesOnlyStatsPlanVisitor.scala:40)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visitFilter(SizeInBytesOnlyStatsPlanVisitor.scala:80)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visitFilter(SizeInBytesOnlyStatsPlanVisitor.scala:28)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor.visit(LogicalPlanVisitor.scala:30)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor.visit$(LogicalPlanVisitor.scala:25)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visit(SizeInBytesOnlyStatsPlanVisitor.scala:28)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.$anonfun$stats$1(LogicalPlanStats.scala:37)
at scala.Option.getOrElse(Option.scala:189)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.stats(LogicalPlanStats.scala:33)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.stats$(LogicalPlanStats.scala:33)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.stats(LogicalPlan.scala:32)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visitUnaryNode(SizeInBytesOnlyStatsPlanVisitor.scala:40)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visitProject(SizeInBytesOnlyStatsPlanVisitor.scala:149)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visitProject(SizeInBytesOnlyStatsPlanVisitor.scala:28)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor.visit(LogicalPlanVisitor.scala:38)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor.visit$(LogicalPlanVisitor.scala:25)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visit(SizeInBytesOnlyStatsPlanVisitor.scala:28)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.$anonfun$stats$1(LogicalPlanStats.scala:37)
at scala.Option.getOrElse(Option.scala:189)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.stats(LogicalPlanStats.scala:33)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.stats$(LogicalPlanStats.scala:33)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPl

[jira] [Commented] (SPARK-45861) Add user guide for dataframe creation

2023-11-15 Thread BingKun Pan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786581#comment-17786581
 ] 

BingKun Pan commented on SPARK-45861:
-

Okay, I see, Let me to try it.

> Add user guide for dataframe creation
> -
>
> Key: SPARK-45861
> URL: https://issues.apache.org/jira/browse/SPARK-45861
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> Add a simple user guide for data frame creation.
> This user guide should cover the following APIs:
>  # df.createDataFrame
>  # spark.read.format(...) (can be csv, json, parquet



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45930) Allow non-deterministic Python UDFs in MapInPandas/MapInArrow

2023-11-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-45930.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43810
[https://github.com/apache/spark/pull/43810]

> Allow non-deterministic Python UDFs in MapInPandas/MapInArrow
> -
>
> Key: SPARK-45930
> URL: https://issues.apache.org/jira/browse/SPARK-45930
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Currently if a Python udf is non-deterministic, the analyzer will fail with 
> this error:[INVALID_NON_DETERMINISTIC_EXPRESSIONS] The operator expects a 
> deterministic expression, but the actual expression is "pyUDF()", "a". 
> SQLSTATE: 42K0E;



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45930) Allow non-deterministic Python UDFs in MapInPandas/MapInArrow

2023-11-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-45930:


Assignee: Allison Wang

> Allow non-deterministic Python UDFs in MapInPandas/MapInArrow
> -
>
> Key: SPARK-45930
> URL: https://issues.apache.org/jira/browse/SPARK-45930
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
>  Labels: pull-request-available
>
> Currently if a Python udf is non-deterministic, the analyzer will fail with 
> this error:[INVALID_NON_DETERMINISTIC_EXPRESSIONS] The operator expects a 
> deterministic expression, but the actual expression is "pyUDF()", "a". 
> SQLSTATE: 42K0E;



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45931) Refine docstring of `mapInPandas`

2023-11-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-45931.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43811
[https://github.com/apache/spark/pull/43811]

> Refine docstring of `mapInPandas`
> -
>
> Key: SPARK-45931
> URL: https://issues.apache.org/jira/browse/SPARK-45931
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Refine the docstring of the mapInPandas function.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45931) Refine docstring of `mapInPandas`

2023-11-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-45931:


Assignee: Allison Wang

> Refine docstring of `mapInPandas`
> -
>
> Key: SPARK-45931
> URL: https://issues.apache.org/jira/browse/SPARK-45931
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
>  Labels: pull-request-available
>
> Refine the docstring of the mapInPandas function.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45936) Optimize `Index.symmetric_difference`

2023-11-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-45936:


Assignee: Ruifeng Zheng

> Optimize `Index.symmetric_difference`
> -
>
> Key: SPARK-45936
> URL: https://issues.apache.org/jira/browse/SPARK-45936
> Project: Spark
>  Issue Type: Improvement
>  Components: PS
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45936) Optimize `Index.symmetric_difference`

2023-11-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-45936.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43816
[https://github.com/apache/spark/pull/43816]

> Optimize `Index.symmetric_difference`
> -
>
> Key: SPARK-45936
> URL: https://issues.apache.org/jira/browse/SPARK-45936
> Project: Spark
>  Issue Type: Improvement
>  Components: PS
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45935) Fix RST files link substitutions error

2023-11-15 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-45935:

Affects Version/s: 3.5.0
   3.4.1
   3.3.3
   (was: 3.4.2)
   (was: 3.5.1)
   (was: 3.3.4)

> Fix RST files link substitutions error
> --
>
> Key: SPARK-45935
> URL: https://issues.apache.org/jira/browse/SPARK-45935
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, PySpark
>Affects Versions: 3.3.3, 3.4.1, 3.5.0, 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45935) Fix RST files link substitutions error

2023-11-15 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-45935:

Affects Version/s: 3.4.2
   3.5.1
   3.3.4

> Fix RST files link substitutions error
> --
>
> Key: SPARK-45935
> URL: https://issues.apache.org/jira/browse/SPARK-45935
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, PySpark
>Affects Versions: 3.4.2, 4.0.0, 3.5.1, 3.3.4
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44685) Remove deprecated Catalog#createExternalTable

2023-11-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-44685:
---
Labels: pull-request-available release-notes  (was: release-notes)

> Remove deprecated Catalog#createExternalTable
> -
>
> Key: SPARK-44685
> URL: https://issues.apache.org/jira/browse/SPARK-44685
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jia Fan
>Priority: Major
>  Labels: pull-request-available, release-notes
>
> I should remove Catalog#createExternalTable becuase it deprecated when 2.2.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44699) Add logging for complete write events to file in EventLogFileWriter.closeWriter

2023-11-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-44699:
---
Labels: pull-request-available  (was: )

> Add logging for complete write events to file in 
> EventLogFileWriter.closeWriter
> ---
>
> Key: SPARK-44699
> URL: https://issues.apache.org/jira/browse/SPARK-44699
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.4.1
>Reporter: shuyouZZ
>Priority: Major
>  Labels: pull-request-available
>
> Sometimes we want to know when to finish logging the events to eventLog file, 
> we need add a log to make it clearer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45944) Leaked file streams in ParquetFileFormat

2023-11-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45944:
--
Description: 
- [https://github.com/apache/spark/actions/runs/6859020738/job/18650698085]
 - [https://github.com/apache/spark/actions/runs/6872747886/job/18691717269]
{code:java}
Cause: java.lang.IllegalStateException: There are 1 possibly leaked file 
streams.
29975[info]   at 
org.apache.spark.DebugFilesystem$.assertNoOpenStreams(DebugFilesystem.scala:54)
29976[info]   at 
org.apache.spark.sql.test.SharedSparkSessionBase.$anonfun$afterEach$1(SharedSparkSession.scala:165)
...
29977[info]   at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormatSuite.eventually(ParquetFileFormatSuite.scala:31)
29984[info]   at 
org.apache.spark.sql.test.SharedSparkSessionBase.afterEach(SharedSparkSession.scala:164)
29985[info]   at 
org.apache.spark.sql.test.SharedSparkSessionBase.afterEach$(SharedSparkSession.scala:158)
29986[info]   at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormatSuite.afterEach(ParquetFileFormatSuite.scala:31)
29987[info]   at 
org.scalatest.BeforeAndAfterEach.$anonfun$runTest$1(BeforeAndAfterEach.scala:247)
29988[info]   at 
...
30025[info]   Cause: java.lang.Throwable:
30026[info]   at 
org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:35)
30027[info]   at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:75)
30028[info]   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:997)
30029[info]   at 
org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69)
30030[info]   at 
org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:796)
30031[info]   at 
org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:666)
30032[info]   at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:85)
30033[info]   at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:76)
30034[info]   at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readParquetFootersInParallel$1(ParquetFileFormat.scala:450)
 {code}

  was:
- https://github.com/apache/spark/actions/runs/6872747886/job/18691717269
{code:java}
Cause: java.lang.IllegalStateException: There are 1 possibly leaked file 
streams.
29975[info]   at 
org.apache.spark.DebugFilesystem$.assertNoOpenStreams(DebugFilesystem.scala:54)
29976[info]   at 
org.apache.spark.sql.test.SharedSparkSessionBase.$anonfun$afterEach$1(SharedSparkSession.scala:165)
...
29977[info]   at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormatSuite.eventually(ParquetFileFormatSuite.scala:31)
29984[info]   at 
org.apache.spark.sql.test.SharedSparkSessionBase.afterEach(SharedSparkSession.scala:164)
29985[info]   at 
org.apache.spark.sql.test.SharedSparkSessionBase.afterEach$(SharedSparkSession.scala:158)
29986[info]   at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormatSuite.afterEach(ParquetFileFormatSuite.scala:31)
29987[info]   at 
org.scalatest.BeforeAndAfterEach.$anonfun$runTest$1(BeforeAndAfterEach.scala:247)
29988[info]   at 
...
30025[info]   Cause: java.lang.Throwable:
30026[info]   at 
org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:35)
30027[info]   at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:75)
30028[info]   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:997)
30029[info]   at 
org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69)
30030[info]   at 
org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:796)
30031[info]   at 
org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:666)
30032[info]   at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:85)
30033[info]   at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:76)
30034[info]   at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readParquetFootersInParallel$1(ParquetFileFormat.scala:450)
 {code}


> Leaked file streams in ParquetFileFormat
> 
>
> Key: SPARK-45944
> URL: https://issues.apache.org/jira/browse/SPARK-45944
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> - [https://github.com/apache/spark/actions/runs/6859020738/job/18650698085]
>  - [https://github.com/apache/spark/actions/runs/6872747886/job/18691717269]
> {code:java}
> Cause: java.lang.IllegalStateException: There are 1 possibly leaked file 
> streams.
> 29975[info]   at 
> org.apache.spark.DebugFilesystem$.assertNoOp

[jira] [Updated] (SPARK-45944) Leaked file streams in ParquetFileFormat

2023-11-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45944:
--
Summary: Leaked file streams in ParquetFileFormat  (was: Leaked file 
streams in ParquetFileFormatV1Suite)

> Leaked file streams in ParquetFileFormat
> 
>
> Key: SPARK-45944
> URL: https://issues.apache.org/jira/browse/SPARK-45944
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> - https://github.com/apache/spark/actions/runs/6872747886/job/18691717269
> {code:java}
> Cause: java.lang.IllegalStateException: There are 1 possibly leaked file 
> streams.
> 29975[info]   at 
> org.apache.spark.DebugFilesystem$.assertNoOpenStreams(DebugFilesystem.scala:54)
> 29976[info]   at 
> org.apache.spark.sql.test.SharedSparkSessionBase.$anonfun$afterEach$1(SharedSparkSession.scala:165)
> ...
> 29977[info]   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormatSuite.eventually(ParquetFileFormatSuite.scala:31)
> 29984[info]   at 
> org.apache.spark.sql.test.SharedSparkSessionBase.afterEach(SharedSparkSession.scala:164)
> 29985[info]   at 
> org.apache.spark.sql.test.SharedSparkSessionBase.afterEach$(SharedSparkSession.scala:158)
> 29986[info]   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormatSuite.afterEach(ParquetFileFormatSuite.scala:31)
> 29987[info]   at 
> org.scalatest.BeforeAndAfterEach.$anonfun$runTest$1(BeforeAndAfterEach.scala:247)
> 29988[info]   at 
> ...
> 30025[info]   Cause: java.lang.Throwable:
> 30026[info]   at 
> org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:35)
> 30027[info]   at 
> org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:75)
> 30028[info]   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:997)
> 30029[info]   at 
> org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69)
> 30030[info]   at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:796)
> 30031[info]   at 
> org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:666)
> 30032[info]   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:85)
> 30033[info]   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:76)
> 30034[info]   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readParquetFootersInParallel$1(ParquetFileFormat.scala:450)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45944) Leaked file streams in ParquetFileFormatV1Suite

2023-11-15 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-45944:
-

 Summary: Leaked file streams in ParquetFileFormatV1Suite
 Key: SPARK-45944
 URL: https://issues.apache.org/jira/browse/SPARK-45944
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun


- https://github.com/apache/spark/actions/runs/6872747886/job/18691717269
{code:java}
Cause: java.lang.IllegalStateException: There are 1 possibly leaked file 
streams.
29975[info]   at 
org.apache.spark.DebugFilesystem$.assertNoOpenStreams(DebugFilesystem.scala:54)
29976[info]   at 
org.apache.spark.sql.test.SharedSparkSessionBase.$anonfun$afterEach$1(SharedSparkSession.scala:165)
...
29977[info]   at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormatSuite.eventually(ParquetFileFormatSuite.scala:31)
29984[info]   at 
org.apache.spark.sql.test.SharedSparkSessionBase.afterEach(SharedSparkSession.scala:164)
29985[info]   at 
org.apache.spark.sql.test.SharedSparkSessionBase.afterEach$(SharedSparkSession.scala:158)
29986[info]   at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormatSuite.afterEach(ParquetFileFormatSuite.scala:31)
29987[info]   at 
org.scalatest.BeforeAndAfterEach.$anonfun$runTest$1(BeforeAndAfterEach.scala:247)
29988[info]   at 
...
30025[info]   Cause: java.lang.Throwable:
30026[info]   at 
org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:35)
30027[info]   at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:75)
30028[info]   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:997)
30029[info]   at 
org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69)
30030[info]   at 
org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:796)
30031[info]   at 
org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:666)
30032[info]   at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:85)
30033[info]   at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:76)
30034[info]   at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readParquetFootersInParallel$1(ParquetFileFormat.scala:450)
 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45592) AQE and InMemoryTableScanExec correctness bug

2023-11-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45592:
--
Fix Version/s: 3.4.2

> AQE and InMemoryTableScanExec correctness bug
> -
>
> Key: SPARK-45592
> URL: https://issues.apache.org/jira/browse/SPARK-45592
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Emil Ejbyfeldt
>Assignee: Emil Ejbyfeldt
>Priority: Blocker
>  Labels: correctness, pull-request-available
> Fix For: 3.4.2, 4.0.0, 3.5.1
>
>
> The following query should return 100
> {code:java}
> import org.apache.spark.storage.StorageLevel
> val df = spark.range(0, 100, 1, 5).map(l => (l, l))
> val ee = df.select($"_1".as("src"), $"_2".as("dst"))
>   .persist(StorageLevel.MEMORY_AND_DISK)
> ee.count()
> val minNbrs1 = ee
>   .groupBy("src").agg(min(col("dst")).as("min_number"))
>   .persist(StorageLevel.MEMORY_AND_DISK)
> val join = ee.join(minNbrs1, "src")
> join.count(){code}
> but on spark 3.5.0 there is a correctness bug causing it to return `104800` 
> or some other smaller value.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45934) Fix `Spark Standalone` documentation table layout

2023-11-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45934:
--
Fix Version/s: 3.5.1

> Fix `Spark Standalone` documentation table layout
> -
>
> Key: SPARK-45934
> URL: https://issues.apache.org/jira/browse/SPARK-45934
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.5.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44488) Support deserializing long fields into `Metadata` object

2023-11-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-44488:
---
Labels: pull-request-available  (was: )

> Support deserializing long fields into `Metadata` object
> 
>
> Key: SPARK-44488
> URL: https://issues.apache.org/jira/browse/SPARK-44488
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.4.1
>Reporter: Richard Chen
>Assignee: Richard Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45943) DataSourceV2Relation.computeStats throws IllegalStateException in test mode

2023-11-15 Thread Asif (Jira)
Asif created SPARK-45943:


 Summary: DataSourceV2Relation.computeStats throws 
IllegalStateException in test mode
 Key: SPARK-45943
 URL: https://issues.apache.org/jira/browse/SPARK-45943
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.1
Reporter: Asif


This issue surfaces when the new unit test of PR 
SPARK-45866|https://github.com/apache/spark/pull/43824] is added



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45934) Fix `Spark Standalone` documentation table layout

2023-11-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-45934.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43814
[https://github.com/apache/spark/pull/43814]

> Fix `Spark Standalone` documentation table layout
> -
>
> Key: SPARK-45934
> URL: https://issues.apache.org/jira/browse/SPARK-45934
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.5.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45719) Upgrade AWS SDK to v2 for Kubernetes integration tests

2023-11-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45719:
--
Component/s: Tests
 (was: Spark Core)

> Upgrade AWS SDK to v2 for Kubernetes integration tests
> --
>
> Key: SPARK-45719
> URL: https://issues.apache.org/jira/browse/SPARK-45719
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Tests
>Affects Versions: 4.0.0
>Reporter: Lantao Jin
>Assignee: junyuc25
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Sub-task of [SPARK-44124|https://issues.apache.org/jira/browse/SPARK-44124]. 
> In this issue, we will upgrade AWS SDK in Credentials providers, AWS clients 
> and related Kubernetes integration tests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45719) Upgrade AWS SDK to v2 for Kubernetes integration tests

2023-11-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45719:
--
Target Version/s:   (was: 4.0.0)

> Upgrade AWS SDK to v2 for Kubernetes integration tests
> --
>
> Key: SPARK-45719
> URL: https://issues.apache.org/jira/browse/SPARK-45719
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Spark Core
>Affects Versions: 4.0.0
>Reporter: Lantao Jin
>Assignee: junyuc25
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Sub-task of [SPARK-44124|https://issues.apache.org/jira/browse/SPARK-44124]. 
> In this issue, we will upgrade AWS SDK in Credentials providers, AWS clients 
> and related Kubernetes integration tests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45810) Create API to stop consuming rows from the input table

2023-11-15 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-45810.
---
  Assignee: Daniel
Resolution: Fixed

Issue resolved by pull request 43682
https://github.com/apache/spark/pull/43682

> Create API to stop consuming rows from the input table
> --
>
> Key: SPARK-45810
> URL: https://issues.apache.org/jira/browse/SPARK-45810
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Daniel
>Assignee: Daniel
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45868) Make spark.table use the same parser with vanilla spark

2023-11-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-45868:
-

Assignee: Ruifeng Zheng

> Make spark.table use the same parser with vanilla spark
> ---
>
> Key: SPARK-45868
> URL: https://issues.apache.org/jira/browse/SPARK-45868
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45868) Make spark.table use the same parser with vanilla spark

2023-11-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-45868.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43741
[https://github.com/apache/spark/pull/43741]

> Make spark.table use the same parser with vanilla spark
> ---
>
> Key: SPARK-45868
> URL: https://issues.apache.org/jira/browse/SPARK-45868
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-45924) Canonicalization of SubqueryAdaptiveBroadcastExec is broken and is not equivalent with SubqueryBroadcastExec

2023-11-15 Thread Asif (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif closed SPARK-45924.


this is not a bug

> Canonicalization of SubqueryAdaptiveBroadcastExec is broken and is not 
> equivalent with SubqueryBroadcastExec
> 
>
> Key: SPARK-45924
> URL: https://issues.apache.org/jira/browse/SPARK-45924
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Asif
>Priority: Major
>  Labels: pull-request-available
>
> while writing bug test for  
> [SPARK-45866|https://issues.apache.org/jira/projects/SPARK/issues/SPARK-45866],
>  found that canonicalization of SubqueryAdaptiveBroadcastExec is broken in 
> the sense that buildPlan : LogicalPlan is not canonicalized which causes 
> batchscans to differ when reuse of exchange check happens in AQE.
> Moreover the equivalence of SubqueryAdaptiveBroadcastExec and 
> SubqueryBroadcastExec  is not there which also aggravates the re-use of 
> exchange in aqe broken.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-45925) SubqueryBroadcastExec is not equivalent with SubqueryAdaptiveBroadcastExec causing re-use of exchange not happening in AQE

2023-11-15 Thread Asif (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif closed SPARK-45925.


this is not an issue

> SubqueryBroadcastExec is not equivalent with SubqueryAdaptiveBroadcastExec 
> causing re-use of exchange not happening in AQE
> --
>
> Key: SPARK-45925
> URL: https://issues.apache.org/jira/browse/SPARK-45925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Asif
>Priority: Major
>  Labels: pull-request-available
>
> A created stage may contain SubqueryAdaptiveBroadcastExec while incominng 
> exchange may contain SubqueryBroadcastExec and though they are equivalent , 
> the match does not happen because equals/hashCode do not match , resulting in 
> non re-use of exchange.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45924) Canonicalization of SubqueryAdaptiveBroadcastExec is broken and is not equivalent with SubqueryBroadcastExec

2023-11-15 Thread Asif (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif resolved SPARK-45924.
--
Resolution: Not A Bug

> Canonicalization of SubqueryAdaptiveBroadcastExec is broken and is not 
> equivalent with SubqueryBroadcastExec
> 
>
> Key: SPARK-45924
> URL: https://issues.apache.org/jira/browse/SPARK-45924
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Asif
>Priority: Major
>  Labels: pull-request-available
>
> while writing bug test for  
> [SPARK-45866|https://issues.apache.org/jira/projects/SPARK/issues/SPARK-45866],
>  found that canonicalization of SubqueryAdaptiveBroadcastExec is broken in 
> the sense that buildPlan : LogicalPlan is not canonicalized which causes 
> batchscans to differ when reuse of exchange check happens in AQE.
> Moreover the equivalence of SubqueryAdaptiveBroadcastExec and 
> SubqueryBroadcastExec  is not there which also aggravates the re-use of 
> exchange in aqe broken.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45925) SubqueryBroadcastExec is not equivalent with SubqueryAdaptiveBroadcastExec causing re-use of exchange not happening in AQE

2023-11-15 Thread Asif (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif resolved SPARK-45925.
--
Resolution: Not A Problem

> SubqueryBroadcastExec is not equivalent with SubqueryAdaptiveBroadcastExec 
> causing re-use of exchange not happening in AQE
> --
>
> Key: SPARK-45925
> URL: https://issues.apache.org/jira/browse/SPARK-45925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Asif
>Priority: Major
>  Labels: pull-request-available
>
> A created stage may contain SubqueryAdaptiveBroadcastExec while incominng 
> exchange may contain SubqueryBroadcastExec and though they are equivalent , 
> the match does not happen because equals/hashCode do not match , resulting in 
> non re-use of exchange.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45942) Only do the thread interruption check for putIterator on executors

2023-11-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45942:
---
Labels: pull-request-available  (was: )

> Only do the thread interruption check for putIterator on executors
> --
>
> Key: SPARK-45942
> URL: https://issues.apache.org/jira/browse/SPARK-45942
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Huanli Wang
>Priority: Major
>  Labels: pull-request-available
>
> https://issues.apache.org/jira/browse/SPARK-45025 
> introduces a peaceful thread interruption handling. However, there is an edge 
> case: when a streaming query is stopped on the driver, it interrupts the 
> stream execution thread. If the streaming query is doing memory store 
> operations on driver and performs {{doPutIterator}} at the same time, the 
> [unroll process will be 
> broken|https://github.com/apache/spark/blob/39fc6108bfaaa0ce471f6460880109f948ba5c62/core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala#L224]
>  and [returns used 
> memory|https://github.com/apache/spark/blob/39fc6108bfaaa0ce471f6460880109f948ba5c62/core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala#L245-L247].
> This can result in {{closeChannelException}} as it falls into this [case 
> clause|https://github.com/apache/spark/blob/aa646d3050028272f7333deaef52f20e6975e0ed/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1614-L1622]
>  which opens an I/O channel and persists the data into the disk. However, 
> because the thread is interrupted, the channel will be closed at the begin: 
> [https://github.com/openjdk-mirror/jdk7u-jdk/blob/master/src/share/classes/java/nio/channels/spi/AbstractInterruptibleChannel.java#L172]
>  and throws out {{closeChannelException}}
> On executors, [the task will be killed if the thread is 
> interrupted|https://github.com/apache/spark/blob/39fc6108bfaaa0ce471f6460880109f948ba5c62/core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala#L374],
>  however, we don't do it on the driver.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45942) Only do the thread interruption check for putIterator on executors

2023-11-15 Thread Huanli Wang (Jira)
Huanli Wang created SPARK-45942:
---

 Summary: Only do the thread interruption check for putIterator on 
executors
 Key: SPARK-45942
 URL: https://issues.apache.org/jira/browse/SPARK-45942
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Huanli Wang


https://issues.apache.org/jira/browse/SPARK-45025 

introduces a peaceful thread interruption handling. However, there is an edge 
case: when a streaming query is stopped on the driver, it interrupts the stream 
execution thread. If the streaming query is doing memory store operations on 
driver and performs {{doPutIterator}} at the same time, the [unroll process 
will be 
broken|https://github.com/apache/spark/blob/39fc6108bfaaa0ce471f6460880109f948ba5c62/core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala#L224]
 and [returns used 
memory|https://github.com/apache/spark/blob/39fc6108bfaaa0ce471f6460880109f948ba5c62/core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala#L245-L247].

This can result in {{closeChannelException}} as it falls into this [case 
clause|https://github.com/apache/spark/blob/aa646d3050028272f7333deaef52f20e6975e0ed/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1614-L1622]
 which opens an I/O channel and persists the data into the disk. However, 
because the thread is interrupted, the channel will be closed at the begin: 
[https://github.com/openjdk-mirror/jdk7u-jdk/blob/master/src/share/classes/java/nio/channels/spi/AbstractInterruptibleChannel.java#L172]
 and throws out {{closeChannelException}}

On executors, [the task will be killed if the thread is 
interrupted|https://github.com/apache/spark/blob/39fc6108bfaaa0ce471f6460880109f948ba5c62/core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala#L374],
 however, we don't do it on the driver.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45941) Update pandas to 2.1.3

2023-11-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-45941:
-

Assignee: Bjørn Jørgensen

> Update pandas to 2.1.3
> --
>
> Key: SPARK-45941
> URL: https://issues.apache.org/jira/browse/SPARK-45941
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Pandas API on Spark
>Affects Versions: 4.0.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Major
>  Labels: pull-request-available
>
> https://pandas.pydata.org/docs/whatsnew/v2.1.3.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45941) Update pandas to 2.1.3

2023-11-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-45941.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43822
[https://github.com/apache/spark/pull/43822]

> Update pandas to 2.1.3
> --
>
> Key: SPARK-45941
> URL: https://issues.apache.org/jira/browse/SPARK-45941
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Pandas API on Spark
>Affects Versions: 4.0.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> https://pandas.pydata.org/docs/whatsnew/v2.1.3.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45941) Update pandas to 2.1.3

2023-11-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45941:
---
Labels: pull-request-available  (was: )

> Update pandas to 2.1.3
> --
>
> Key: SPARK-45941
> URL: https://issues.apache.org/jira/browse/SPARK-45941
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Pandas API on Spark
>Affects Versions: 4.0.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>  Labels: pull-request-available
>
> https://pandas.pydata.org/docs/whatsnew/v2.1.3.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45941) Update pandas to 2.1.3

2023-11-15 Thread Jira
Bjørn Jørgensen created SPARK-45941:
---

 Summary: Update pandas to 2.1.3
 Key: SPARK-45941
 URL: https://issues.apache.org/jira/browse/SPARK-45941
 Project: Spark
  Issue Type: Dependency upgrade
  Components: Pandas API on Spark
Affects Versions: 4.0.0
Reporter: Bjørn Jørgensen


https://pandas.pydata.org/docs/whatsnew/v2.1.3.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45940) Add InputPartition to DataSourceReader interface

2023-11-15 Thread Allison Wang (Jira)
Allison Wang created SPARK-45940:


 Summary: Add InputPartition to DataSourceReader interface
 Key: SPARK-45940
 URL: https://issues.apache.org/jira/browse/SPARK-45940
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


Add InputPartition class and make the partitions method return a list of input 
partitions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45861) Add user guide for dataframe creation

2023-11-15 Thread Allison Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786487#comment-17786487
 ] 

Allison Wang commented on SPARK-45861:
--

[~panbingkun] again, thanks for working on this. Let me give you more details.

When people search on Google for example "spark create dataframe", you can see 
there are many results, one of them being the PySpark documentation - 
createDataFrame.

But there are many other ways to create a dataframe, for example from various 
data sources (CSV, JDBC, Parquet, etc), from pandas dataframe, from 
`spark.sql`, etc. 

We want to create a new documentation page under `{*}User Guides{*}` to explain 
all kinds of ways people can use to create a Spark data frame. It's different 
from the quickstart in that the user guide will provide more comprehensive 
examples.

Feel free to take a look at the results when you search "spark create 
dataframe" or even "create dataframe" to get more inspirations.

cc [~afolting] [~smilegator]

> Add user guide for dataframe creation
> -
>
> Key: SPARK-45861
> URL: https://issues.apache.org/jira/browse/SPARK-45861
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> Add a simple user guide for data frame creation.
> This user guide should cover the following APIs:
>  # df.createDataFrame
>  # spark.read.format(...) (can be csv, json, parquet



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-45390) Remove `distutils` usage

2023-11-15 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786455#comment-17786455
 ] 

Dongjoon Hyun edited comment on SPARK-45390 at 11/15/23 6:00 PM:
-

Your concern is legit like we had the same concern at Apache Spark 2.4.x. :) 

For the following, let me ask you in this way. Do you think Python 3.12 support 
all minimum Python package requirements of Spark 3.5? Have you test your 
branch-3.5 with Python 3.12 + SPARK-45390 ?
bq. I'm not sure what you mean here.


was (Author: dongjoon):
Your concern is legit like we had the same concern at Apache Spark 2.4.x. :) 

For the following, let me ask you in this way. Do you think Python 3.12 support 
all minimum Python package requirements of Spark 3.5? Have you test your 
branch-3.5 with Python 3.12 + SPARK-45390 ?
> I'm not sure what you mean here.

> Remove `distutils` usage
> 
>
> Key: SPARK-45390
> URL: https://issues.apache.org/jira/browse/SPARK-45390
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> [PEP-632|https://peps.python.org/pep-0632] deprecated {{distutils}} module in 
> Python {{3.10}} and dropped in Python {{3.12}} in favor of {{packaging}} 
> package.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45390) Remove `distutils` usage

2023-11-15 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786455#comment-17786455
 ] 

Dongjoon Hyun commented on SPARK-45390:
---

Your concern is legit like we had the same concern at Apache Spark 2.4.x. :) 

For the following, let me ask you in this way. Do you think Python 3.12 support 
all minimum Python package requirements of Spark 3.5? Have you test your 
branch-3.5 with Python 3.12 + SPARK-45390 ?
> I'm not sure what you mean here.

> Remove `distutils` usage
> 
>
> Key: SPARK-45390
> URL: https://issues.apache.org/jira/browse/SPARK-45390
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> [PEP-632|https://peps.python.org/pep-0632] deprecated {{distutils}} module in 
> Python {{3.10}} and dropped in Python {{3.12}} in favor of {{packaging}} 
> package.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45208) Website doesn't have horizontal scrollbar

2023-11-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45208:
--
Description: 
This was reported in the dev mailing list.

- https://lists.apache.org/thread/cfhqgltx0f4flrtb1p5c40zoopdy5yt9

I find a recent issue with the official Spark documentation on the website. 
Specifically, the Kubernetes configuration lists on the right-hand side are not 
visible and doc doesn't have a horizontal scrollbar.

 
- [https://spark.apache.org/docs/3.5.0/running-on-kubernetes.html#configuration]
- [https://spark.apache.org/docs/3.4.1/running-on-kubernetes.html#configuration]

Wide tables are broken in the same way.

  was:
I find a recent issue with the official Spark documentation on the website. 
Specifically, the Kubernetes configuration lists on the right-hand side are not 
visible and doc doesn't have a horizontal scrollbar.

 
- [https://spark.apache.org/docs/3.5.0/running-on-kubernetes.html#configuration]
- [https://spark.apache.org/docs/3.4.1/running-on-kubernetes.html#configuration]

Wide tables are broken in the same way.

- https://spark.apache.org/docs/latest/spark-standalone.html


> Website doesn't have horizontal scrollbar
> -
>
> Key: SPARK-45208
> URL: https://issues.apache.org/jira/browse/SPARK-45208
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.5.0
>Reporter: Qian Sun
>Priority: Major
>
> This was reported in the dev mailing list.
> - https://lists.apache.org/thread/cfhqgltx0f4flrtb1p5c40zoopdy5yt9
> I find a recent issue with the official Spark documentation on the website. 
> Specifically, the Kubernetes configuration lists on the right-hand side are 
> not visible and doc doesn't have a horizontal scrollbar.
>  
> - 
> [https://spark.apache.org/docs/3.5.0/running-on-kubernetes.html#configuration]
> - 
> [https://spark.apache.org/docs/3.4.1/running-on-kubernetes.html#configuration]
> Wide tables are broken in the same way.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45934) Fix `Spark Standalone` documentation table layout

2023-11-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45934:
--
Parent: SPARK-45869
Issue Type: Sub-task  (was: Bug)

> Fix `Spark Standalone` documentation table layout
> -
>
> Key: SPARK-45934
> URL: https://issues.apache.org/jira/browse/SPARK-45934
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.5.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45934) Fix `Spark Standalone` documentation table layout

2023-11-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-45934:
-

Assignee: Dongjoon Hyun

> Fix `Spark Standalone` documentation table layout
> -
>
> Key: SPARK-45934
> URL: https://issues.apache.org/jira/browse/SPARK-45934
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.5.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45934) Fix `Spark Standalone` documentation table layout

2023-11-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45934:
--
Summary: Fix `Spark Standalone` documentation table layout  (was: Fix 
`spark-standalone.md` table layout)

> Fix `Spark Standalone` documentation table layout
> -
>
> Key: SPARK-45934
> URL: https://issues.apache.org/jira/browse/SPARK-45934
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45934) Fix `Spark Standalone` documentation table layout

2023-11-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45934:
--
Affects Version/s: (was: 4.0.0)

> Fix `Spark Standalone` documentation table layout
> -
>
> Key: SPARK-45934
> URL: https://issues.apache.org/jira/browse/SPARK-45934
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.5.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45934) Fix `spark-standalone.md` table layout

2023-11-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45934:
--
Summary: Fix `spark-standalone.md` table layout  (was: Fix 
`spark-standalone.md` and `running-on-kubernetes.md` table layout)

> Fix `spark-standalone.md` table layout
> --
>
> Key: SPARK-45934
> URL: https://issues.apache.org/jira/browse/SPARK-45934
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43393) Sequence expression can overflow

2023-11-15 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786438#comment-17786438
 ] 

Dongjoon Hyun commented on SPARK-43393:
---

Due to the compilation failure, this is reverted from branch-3.5 via 
https://github.com/apache/spark/commit/e38310c74e6cae8c8c8489ffcbceb80ed37a7cae 
.

> Sequence expression can overflow
> 
>
> Key: SPARK-43393
> URL: https://issues.apache.org/jira/browse/SPARK-43393
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Deepayan Patra
>Assignee: Deepayan Patra
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Spark has a (long-standing) overflow bug in the {{sequence}} expression.
>  
> Consider the following operations:
> {{spark.sql("CREATE TABLE foo (l LONG);")}}
> {{spark.sql(s"INSERT INTO foo VALUES (${Long.MaxValue});")}}
> {{spark.sql("SELECT sequence(0, l) FROM foo;").collect()}}
>  
> The result of these operations will be:
> {{Array[org.apache.spark.sql.Row] = Array([WrappedArray()])}}
> an unintended consequence of overflow.
>  
> The sequence is applied to values {{0}} and {{Long.MaxValue}} with a step 
> size of {{1}} which uses a length computation defined 
> [here|https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3451].
>  In this calculation, with {{{}start = 0{}}}, {{{}stop = Long.MaxValue{}}}, 
> and {{{}step = 1{}}}, the calculated {{len}} overflows to 
> {{{}Long.MinValue{}}}. The computation, in binary looks like:
> 0111 -
> 
> --      
> 0111 /
> 0001
> --        
>         0111 +
> 0001
> --      
> 1000
> The following 
> [check|https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3454]
>  passes as the negative {{Long.MinValue}} is still {{{}<= 
> MAX_ROUNDED_ARRAY_LENGTH{}}}. The following cast to {{toInt}} uses this 
> representation and [truncates the upper 
> bits|https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3457]
>  resulting in an empty length of 0.
> Other overflows are similarly problematic.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43393) Sequence expression can overflow

2023-11-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-43393:
--
Fix Version/s: (was: 3.5.1)

> Sequence expression can overflow
> 
>
> Key: SPARK-43393
> URL: https://issues.apache.org/jira/browse/SPARK-43393
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Deepayan Patra
>Assignee: Deepayan Patra
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Spark has a (long-standing) overflow bug in the {{sequence}} expression.
>  
> Consider the following operations:
> {{spark.sql("CREATE TABLE foo (l LONG);")}}
> {{spark.sql(s"INSERT INTO foo VALUES (${Long.MaxValue});")}}
> {{spark.sql("SELECT sequence(0, l) FROM foo;").collect()}}
>  
> The result of these operations will be:
> {{Array[org.apache.spark.sql.Row] = Array([WrappedArray()])}}
> an unintended consequence of overflow.
>  
> The sequence is applied to values {{0}} and {{Long.MaxValue}} with a step 
> size of {{1}} which uses a length computation defined 
> [here|https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3451].
>  In this calculation, with {{{}start = 0{}}}, {{{}stop = Long.MaxValue{}}}, 
> and {{{}step = 1{}}}, the calculated {{len}} overflows to 
> {{{}Long.MinValue{}}}. The computation, in binary looks like:
> 0111 -
> 
> --      
> 0111 /
> 0001
> --        
>         0111 +
> 0001
> --      
> 1000
> The following 
> [check|https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3454]
>  passes as the negative {{Long.MinValue}} is still {{{}<= 
> MAX_ROUNDED_ARRAY_LENGTH{}}}. The following cast to {{toInt}} uses this 
> representation and [truncates the upper 
> bits|https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3457]
>  resulting in an empty length of 0.
> Other overflows are similarly problematic.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45934) Fix `spark-standalone.md` and `running-on-kubernetes.md` table layout

2023-11-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45934:
--
Affects Version/s: 3.5.0

> Fix `spark-standalone.md` and `running-on-kubernetes.md` table layout
> -
>
> Key: SPARK-45934
> URL: https://issues.apache.org/jira/browse/SPARK-45934
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45938) Add `utils` to the dependency list of the `core` module in `module.py`

2023-11-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45938:
---
Labels: pull-request-available  (was: )

> Add `utils` to the dependency list of the `core` module in `module.py`
> --
>
> Key: SPARK-45938
> URL: https://issues.apache.org/jira/browse/SPARK-45938
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45938) Add `utils` to the dependency list of the `core` module in `module.py`

2023-11-15 Thread Yang Jie (Jira)
Yang Jie created SPARK-45938:


 Summary: Add `utils` to the dependency list of the `core` module 
in `module.py`
 Key: SPARK-45938
 URL: https://issues.apache.org/jira/browse/SPARK-45938
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45390) Remove `distutils` usage

2023-11-15 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786401#comment-17786401
 ] 

Nicholas Chammas commented on SPARK-45390:
--

{quote}We don't promise to support all future unreleased Python versions
{quote}
"all future unreleased versions" is a tall ask that no-one is making. :) 

The relevant circumstances here are that a) Python 3.12 is already out and the 
backwards-incompatible changes are known and [very 
limited|https://docs.python.org/3/whatsnew/3.12.html], and b) Spark 4.0 may be 
a disruptive change and so many people may remain on Spark 3.5 for longer than 
usual.

If we expect 3.5 -> 4.0 to be an easy migration, then backporting a fix like 
this to 3.5 is not as important.
{quote}we need much more validation because all Python package ecosystem should 
work there without any issues
{quote}
I'm not sure what you mean here.

Anyway, I suppose we could just wait and see. Maybe I'm wrong, but I suspect 
many users will find it surprising that Spark 3.5 doesn't work on Python 3.12, 
especially if this is the only (or close to the only) fix required.

> Remove `distutils` usage
> 
>
> Key: SPARK-45390
> URL: https://issues.apache.org/jira/browse/SPARK-45390
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> [PEP-632|https://peps.python.org/pep-0632] deprecated {{distutils}} module in 
> Python {{3.10}} and dropped in Python {{3.12}} in favor of {{packaging}} 
> package.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45937) Fix documentation of spark.executor.maxNumFailures

2023-11-15 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786395#comment-17786395
 ] 

Thomas Graves commented on SPARK-45937:
---

 

@Cheng Pan  Could you fix this as followup?

> Fix documentation of spark.executor.maxNumFailures
> --
>
> Key: SPARK-45937
> URL: https://issues.apache.org/jira/browse/SPARK-45937
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Thomas Graves
>Priority: Critical
>
> https://issues.apache.org/jira/browse/SPARK-41210 added support for 
> spark.executor.maxNumFailures on Kubernetes, it made this config generic and 
> deprecated the yarn version.  This config isn't documented and defaults are 
> not documented.
>  
> [https://github.com/apache/spark/commit/40872e9a094f8459b0b6f626937ced48a8d98efb]
> \
> It also added {color:#0a3069}spark.executor.failuresValidityInterval.{color}
>  
> {color:#0a3069}Both need to have default values specified for yarn and k8s, 
> it also needs to remove the yarn documentation for equivalent configs 
> spark.yarn.max.executor.failures configuration{color}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45937) Fix documentation of spark.executor.maxNumFailures

2023-11-15 Thread Thomas Graves (Jira)
Thomas Graves created SPARK-45937:
-

 Summary: Fix documentation of spark.executor.maxNumFailures
 Key: SPARK-45937
 URL: https://issues.apache.org/jira/browse/SPARK-45937
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.5.0
Reporter: Thomas Graves


https://issues.apache.org/jira/browse/SPARK-41210 added support for 
spark.executor.maxNumFailures on Kubernetes, it made this config generic and 
deprecated the yarn version.  This config isn't documented and defaults are not 
documented.

 

[https://github.com/apache/spark/commit/40872e9a094f8459b0b6f626937ced48a8d98efb]

\

It also added {color:#0a3069}spark.executor.failuresValidityInterval.{color}

 

{color:#0a3069}Both need to have default values specified for yarn and k8s, it 
also needs to remove the yarn documentation for equivalent configs 
spark.yarn.max.executor.failures configuration{color}

 
 
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45905) least common type between decimal types should retain integral digits first

2023-11-15 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-45905.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43781
[https://github.com/apache/spark/pull/43781]

> least common type between decimal types should retain integral digits first
> ---
>
> Key: SPARK-45905
> URL: https://issues.apache.org/jira/browse/SPARK-45905
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45390) Remove `distutils` usage

2023-11-15 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786312#comment-17786312
 ] 

Dongjoon Hyun commented on SPARK-45390:
---

Apache Spark 3.5 is released first on 2023-09-13 and Python 3.12 is released on 
2023-10-02.
We don't promise to support all future unreleased Python versions, [~nchammas].

As you pointed out, this is designed to be an improvement for Apache Spark 
4.0.0 only.

BTW, in order to support Python 3.12 in Apache Spark 3.5.x, we need much more 
validation because all Python package ecosystem should work there without any 
issues.

> Remove `distutils` usage
> 
>
> Key: SPARK-45390
> URL: https://issues.apache.org/jira/browse/SPARK-45390
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> [PEP-632|https://peps.python.org/pep-0632] deprecated {{distutils}} module in 
> Python {{3.10}} and dropped in Python {{3.12}} in favor of {{packaging}} 
> package.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45915) Treat decimal(x, 0) the same as IntegralType in PromoteStrings

2023-11-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-45915.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43812
[https://github.com/apache/spark/pull/43812]

> Treat decimal(x, 0) the same as IntegralType in PromoteStrings
> --
>
> Key: SPARK-45915
> URL: https://issues.apache.org/jira/browse/SPARK-45915
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45935) Fix RST files link substitutions error

2023-11-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45935:
---
Labels: pull-request-available  (was: )

> Fix RST files link substitutions error
> --
>
> Key: SPARK-45935
> URL: https://issues.apache.org/jira/browse/SPARK-45935
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45935) Fix RST files link substitutions error

2023-11-15 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-45935:
---

 Summary: Fix RST files link substitutions error
 Key: SPARK-45935
 URL: https://issues.apache.org/jira/browse/SPARK-45935
 Project: Spark
  Issue Type: Bug
  Components: Documentation, PySpark
Affects Versions: 4.0.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45934) Fix `spark-standalone.md` and `running-on-kubernetes.md` table layout

2023-11-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45934:
---
Labels: pull-request-available  (was: )

> Fix `spark-standalone.md` and `running-on-kubernetes.md` table layout
> -
>
> Key: SPARK-45934
> URL: https://issues.apache.org/jira/browse/SPARK-45934
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45934) Fix `spark-standalone.md` and `running-on-kubernetes.md` table layout

2023-11-15 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-45934:
-

 Summary: Fix `spark-standalone.md` and `running-on-kubernetes.md` 
table layout
 Key: SPARK-45934
 URL: https://issues.apache.org/jira/browse/SPARK-45934
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-42471) Distributed ML <> spark connect

2023-11-15 Thread Faiz Halde (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786292#comment-17786292
 ] 

Faiz Halde edited comment on SPARK-42471 at 11/15/23 10:22 AM:
---

Hello, our use-case requires us to use spark-connect and we have some of our 
jobs that used spark ML ( scala ). Is this Umbrella tracking the work required 
to make spark ml compatible with spark-connect? Because so far we've been 
struggling with this. May I know if this is already done and if there are docs 
on how to make this work?

 

Thanks!


was (Author: JIRAUSER300204):
Hello, our use-case requires us to use spark-connect and we have some of our 
jobs that used spark ML. Is this Umbrella tracking the work required to make 
spark ml compatible with spark-connect? Because so far we've been struggling 
with this. May I know if this is already done and if there are docs on how to 
make this work?

 

Thanks!

> Distributed ML <> spark connect
> ---
>
> Key: SPARK-42471
> URL: https://issues.apache.org/jira/browse/SPARK-42471
> Project: Spark
>  Issue Type: Umbrella
>  Components: Connect, ML
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Weichen Xu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45933) Runtime filter should infers more application side.

2023-11-15 Thread Jiaan Geng (Jira)
Jiaan Geng created SPARK-45933:
--

 Summary: Runtime filter should infers more application side.
 Key: SPARK-45933
 URL: https://issues.apache.org/jira/browse/SPARK-45933
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 4.0.0
Reporter: Jiaan Geng
Assignee: Jiaan Geng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45924) Canonicalization of SubqueryAdaptiveBroadcastExec is broken and is not equivalent with SubqueryBroadcastExec

2023-11-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45924:
--

Assignee: (was: Apache Spark)

> Canonicalization of SubqueryAdaptiveBroadcastExec is broken and is not 
> equivalent with SubqueryBroadcastExec
> 
>
> Key: SPARK-45924
> URL: https://issues.apache.org/jira/browse/SPARK-45924
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Asif
>Priority: Major
>  Labels: pull-request-available
>
> while writing bug test for  
> [SPARK-45866|https://issues.apache.org/jira/projects/SPARK/issues/SPARK-45866],
>  found that canonicalization of SubqueryAdaptiveBroadcastExec is broken in 
> the sense that buildPlan : LogicalPlan is not canonicalized which causes 
> batchscans to differ when reuse of exchange check happens in AQE.
> Moreover the equivalence of SubqueryAdaptiveBroadcastExec and 
> SubqueryBroadcastExec  is not there which also aggravates the re-use of 
> exchange in aqe broken.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45924) Canonicalization of SubqueryAdaptiveBroadcastExec is broken and is not equivalent with SubqueryBroadcastExec

2023-11-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45924:
--

Assignee: Apache Spark

> Canonicalization of SubqueryAdaptiveBroadcastExec is broken and is not 
> equivalent with SubqueryBroadcastExec
> 
>
> Key: SPARK-45924
> URL: https://issues.apache.org/jira/browse/SPARK-45924
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Asif
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> while writing bug test for  
> [SPARK-45866|https://issues.apache.org/jira/projects/SPARK/issues/SPARK-45866],
>  found that canonicalization of SubqueryAdaptiveBroadcastExec is broken in 
> the sense that buildPlan : LogicalPlan is not canonicalized which causes 
> batchscans to differ when reuse of exchange check happens in AQE.
> Moreover the equivalence of SubqueryAdaptiveBroadcastExec and 
> SubqueryBroadcastExec  is not there which also aggravates the re-use of 
> exchange in aqe broken.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-45861) Add user guide for dataframe creation

2023-11-15 Thread BingKun Pan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786216#comment-17786216
 ] 

BingKun Pan edited comment on SPARK-45861 at 11/15/23 8:17 AM:
---

Unfortunately, I have found similar documents at 
[https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_df.html],
 
!screenshot-1.png|width=751,height=541! 
so do we need to move them under menu `User Guides`, or ?
we feel a bit repetitive, Perhaps we should organize the menu categories?
!screenshot-2.png|width=671,height=43!


was (Author: panbingkun):
Unfortunately, I have found similar documents at 
[https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_df.html],
 
!screenshot-1.png|width=751,height=541! 
so do we need to move them under menu `User Guides`, or ?
we feel a bit repetitive? Perhaps we should organize the menu categories?
!screenshot-2.png|width=671,height=43!

> Add user guide for dataframe creation
> -
>
> Key: SPARK-45861
> URL: https://issues.apache.org/jira/browse/SPARK-45861
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> Add a simple user guide for data frame creation.
> This user guide should cover the following APIs:
>  # df.createDataFrame
>  # spark.read.format(...) (can be csv, json, parquet



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-45861) Add user guide for dataframe creation

2023-11-15 Thread BingKun Pan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786216#comment-17786216
 ] 

BingKun Pan edited comment on SPARK-45861 at 11/15/23 8:08 AM:
---

Unfortunately, I have found similar documents at 
[https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_df.html],
 
!screenshot-1.png|width=751,height=541! 
so do we need to move them under menu `User Guides`, or ?
we feel a bit repetitive? Perhaps we should organize the menu categories?
!screenshot-2.png|width=671,height=43!


was (Author: panbingkun):
Unfortunately, I have found similar documents at 
https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_df.html,
 
 !screenshot-1.png! 
so do we need to move them under menu ``, or ?
we feel a bit repetitive? Perhaps we should organize the menu categories?
 !screenshot-2.png! 

> Add user guide for dataframe creation
> -
>
> Key: SPARK-45861
> URL: https://issues.apache.org/jira/browse/SPARK-45861
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> Add a simple user guide for data frame creation.
> This user guide should cover the following APIs:
>  # df.createDataFrame
>  # spark.read.format(...) (can be csv, json, parquet



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45861) Add user guide for dataframe creation

2023-11-15 Thread BingKun Pan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786216#comment-17786216
 ] 

BingKun Pan commented on SPARK-45861:
-

Unfortunately, I have found similar documents at 
https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_df.html,
 
 !screenshot-1.png! 
so do we need to move them under menu ``, or ?
we feel a bit repetitive? Perhaps we should organize the menu categories?
 !screenshot-2.png! 

> Add user guide for dataframe creation
> -
>
> Key: SPARK-45861
> URL: https://issues.apache.org/jira/browse/SPARK-45861
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> Add a simple user guide for data frame creation.
> This user guide should cover the following APIs:
>  # df.createDataFrame
>  # spark.read.format(...) (can be csv, json, parquet



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45861) Add user guide for dataframe creation

2023-11-15 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-45861:

Attachment: screenshot-2.png

> Add user guide for dataframe creation
> -
>
> Key: SPARK-45861
> URL: https://issues.apache.org/jira/browse/SPARK-45861
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> Add a simple user guide for data frame creation.
> This user guide should cover the following APIs:
>  # df.createDataFrame
>  # spark.read.format(...) (can be csv, json, parquet



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45861) Add user guide for dataframe creation

2023-11-15 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-45861:

Attachment: screenshot-1.png

> Add user guide for dataframe creation
> -
>
> Key: SPARK-45861
> URL: https://issues.apache.org/jira/browse/SPARK-45861
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> Add a simple user guide for data frame creation.
> This user guide should cover the following APIs:
>  # df.createDataFrame
>  # spark.read.format(...) (can be csv, json, parquet



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org