[jira] [Updated] (SPARK-46522) Block Python data source registration with name conflicts

2023-12-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46522:
---
Labels: pull-request-available  (was: )

> Block Python data source registration with name conflicts
> -
>
> Key: SPARK-46522
> URL: https://issues.apache.org/jira/browse/SPARK-46522
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>  Labels: pull-request-available
>
> Users should not be allowed to register Python data sources with names that 
> are the same as builtin or existing Scala/Java data sources.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46522) Block Python data source registration with name conflicts

2023-12-26 Thread Allison Wang (Jira)
Allison Wang created SPARK-46522:


 Summary: Block Python data source registration with name conflicts
 Key: SPARK-46522
 URL: https://issues.apache.org/jira/browse/SPARK-46522
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


Users should not be allowed to register Python data sources with names that are 
the same as builtin or existing Scala/Java data sources.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46521) Refine docstring of `array_compact/array_distinct/array_remove`

2023-12-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46521:
---
Labels: pull-request-available  (was: )

> Refine docstring of `array_compact/array_distinct/array_remove`
> ---
>
> Key: SPARK-46521
> URL: https://issues.apache.org/jira/browse/SPARK-46521
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46517) Reorganize `IndexingTest`

2023-12-26 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-46517.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44502
[https://github.com/apache/spark/pull/44502]

> Reorganize `IndexingTest`
> -
>
> Key: SPARK-46517
> URL: https://issues.apache.org/jira/browse/SPARK-46517
> Project: Spark
>  Issue Type: Sub-task
>  Components: PS, Tests
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46516) autoBroadcastJoinThreshold compared to plan.statistics not a table size

2023-12-26 Thread Guram Savinov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guram Savinov updated SPARK-46516:
--
Description: 
>From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum 
>size in bytes for a table that will be broadcasted to all worker nodes when 
>performing a join.

[https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration]

In fact Spark compares plan.statistics.sizeInBytes for columns selected in 
join, not a table size.

[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L368]

Join can select only a few columns and sizeInBytes will be lesser than 
autoBroadcastJoinThreshold, but broadcasted table can be huge and it is loaded 
entirely into drivers memory which can lead to OOM.

spark.sql.autoBroadcastJoinThreshold parameter seems useless when its not 
compared to  broadcasted table size.

Related topic on SO: 
[https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s]

  was:
>From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum 
>size in bytes for a table that will be broadcasted to all worker nodes when 
>performing a join.

[https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration]

In fact Spark compares plan.statistics.sizeInBytes for columns selected in 
join, not a table size.

[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L368]

Join can select only a few columns and sizeInBytes will be lesser than 
autoBroadcastJoinThreshold, but broadcasted table can be huge and leads to OOM 
on driver.

spark.sql.autoBroadcastJoinThreshold parameter seems useless when its not 
compared to  broadcasted table size.

Related topic on SO: 
[https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s]


> autoBroadcastJoinThreshold compared to plan.statistics not a table size
> ---
>
> Key: SPARK-46516
> URL: https://issues.apache.org/jira/browse/SPARK-46516
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Guram Savinov
>Priority: Major
>
> From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum 
> size in bytes for a table that will be broadcasted to all worker nodes when 
> performing a join.
> [https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration]
> In fact Spark compares plan.statistics.sizeInBytes for columns selected in 
> join, not a table size.
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L368]
> Join can select only a few columns and sizeInBytes will be lesser than 
> autoBroadcastJoinThreshold, but broadcasted table can be huge and it is 
> loaded entirely into drivers memory which can lead to OOM.
> spark.sql.autoBroadcastJoinThreshold parameter seems useless when its not 
> compared to  broadcasted table size.
> Related topic on SO: 
> [https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46521) Refine docstring of `array_compact/array_distinct/array_remove`

2023-12-26 Thread Yang Jie (Jira)
Yang Jie created SPARK-46521:


 Summary: Refine docstring of 
`array_compact/array_distinct/array_remove`
 Key: SPARK-46521
 URL: https://issues.apache.org/jira/browse/SPARK-46521
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Affects Versions: 4.0.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46520) Support overwrite mode for Python data source write

2023-12-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46520:
---
Labels: pull-request-available  (was: )

> Support overwrite mode for Python data source write
> ---
>
> Key: SPARK-46520
> URL: https://issues.apache.org/jira/browse/SPARK-46520
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>  Labels: pull-request-available
>
> Support the `overwrite` mode for Python data source



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45917) Statically register Python Data Source

2023-12-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45917:
---
Labels: pull-request-available  (was: )

> Statically register Python Data Source
> --
>
> Key: SPARK-45917
> URL: https://issues.apache.org/jira/browse/SPARK-45917
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>
> See the inlined comment in {{DataSourceManager}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38388) Repartition + Stage retries could lead to incorrect data

2023-12-26 Thread Wei Lu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800682#comment-17800682
 ] 

Wei Lu commented on SPARK-38388:


We had the same problem(using Spark 3.2.1),is there any plan to fix the problem?

> Repartition + Stage retries could lead to incorrect data 
> -
>
> Key: SPARK-38388
> URL: https://issues.apache.org/jira/browse/SPARK-38388
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.1.1
> Environment: Spark 2.4 and 3.x
>Reporter: Jason Xu
>Priority: Major
>  Labels: correctness, data-loss
>
> Spark repartition uses RoundRobinPartitioning, the generated results is 
> non-deterministic when data has some randomness and stage/task retries happen.
> The bug can be triggered when upstream data has some randomness, a 
> repartition is called on them, then followed by result stage (could be more 
> stages).
> As the pattern shows below:
> upstream stage (data with randomness) -> (repartition shuffle) -> result stage
> When one executor goes down at result stage, some tasks of that stage might 
> have finished, others would fail, shuffle files on that executor also get 
> lost, some tasks from previous stage (upstream data generation, repartition) 
> will need to rerun to generate dependent shuffle data files.
> Because data has some randomness, regenerated data in upstream retried tasks 
> is slightly different, repartition then generates inconsistent ordering, then 
> tasks at result stage will be retried generating different data.
> This is similar but different to 
> https://issues.apache.org/jira/browse/SPARK-23207, fix for it uses extra 
> local sort to make the row ordering deterministic, the sorting algorithm it 
> uses simply compares row/record hash. But in this case, upstream data has 
> some randomness, the sorting algorithm doesn't help keep the order, thus 
> RoundRobinPartitioning introduced non-deterministic result.
> The following code returns 986415, instead of 100:
> {code:java}
> import scala.sys.process._
> import org.apache.spark.TaskContext
> case class TestObject(id: Long, value: Double)
> val ds = spark.range(0, 1000 * 1000, 1).repartition(100, 
> $"id").withColumn("val", rand()).repartition(100).map { 
>   row => if (TaskContext.get.stageAttemptNumber == 0 && 
> TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId > 97) {
> throw new Exception("pkill -f java".!!)
>   }
>   TestObject(row.getLong(0), row.getDouble(1))
> }
> ds.toDF("id", "value").write.mode("overwrite").saveAsTable("tmp.test_table")
> spark.sql("select count(distinct id) from tmp.test_table").show{code}
> Command: 
> {code:java}
> spark-shell --num-executors 10 (--conf spark.dynamicAllocation.enabled=false 
> --conf spark.shuffle.service.enabled=false){code}
> To simulate the issue, disable external shuffle service is needed (if it's 
> also enabled by default in your environment),  this is to trigger shuffle 
> file loss and previous stage retries.
> In our production, we have external shuffle service enabled, this data 
> correctness issue happened when there were node losses.
> Although there's some non-deterministic factor in upstream data, user 
> wouldn't expect  to see incorrect result.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46520) Support overwrite mode for Python data source write

2023-12-26 Thread Allison Wang (Jira)
Allison Wang created SPARK-46520:


 Summary: Support overwrite mode for Python data source write
 Key: SPARK-46520
 URL: https://issues.apache.org/jira/browse/SPARK-46520
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


Support the `overwrite` mode for Python data source



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46519) Clear unused error classes from error-classes.json file

2023-12-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46519:
---
Labels: pull-request-available  (was: )

> Clear unused error classes from error-classes.json file
> ---
>
> Key: SPARK-46519
> URL: https://issues.apache.org/jira/browse/SPARK-46519
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46519) Clear unused error classes from error-classes.json file

2023-12-26 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-46519:
---

 Summary: Clear unused error classes from error-classes.json file
 Key: SPARK-46519
 URL: https://issues.apache.org/jira/browse/SPARK-46519
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43338) Support modify the SESSION_CATALOG_NAME value

2023-12-26 Thread melin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

melin updated SPARK-43338:
--
Description: 
{code:java}
private[sql] object CatalogManager {
val SESSION_CATALOG_NAME: String = "spark_catalog"
}{code}
 
The SESSION_CATALOG_NAME value cannot be modified。

If the platform supports hive and spark sql, the metadata catalog name is 
hive_metastore. It's more appropriate. The user directly copies the table name 
and brings the hive_metastore catalog. In this case, the default spark catalog 
name needs to be changed。

 

!image-2023-12-27-09-55-55-693.png!

[~fanjia] 
 

  was:
{code:java}
private[sql] object CatalogManager {
val SESSION_CATALOG_NAME: String = "spark_catalog"
}{code}
 
The SESSION_CATALOG_NAME value cannot be modified。

If multiple Hive Metastores exist, the platform manages multiple hms metadata 
and classifies them by catalogName. A different catalog name is required

[~fanjia] 


> Support  modify the SESSION_CATALOG_NAME value
> --
>
> Key: SPARK-43338
> URL: https://issues.apache.org/jira/browse/SPARK-43338
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: melin
>Priority: Major
> Attachments: image-2023-12-27-09-55-55-693.png
>
>
> {code:java}
> private[sql] object CatalogManager {
> val SESSION_CATALOG_NAME: String = "spark_catalog"
> }{code}
>  
> The SESSION_CATALOG_NAME value cannot be modified。
> If the platform supports hive and spark sql, the metadata catalog name is 
> hive_metastore. It's more appropriate. The user directly copies the table 
> name and brings the hive_metastore catalog. In this case, the default spark 
> catalog name needs to be changed。
>  
> !image-2023-12-27-09-55-55-693.png!
> [~fanjia] 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43338) Support modify the SESSION_CATALOG_NAME value

2023-12-26 Thread melin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

melin updated SPARK-43338:
--
Attachment: image-2023-12-27-09-55-55-693.png

> Support  modify the SESSION_CATALOG_NAME value
> --
>
> Key: SPARK-43338
> URL: https://issues.apache.org/jira/browse/SPARK-43338
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: melin
>Priority: Major
> Attachments: image-2023-12-27-09-55-55-693.png
>
>
> {code:java}
> private[sql] object CatalogManager {
> val SESSION_CATALOG_NAME: String = "spark_catalog"
> }{code}
>  
> The SESSION_CATALOG_NAME value cannot be modified。
> If multiple Hive Metastores exist, the platform manages multiple hms metadata 
> and classifies them by catalogName. A different catalog name is required
> [~fanjia] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46518) Support for copy from write compatible postgresql databases (pg, redshift, snowflake, gauss)

2023-12-26 Thread melin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

melin updated SPARK-46518:
--
Description: 
Now many databases are compatible with pg syntax and support copy from syntax. 
The copy form import performance is 10 times higher than that of jdbc batch.

[https://github.com/melin/datatunnel/blob/master/connectors/jdbc/src/main/scala/com/superior/datatunnel/plugin/jdbc/support/CopyHelper.scala]

Supports upsert data import: 
[https://github.com/melin/datatunnel/blob/master/connectors/jdbc/src/main/scala/com/superior/datatunnel/plugin/jdbc/support/DataTunnelJdbcRelationProvider.scala]

!image-2023-12-27-09-44-19-292.png!

 

[~yao] 

  was:
Now many databases are compatible with pg syntax and support copy from syntax. 
The copy form import performance is 10 times higher than that of jdbc batch.

[https://github.com/melin/datatunnel/blob/master/connectors/jdbc/src/main/scala/com/superior/datatunnel/plugin/jdbc/support/CopyHelper.scala]

Supports upsert data import: 
[https://github.com/melin/datatunnel/blob/master/connectors/jdbc/src/main/scala/com/superior/datatunnel/plugin/jdbc/support/DataTunnelJdbcRelationProvider.scala]

!image-2023-12-27-09-43-01-529.png!

 

 


> Support for copy from write compatible postgresql databases (pg, redshift, 
> snowflake, gauss)
> 
>
> Key: SPARK-46518
> URL: https://issues.apache.org/jira/browse/SPARK-46518
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: melin
>Priority: Major
> Attachments: image-2023-12-27-09-44-19-292.png
>
>
> Now many databases are compatible with pg syntax and support copy from 
> syntax. The copy form import performance is 10 times higher than that of jdbc 
> batch.
> [https://github.com/melin/datatunnel/blob/master/connectors/jdbc/src/main/scala/com/superior/datatunnel/plugin/jdbc/support/CopyHelper.scala]
> Supports upsert data import: 
> [https://github.com/melin/datatunnel/blob/master/connectors/jdbc/src/main/scala/com/superior/datatunnel/plugin/jdbc/support/DataTunnelJdbcRelationProvider.scala]
> !image-2023-12-27-09-44-19-292.png!
>  
> [~yao] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46518) Support for copy from write compatible postgresql databases (pg, redshift, snowflake, gauss)

2023-12-26 Thread melin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

melin updated SPARK-46518:
--
Attachment: image-2023-12-27-09-44-19-292.png

> Support for copy from write compatible postgresql databases (pg, redshift, 
> snowflake, gauss)
> 
>
> Key: SPARK-46518
> URL: https://issues.apache.org/jira/browse/SPARK-46518
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: melin
>Priority: Major
> Attachments: image-2023-12-27-09-44-19-292.png
>
>
> Now many databases are compatible with pg syntax and support copy from 
> syntax. The copy form import performance is 10 times higher than that of jdbc 
> batch.
> [https://github.com/melin/datatunnel/blob/master/connectors/jdbc/src/main/scala/com/superior/datatunnel/plugin/jdbc/support/CopyHelper.scala]
> Supports upsert data import: 
> [https://github.com/melin/datatunnel/blob/master/connectors/jdbc/src/main/scala/com/superior/datatunnel/plugin/jdbc/support/DataTunnelJdbcRelationProvider.scala]
> !image-2023-12-27-09-43-01-529.png!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46518) Support for copy from write compatible postgresql databases (pg, redshift, snowflake, gauss)

2023-12-26 Thread melin (Jira)
melin created SPARK-46518:
-

 Summary: Support for copy from write compatible postgresql 
databases (pg, redshift, snowflake, gauss)
 Key: SPARK-46518
 URL: https://issues.apache.org/jira/browse/SPARK-46518
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 4.0.0
Reporter: melin


Now many databases are compatible with pg syntax and support copy from syntax. 
The copy form import performance is 10 times higher than that of jdbc batch.

[https://github.com/melin/datatunnel/blob/master/connectors/jdbc/src/main/scala/com/superior/datatunnel/plugin/jdbc/support/CopyHelper.scala]

Supports upsert data import: 
[https://github.com/melin/datatunnel/blob/master/connectors/jdbc/src/main/scala/com/superior/datatunnel/plugin/jdbc/support/DataTunnelJdbcRelationProvider.scala]

!image-2023-12-27-09-43-01-529.png!

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46508) Upgrade Jackson to 2.16.1

2023-12-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-46508.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44494
[https://github.com/apache/spark/pull/44494]

> Upgrade Jackson to 2.16.1
> -
>
> Key: SPARK-46508
> URL: https://issues.apache.org/jira/browse/SPARK-46508
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.16.1



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46517) Reorganize `IndexingTest`

2023-12-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46517:
---
Labels: pull-request-available  (was: )

> Reorganize `IndexingTest`
> -
>
> Key: SPARK-46517
> URL: https://issues.apache.org/jira/browse/SPARK-46517
> Project: Spark
>  Issue Type: Sub-task
>  Components: PS, Tests
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46517) Reorganize `IndexingTest`

2023-12-26 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-46517:
-

 Summary: Reorganize `IndexingTest`
 Key: SPARK-46517
 URL: https://issues.apache.org/jira/browse/SPARK-46517
 Project: Spark
  Issue Type: Sub-task
  Components: PS, Tests
Affects Versions: 4.0.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46513) Move `BasicIndexingTests` to `pyspark.pandas.tests.indexes.*`

2023-12-26 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-46513.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44499
[https://github.com/apache/spark/pull/44499]

> Move `BasicIndexingTests` to `pyspark.pandas.tests.indexes.*`
> -
>
> Key: SPARK-46513
> URL: https://issues.apache.org/jira/browse/SPARK-46513
> Project: Spark
>  Issue Type: Sub-task
>  Components: PS, Tests
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46513) Move `BasicIndexingTests` to `pyspark.pandas.tests.indexes.*`

2023-12-26 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-46513:


Assignee: Ruifeng Zheng

> Move `BasicIndexingTests` to `pyspark.pandas.tests.indexes.*`
> -
>
> Key: SPARK-46513
> URL: https://issues.apache.org/jira/browse/SPARK-46513
> Project: Spark
>  Issue Type: Sub-task
>  Components: PS, Tests
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46516) autoBroadcastJoinThreshold compared to plan.statistics not a table size

2023-12-26 Thread Guram Savinov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guram Savinov updated SPARK-46516:
--
Description: 
>From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum 
>size in bytes for a table that will be broadcasted to all worker nodes when 
>performing a join.

[https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration]

In fact Spark compares plan.statistics.sizeInBytes for columns selected in 
join, not a table size.

[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L368]

Join can select only a few columns and sizeInBytes will be lesser than 
autoBroadcastJoinThreshold, but broadcasted table can be huge and leads to OOM 
on driver.

spark.sql.autoBroadcastJoinThreshold parameter seems useless when its not 
compared to  broadcasted table size.

Related topic on SO: 
[https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s]

  was:
>From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum 
>size in bytes for a table that will be broadcasted to all worker nodes when 
>performing a join.

[https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration]

In fact Spark compares plan.statistics.sizeInBytes for columns selected in 
join, not a table size.

[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L368]

Join can select only a few columns in join and sizeInBytes will be lesser than 
autoBroadcastJoinThreshold, but broadcasted table can be huge and leads to OOM 
on driver.

spark.sql.autoBroadcastJoinThreshold parameter seems useless when its not 
compared to  broadcasted table size.

Related topic on SO: 
[https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s]


> autoBroadcastJoinThreshold compared to plan.statistics not a table size
> ---
>
> Key: SPARK-46516
> URL: https://issues.apache.org/jira/browse/SPARK-46516
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Guram Savinov
>Priority: Major
>
> From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum 
> size in bytes for a table that will be broadcasted to all worker nodes when 
> performing a join.
> [https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration]
> In fact Spark compares plan.statistics.sizeInBytes for columns selected in 
> join, not a table size.
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L368]
> Join can select only a few columns and sizeInBytes will be lesser than 
> autoBroadcastJoinThreshold, but broadcasted table can be huge and leads to 
> OOM on driver.
> spark.sql.autoBroadcastJoinThreshold parameter seems useless when its not 
> compared to  broadcasted table size.
> Related topic on SO: 
> [https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46516) autoBroadcastJoinThreshold compared to plan.statistics not a table size

2023-12-26 Thread Guram Savinov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guram Savinov updated SPARK-46516:
--
Description: 
>From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum 
>size in bytes for a table that will be broadcasted to all worker nodes when 
>performing a join.

[https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration]

In fact Spark compares plan.statistics.sizeInBytes for columns selected in 
join, not a table size.

[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L368]

Join can select only a few columns in join and sizeInBytes will be lesser than 
autoBroadcastJoinThreshold, but broadcasted table can be huge and leads to OOM 
on driver.

spark.sql.autoBroadcastJoinThreshold parameter seems useless when its not 
compared to  broadcasted table size.

Related topic on SO: 
[https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s]

  was:
>From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum 
>size in bytes for a table that will be broadcasted to all worker nodes when 
>performing a join.

[https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration]

In fact Spark compares plan.statistics.sizeInBytes for columns selected in 
join, not a table size.

[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L368]

The broadcasted table can be huge and leads to OOM on driver, so 
spark.sql.autoBroadcastJoinThreshold parameter seems useless when its not 
compared to  broadcasted table size.

Related topic on SO: 
[https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s]


> autoBroadcastJoinThreshold compared to plan.statistics not a table size
> ---
>
> Key: SPARK-46516
> URL: https://issues.apache.org/jira/browse/SPARK-46516
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Guram Savinov
>Priority: Major
>
> From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum 
> size in bytes for a table that will be broadcasted to all worker nodes when 
> performing a join.
> [https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration]
> In fact Spark compares plan.statistics.sizeInBytes for columns selected in 
> join, not a table size.
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L368]
> Join can select only a few columns in join and sizeInBytes will be lesser 
> than autoBroadcastJoinThreshold, but broadcasted table can be huge and leads 
> to OOM on driver.
> spark.sql.autoBroadcastJoinThreshold parameter seems useless when its not 
> compared to  broadcasted table size.
> Related topic on SO: 
> [https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46516) autoBroadcastJoinThreshold compared to plan.statistics not a table size

2023-12-26 Thread Guram Savinov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guram Savinov updated SPARK-46516:
--
Description: 
>From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum 
>size in bytes for a table that will be broadcasted to all worker nodes when 
>performing a join.

[https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration]

In fact Spark compares plan.statistics.sizeInBytes for columns selected in 
join, not a table size.

[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L368]

The broadcasted table can be huge and leads to OOM on driver, so 
spark.sql.autoBroadcastJoinThreshold parameter seems useless when its not 
compared to  broadcasted table size.

Related topic on SO: 
[https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s]

  was:
>From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum 
>size in bytes for a table that will be broadcasted to all worker nodes when 
>performing a join.

In fact Spark compares plan.statistics.sizeInBytes for columns selected in 
join, not a table size.

[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L368]The
 broadcasted table can be huge and leads to OOM on driver, so 
spark.sql.autoBroadcastJoinThreshold parameter seems useless when its not 
compared to  broadcasted table size.

[https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration]

Related topic on SO: 
[https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s]


> autoBroadcastJoinThreshold compared to plan.statistics not a table size
> ---
>
> Key: SPARK-46516
> URL: https://issues.apache.org/jira/browse/SPARK-46516
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Guram Savinov
>Priority: Major
>
> From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum 
> size in bytes for a table that will be broadcasted to all worker nodes when 
> performing a join.
> [https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration]
> In fact Spark compares plan.statistics.sizeInBytes for columns selected in 
> join, not a table size.
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L368]
> The broadcasted table can be huge and leads to OOM on driver, so 
> spark.sql.autoBroadcastJoinThreshold parameter seems useless when its not 
> compared to  broadcasted table size.
> Related topic on SO: 
> [https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46516) autoBroadcastJoinThreshold compared to plan.statistics not a table size

2023-12-26 Thread Guram Savinov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guram Savinov updated SPARK-46516:
--
Description: 
>From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum 
>size in bytes for a table that will be broadcasted to all worker nodes when 
>performing a join.

In fact Spark compares plan.statistics.sizeInBytes for columns selected in 
join, not a table size.

[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L368]The
 broadcasted table can be huge and leads to OOM on driver, so 
spark.sql.autoBroadcastJoinThreshold parameter seems useless when its not 
compared to  broadcasted table size.

[https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration]

Related topic on SO: 
[https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s]

  was:
>From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum 
>size in bytes for a table that will be broadcasted to all worker nodes when 
>performing a join.
In fact Spark compares plan.statistics.sizeInBytes for columns selected in 
join, not a table size.
The broadcasted table can be huge and leads to OOM on driver, so 
spark.sql.autoBroadcastJoinThreshold parameter seems useless when its not 
compared to  broadcasted table size.

[https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration]

Related topic on SO: 
[https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s]


> autoBroadcastJoinThreshold compared to plan.statistics not a table size
> ---
>
> Key: SPARK-46516
> URL: https://issues.apache.org/jira/browse/SPARK-46516
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Guram Savinov
>Priority: Major
>
> From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum 
> size in bytes for a table that will be broadcasted to all worker nodes when 
> performing a join.
> In fact Spark compares plan.statistics.sizeInBytes for columns selected in 
> join, not a table size.
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L368]The
>  broadcasted table can be huge and leads to OOM on driver, so 
> spark.sql.autoBroadcastJoinThreshold parameter seems useless when its not 
> compared to  broadcasted table size.
> [https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration]
> Related topic on SO: 
> [https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46516) autoBroadcastJoinThreshold compared to plan.statistics not a table size

2023-12-26 Thread Guram Savinov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guram Savinov updated SPARK-46516:
--
Description: 
>From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum 
>size in bytes for a table that will be broadcasted to all worker nodes when 
>performing a join.
In fact Spark compares plan.statistics.sizeInBytes for columns selected in 
join, not a table size.
The broadcasted table can be huge and leads to OOM on driver, so 
spark.sql.autoBroadcastJoinThreshold parameter seems useless when its not 
compared to  broadcasted table size.

[https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration]

Related topic on SO: 
[https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s]

  was:
>From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum 
>size in bytes for a table that will be broadcasted to all worker nodes when 
>performing a join.
In fact Spark compares plan.statistics.sizeInBytes for columns selected in 
join, not a table size.
The broadcasted table can be huge and leads to OOM on driver, so 
spark.sql.autoBroadcastJoinThreshold parameter seems useless when its not 
compared to  broadcasted table sizes.

[https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration]

Related topic on SO: 
https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s


> autoBroadcastJoinThreshold compared to plan.statistics not a table size
> ---
>
> Key: SPARK-46516
> URL: https://issues.apache.org/jira/browse/SPARK-46516
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Guram Savinov
>Priority: Major
>
> From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum 
> size in bytes for a table that will be broadcasted to all worker nodes when 
> performing a join.
> In fact Spark compares plan.statistics.sizeInBytes for columns selected in 
> join, not a table size.
> The broadcasted table can be huge and leads to OOM on driver, so 
> spark.sql.autoBroadcastJoinThreshold parameter seems useless when its not 
> compared to  broadcasted table size.
> [https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration]
> Related topic on SO: 
> [https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46516) autoBroadcastJoinThreshold compared to plan.statistics not a table size

2023-12-26 Thread Guram Savinov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guram Savinov updated SPARK-46516:
--
Issue Type: Bug  (was: Documentation)

> autoBroadcastJoinThreshold compared to plan.statistics not a table size
> ---
>
> Key: SPARK-46516
> URL: https://issues.apache.org/jira/browse/SPARK-46516
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Guram Savinov
>Priority: Major
>
> From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum 
> size in bytes for a table that will be broadcasted to all worker nodes when 
> performing a join.
> In fact Spark compares plan.statistics.sizeInBytes for columns selected in 
> join, not a table size.
> The broadcasted table can be huge and leads to OOM on driver, so 
> spark.sql.autoBroadcastJoinThreshold parameter seems useless when its not 
> compared to  broadcasted table size.
> [https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration]
> Related topic on SO: 
> [https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46516) autoBroadcastJoinThreshold compared to plan.statistics not a table size

2023-12-26 Thread Guram Savinov (Jira)
Guram Savinov created SPARK-46516:
-

 Summary: autoBroadcastJoinThreshold compared to plan.statistics 
not a table size
 Key: SPARK-46516
 URL: https://issues.apache.org/jira/browse/SPARK-46516
 Project: Spark
  Issue Type: Documentation
  Components: SQL
Affects Versions: 3.1.1
Reporter: Guram Savinov


>From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum 
>size in bytes for a table that will be broadcasted to all worker nodes when 
>performing a join.
In fact Spark compares plan.statistics.sizeInBytes for columns selected in 
join, not a table size.
The broadcasted table can be huge and leads to OOM on driver, so 
spark.sql.autoBroadcastJoinThreshold parameter seems useless when its not 
compared to  broadcasted table sizes.

[https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration]

Related topic on SO: 
https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46506) Refine docstring of `array_intersect/array_union/array_except`

2023-12-26 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie reassigned SPARK-46506:


Assignee: Yang Jie

> Refine docstring of `array_intersect/array_union/array_except`
> --
>
> Key: SPARK-46506
> URL: https://issues.apache.org/jira/browse/SPARK-46506
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46506) Refine docstring of `array_intersect/array_union/array_except`

2023-12-26 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-46506.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44490
[https://github.com/apache/spark/pull/44490]

> Refine docstring of `array_intersect/array_union/array_except`
> --
>
> Key: SPARK-46506
> URL: https://issues.apache.org/jira/browse/SPARK-46506
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-46192) failed to insert the table using the default value of union

2023-12-26 Thread zengxl (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-46192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800471#comment-17800471
 ] 

zengxl commented on SPARK-46192:


{code:java}
create table test_spark_3(k string default null,v int default null,m string 
default null) stored as orc; 

insert into table test_spark_3(k,v) select k,sum(v) v from test_spark_1 group 
by k;

insert into table test_spark_3(k,v) select distinct a.k,a.v from test_spark a 
left join test_spark_1 b on a.k=b.k  limit 2;{code}
The above SQL has the same exception

 

> failed to insert the table using the default value of union
> ---
>
> Key: SPARK-46192
> URL: https://issues.apache.org/jira/browse/SPARK-46192
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0, 3.4.1
>Reporter: zengxl
>Priority: Major
>
>  
> Obtain the following tables and data
> {code:java}
> create table test_spark(k string default null,v int default null) stored as 
> orc;
> create table test_spark_1(k string default null,v int default null) stored as 
> orc;
> insert into table test_spark_1 values('k1',1),('k2',2),('k3',3);
> create table test_spark_2(k string default null,v int default null) stored as 
> orc; 
> insert into table test_spark_2 values('k3',3),('k4',4),('k5',5);
> {code}
> Execute the following SQL
> {code:java}
> insert into table test_spark (k) 
> select k from test_spark_1
> union
> select k from test_spark_2 
> {code}
> exception:
> {code:java}
> 23/12/01 10:44:25 INFO HiveSessionStateBuilder$$anon$1: here is 
> CatalogAndIdentifier
> 23/12/01 10:44:25 INFO HiveSessionStateBuilder$$anon$1: here is 
> CatalogAndIdentifier
> 23/12/01 10:44:25 INFO HiveSessionStateBuilder$$anon$1: here is 
> CatalogAndIdentifier
> 23/12/01 10:44:26 INFO Analyzer$ResolveUserSpecifiedColumns: 
> i.userSpecifiedCols.size is 1
> 23/12/01 10:44:26 INFO Analyzer$ResolveUserSpecifiedColumns: 
> i.userSpecifiedCols.size is 1
> 23/12/01 10:44:26 INFO Analyzer$ResolveUserSpecifiedColumns: i.table.output 2 
> ,resolved :1 , i.query 1
> 23/12/01 10:44:26 INFO Analyzer$ResolveUserSpecifiedColumns: here is 
> ResolveUserSpecifiedColumns tableOutoyt: 2---nameToQueryExpr : 1Error in 
> query: `default`.`test_spark` requires that the data to be inserted have the 
> same number of columns as the target table: target table has 2 column(s) but 
> the inserted data has 1 column(s), including 0 partition column(s) having 
> constant value(s). {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46514) Fix HiveMetastoreLazyInitializationSuite

2023-12-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46514:
---
Labels: pull-request-available  (was: )

> Fix HiveMetastoreLazyInitializationSuite
> 
>
> Key: SPARK-46514
> URL: https://issues.apache.org/jira/browse/SPARK-46514
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46513) Move `BasicIndexingTests` to `pyspark.pandas.tests.indexes.*`

2023-12-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46513:
---
Labels: pull-request-available  (was: )

> Move `BasicIndexingTests` to `pyspark.pandas.tests.indexes.*`
> -
>
> Key: SPARK-46513
> URL: https://issues.apache.org/jira/browse/SPARK-46513
> Project: Spark
>  Issue Type: Sub-task
>  Components: PS, Tests
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46512) Optimize shuffle reading when both sort and combine are used.

2023-12-26 Thread Chenyu Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenyu Zheng updated SPARK-46512:
-
Description: 
After the shuffle reader obtains the block, it will first perform a combine 
operation, and then perform a sort operation. It is known that both combine and 
sort may generate temporary files, so the performance may be poor when both 
sort and combine are used. In fact, combine operations can be performed during 
the sort process, and we can avoid the combine spill file.

 

I did not find any direct api to construct the shuffle which both sort and 
combine is used. But I can do like following code, here is a wordcount, and the 
output words is sorted.
{code:java}
sc.textFile(input).flatMap(_.split(" ")).map(w => (w, 1)).
reduceByKey(_ + _, 1).
asInstanceOf[ShuffledRDD[String, Int, Int]].setKeyOrdering(Ordering.String).
collect().foreach(println) {code}

  was:
After the shuffle reader obtains the block, it will first perform a combine 
operation, and then perform a sort operation. It is known that both combine and 
sort may generate temporary files, so the performance may be poor when both 
sort and combine are used. In fact, combine operations can be performed during 
the sort process, and we can avoid the combine spill file.

 

I did not find any direct api to construct the shuffle which both sort and 
combine is used. But I can do like below code, here is a wordcount, and the 
output words is sorted.

```
sc.textFile(input).flatMap(_.split(" ")).map(w => (w, 1)).
reduceByKey(_ + _, 1).
asInstanceOf[ShuffledRDD[String, Int, Int]].setKeyOrdering(Ordering.String).
collect().foreach(println)
```


> Optimize shuffle reading when both sort and combine are used.
> -
>
> Key: SPARK-46512
> URL: https://issues.apache.org/jira/browse/SPARK-46512
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 4.0.0
>Reporter: Chenyu Zheng
>Priority: Minor
>
> After the shuffle reader obtains the block, it will first perform a combine 
> operation, and then perform a sort operation. It is known that both combine 
> and sort may generate temporary files, so the performance may be poor when 
> both sort and combine are used. In fact, combine operations can be performed 
> during the sort process, and we can avoid the combine spill file.
>  
> I did not find any direct api to construct the shuffle which both sort and 
> combine is used. But I can do like following code, here is a wordcount, and 
> the output words is sorted.
> {code:java}
> sc.textFile(input).flatMap(_.split(" ")).map(w => (w, 1)).
> reduceByKey(_ + _, 1).
> asInstanceOf[ShuffledRDD[String, Int, Int]].setKeyOrdering(Ordering.String).
> collect().foreach(println) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46514) Fix HiveMetastoreLazyInitializationSuite

2023-12-26 Thread Kent Yao (Jira)
Kent Yao created SPARK-46514:


 Summary: Fix HiveMetastoreLazyInitializationSuite
 Key: SPARK-46514
 URL: https://issues.apache.org/jira/browse/SPARK-46514
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46513) Move `BasicIndexingTests` to `pyspark.pandas.tests.indexes.*`

2023-12-26 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-46513:
-

 Summary: Move `BasicIndexingTests` to 
`pyspark.pandas.tests.indexes.*`
 Key: SPARK-46513
 URL: https://issues.apache.org/jira/browse/SPARK-46513
 Project: Spark
  Issue Type: Sub-task
  Components: PS, Tests
Affects Versions: 4.0.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46510) Spark shell log filter should be applied to all AbstractAppender

2023-12-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-46510:
--

Assignee: (was: Apache Spark)

> Spark shell log filter should be applied to all AbstractAppender
> 
>
> Key: SPARK-46510
> URL: https://issues.apache.org/jira/browse/SPARK-46510
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.1, 3.4.1, 3.3.4
>Reporter: Yi Zhu
>Priority: Major
>  Labels: pull-request-available
>
> When we set async appender and refer to console, spark shell log filter won't 
> work.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46510) Spark shell log filter should be applied to all AbstractAppender

2023-12-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-46510:
--

Assignee: Apache Spark

> Spark shell log filter should be applied to all AbstractAppender
> 
>
> Key: SPARK-46510
> URL: https://issues.apache.org/jira/browse/SPARK-46510
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.1, 3.4.1, 3.3.4
>Reporter: Yi Zhu
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> When we set async appender and refer to console, spark shell log filter won't 
> work.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46510) Spark shell log filter should be applied to all AbstractAppender

2023-12-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-46510:
--

Assignee: Apache Spark

> Spark shell log filter should be applied to all AbstractAppender
> 
>
> Key: SPARK-46510
> URL: https://issues.apache.org/jira/browse/SPARK-46510
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.1, 3.4.1, 3.3.4
>Reporter: Yi Zhu
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> When we set async appender and refer to console, spark shell log filter won't 
> work.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46510) Spark shell log filter should be applied to all AbstractAppender

2023-12-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-46510:
--

Assignee: (was: Apache Spark)

> Spark shell log filter should be applied to all AbstractAppender
> 
>
> Key: SPARK-46510
> URL: https://issues.apache.org/jira/browse/SPARK-46510
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.1, 3.4.1, 3.3.4
>Reporter: Yi Zhu
>Priority: Major
>  Labels: pull-request-available
>
> When we set async appender and refer to console, spark shell log filter won't 
> work.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46460) The filter of partition including cast function may lead the partition pruning to disable

2023-12-26 Thread Zhou Tong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhou Tong updated SPARK-46460:
--
Attachment: SPARK-46460.patch

> The filter of partition including cast function may lead the partition 
> pruning to disable
> -
>
> Key: SPARK-46460
> URL: https://issues.apache.org/jira/browse/SPARK-46460
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer, SQL
>Affects Versions: 3.2.0
>Reporter: Zhou Tong
>Priority: Minor
>  Labels: pull-request-available
> Attachments: SPARK-46460.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> SQL:select * from test_db.test_table where day between 
> date_sub('2023-12-01',1) and  '2023-12-03'
> The Physical Plan of sql above will implement _cast_ function on partition 
> col 'day',  like this, {_}cast(day as date) > 2023-11-30{_}. In this 
> situation, spark just pass the filter condition _day < "2023-12-03"_ to 
> HiveMetastore, not including filter condition {_}cast(day as date) > 
> 2023-11-30{_}, which may lead performance of HMS degarde if the HiveTable has 
> huge number of partitions.
>  
> In this regard, a new rule may solve this problem. This rule can convert 
> binary comparison _cast(day as date) > 2023-11-30_ to {_}day > 
> cast(2023-11-30 as string){_}. The right node is foldable, so the result is 
> {_}day > "2023-11-30"{_}, and the filter condition passed to HMS will be _day 
> > "2023-11-30" and_ _day < "2023-12-03"._
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46511) Optimize spark jdbc write speed with Multi-Row Inserts

2023-12-26 Thread melin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

melin resolved SPARK-46511.
---
Resolution: Fixed

> Optimize spark jdbc write speed with Multi-Row Inserts
> --
>
> Key: SPARK-46511
> URL: https://issues.apache.org/jira/browse/SPARK-46511
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: melin
>Priority: Major
>
> INSERT INTO table_name (column1, column2, column3)
> VALUES (value1, value2, value3),
> (value4, value5, value6),
> (value7, value8, value9);



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46498) Remove `shuffleServiceEnabled` from `o.a.spark.util.Utils#getConfiguredLocalDirs`

2023-12-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-46498:
--
Summary: Remove `shuffleServiceEnabled` from 
`o.a.spark.util.Utils#getConfiguredLocalDirs`  (was: Remove an unused local 
variables from `o.a.spark.util.Utils#getConfiguredLocalDirs`)

> Remove `shuffleServiceEnabled` from 
> `o.a.spark.util.Utils#getConfiguredLocalDirs`
> -
>
> Key: SPARK-46498
> URL: https://issues.apache.org/jira/browse/SPARK-46498
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46498) Remove an unused local variables from `o.a.spark.util.Utils#getConfiguredLocalDirs`

2023-12-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-46498:
-

Assignee: Yang Jie

> Remove an unused local variables from 
> `o.a.spark.util.Utils#getConfiguredLocalDirs`
> ---
>
> Key: SPARK-46498
> URL: https://issues.apache.org/jira/browse/SPARK-46498
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46498) Remove an unused local variables from `o.a.spark.util.Utils#getConfiguredLocalDirs`

2023-12-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-46498.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44475
[https://github.com/apache/spark/pull/44475]

> Remove an unused local variables from 
> `o.a.spark.util.Utils#getConfiguredLocalDirs`
> ---
>
> Key: SPARK-46498
> URL: https://issues.apache.org/jira/browse/SPARK-46498
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46371) Clean up outdated items in `.rat-excludes`

2023-12-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-46371.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44293
[https://github.com/apache/spark/pull/44293]

> Clean up outdated items in `.rat-excludes`
> --
>
> Key: SPARK-46371
> URL: https://issues.apache.org/jira/browse/SPARK-46371
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46371) Clean up outdated items in `.rat-excludes`

2023-12-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-46371:
-

Assignee: BingKun Pan

> Clean up outdated items in `.rat-excludes`
> --
>
> Key: SPARK-46371
> URL: https://issues.apache.org/jira/browse/SPARK-46371
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45914) Support `commit` and `abort` API for Python data source write

2023-12-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45914:
---
Labels: pull-request-available  (was: )

> Support `commit` and `abort` API for Python data source write
> -
>
> Key: SPARK-45914
> URL: https://issues.apache.org/jira/browse/SPARK-45914
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>  Labels: pull-request-available
>
> Support `commit` and `abort` API for Python data source write.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org