[jira] [Updated] (SPARK-46522) Block Python data source registration with name conflicts
[ https://issues.apache.org/jira/browse/SPARK-46522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46522: --- Labels: pull-request-available (was: ) > Block Python data source registration with name conflicts > - > > Key: SPARK-46522 > URL: https://issues.apache.org/jira/browse/SPARK-46522 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > Labels: pull-request-available > > Users should not be allowed to register Python data sources with names that > are the same as builtin or existing Scala/Java data sources. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46522) Block Python data source registration with name conflicts
Allison Wang created SPARK-46522: Summary: Block Python data source registration with name conflicts Key: SPARK-46522 URL: https://issues.apache.org/jira/browse/SPARK-46522 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 4.0.0 Reporter: Allison Wang Users should not be allowed to register Python data sources with names that are the same as builtin or existing Scala/Java data sources. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46521) Refine docstring of `array_compact/array_distinct/array_remove`
[ https://issues.apache.org/jira/browse/SPARK-46521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46521: --- Labels: pull-request-available (was: ) > Refine docstring of `array_compact/array_distinct/array_remove` > --- > > Key: SPARK-46521 > URL: https://issues.apache.org/jira/browse/SPARK-46521 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46517) Reorganize `IndexingTest`
[ https://issues.apache.org/jira/browse/SPARK-46517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-46517. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44502 [https://github.com/apache/spark/pull/44502] > Reorganize `IndexingTest` > - > > Key: SPARK-46517 > URL: https://issues.apache.org/jira/browse/SPARK-46517 > Project: Spark > Issue Type: Sub-task > Components: PS, Tests >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46516) autoBroadcastJoinThreshold compared to plan.statistics not a table size
[ https://issues.apache.org/jira/browse/SPARK-46516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guram Savinov updated SPARK-46516: -- Description: >From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum >size in bytes for a table that will be broadcasted to all worker nodes when >performing a join. [https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration] In fact Spark compares plan.statistics.sizeInBytes for columns selected in join, not a table size. [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L368] Join can select only a few columns and sizeInBytes will be lesser than autoBroadcastJoinThreshold, but broadcasted table can be huge and it is loaded entirely into drivers memory which can lead to OOM. spark.sql.autoBroadcastJoinThreshold parameter seems useless when its not compared to broadcasted table size. Related topic on SO: [https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s] was: >From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum >size in bytes for a table that will be broadcasted to all worker nodes when >performing a join. [https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration] In fact Spark compares plan.statistics.sizeInBytes for columns selected in join, not a table size. [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L368] Join can select only a few columns and sizeInBytes will be lesser than autoBroadcastJoinThreshold, but broadcasted table can be huge and leads to OOM on driver. spark.sql.autoBroadcastJoinThreshold parameter seems useless when its not compared to broadcasted table size. Related topic on SO: [https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s] > autoBroadcastJoinThreshold compared to plan.statistics not a table size > --- > > Key: SPARK-46516 > URL: https://issues.apache.org/jira/browse/SPARK-46516 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1 >Reporter: Guram Savinov >Priority: Major > > From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum > size in bytes for a table that will be broadcasted to all worker nodes when > performing a join. > [https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration] > In fact Spark compares plan.statistics.sizeInBytes for columns selected in > join, not a table size. > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L368] > Join can select only a few columns and sizeInBytes will be lesser than > autoBroadcastJoinThreshold, but broadcasted table can be huge and it is > loaded entirely into drivers memory which can lead to OOM. > spark.sql.autoBroadcastJoinThreshold parameter seems useless when its not > compared to broadcasted table size. > Related topic on SO: > [https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46521) Refine docstring of `array_compact/array_distinct/array_remove`
Yang Jie created SPARK-46521: Summary: Refine docstring of `array_compact/array_distinct/array_remove` Key: SPARK-46521 URL: https://issues.apache.org/jira/browse/SPARK-46521 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Affects Versions: 4.0.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46520) Support overwrite mode for Python data source write
[ https://issues.apache.org/jira/browse/SPARK-46520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46520: --- Labels: pull-request-available (was: ) > Support overwrite mode for Python data source write > --- > > Key: SPARK-46520 > URL: https://issues.apache.org/jira/browse/SPARK-46520 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > Labels: pull-request-available > > Support the `overwrite` mode for Python data source -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45917) Statically register Python Data Source
[ https://issues.apache.org/jira/browse/SPARK-45917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45917: --- Labels: pull-request-available (was: ) > Statically register Python Data Source > -- > > Key: SPARK-45917 > URL: https://issues.apache.org/jira/browse/SPARK-45917 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > > See the inlined comment in {{DataSourceManager}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38388) Repartition + Stage retries could lead to incorrect data
[ https://issues.apache.org/jira/browse/SPARK-38388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800682#comment-17800682 ] Wei Lu commented on SPARK-38388: We had the same problem(using Spark 3.2.1),is there any plan to fix the problem? > Repartition + Stage retries could lead to incorrect data > - > > Key: SPARK-38388 > URL: https://issues.apache.org/jira/browse/SPARK-38388 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.1.1 > Environment: Spark 2.4 and 3.x >Reporter: Jason Xu >Priority: Major > Labels: correctness, data-loss > > Spark repartition uses RoundRobinPartitioning, the generated results is > non-deterministic when data has some randomness and stage/task retries happen. > The bug can be triggered when upstream data has some randomness, a > repartition is called on them, then followed by result stage (could be more > stages). > As the pattern shows below: > upstream stage (data with randomness) -> (repartition shuffle) -> result stage > When one executor goes down at result stage, some tasks of that stage might > have finished, others would fail, shuffle files on that executor also get > lost, some tasks from previous stage (upstream data generation, repartition) > will need to rerun to generate dependent shuffle data files. > Because data has some randomness, regenerated data in upstream retried tasks > is slightly different, repartition then generates inconsistent ordering, then > tasks at result stage will be retried generating different data. > This is similar but different to > https://issues.apache.org/jira/browse/SPARK-23207, fix for it uses extra > local sort to make the row ordering deterministic, the sorting algorithm it > uses simply compares row/record hash. But in this case, upstream data has > some randomness, the sorting algorithm doesn't help keep the order, thus > RoundRobinPartitioning introduced non-deterministic result. > The following code returns 986415, instead of 100: > {code:java} > import scala.sys.process._ > import org.apache.spark.TaskContext > case class TestObject(id: Long, value: Double) > val ds = spark.range(0, 1000 * 1000, 1).repartition(100, > $"id").withColumn("val", rand()).repartition(100).map { > row => if (TaskContext.get.stageAttemptNumber == 0 && > TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId > 97) { > throw new Exception("pkill -f java".!!) > } > TestObject(row.getLong(0), row.getDouble(1)) > } > ds.toDF("id", "value").write.mode("overwrite").saveAsTable("tmp.test_table") > spark.sql("select count(distinct id) from tmp.test_table").show{code} > Command: > {code:java} > spark-shell --num-executors 10 (--conf spark.dynamicAllocation.enabled=false > --conf spark.shuffle.service.enabled=false){code} > To simulate the issue, disable external shuffle service is needed (if it's > also enabled by default in your environment), this is to trigger shuffle > file loss and previous stage retries. > In our production, we have external shuffle service enabled, this data > correctness issue happened when there were node losses. > Although there's some non-deterministic factor in upstream data, user > wouldn't expect to see incorrect result. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46520) Support overwrite mode for Python data source write
Allison Wang created SPARK-46520: Summary: Support overwrite mode for Python data source write Key: SPARK-46520 URL: https://issues.apache.org/jira/browse/SPARK-46520 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 4.0.0 Reporter: Allison Wang Support the `overwrite` mode for Python data source -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46519) Clear unused error classes from error-classes.json file
[ https://issues.apache.org/jira/browse/SPARK-46519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46519: --- Labels: pull-request-available (was: ) > Clear unused error classes from error-classes.json file > --- > > Key: SPARK-46519 > URL: https://issues.apache.org/jira/browse/SPARK-46519 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46519) Clear unused error classes from error-classes.json file
BingKun Pan created SPARK-46519: --- Summary: Clear unused error classes from error-classes.json file Key: SPARK-46519 URL: https://issues.apache.org/jira/browse/SPARK-46519 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43338) Support modify the SESSION_CATALOG_NAME value
[ https://issues.apache.org/jira/browse/SPARK-43338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] melin updated SPARK-43338: -- Description: {code:java} private[sql] object CatalogManager { val SESSION_CATALOG_NAME: String = "spark_catalog" }{code} The SESSION_CATALOG_NAME value cannot be modified。 If the platform supports hive and spark sql, the metadata catalog name is hive_metastore. It's more appropriate. The user directly copies the table name and brings the hive_metastore catalog. In this case, the default spark catalog name needs to be changed。 !image-2023-12-27-09-55-55-693.png! [~fanjia] was: {code:java} private[sql] object CatalogManager { val SESSION_CATALOG_NAME: String = "spark_catalog" }{code} The SESSION_CATALOG_NAME value cannot be modified。 If multiple Hive Metastores exist, the platform manages multiple hms metadata and classifies them by catalogName. A different catalog name is required [~fanjia] > Support modify the SESSION_CATALOG_NAME value > -- > > Key: SPARK-43338 > URL: https://issues.apache.org/jira/browse/SPARK-43338 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: melin >Priority: Major > Attachments: image-2023-12-27-09-55-55-693.png > > > {code:java} > private[sql] object CatalogManager { > val SESSION_CATALOG_NAME: String = "spark_catalog" > }{code} > > The SESSION_CATALOG_NAME value cannot be modified。 > If the platform supports hive and spark sql, the metadata catalog name is > hive_metastore. It's more appropriate. The user directly copies the table > name and brings the hive_metastore catalog. In this case, the default spark > catalog name needs to be changed。 > > !image-2023-12-27-09-55-55-693.png! > [~fanjia] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43338) Support modify the SESSION_CATALOG_NAME value
[ https://issues.apache.org/jira/browse/SPARK-43338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] melin updated SPARK-43338: -- Attachment: image-2023-12-27-09-55-55-693.png > Support modify the SESSION_CATALOG_NAME value > -- > > Key: SPARK-43338 > URL: https://issues.apache.org/jira/browse/SPARK-43338 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: melin >Priority: Major > Attachments: image-2023-12-27-09-55-55-693.png > > > {code:java} > private[sql] object CatalogManager { > val SESSION_CATALOG_NAME: String = "spark_catalog" > }{code} > > The SESSION_CATALOG_NAME value cannot be modified。 > If multiple Hive Metastores exist, the platform manages multiple hms metadata > and classifies them by catalogName. A different catalog name is required > [~fanjia] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46518) Support for copy from write compatible postgresql databases (pg, redshift, snowflake, gauss)
[ https://issues.apache.org/jira/browse/SPARK-46518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] melin updated SPARK-46518: -- Description: Now many databases are compatible with pg syntax and support copy from syntax. The copy form import performance is 10 times higher than that of jdbc batch. [https://github.com/melin/datatunnel/blob/master/connectors/jdbc/src/main/scala/com/superior/datatunnel/plugin/jdbc/support/CopyHelper.scala] Supports upsert data import: [https://github.com/melin/datatunnel/blob/master/connectors/jdbc/src/main/scala/com/superior/datatunnel/plugin/jdbc/support/DataTunnelJdbcRelationProvider.scala] !image-2023-12-27-09-44-19-292.png! [~yao] was: Now many databases are compatible with pg syntax and support copy from syntax. The copy form import performance is 10 times higher than that of jdbc batch. [https://github.com/melin/datatunnel/blob/master/connectors/jdbc/src/main/scala/com/superior/datatunnel/plugin/jdbc/support/CopyHelper.scala] Supports upsert data import: [https://github.com/melin/datatunnel/blob/master/connectors/jdbc/src/main/scala/com/superior/datatunnel/plugin/jdbc/support/DataTunnelJdbcRelationProvider.scala] !image-2023-12-27-09-43-01-529.png! > Support for copy from write compatible postgresql databases (pg, redshift, > snowflake, gauss) > > > Key: SPARK-46518 > URL: https://issues.apache.org/jira/browse/SPARK-46518 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 4.0.0 >Reporter: melin >Priority: Major > Attachments: image-2023-12-27-09-44-19-292.png > > > Now many databases are compatible with pg syntax and support copy from > syntax. The copy form import performance is 10 times higher than that of jdbc > batch. > [https://github.com/melin/datatunnel/blob/master/connectors/jdbc/src/main/scala/com/superior/datatunnel/plugin/jdbc/support/CopyHelper.scala] > Supports upsert data import: > [https://github.com/melin/datatunnel/blob/master/connectors/jdbc/src/main/scala/com/superior/datatunnel/plugin/jdbc/support/DataTunnelJdbcRelationProvider.scala] > !image-2023-12-27-09-44-19-292.png! > > [~yao] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46518) Support for copy from write compatible postgresql databases (pg, redshift, snowflake, gauss)
[ https://issues.apache.org/jira/browse/SPARK-46518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] melin updated SPARK-46518: -- Attachment: image-2023-12-27-09-44-19-292.png > Support for copy from write compatible postgresql databases (pg, redshift, > snowflake, gauss) > > > Key: SPARK-46518 > URL: https://issues.apache.org/jira/browse/SPARK-46518 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 4.0.0 >Reporter: melin >Priority: Major > Attachments: image-2023-12-27-09-44-19-292.png > > > Now many databases are compatible with pg syntax and support copy from > syntax. The copy form import performance is 10 times higher than that of jdbc > batch. > [https://github.com/melin/datatunnel/blob/master/connectors/jdbc/src/main/scala/com/superior/datatunnel/plugin/jdbc/support/CopyHelper.scala] > Supports upsert data import: > [https://github.com/melin/datatunnel/blob/master/connectors/jdbc/src/main/scala/com/superior/datatunnel/plugin/jdbc/support/DataTunnelJdbcRelationProvider.scala] > !image-2023-12-27-09-43-01-529.png! > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46518) Support for copy from write compatible postgresql databases (pg, redshift, snowflake, gauss)
melin created SPARK-46518: - Summary: Support for copy from write compatible postgresql databases (pg, redshift, snowflake, gauss) Key: SPARK-46518 URL: https://issues.apache.org/jira/browse/SPARK-46518 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 4.0.0 Reporter: melin Now many databases are compatible with pg syntax and support copy from syntax. The copy form import performance is 10 times higher than that of jdbc batch. [https://github.com/melin/datatunnel/blob/master/connectors/jdbc/src/main/scala/com/superior/datatunnel/plugin/jdbc/support/CopyHelper.scala] Supports upsert data import: [https://github.com/melin/datatunnel/blob/master/connectors/jdbc/src/main/scala/com/superior/datatunnel/plugin/jdbc/support/DataTunnelJdbcRelationProvider.scala] !image-2023-12-27-09-43-01-529.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46508) Upgrade Jackson to 2.16.1
[ https://issues.apache.org/jira/browse/SPARK-46508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-46508. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44494 [https://github.com/apache/spark/pull/44494] > Upgrade Jackson to 2.16.1 > - > > Key: SPARK-46508 > URL: https://issues.apache.org/jira/browse/SPARK-46508 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.16.1 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46517) Reorganize `IndexingTest`
[ https://issues.apache.org/jira/browse/SPARK-46517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46517: --- Labels: pull-request-available (was: ) > Reorganize `IndexingTest` > - > > Key: SPARK-46517 > URL: https://issues.apache.org/jira/browse/SPARK-46517 > Project: Spark > Issue Type: Sub-task > Components: PS, Tests >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46517) Reorganize `IndexingTest`
Ruifeng Zheng created SPARK-46517: - Summary: Reorganize `IndexingTest` Key: SPARK-46517 URL: https://issues.apache.org/jira/browse/SPARK-46517 Project: Spark Issue Type: Sub-task Components: PS, Tests Affects Versions: 4.0.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46513) Move `BasicIndexingTests` to `pyspark.pandas.tests.indexes.*`
[ https://issues.apache.org/jira/browse/SPARK-46513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-46513. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44499 [https://github.com/apache/spark/pull/44499] > Move `BasicIndexingTests` to `pyspark.pandas.tests.indexes.*` > - > > Key: SPARK-46513 > URL: https://issues.apache.org/jira/browse/SPARK-46513 > Project: Spark > Issue Type: Sub-task > Components: PS, Tests >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46513) Move `BasicIndexingTests` to `pyspark.pandas.tests.indexes.*`
[ https://issues.apache.org/jira/browse/SPARK-46513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-46513: Assignee: Ruifeng Zheng > Move `BasicIndexingTests` to `pyspark.pandas.tests.indexes.*` > - > > Key: SPARK-46513 > URL: https://issues.apache.org/jira/browse/SPARK-46513 > Project: Spark > Issue Type: Sub-task > Components: PS, Tests >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46516) autoBroadcastJoinThreshold compared to plan.statistics not a table size
[ https://issues.apache.org/jira/browse/SPARK-46516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guram Savinov updated SPARK-46516: -- Description: >From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum >size in bytes for a table that will be broadcasted to all worker nodes when >performing a join. [https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration] In fact Spark compares plan.statistics.sizeInBytes for columns selected in join, not a table size. [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L368] Join can select only a few columns and sizeInBytes will be lesser than autoBroadcastJoinThreshold, but broadcasted table can be huge and leads to OOM on driver. spark.sql.autoBroadcastJoinThreshold parameter seems useless when its not compared to broadcasted table size. Related topic on SO: [https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s] was: >From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum >size in bytes for a table that will be broadcasted to all worker nodes when >performing a join. [https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration] In fact Spark compares plan.statistics.sizeInBytes for columns selected in join, not a table size. [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L368] Join can select only a few columns in join and sizeInBytes will be lesser than autoBroadcastJoinThreshold, but broadcasted table can be huge and leads to OOM on driver. spark.sql.autoBroadcastJoinThreshold parameter seems useless when its not compared to broadcasted table size. Related topic on SO: [https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s] > autoBroadcastJoinThreshold compared to plan.statistics not a table size > --- > > Key: SPARK-46516 > URL: https://issues.apache.org/jira/browse/SPARK-46516 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1 >Reporter: Guram Savinov >Priority: Major > > From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum > size in bytes for a table that will be broadcasted to all worker nodes when > performing a join. > [https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration] > In fact Spark compares plan.statistics.sizeInBytes for columns selected in > join, not a table size. > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L368] > Join can select only a few columns and sizeInBytes will be lesser than > autoBroadcastJoinThreshold, but broadcasted table can be huge and leads to > OOM on driver. > spark.sql.autoBroadcastJoinThreshold parameter seems useless when its not > compared to broadcasted table size. > Related topic on SO: > [https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46516) autoBroadcastJoinThreshold compared to plan.statistics not a table size
[ https://issues.apache.org/jira/browse/SPARK-46516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guram Savinov updated SPARK-46516: -- Description: >From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum >size in bytes for a table that will be broadcasted to all worker nodes when >performing a join. [https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration] In fact Spark compares plan.statistics.sizeInBytes for columns selected in join, not a table size. [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L368] Join can select only a few columns in join and sizeInBytes will be lesser than autoBroadcastJoinThreshold, but broadcasted table can be huge and leads to OOM on driver. spark.sql.autoBroadcastJoinThreshold parameter seems useless when its not compared to broadcasted table size. Related topic on SO: [https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s] was: >From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum >size in bytes for a table that will be broadcasted to all worker nodes when >performing a join. [https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration] In fact Spark compares plan.statistics.sizeInBytes for columns selected in join, not a table size. [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L368] The broadcasted table can be huge and leads to OOM on driver, so spark.sql.autoBroadcastJoinThreshold parameter seems useless when its not compared to broadcasted table size. Related topic on SO: [https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s] > autoBroadcastJoinThreshold compared to plan.statistics not a table size > --- > > Key: SPARK-46516 > URL: https://issues.apache.org/jira/browse/SPARK-46516 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1 >Reporter: Guram Savinov >Priority: Major > > From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum > size in bytes for a table that will be broadcasted to all worker nodes when > performing a join. > [https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration] > In fact Spark compares plan.statistics.sizeInBytes for columns selected in > join, not a table size. > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L368] > Join can select only a few columns in join and sizeInBytes will be lesser > than autoBroadcastJoinThreshold, but broadcasted table can be huge and leads > to OOM on driver. > spark.sql.autoBroadcastJoinThreshold parameter seems useless when its not > compared to broadcasted table size. > Related topic on SO: > [https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46516) autoBroadcastJoinThreshold compared to plan.statistics not a table size
[ https://issues.apache.org/jira/browse/SPARK-46516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guram Savinov updated SPARK-46516: -- Description: >From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum >size in bytes for a table that will be broadcasted to all worker nodes when >performing a join. [https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration] In fact Spark compares plan.statistics.sizeInBytes for columns selected in join, not a table size. [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L368] The broadcasted table can be huge and leads to OOM on driver, so spark.sql.autoBroadcastJoinThreshold parameter seems useless when its not compared to broadcasted table size. Related topic on SO: [https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s] was: >From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum >size in bytes for a table that will be broadcasted to all worker nodes when >performing a join. In fact Spark compares plan.statistics.sizeInBytes for columns selected in join, not a table size. [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L368]The broadcasted table can be huge and leads to OOM on driver, so spark.sql.autoBroadcastJoinThreshold parameter seems useless when its not compared to broadcasted table size. [https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration] Related topic on SO: [https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s] > autoBroadcastJoinThreshold compared to plan.statistics not a table size > --- > > Key: SPARK-46516 > URL: https://issues.apache.org/jira/browse/SPARK-46516 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1 >Reporter: Guram Savinov >Priority: Major > > From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum > size in bytes for a table that will be broadcasted to all worker nodes when > performing a join. > [https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration] > In fact Spark compares plan.statistics.sizeInBytes for columns selected in > join, not a table size. > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L368] > The broadcasted table can be huge and leads to OOM on driver, so > spark.sql.autoBroadcastJoinThreshold parameter seems useless when its not > compared to broadcasted table size. > Related topic on SO: > [https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46516) autoBroadcastJoinThreshold compared to plan.statistics not a table size
[ https://issues.apache.org/jira/browse/SPARK-46516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guram Savinov updated SPARK-46516: -- Description: >From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum >size in bytes for a table that will be broadcasted to all worker nodes when >performing a join. In fact Spark compares plan.statistics.sizeInBytes for columns selected in join, not a table size. [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L368]The broadcasted table can be huge and leads to OOM on driver, so spark.sql.autoBroadcastJoinThreshold parameter seems useless when its not compared to broadcasted table size. [https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration] Related topic on SO: [https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s] was: >From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum >size in bytes for a table that will be broadcasted to all worker nodes when >performing a join. In fact Spark compares plan.statistics.sizeInBytes for columns selected in join, not a table size. The broadcasted table can be huge and leads to OOM on driver, so spark.sql.autoBroadcastJoinThreshold parameter seems useless when its not compared to broadcasted table size. [https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration] Related topic on SO: [https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s] > autoBroadcastJoinThreshold compared to plan.statistics not a table size > --- > > Key: SPARK-46516 > URL: https://issues.apache.org/jira/browse/SPARK-46516 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1 >Reporter: Guram Savinov >Priority: Major > > From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum > size in bytes for a table that will be broadcasted to all worker nodes when > performing a join. > In fact Spark compares plan.statistics.sizeInBytes for columns selected in > join, not a table size. > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L368]The > broadcasted table can be huge and leads to OOM on driver, so > spark.sql.autoBroadcastJoinThreshold parameter seems useless when its not > compared to broadcasted table size. > [https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration] > Related topic on SO: > [https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46516) autoBroadcastJoinThreshold compared to plan.statistics not a table size
[ https://issues.apache.org/jira/browse/SPARK-46516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guram Savinov updated SPARK-46516: -- Description: >From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum >size in bytes for a table that will be broadcasted to all worker nodes when >performing a join. In fact Spark compares plan.statistics.sizeInBytes for columns selected in join, not a table size. The broadcasted table can be huge and leads to OOM on driver, so spark.sql.autoBroadcastJoinThreshold parameter seems useless when its not compared to broadcasted table size. [https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration] Related topic on SO: [https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s] was: >From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum >size in bytes for a table that will be broadcasted to all worker nodes when >performing a join. In fact Spark compares plan.statistics.sizeInBytes for columns selected in join, not a table size. The broadcasted table can be huge and leads to OOM on driver, so spark.sql.autoBroadcastJoinThreshold parameter seems useless when its not compared to broadcasted table sizes. [https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration] Related topic on SO: https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s > autoBroadcastJoinThreshold compared to plan.statistics not a table size > --- > > Key: SPARK-46516 > URL: https://issues.apache.org/jira/browse/SPARK-46516 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.1.1 >Reporter: Guram Savinov >Priority: Major > > From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum > size in bytes for a table that will be broadcasted to all worker nodes when > performing a join. > In fact Spark compares plan.statistics.sizeInBytes for columns selected in > join, not a table size. > The broadcasted table can be huge and leads to OOM on driver, so > spark.sql.autoBroadcastJoinThreshold parameter seems useless when its not > compared to broadcasted table size. > [https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration] > Related topic on SO: > [https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46516) autoBroadcastJoinThreshold compared to plan.statistics not a table size
[ https://issues.apache.org/jira/browse/SPARK-46516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guram Savinov updated SPARK-46516: -- Issue Type: Bug (was: Documentation) > autoBroadcastJoinThreshold compared to plan.statistics not a table size > --- > > Key: SPARK-46516 > URL: https://issues.apache.org/jira/browse/SPARK-46516 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1 >Reporter: Guram Savinov >Priority: Major > > From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum > size in bytes for a table that will be broadcasted to all worker nodes when > performing a join. > In fact Spark compares plan.statistics.sizeInBytes for columns selected in > join, not a table size. > The broadcasted table can be huge and leads to OOM on driver, so > spark.sql.autoBroadcastJoinThreshold parameter seems useless when its not > compared to broadcasted table size. > [https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration] > Related topic on SO: > [https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46516) autoBroadcastJoinThreshold compared to plan.statistics not a table size
Guram Savinov created SPARK-46516: - Summary: autoBroadcastJoinThreshold compared to plan.statistics not a table size Key: SPARK-46516 URL: https://issues.apache.org/jira/browse/SPARK-46516 Project: Spark Issue Type: Documentation Components: SQL Affects Versions: 3.1.1 Reporter: Guram Savinov >From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum >size in bytes for a table that will be broadcasted to all worker nodes when >performing a join. In fact Spark compares plan.statistics.sizeInBytes for columns selected in join, not a table size. The broadcasted table can be huge and leads to OOM on driver, so spark.sql.autoBroadcastJoinThreshold parameter seems useless when its not compared to broadcasted table sizes. [https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration] Related topic on SO: https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46506) Refine docstring of `array_intersect/array_union/array_except`
[ https://issues.apache.org/jira/browse/SPARK-46506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie reassigned SPARK-46506: Assignee: Yang Jie > Refine docstring of `array_intersect/array_union/array_except` > -- > > Key: SPARK-46506 > URL: https://issues.apache.org/jira/browse/SPARK-46506 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46506) Refine docstring of `array_intersect/array_union/array_except`
[ https://issues.apache.org/jira/browse/SPARK-46506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie resolved SPARK-46506. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44490 [https://github.com/apache/spark/pull/44490] > Refine docstring of `array_intersect/array_union/array_except` > -- > > Key: SPARK-46506 > URL: https://issues.apache.org/jira/browse/SPARK-46506 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46192) failed to insert the table using the default value of union
[ https://issues.apache.org/jira/browse/SPARK-46192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800471#comment-17800471 ] zengxl commented on SPARK-46192: {code:java} create table test_spark_3(k string default null,v int default null,m string default null) stored as orc; insert into table test_spark_3(k,v) select k,sum(v) v from test_spark_1 group by k; insert into table test_spark_3(k,v) select distinct a.k,a.v from test_spark a left join test_spark_1 b on a.k=b.k limit 2;{code} The above SQL has the same exception > failed to insert the table using the default value of union > --- > > Key: SPARK-46192 > URL: https://issues.apache.org/jira/browse/SPARK-46192 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0, 3.4.1 >Reporter: zengxl >Priority: Major > > > Obtain the following tables and data > {code:java} > create table test_spark(k string default null,v int default null) stored as > orc; > create table test_spark_1(k string default null,v int default null) stored as > orc; > insert into table test_spark_1 values('k1',1),('k2',2),('k3',3); > create table test_spark_2(k string default null,v int default null) stored as > orc; > insert into table test_spark_2 values('k3',3),('k4',4),('k5',5); > {code} > Execute the following SQL > {code:java} > insert into table test_spark (k) > select k from test_spark_1 > union > select k from test_spark_2 > {code} > exception: > {code:java} > 23/12/01 10:44:25 INFO HiveSessionStateBuilder$$anon$1: here is > CatalogAndIdentifier > 23/12/01 10:44:25 INFO HiveSessionStateBuilder$$anon$1: here is > CatalogAndIdentifier > 23/12/01 10:44:25 INFO HiveSessionStateBuilder$$anon$1: here is > CatalogAndIdentifier > 23/12/01 10:44:26 INFO Analyzer$ResolveUserSpecifiedColumns: > i.userSpecifiedCols.size is 1 > 23/12/01 10:44:26 INFO Analyzer$ResolveUserSpecifiedColumns: > i.userSpecifiedCols.size is 1 > 23/12/01 10:44:26 INFO Analyzer$ResolveUserSpecifiedColumns: i.table.output 2 > ,resolved :1 , i.query 1 > 23/12/01 10:44:26 INFO Analyzer$ResolveUserSpecifiedColumns: here is > ResolveUserSpecifiedColumns tableOutoyt: 2---nameToQueryExpr : 1Error in > query: `default`.`test_spark` requires that the data to be inserted have the > same number of columns as the target table: target table has 2 column(s) but > the inserted data has 1 column(s), including 0 partition column(s) having > constant value(s). {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46514) Fix HiveMetastoreLazyInitializationSuite
[ https://issues.apache.org/jira/browse/SPARK-46514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46514: --- Labels: pull-request-available (was: ) > Fix HiveMetastoreLazyInitializationSuite > > > Key: SPARK-46514 > URL: https://issues.apache.org/jira/browse/SPARK-46514 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 4.0.0 >Reporter: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46513) Move `BasicIndexingTests` to `pyspark.pandas.tests.indexes.*`
[ https://issues.apache.org/jira/browse/SPARK-46513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46513: --- Labels: pull-request-available (was: ) > Move `BasicIndexingTests` to `pyspark.pandas.tests.indexes.*` > - > > Key: SPARK-46513 > URL: https://issues.apache.org/jira/browse/SPARK-46513 > Project: Spark > Issue Type: Sub-task > Components: PS, Tests >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46512) Optimize shuffle reading when both sort and combine are used.
[ https://issues.apache.org/jira/browse/SPARK-46512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chenyu Zheng updated SPARK-46512: - Description: After the shuffle reader obtains the block, it will first perform a combine operation, and then perform a sort operation. It is known that both combine and sort may generate temporary files, so the performance may be poor when both sort and combine are used. In fact, combine operations can be performed during the sort process, and we can avoid the combine spill file. I did not find any direct api to construct the shuffle which both sort and combine is used. But I can do like following code, here is a wordcount, and the output words is sorted. {code:java} sc.textFile(input).flatMap(_.split(" ")).map(w => (w, 1)). reduceByKey(_ + _, 1). asInstanceOf[ShuffledRDD[String, Int, Int]].setKeyOrdering(Ordering.String). collect().foreach(println) {code} was: After the shuffle reader obtains the block, it will first perform a combine operation, and then perform a sort operation. It is known that both combine and sort may generate temporary files, so the performance may be poor when both sort and combine are used. In fact, combine operations can be performed during the sort process, and we can avoid the combine spill file. I did not find any direct api to construct the shuffle which both sort and combine is used. But I can do like below code, here is a wordcount, and the output words is sorted. ``` sc.textFile(input).flatMap(_.split(" ")).map(w => (w, 1)). reduceByKey(_ + _, 1). asInstanceOf[ShuffledRDD[String, Int, Int]].setKeyOrdering(Ordering.String). collect().foreach(println) ``` > Optimize shuffle reading when both sort and combine are used. > - > > Key: SPARK-46512 > URL: https://issues.apache.org/jira/browse/SPARK-46512 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 4.0.0 >Reporter: Chenyu Zheng >Priority: Minor > > After the shuffle reader obtains the block, it will first perform a combine > operation, and then perform a sort operation. It is known that both combine > and sort may generate temporary files, so the performance may be poor when > both sort and combine are used. In fact, combine operations can be performed > during the sort process, and we can avoid the combine spill file. > > I did not find any direct api to construct the shuffle which both sort and > combine is used. But I can do like following code, here is a wordcount, and > the output words is sorted. > {code:java} > sc.textFile(input).flatMap(_.split(" ")).map(w => (w, 1)). > reduceByKey(_ + _, 1). > asInstanceOf[ShuffledRDD[String, Int, Int]].setKeyOrdering(Ordering.String). > collect().foreach(println) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46514) Fix HiveMetastoreLazyInitializationSuite
Kent Yao created SPARK-46514: Summary: Fix HiveMetastoreLazyInitializationSuite Key: SPARK-46514 URL: https://issues.apache.org/jira/browse/SPARK-46514 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46513) Move `BasicIndexingTests` to `pyspark.pandas.tests.indexes.*`
Ruifeng Zheng created SPARK-46513: - Summary: Move `BasicIndexingTests` to `pyspark.pandas.tests.indexes.*` Key: SPARK-46513 URL: https://issues.apache.org/jira/browse/SPARK-46513 Project: Spark Issue Type: Sub-task Components: PS, Tests Affects Versions: 4.0.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46510) Spark shell log filter should be applied to all AbstractAppender
[ https://issues.apache.org/jira/browse/SPARK-46510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-46510: -- Assignee: (was: Apache Spark) > Spark shell log filter should be applied to all AbstractAppender > > > Key: SPARK-46510 > URL: https://issues.apache.org/jira/browse/SPARK-46510 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.1, 3.4.1, 3.3.4 >Reporter: Yi Zhu >Priority: Major > Labels: pull-request-available > > When we set async appender and refer to console, spark shell log filter won't > work. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46510) Spark shell log filter should be applied to all AbstractAppender
[ https://issues.apache.org/jira/browse/SPARK-46510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-46510: -- Assignee: Apache Spark > Spark shell log filter should be applied to all AbstractAppender > > > Key: SPARK-46510 > URL: https://issues.apache.org/jira/browse/SPARK-46510 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.1, 3.4.1, 3.3.4 >Reporter: Yi Zhu >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > > When we set async appender and refer to console, spark shell log filter won't > work. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46510) Spark shell log filter should be applied to all AbstractAppender
[ https://issues.apache.org/jira/browse/SPARK-46510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-46510: -- Assignee: Apache Spark > Spark shell log filter should be applied to all AbstractAppender > > > Key: SPARK-46510 > URL: https://issues.apache.org/jira/browse/SPARK-46510 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.1, 3.4.1, 3.3.4 >Reporter: Yi Zhu >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > > When we set async appender and refer to console, spark shell log filter won't > work. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46510) Spark shell log filter should be applied to all AbstractAppender
[ https://issues.apache.org/jira/browse/SPARK-46510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-46510: -- Assignee: (was: Apache Spark) > Spark shell log filter should be applied to all AbstractAppender > > > Key: SPARK-46510 > URL: https://issues.apache.org/jira/browse/SPARK-46510 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.1, 3.4.1, 3.3.4 >Reporter: Yi Zhu >Priority: Major > Labels: pull-request-available > > When we set async appender and refer to console, spark shell log filter won't > work. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46460) The filter of partition including cast function may lead the partition pruning to disable
[ https://issues.apache.org/jira/browse/SPARK-46460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhou Tong updated SPARK-46460: -- Attachment: SPARK-46460.patch > The filter of partition including cast function may lead the partition > pruning to disable > - > > Key: SPARK-46460 > URL: https://issues.apache.org/jira/browse/SPARK-46460 > Project: Spark > Issue Type: Improvement > Components: Optimizer, SQL >Affects Versions: 3.2.0 >Reporter: Zhou Tong >Priority: Minor > Labels: pull-request-available > Attachments: SPARK-46460.patch > > Original Estimate: 168h > Remaining Estimate: 168h > > SQL:select * from test_db.test_table where day between > date_sub('2023-12-01',1) and '2023-12-03' > The Physical Plan of sql above will implement _cast_ function on partition > col 'day', like this, {_}cast(day as date) > 2023-11-30{_}. In this > situation, spark just pass the filter condition _day < "2023-12-03"_ to > HiveMetastore, not including filter condition {_}cast(day as date) > > 2023-11-30{_}, which may lead performance of HMS degarde if the HiveTable has > huge number of partitions. > > In this regard, a new rule may solve this problem. This rule can convert > binary comparison _cast(day as date) > 2023-11-30_ to {_}day > > cast(2023-11-30 as string){_}. The right node is foldable, so the result is > {_}day > "2023-11-30"{_}, and the filter condition passed to HMS will be _day > > "2023-11-30" and_ _day < "2023-12-03"._ > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46511) Optimize spark jdbc write speed with Multi-Row Inserts
[ https://issues.apache.org/jira/browse/SPARK-46511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] melin resolved SPARK-46511. --- Resolution: Fixed > Optimize spark jdbc write speed with Multi-Row Inserts > -- > > Key: SPARK-46511 > URL: https://issues.apache.org/jira/browse/SPARK-46511 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 4.0.0 >Reporter: melin >Priority: Major > > INSERT INTO table_name (column1, column2, column3) > VALUES (value1, value2, value3), > (value4, value5, value6), > (value7, value8, value9); -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46498) Remove `shuffleServiceEnabled` from `o.a.spark.util.Utils#getConfiguredLocalDirs`
[ https://issues.apache.org/jira/browse/SPARK-46498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-46498: -- Summary: Remove `shuffleServiceEnabled` from `o.a.spark.util.Utils#getConfiguredLocalDirs` (was: Remove an unused local variables from `o.a.spark.util.Utils#getConfiguredLocalDirs`) > Remove `shuffleServiceEnabled` from > `o.a.spark.util.Utils#getConfiguredLocalDirs` > - > > Key: SPARK-46498 > URL: https://issues.apache.org/jira/browse/SPARK-46498 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46498) Remove an unused local variables from `o.a.spark.util.Utils#getConfiguredLocalDirs`
[ https://issues.apache.org/jira/browse/SPARK-46498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-46498: - Assignee: Yang Jie > Remove an unused local variables from > `o.a.spark.util.Utils#getConfiguredLocalDirs` > --- > > Key: SPARK-46498 > URL: https://issues.apache.org/jira/browse/SPARK-46498 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46498) Remove an unused local variables from `o.a.spark.util.Utils#getConfiguredLocalDirs`
[ https://issues.apache.org/jira/browse/SPARK-46498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-46498. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44475 [https://github.com/apache/spark/pull/44475] > Remove an unused local variables from > `o.a.spark.util.Utils#getConfiguredLocalDirs` > --- > > Key: SPARK-46498 > URL: https://issues.apache.org/jira/browse/SPARK-46498 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46371) Clean up outdated items in `.rat-excludes`
[ https://issues.apache.org/jira/browse/SPARK-46371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-46371. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44293 [https://github.com/apache/spark/pull/44293] > Clean up outdated items in `.rat-excludes` > -- > > Key: SPARK-46371 > URL: https://issues.apache.org/jira/browse/SPARK-46371 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46371) Clean up outdated items in `.rat-excludes`
[ https://issues.apache.org/jira/browse/SPARK-46371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-46371: - Assignee: BingKun Pan > Clean up outdated items in `.rat-excludes` > -- > > Key: SPARK-46371 > URL: https://issues.apache.org/jira/browse/SPARK-46371 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45914) Support `commit` and `abort` API for Python data source write
[ https://issues.apache.org/jira/browse/SPARK-45914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45914: --- Labels: pull-request-available (was: ) > Support `commit` and `abort` API for Python data source write > - > > Key: SPARK-45914 > URL: https://issues.apache.org/jira/browse/SPARK-45914 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > Labels: pull-request-available > > Support `commit` and `abort` API for Python data source write. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org