[
https://issues.apache.org/jira/browse/SPARK-56485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-56485:
-----------------------------------
Labels: pull-request-available (was: )
> Incorrect RowCount Estimation in CBO leads to unintended BroadcastHashJoin
> --------------------------------------------------------------------------
>
> Key: SPARK-56485
> URL: https://issues.apache.org/jira/browse/SPARK-56485
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 4.1.1
> Reporter: Ke Jia
> Priority: Major
> Labels: pull-request-available
>
> When running TPC-DS Q4 on the database 1TB with CBO enabled and
> spark.sql.autoBroadcastJoinThreshold set to 10MB the query fails with a
> SparkException.
>
> {code:java}
> Py4JJavaError: An error occurred while calling o596.collectToPython.
> : org.apache.spark.SparkException: Cannot broadcast the table that is larger
> than 8.0 GiB: 10.7 GiB
> at
> org.apache.gluten.backendsapi.velox.VeloxSparkPlanExecApi.createBroadcastRelation(VeloxSparkPlanExecApi.scala:850)
> at
> org.apache.spark.sql.execution.ColumnarBroadcastExchangeExec.$anonfun$relationFuture$2(ColumnarBroadcastExchangeExec.scala:79)
> at org.apache.gluten.utils.Arm$.withResource(Arm.scala:25)
> at
> org.apache.gluten.metrics.GlutenTimeMetric$.millis(GlutenTimeMetric.scala:37)
> at
> org.apache.spark.sql.execution.ColumnarBroadcastExchangeExec.$anonfun$relationFuture$1(ColumnarBroadcastExchangeExec.scala:66)
> at
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$2(SQLExecution.scala:230)
> at
> org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:94)
> at
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$1(SQLExecution.scala:225)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
> at java.base/java.lang.Thread.run(Thread.java:840){code}
>
> The error indicates that Spark attempted to broadcast a table significantly
> larger than the 8GB limit (actual size ~10.7 GiB). Interestingly, the query
> runs successfully if CBO is disabled.
> The issue stems from a bug in Spark's CBO Filter Estimation. When processing
> equality filters (e.g., d_year = 2001), the CBO incorrectly estimates the
> rowCount as 0.
> Because the estimated row count is 0, the optimizer concludes that the join
> result will be very small and chooses a BroadcastHashJoin . However, the
> actual data size is roughly 10.7 GiB, which exceeds Spark's hard limit for
> broadcasting.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]