[ 
https://issues.apache.org/jira/browse/SPARK-56485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-56485:
-----------------------------------
    Labels: pull-request-available  (was: )

> Incorrect RowCount Estimation in CBO leads to unintended BroadcastHashJoin
> --------------------------------------------------------------------------
>
>                 Key: SPARK-56485
>                 URL: https://issues.apache.org/jira/browse/SPARK-56485
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 4.1.1
>            Reporter: Ke Jia
>            Priority: Major
>              Labels: pull-request-available
>
> When running TPC-DS Q4 on the database 1TB  with CBO enabled and 
> spark.sql.autoBroadcastJoinThreshold set to 10MB the query fails with a 
> SparkException.
>  
> {code:java}
> Py4JJavaError: An error occurred while calling o596.collectToPython.
> : org.apache.spark.SparkException: Cannot broadcast the table that is larger 
> than 8.0 GiB: 10.7 GiB
> at 
> org.apache.gluten.backendsapi.velox.VeloxSparkPlanExecApi.createBroadcastRelation(VeloxSparkPlanExecApi.scala:850)
> at 
> org.apache.spark.sql.execution.ColumnarBroadcastExchangeExec.$anonfun$relationFuture$2(ColumnarBroadcastExchangeExec.scala:79)
> at org.apache.gluten.utils.Arm$.withResource(Arm.scala:25)
> at 
> org.apache.gluten.metrics.GlutenTimeMetric$.millis(GlutenTimeMetric.scala:37)
> at 
> org.apache.spark.sql.execution.ColumnarBroadcastExchangeExec.$anonfun$relationFuture$1(ColumnarBroadcastExchangeExec.scala:66)
> at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$2(SQLExecution.scala:230)
> at 
> org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:94)
> at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$1(SQLExecution.scala:225)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
> at java.base/java.lang.Thread.run(Thread.java:840){code}
>  
> The error indicates that Spark attempted to broadcast a table significantly 
> larger than the 8GB limit (actual size ~10.7 GiB). Interestingly, the query 
> runs successfully if CBO is disabled.
> The issue stems from a bug in Spark's CBO Filter Estimation. When processing 
> equality filters (e.g., d_year = 2001), the CBO incorrectly estimates the 
> rowCount as 0.
> Because the estimated row count is 0, the optimizer concludes that the join 
> result will be very small and chooses a BroadcastHashJoin . However, the 
> actual data size is roughly 10.7 GiB, which exceeds Spark's hard limit for 
> broadcasting.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to