[jira] [Updated] (SPARK-56485) Incorrect RowCount Estimation in CBO leads to unintended BroadcastHashJoin

Ke Jia (Jira) Wed, 15 Apr 2026 06:07:50 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-56485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ke Jia updated SPARK-56485:
---------------------------
    Description: 
When running TPC-DS Q4 on the database 1TB  with CBO enabled and 
spark.sql.autoBroadcastJoinThreshold set to 10MB the query fails with a 
SparkException.
 
{code:java}
Py4JJavaError: An error occurred while calling o596.collectToPython.
: org.apache.spark.SparkException: Cannot broadcast the table that is larger 
than 8.0 GiB: 10.7 GiB
at 
org.apache.gluten.backendsapi.velox.VeloxSparkPlanExecApi.createBroadcastRelation(VeloxSparkPlanExecApi.scala:850)
at 
org.apache.spark.sql.execution.ColumnarBroadcastExchangeExec.$anonfun$relationFuture$2(ColumnarBroadcastExchangeExec.scala:79)
at org.apache.gluten.utils.Arm$.withResource(Arm.scala:25)
at org.apache.gluten.metrics.GlutenTimeMetric$.millis(GlutenTimeMetric.scala:37)
at 
org.apache.spark.sql.execution.ColumnarBroadcastExchangeExec.$anonfun$relationFuture$1(ColumnarBroadcastExchangeExec.scala:66)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$2(SQLExecution.scala:230)
at 
org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:94)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$1(SQLExecution.scala:225)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840){code}
 
The error indicates that Spark attempted to broadcast a table significantly 
larger than the 8GB limit (actual size ~10.7 GiB). Interestingly, the query 
runs successfully if CBO is disabled.

The issue stems from a bug in Spark's CBO Filter Estimation. When processing 
equality filters (e.g., d_year = 2001), the CBO incorrectly estimates the 
rowCount as 0.

Because the estimated row count is 0, the optimizer concludes that the join 
result will be very small and chooses a BroadcastHashJoin . However, the actual 
data size is roughly 10.7 GiB, which exceeds Spark's hard limit for 
broadcasting.

 

  was:
When running TPC-DS Q4 on the database 1TB  with CBO enabled and {{}}
{code:java}
spark.sql.autoBroadcastJoinThreshold{code}
{{ set to }}
{code:java}
10MB{code}
{{{}{}}}, the query fails with a SparkException.
 
{code:java}
Py4JJavaError: An error occurred while calling o596.collectToPython.
: org.apache.spark.SparkException: Cannot broadcast the table that is larger 
than 8.0 GiB: 10.7 GiB
at 
org.apache.gluten.backendsapi.velox.VeloxSparkPlanExecApi.createBroadcastRelation(VeloxSparkPlanExecApi.scala:850)
at 
org.apache.spark.sql.execution.ColumnarBroadcastExchangeExec.$anonfun$relationFuture$2(ColumnarBroadcastExchangeExec.scala:79)
at org.apache.gluten.utils.Arm$.withResource(Arm.scala:25)
at org.apache.gluten.metrics.GlutenTimeMetric$.millis(GlutenTimeMetric.scala:37)
at 
org.apache.spark.sql.execution.ColumnarBroadcastExchangeExec.$anonfun$relationFuture$1(ColumnarBroadcastExchangeExec.scala:66)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$2(SQLExecution.scala:230)
at 
org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:94)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$1(SQLExecution.scala:225)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840){code}

 
The error indicates that Spark attempted to broadcast a table significantly 
larger than the 8GB limit (actual size ~10.7 GiB). Interestingly, the query 
runs successfully if CBO is disabled.

The issue stems from a bug in Spark's CBO Filter Estimation. When processing 
equality filters (e.g., d_year = 2001), the CBO incorrectly estimates the 
rowCount as 0.

Because the estimated row count is 0, the optimizer concludes that the join 
result will be very small and chooses a BroadcastHashJoin . However, the actual 
data size is roughly 10.7 GiB, which exceeds Spark's hard limit for 
broadcasting.

 


> Incorrect RowCount Estimation in CBO leads to unintended BroadcastHashJoin
> --------------------------------------------------------------------------
>
>                 Key: SPARK-56485
>                 URL: https://issues.apache.org/jira/browse/SPARK-56485
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 4.1.1
>            Reporter: Ke Jia
>            Priority: Major
>
> When running TPC-DS Q4 on the database 1TB  with CBO enabled and 
> spark.sql.autoBroadcastJoinThreshold set to 10MB the query fails with a 
> SparkException.
>  
> {code:java}
> Py4JJavaError: An error occurred while calling o596.collectToPython.
> : org.apache.spark.SparkException: Cannot broadcast the table that is larger 
> than 8.0 GiB: 10.7 GiB
> at 
> org.apache.gluten.backendsapi.velox.VeloxSparkPlanExecApi.createBroadcastRelation(VeloxSparkPlanExecApi.scala:850)
> at 
> org.apache.spark.sql.execution.ColumnarBroadcastExchangeExec.$anonfun$relationFuture$2(ColumnarBroadcastExchangeExec.scala:79)
> at org.apache.gluten.utils.Arm$.withResource(Arm.scala:25)
> at 
> org.apache.gluten.metrics.GlutenTimeMetric$.millis(GlutenTimeMetric.scala:37)
> at 
> org.apache.spark.sql.execution.ColumnarBroadcastExchangeExec.$anonfun$relationFuture$1(ColumnarBroadcastExchangeExec.scala:66)
> at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$2(SQLExecution.scala:230)
> at 
> org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:94)
> at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$1(SQLExecution.scala:225)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
> at java.base/java.lang.Thread.run(Thread.java:840){code}
>  
> The error indicates that Spark attempted to broadcast a table significantly 
> larger than the 8GB limit (actual size ~10.7 GiB). Interestingly, the query 
> runs successfully if CBO is disabled.
> The issue stems from a bug in Spark's CBO Filter Estimation. When processing 
> equality filters (e.g., d_year = 2001), the CBO incorrectly estimates the 
> rowCount as 0.
> Because the estimated row count is 0, the optimizer concludes that the join 
> result will be very small and chooses a BroadcastHashJoin . However, the 
> actual data size is roughly 10.7 GiB, which exceeds Spark's hard limit for 
> broadcasting.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-56485) Incorrect RowCount Estimation in CBO leads to unintended BroadcastHashJoin

Reply via email to