[
https://issues.apache.org/jira/browse/SPARK-56485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ke Jia updated SPARK-56485:
---------------------------
Description:
When running TPC-DS Q4 on the database 1TB with CBO enabled and
spark.sql.autoBroadcastJoinThreshold set to 10MB the query fails with a
SparkException.
{code:java}
Py4JJavaError: An error occurred while calling o596.collectToPython.
: org.apache.spark.SparkException: Cannot broadcast the table that is larger
than 8.0 GiB: 10.7 GiB
at
org.apache.gluten.backendsapi.velox.VeloxSparkPlanExecApi.createBroadcastRelation(VeloxSparkPlanExecApi.scala:850)
at
org.apache.spark.sql.execution.ColumnarBroadcastExchangeExec.$anonfun$relationFuture$2(ColumnarBroadcastExchangeExec.scala:79)
at org.apache.gluten.utils.Arm$.withResource(Arm.scala:25)
at org.apache.gluten.metrics.GlutenTimeMetric$.millis(GlutenTimeMetric.scala:37)
at
org.apache.spark.sql.execution.ColumnarBroadcastExchangeExec.$anonfun$relationFuture$1(ColumnarBroadcastExchangeExec.scala:66)
at
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$2(SQLExecution.scala:230)
at
org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:94)
at
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$1(SQLExecution.scala:225)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840){code}
The error indicates that Spark attempted to broadcast a table significantly
larger than the 8GB limit (actual size ~10.7 GiB). Interestingly, the query
runs successfully if CBO is disabled.
The issue stems from a bug in Spark's CBO Filter Estimation. When processing
equality filters (e.g., d_year = 2001), the CBO incorrectly estimates the
rowCount as 0.
Because the estimated row count is 0, the optimizer concludes that the join
result will be very small and chooses a BroadcastHashJoin . However, the actual
data size is roughly 10.7 GiB, which exceeds Spark's hard limit for
broadcasting.
was:
When running TPC-DS Q4 on the database 1TB with CBO enabled and {{}}
{code:java}
spark.sql.autoBroadcastJoinThreshold{code}
{{ set to }}
{code:java}
10MB{code}
{{{}{}}}, the query fails with a SparkException.
{code:java}
Py4JJavaError: An error occurred while calling o596.collectToPython.
: org.apache.spark.SparkException: Cannot broadcast the table that is larger
than 8.0 GiB: 10.7 GiB
at
org.apache.gluten.backendsapi.velox.VeloxSparkPlanExecApi.createBroadcastRelation(VeloxSparkPlanExecApi.scala:850)
at
org.apache.spark.sql.execution.ColumnarBroadcastExchangeExec.$anonfun$relationFuture$2(ColumnarBroadcastExchangeExec.scala:79)
at org.apache.gluten.utils.Arm$.withResource(Arm.scala:25)
at org.apache.gluten.metrics.GlutenTimeMetric$.millis(GlutenTimeMetric.scala:37)
at
org.apache.spark.sql.execution.ColumnarBroadcastExchangeExec.$anonfun$relationFuture$1(ColumnarBroadcastExchangeExec.scala:66)
at
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$2(SQLExecution.scala:230)
at
org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:94)
at
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$1(SQLExecution.scala:225)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840){code}
The error indicates that Spark attempted to broadcast a table significantly
larger than the 8GB limit (actual size ~10.7 GiB). Interestingly, the query
runs successfully if CBO is disabled.
The issue stems from a bug in Spark's CBO Filter Estimation. When processing
equality filters (e.g., d_year = 2001), the CBO incorrectly estimates the
rowCount as 0.
Because the estimated row count is 0, the optimizer concludes that the join
result will be very small and chooses a BroadcastHashJoin . However, the actual
data size is roughly 10.7 GiB, which exceeds Spark's hard limit for
broadcasting.
> Incorrect RowCount Estimation in CBO leads to unintended BroadcastHashJoin
> --------------------------------------------------------------------------
>
> Key: SPARK-56485
> URL: https://issues.apache.org/jira/browse/SPARK-56485
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 4.1.1
> Reporter: Ke Jia
> Priority: Major
>
> When running TPC-DS Q4 on the database 1TB with CBO enabled and
> spark.sql.autoBroadcastJoinThreshold set to 10MB the query fails with a
> SparkException.
>
> {code:java}
> Py4JJavaError: An error occurred while calling o596.collectToPython.
> : org.apache.spark.SparkException: Cannot broadcast the table that is larger
> than 8.0 GiB: 10.7 GiB
> at
> org.apache.gluten.backendsapi.velox.VeloxSparkPlanExecApi.createBroadcastRelation(VeloxSparkPlanExecApi.scala:850)
> at
> org.apache.spark.sql.execution.ColumnarBroadcastExchangeExec.$anonfun$relationFuture$2(ColumnarBroadcastExchangeExec.scala:79)
> at org.apache.gluten.utils.Arm$.withResource(Arm.scala:25)
> at
> org.apache.gluten.metrics.GlutenTimeMetric$.millis(GlutenTimeMetric.scala:37)
> at
> org.apache.spark.sql.execution.ColumnarBroadcastExchangeExec.$anonfun$relationFuture$1(ColumnarBroadcastExchangeExec.scala:66)
> at
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$2(SQLExecution.scala:230)
> at
> org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:94)
> at
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$1(SQLExecution.scala:225)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
> at java.base/java.lang.Thread.run(Thread.java:840){code}
>
> The error indicates that Spark attempted to broadcast a table significantly
> larger than the 8GB limit (actual size ~10.7 GiB). Interestingly, the query
> runs successfully if CBO is disabled.
> The issue stems from a bug in Spark's CBO Filter Estimation. When processing
> equality filters (e.g., d_year = 2001), the CBO incorrectly estimates the
> rowCount as 0.
> Because the estimated row count is 0, the optimizer concludes that the join
> result will be very small and chooses a BroadcastHashJoin . However, the
> actual data size is roughly 10.7 GiB, which exceeds Spark's hard limit for
> broadcasting.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]