HonglunChen created SPARK-35833:
-----------------------------------

             Summary: The Statistics size of PARQUET table is not estimated 
correctly
                 Key: SPARK-35833
                 URL: https://issues.apache.org/jira/browse/SPARK-35833
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 3.1.2
            Reporter: HonglunChen


{code:java}
// Table 'test_txt' and 'test_parquet' have the same data.
scala> val sql="select * from tmp_db.test_txt"
sql: String = select * from tmp_db.test_txt
scala> spark.sql(sql).queryExecution.optimizedPlan.stats.sizeInBytes
res5: BigInt = 92990
scala> val sql = "select * from tmp_db.test_parquet"
sql: String = select * from tmp_db.test_parquet
scala> spark.sql(sql).queryExecution.optimizedPlan.stats.sizeInBytes
res6: BigInt = 37556
{code}
PARQUET file is compressed by default, this could lead to choose the wrong type 
of JOIN, like BROADCASTJOIN. Driver may be OOM in this case, because the actual 
size may be much greater than the AUTO_BROADCASTJOIN_THRESHOLD.  Can we improve 
this? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to