HonglunChen created SPARK-35833: ----------------------------------- Summary: The Statistics size of PARQUET table is not estimated correctly Key: SPARK-35833 URL: https://issues.apache.org/jira/browse/SPARK-35833 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.2 Reporter: HonglunChen
{code:java} // Table 'test_txt' and 'test_parquet' have the same data. scala> val sql="select * from tmp_db.test_txt" sql: String = select * from tmp_db.test_txt scala> spark.sql(sql).queryExecution.optimizedPlan.stats.sizeInBytes res5: BigInt = 92990 scala> val sql = "select * from tmp_db.test_parquet" sql: String = select * from tmp_db.test_parquet scala> spark.sql(sql).queryExecution.optimizedPlan.stats.sizeInBytes res6: BigInt = 37556 {code} PARQUET file is compressed by default, this could lead to choose the wrong type of JOIN, like BROADCASTJOIN. Driver may be OOM in this case, because the actual size may be much greater than the AUTO_BROADCASTJOIN_THRESHOLD. Can we improve this? -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org