Babulal created SPARK-28413: ------------------------------- Summary: sizeInByte is Not updated for parquet datasource on Next Insert. Key: SPARK-28413 URL: https://issues.apache.org/jira/browse/SPARK-28413 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.1, 2.3.2 Reporter: Babulal
In SPARK-21237 ([link SPARK-21237|https://issues.apache.org/jira/browse/SPARK-21237] it is fix when Appending data using write.mode("append") . But when create same type of parquet table using SQL and Insert data ,stats shows in-correct (not updated). *+Correct Stats Example (SPARK-21237)+* scala> spark.range(100).write.saveAsTable("tab1") scala> spark.sql("explain cost select * from tab1").show(false) +------------------------------------------------------------------------ |plan +------------------------------------------------------------------------ |== Optimized Logical Plan == Relation[id#10L] parquet, Statistics(*sizeInBytes=784.0 B*, hints=none) == Physical Plan == FileScan parquet default.tab1[id#10L] Batched: false, Format: Parquet, scala> spark.range(100).write.mode("append").saveAsTable("tab1") scala> spark.sql("explain cost select * from tab1").show(false) +---------------------------------------------------------------------- |plan +---------------------------------------------------------------------- |== Optimized Logical Plan == Relation[id#23L] parquet, Statistics(*sizeInBytes=1568.0 B*, hints=none) == Physical Plan == FileScan parquet default.tab1[id#23L] Batched: false, Format: Parquet, +*Incorrect Stats Example*+ scala> spark.sql("create table tab2(id bigint) using parquet") res6: org.apache.spark.sql.DataFrame = [] scala> spark.sql("explain cost select * from tab2").show(false) +---------------------------------------------------------------------- |plan +---------------------------------------------------------------------- |== Optimized Logical Plan == Relation[id#30L] parquet, Statistics(*sizeInBytes=374.0 B,* hints=none) == Physical Plan == FileScan parquet default.tab2[id#30L] Batched: false, Format: Parquet, scala> spark.sql("insert into tab2 select 1") res9: org.apache.spark.sql.DataFrame = [] scala> spark.sql("explain cost select * from tab2").show(false) +---------------------------------------------------------------------- |plan +---------------------------------------------------------------------- |== Optimized Logical Plan == Relation[id#30L] parquet, Statistics(*sizeInBytes={color:#FF0000}374.0 B{color}*, hints=none) == Physical Plan == FileScan parquet default.tab2[id#30L] Batched: false, Format: Parquet, Both table are same type of table scala> spark.sql("desc formatted tab1").show(2000,false) +----------------------------+--------------------------------------------------------------+ |col_name |data_type | +----------------------------+--------------------------------------------------------------+ |id |bigint | | | | |# Detailed Table Information| | |Database |default | |Table |tab1 | |Owner |Administrator | |Created Time |Tue Jul 16 21:08:35 IST 2019 | |Last Access |Thu Jan 01 05:30:00 IST 1970 | |Created By |Spark 2.3.2 | |Type |MANAGED | |Provider |parquet | |Table Properties |[transient_lastDdlTime=1563291579] | |Statistics |1568 bytes | |Location |file:/x/2 | |Serde Library |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | |InputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | |OutputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| scala> spark.sql("desc formatted tab2").show(2000,false) +----------------------------+-------------------------------------------------------------- |col_name |data_type +----------------------------+-------------------------------------------------------------- |id |bigint | | |# Detailed Table Information| |Database |default |Table |tab2 |Owner |Administrator |Created Time |Tue Jul 16 21:10:24 IST 2019 |Last Access |Thu Jan 01 05:30:00 IST 1970 |Created By |Spark 2.3.2 |Type |MANAGED |Provider |parquet |Table Properties |[transient_lastDdlTime=1563291624] |Location |file:/x/1 |Serde Library |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe |InputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat |OutputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat -- This message was sent by Atlassian JIRA (v7.6.14#76016) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org