[jira] [Updated] (SPARK-28413) sizeInByte is Not updated for parquet datasource on Next Insert.
[ https://issues.apache.org/jira/browse/SPARK-28413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-28413: Fix Version/s: 3.0.0 > sizeInByte is Not updated for parquet datasource on Next Insert. > > > Key: SPARK-28413 > URL: https://issues.apache.org/jira/browse/SPARK-28413 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.1 >Reporter: Babulal >Priority: Minor > Fix For: 3.0.0 > > > In SPARK-21237 (link SPARK-21237) it is fix when Appending data using > write.mode("append") . But when create same type of parquet table using SQL > and Insert data ,stats shows in-correct (not updated). > *+Correct Stats Example (SPARK-21237)+* > scala> spark.range(100).write.saveAsTable("tab1") > scala> spark.sql("explain cost select * from tab1").show(false) > + > |plan > +| > |== Optimized Logical Plan == > Relation[id#10L|#10L] parquet, Statistics(*sizeInBytes=784.0 B*, hints=none)| > == Physical Plan == > FileScan parquet default.tab1[id#10L|#10L] Batched: false, Format: Parquet, > scala> spark.range(100).write.mode("append").saveAsTable("tab1") > scala> spark.sql("explain cost select * from tab1").show(false) > +-- > |plan > +--| > |== Optimized Logical Plan == > Relation[id#23L|#23L] parquet, Statistics(*sizeInBytes=1568.0 B*, > hints=none)| > == Physical Plan == > FileScan parquet default.tab1[id#23L|#23L] Batched: false, Format: Parquet, > > > +*Incorrect Stats Example*+ > scala> spark.sql("create table tab2(id bigint) using parquet") > res6: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("explain cost select * from tab2").show(false) > +-- > |plan > +--| > |== Optimized Logical Plan == > Relation[id#30L|#30L] parquet, Statistics(*sizeInBytes=374.0 B,* hints=none)| > == Physical Plan == > FileScan parquet default.tab2[id#30L|#30L] Batched: false, Format: Parquet, > > scala> spark.sql("insert into tab2 select 1") > res9: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("explain cost select * from tab2").show(false) > +-- > |plan > +--| > |== Optimized Logical Plan == > Relation[id#30L|#30L] parquet, Statistics(*sizeInBytes={color:#ff}374.0 > B{color}*, hints=none)| > == Physical Plan == > FileScan parquet default.tab2[id#30L|#30L] Batched: false, Format: Parquet, > > > Both table are same type of table > scala> spark.sql("desc formatted tab1").show(2000,false) > > +-+-+ > |col_name|data_type| > +-+-+ > |id|bigint| > | | | > | # Detailed Table Information| | > |Database|default| > |Table|tab1| > |Owner|Administrator| > |Created Time|Tue Jul 16 21:08:35 IST 2019| > |Last Access|Thu Jan 01 05:30:00 IST 1970| > |Created By|Spark 2.3.2| > |Type|MANAGED| > |Provider|parquet| > |Table Properties|[transient_lastDdlTime=1563291579]| > |Statistics|1568 bytes| > |Location|file:/x/2| > |Serde Library|org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe| > |InputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat| > |OutputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| > > scala> spark.sql("desc formatted tab2").show(2000,false) > > +-+- > |col_name|data_type > > +-+-| > |id|bigint| > | | > | # Detailed Table Information| > |Database|default| > |Table|tab2| > |Owner|Administrator| > |Created Time|Tue Jul 16 21:10:24 IST 2019| > |Last Access|Thu Jan 01 05:30:00 IST 1970| > |Created By|Spark 2.3.2| > |Type|MANAGED| > |Provider|parquet| > |Table Properties|[transient_lastDdlTime=1563291624]| > |Location|file:/x/1| > |Serde Library|org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe| > |InputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat| > |OutputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (SPARK-28413) sizeInByte is Not updated for parquet datasource on Next Insert.
[ https://issues.apache.org/jira/browse/SPARK-28413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Babulal updated SPARK-28413: Description: In SPARK-21237 (link SPARK-21237) it is fix when Appending data using write.mode("append") . But when create same type of parquet table using SQL and Insert data ,stats shows in-correct (not updated). *+Correct Stats Example (SPARK-21237)+* scala> spark.range(100).write.saveAsTable("tab1") scala> spark.sql("explain cost select * from tab1").show(false) + |plan +| |== Optimized Logical Plan == Relation[id#10L|#10L] parquet, Statistics(*sizeInBytes=784.0 B*, hints=none)| == Physical Plan == FileScan parquet default.tab1[id#10L|#10L] Batched: false, Format: Parquet, scala> spark.range(100).write.mode("append").saveAsTable("tab1") scala> spark.sql("explain cost select * from tab1").show(false) +-- |plan +--| |== Optimized Logical Plan == Relation[id#23L|#23L] parquet, Statistics(*sizeInBytes=1568.0 B*, hints=none)| == Physical Plan == FileScan parquet default.tab1[id#23L|#23L] Batched: false, Format: Parquet, +*Incorrect Stats Example*+ scala> spark.sql("create table tab2(id bigint) using parquet") res6: org.apache.spark.sql.DataFrame = [] scala> spark.sql("explain cost select * from tab2").show(false) +-- |plan +--| |== Optimized Logical Plan == Relation[id#30L|#30L] parquet, Statistics(*sizeInBytes=374.0 B,* hints=none)| == Physical Plan == FileScan parquet default.tab2[id#30L|#30L] Batched: false, Format: Parquet, scala> spark.sql("insert into tab2 select 1") res9: org.apache.spark.sql.DataFrame = [] scala> spark.sql("explain cost select * from tab2").show(false) +-- |plan +--| |== Optimized Logical Plan == Relation[id#30L|#30L] parquet, Statistics(*sizeInBytes={color:#ff}374.0 B{color}*, hints=none)| == Physical Plan == FileScan parquet default.tab2[id#30L|#30L] Batched: false, Format: Parquet, Both table are same type of table scala> spark.sql("desc formatted tab1").show(2000,false) +-+-+ |col_name|data_type| +-+-+ |id|bigint| | | | | # Detailed Table Information| | |Database|default| |Table|tab1| |Owner|Administrator| |Created Time|Tue Jul 16 21:08:35 IST 2019| |Last Access|Thu Jan 01 05:30:00 IST 1970| |Created By|Spark 2.3.2| |Type|MANAGED| |Provider|parquet| |Table Properties|[transient_lastDdlTime=1563291579]| |Statistics|1568 bytes| |Location|file:/x/2| |Serde Library|org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe| |InputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat| |OutputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| scala> spark.sql("desc formatted tab2").show(2000,false) +-+- |col_name|data_type +-+-| |id|bigint| | | | # Detailed Table Information| |Database|default| |Table|tab2| |Owner|Administrator| |Created Time|Tue Jul 16 21:10:24 IST 2019| |Last Access|Thu Jan 01 05:30:00 IST 1970| |Created By|Spark 2.3.2| |Type|MANAGED| |Provider|parquet| |Table Properties|[transient_lastDdlTime=1563291624]| |Location|file:/x/1| |Serde Library|org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe| |InputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat| |OutputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| was: In SPARK-21237 ([link SPARK-21237|https://issues.apache.org/jira/browse/SPARK-21237] it is fix when Appending data using write.mode("append") . But when create same type of parquet table using SQL and Insert data ,stats shows in-correct (not updated). *+Correct Stats Example (SPARK-21237)+* scala> spark.range(100).write.saveAsTable("tab1") scala> spark.sql("explain cost select * from tab1").show(false) + |plan + |== Optimized Logical Plan == Relation[id#10L] parquet, Statistics(*sizeInBytes=784.0 B*, hints=none) == Physical Plan == FileScan parquet default.tab1[id#10L]