Babulal created SPARK-28413:
-------------------------------

             Summary: sizeInByte is Not updated for parquet datasource on Next 
Insert.
                 Key: SPARK-28413
                 URL: https://issues.apache.org/jira/browse/SPARK-28413
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.4.1, 2.3.2
            Reporter: Babulal


In  SPARK-21237 ([link 
SPARK-21237|https://issues.apache.org/jira/browse/SPARK-21237]  it is fix when 
Appending data using  write.mode("append") . But when create same type of 
parquet table using SQL and  Insert data ,stats shows in-correct (not updated).

*+Correct Stats  Example (SPARK-21237)+*

scala> spark.range(100).write.saveAsTable("tab1")

scala> spark.sql("explain cost select * from tab1").show(false)
+------------------------------------------------------------------------
|plan
+------------------------------------------------------------------------
|== Optimized Logical Plan ==
Relation[id#10L] parquet, Statistics(*sizeInBytes=784.0 B*, hints=none)

== Physical Plan ==
FileScan parquet default.tab1[id#10L] Batched: false, Format: Parquet, 

scala> spark.range(100).write.mode("append").saveAsTable("tab1")

scala> spark.sql("explain cost select * from tab1").show(false)
+----------------------------------------------------------------------
|plan
+----------------------------------------------------------------------
|== Optimized Logical Plan ==
Relation[id#23L] parquet, Statistics(*sizeInBytes=1568.0 B*, hints=none)

== Physical Plan ==
FileScan parquet default.tab1[id#23L] Batched: false, Format: Parquet,

 

 

+*Incorrect Stats Example*+

scala> spark.sql("create table tab2(id bigint) using parquet")
res6: org.apache.spark.sql.DataFrame = []

scala> spark.sql("explain cost select * from tab2").show(false)
+----------------------------------------------------------------------
|plan
+----------------------------------------------------------------------
|== Optimized Logical Plan ==
Relation[id#30L] parquet, Statistics(*sizeInBytes=374.0 B,* hints=none)

== Physical Plan ==
FileScan parquet default.tab2[id#30L] Batched: false, Format: Parquet,

 

scala> spark.sql("insert into tab2 select 1")
res9: org.apache.spark.sql.DataFrame = []

scala> spark.sql("explain cost select * from tab2").show(false)
+----------------------------------------------------------------------
|plan
+----------------------------------------------------------------------
|== Optimized Logical Plan ==
Relation[id#30L] parquet, Statistics(*sizeInBytes={color:#FF0000}374.0 
B{color}*, hints=none)

== Physical Plan ==
FileScan parquet default.tab2[id#30L] Batched: false, Format: Parquet,

 

 

Both table are same type of table

scala> spark.sql("desc formatted tab1").show(2000,false)
+----------------------------+--------------------------------------------------------------+
|col_name |data_type |
+----------------------------+--------------------------------------------------------------+
|id |bigint |
| | |
|# Detailed Table Information| |
|Database |default |
|Table |tab1 |
|Owner |Administrator |
|Created Time |Tue Jul 16 21:08:35 IST 2019 |
|Last Access |Thu Jan 01 05:30:00 IST 1970 |
|Created By |Spark 2.3.2 |
|Type |MANAGED |
|Provider |parquet |
|Table Properties |[transient_lastDdlTime=1563291579] |
|Statistics |1568 bytes |
|Location |file:/x/2 |
|Serde Library |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe |
|InputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat |
|OutputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat|

 

scala> spark.sql("desc formatted tab2").show(2000,false)
+----------------------------+--------------------------------------------------------------
|col_name |data_type
+----------------------------+--------------------------------------------------------------
|id |bigint
| |
|# Detailed Table Information|
|Database |default
|Table |tab2
|Owner |Administrator
|Created Time |Tue Jul 16 21:10:24 IST 2019
|Last Access |Thu Jan 01 05:30:00 IST 1970
|Created By |Spark 2.3.2
|Type |MANAGED
|Provider |parquet
|Table Properties |[transient_lastDdlTime=1563291624]
|Location |file:/x/1
|Serde Library |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
|InputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
|OutputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to