[jira] [Updated] (SPARK-28413) sizeInByte is Not updated for parquet datasource on Next Insert.

2020-05-17 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-28413:

Fix Version/s: 3.0.0

> sizeInByte is Not updated for parquet datasource on Next Insert.
> 
>
> Key: SPARK-28413
> URL: https://issues.apache.org/jira/browse/SPARK-28413
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.1
>Reporter: Babulal
>Priority: Minor
> Fix For: 3.0.0
>
>
> In  SPARK-21237 (link SPARK-21237)  it is fix when Appending data using  
> write.mode("append") . But when create same type of parquet table using SQL 
> and  Insert data ,stats shows in-correct (not updated).
> *+Correct Stats  Example (SPARK-21237)+*
> scala> spark.range(100).write.saveAsTable("tab1")
> scala> spark.sql("explain cost select * from tab1").show(false)
>  +
> |plan
>  +|
> |== Optimized Logical Plan ==
>  Relation[id#10L|#10L] parquet, Statistics(*sizeInBytes=784.0 B*, hints=none)|
> == Physical Plan ==
>  FileScan parquet default.tab1[id#10L|#10L] Batched: false, Format: Parquet, 
> scala> spark.range(100).write.mode("append").saveAsTable("tab1")
> scala> spark.sql("explain cost select * from tab1").show(false)
>  +--
> |plan
>  +--|
> |== Optimized Logical Plan ==
>  Relation[id#23L|#23L] parquet, Statistics(*sizeInBytes=1568.0 B*, 
> hints=none)|
> == Physical Plan ==
>  FileScan parquet default.tab1[id#23L|#23L] Batched: false, Format: Parquet,
>  
>  
> +*Incorrect Stats Example*+
> scala> spark.sql("create table tab2(id bigint) using parquet")
>  res6: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("explain cost select * from tab2").show(false)
>  +--
> |plan
>  +--|
> |== Optimized Logical Plan ==
>  Relation[id#30L|#30L] parquet, Statistics(*sizeInBytes=374.0 B,* hints=none)|
> == Physical Plan ==
>  FileScan parquet default.tab2[id#30L|#30L] Batched: false, Format: Parquet,
>  
> scala> spark.sql("insert into tab2 select 1")
>  res9: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("explain cost select * from tab2").show(false)
>  +--
> |plan
>  +--|
> |== Optimized Logical Plan ==
>  Relation[id#30L|#30L] parquet, Statistics(*sizeInBytes={color:#ff}374.0 
> B{color}*, hints=none)|
> == Physical Plan ==
>  FileScan parquet default.tab2[id#30L|#30L] Batched: false, Format: Parquet,
>  
>  
> Both table are same type of table
> scala> spark.sql("desc formatted tab1").show(2000,false)
>  
> +-+-+
> |col_name|data_type|
> +-+-+
> |id|bigint|
> | | |
> | # Detailed Table Information| |
> |Database|default|
> |Table|tab1|
> |Owner|Administrator|
> |Created Time|Tue Jul 16 21:08:35 IST 2019|
> |Last Access|Thu Jan 01 05:30:00 IST 1970|
> |Created By|Spark 2.3.2|
> |Type|MANAGED|
> |Provider|parquet|
> |Table Properties|[transient_lastDdlTime=1563291579]|
> |Statistics|1568 bytes|
> |Location|file:/x/2|
> |Serde Library|org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe|
> |InputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat|
> |OutputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat|
>  
> scala> spark.sql("desc formatted tab2").show(2000,false)
>  
> +-+-
> |col_name|data_type
>  
> +-+-|
> |id|bigint|
> | |
> | # Detailed Table Information|
> |Database|default|
> |Table|tab2|
> |Owner|Administrator|
> |Created Time|Tue Jul 16 21:10:24 IST 2019|
> |Last Access|Thu Jan 01 05:30:00 IST 1970|
> |Created By|Spark 2.3.2|
> |Type|MANAGED|
> |Provider|parquet|
> |Table Properties|[transient_lastDdlTime=1563291624]|
> |Location|file:/x/1|
> |Serde Library|org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe|
> |InputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat|
> |OutputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (SPARK-28413) sizeInByte is Not updated for parquet datasource on Next Insert.

2019-07-16 Thread Babulal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Babulal updated SPARK-28413:

Description: 
In  SPARK-21237 (link SPARK-21237)  it is fix when Appending data using  
write.mode("append") . But when create same type of parquet table using SQL and 
 Insert data ,stats shows in-correct (not updated).

*+Correct Stats  Example (SPARK-21237)+*

scala> spark.range(100).write.saveAsTable("tab1")

scala> spark.sql("explain cost select * from tab1").show(false)
 +
|plan
 +|
|== Optimized Logical Plan ==
 Relation[id#10L|#10L] parquet, Statistics(*sizeInBytes=784.0 B*, hints=none)|

== Physical Plan ==
 FileScan parquet default.tab1[id#10L|#10L] Batched: false, Format: Parquet, 

scala> spark.range(100).write.mode("append").saveAsTable("tab1")

scala> spark.sql("explain cost select * from tab1").show(false)
 +--
|plan
 +--|
|== Optimized Logical Plan ==
 Relation[id#23L|#23L] parquet, Statistics(*sizeInBytes=1568.0 B*, hints=none)|

== Physical Plan ==
 FileScan parquet default.tab1[id#23L|#23L] Batched: false, Format: Parquet,

 

 

+*Incorrect Stats Example*+

scala> spark.sql("create table tab2(id bigint) using parquet")
 res6: org.apache.spark.sql.DataFrame = []

scala> spark.sql("explain cost select * from tab2").show(false)
 +--
|plan
 +--|
|== Optimized Logical Plan ==
 Relation[id#30L|#30L] parquet, Statistics(*sizeInBytes=374.0 B,* hints=none)|

== Physical Plan ==
 FileScan parquet default.tab2[id#30L|#30L] Batched: false, Format: Parquet,

 

scala> spark.sql("insert into tab2 select 1")
 res9: org.apache.spark.sql.DataFrame = []

scala> spark.sql("explain cost select * from tab2").show(false)
 +--
|plan
 +--|
|== Optimized Logical Plan ==
 Relation[id#30L|#30L] parquet, Statistics(*sizeInBytes={color:#ff}374.0 
B{color}*, hints=none)|

== Physical Plan ==
 FileScan parquet default.tab2[id#30L|#30L] Batched: false, Format: Parquet,

 

 

Both table are same type of table

scala> spark.sql("desc formatted tab1").show(2000,false)
 
+-+-+
|col_name|data_type|

+-+-+
|id|bigint|
| | |
| # Detailed Table Information| |
|Database|default|
|Table|tab1|
|Owner|Administrator|
|Created Time|Tue Jul 16 21:08:35 IST 2019|
|Last Access|Thu Jan 01 05:30:00 IST 1970|
|Created By|Spark 2.3.2|
|Type|MANAGED|
|Provider|parquet|
|Table Properties|[transient_lastDdlTime=1563291579]|
|Statistics|1568 bytes|
|Location|file:/x/2|
|Serde Library|org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe|
|InputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat|
|OutputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat|

 

scala> spark.sql("desc formatted tab2").show(2000,false)
 
+-+-
|col_name|data_type
 
+-+-|
|id|bigint|
| |
| # Detailed Table Information|
|Database|default|
|Table|tab2|
|Owner|Administrator|
|Created Time|Tue Jul 16 21:10:24 IST 2019|
|Last Access|Thu Jan 01 05:30:00 IST 1970|
|Created By|Spark 2.3.2|
|Type|MANAGED|
|Provider|parquet|
|Table Properties|[transient_lastDdlTime=1563291624]|
|Location|file:/x/1|
|Serde Library|org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe|
|InputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat|
|OutputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat|

  was:
In  SPARK-21237 ([link 
SPARK-21237|https://issues.apache.org/jira/browse/SPARK-21237]  it is fix when 
Appending data using  write.mode("append") . But when create same type of 
parquet table using SQL and  Insert data ,stats shows in-correct (not updated).

*+Correct Stats  Example (SPARK-21237)+*

scala> spark.range(100).write.saveAsTable("tab1")

scala> spark.sql("explain cost select * from tab1").show(false)
+
|plan
+
|== Optimized Logical Plan ==
Relation[id#10L] parquet, Statistics(*sizeInBytes=784.0 B*, hints=none)

== Physical Plan ==
FileScan parquet default.tab1[id#10L]