[ 
https://issues.apache.org/jira/browse/SPARK-21969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bogdan Raducanu updated SPARK-21969:
------------------------------------
    Description: 
The table is cached so even though statistics are removed, they will still be 
used by the existing sessions.


{{
spark.range(100).write.saveAsTable("tab1")
sql("analyze table tab1 compute statistics")
sql("explain cost select distinct * from tab1").show(false)
}}

Produces:
{{
Relation[id#103L] parquet, Statistics(sizeInBytes=784.0 B, rowCount=100, 
hints=none)
}}


{{
spark.range(100).write.mode("append").saveAsTable("tab1")
sql("explain cost select distinct * from tab1").show(false)
}}

After append something, the same stats are used
{{
Relation[id#135L] parquet, Statistics(sizeInBytes=784.0 B, rowCount=100, 
hints=none)
}}

Manually refreshing the table removes the stats
{{
spark.sessionState.catalog.refreshTable(TableIdentifier("tab1"))
sql("explain cost select distinct * from tab1").show(false)
}}

{{
Relation[id#155L] parquet, Statistics(sizeInBytes=1568.0 B, hints=none)
}}

  was:
The table is cached so even though statistics are removed, they will still be 
used by the existing sessions.


{{code}}
spark.range(100).write.saveAsTable("tab1")
sql("analyze table tab1 compute statistics")
sql("explain cost select distinct * from tab1").show(false)
{{code}}

Produces:
{{code}}
Relation[id#103L] parquet, Statistics(sizeInBytes=784.0 B, rowCount=100, 
hints=none)
{{code}}


{{code}}
spark.range(100).write.mode("append").saveAsTable("tab1")
sql("explain cost select distinct * from tab1").show(false)
{{code}}

After append something, the same stats are used
{{code}}
Relation[id#135L] parquet, Statistics(sizeInBytes=784.0 B, rowCount=100, 
hints=none)
{{code}}

Manually refreshing the table removes the stats
{{code}}
spark.sessionState.catalog.refreshTable(TableIdentifier("tab1"))
sql("explain cost select distinct * from tab1").show(false)
{{code}}

{{code}}
Relation[id#155L] parquet, Statistics(sizeInBytes=1568.0 B, hints=none)
{{code}}


> CommandUtils.updateTableStats should call refreshTable
> ------------------------------------------------------
>
>                 Key: SPARK-21969
>                 URL: https://issues.apache.org/jira/browse/SPARK-21969
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Bogdan Raducanu
>
> The table is cached so even though statistics are removed, they will still be 
> used by the existing sessions.
> {{
> spark.range(100).write.saveAsTable("tab1")
> sql("analyze table tab1 compute statistics")
> sql("explain cost select distinct * from tab1").show(false)
> }}
> Produces:
> {{
> Relation[id#103L] parquet, Statistics(sizeInBytes=784.0 B, rowCount=100, 
> hints=none)
> }}
> {{
> spark.range(100).write.mode("append").saveAsTable("tab1")
> sql("explain cost select distinct * from tab1").show(false)
> }}
> After append something, the same stats are used
> {{
> Relation[id#135L] parquet, Statistics(sizeInBytes=784.0 B, rowCount=100, 
> hints=none)
> }}
> Manually refreshing the table removes the stats
> {{
> spark.sessionState.catalog.refreshTable(TableIdentifier("tab1"))
> sql("explain cost select distinct * from tab1").show(false)
> }}
> {{
> Relation[id#155L] parquet, Statistics(sizeInBytes=1568.0 B, hints=none)
> }}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to