[ https://issues.apache.org/jira/browse/SPARK-21969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bogdan Raducanu updated SPARK-21969: ------------------------------------ Description: The table is cached so even though statistics are removed, they will still be used by the existing sessions. {code} spark.range(100).write.saveAsTable("tab1") sql("analyze table tab1 compute statistics") sql("explain cost select distinct * from tab1").show(false) {code} Produces: {code} Relation[id#103L] parquet, Statistics(sizeInBytes=784.0 B, rowCount=100, hints=none) {code} {code} spark.range(100).write.mode("append").saveAsTable("tab1") sql("explain cost select distinct * from tab1").show(false) {code} After append something, the same stats are used {code} Relation[id#135L] parquet, Statistics(sizeInBytes=784.0 B, rowCount=100, hints=none) {code} Manually refreshing the table removes the stats {code} spark.sessionState.catalog.refreshTable(TableIdentifier("tab1")) sql("explain cost select distinct * from tab1").show(false) {code} {code} Relation[id#155L] parquet, Statistics(sizeInBytes=1568.0 B, hints=none) {code} was: The table is cached so even though statistics are removed, they will still be used by the existing sessions. {{ spark.range(100).write.saveAsTable("tab1") sql("analyze table tab1 compute statistics") sql("explain cost select distinct * from tab1").show(false) }} Produces: {{ Relation[id#103L] parquet, Statistics(sizeInBytes=784.0 B, rowCount=100, hints=none) }} {{ spark.range(100).write.mode("append").saveAsTable("tab1") sql("explain cost select distinct * from tab1").show(false) }} After append something, the same stats are used {{ Relation[id#135L] parquet, Statistics(sizeInBytes=784.0 B, rowCount=100, hints=none) }} Manually refreshing the table removes the stats {{ spark.sessionState.catalog.refreshTable(TableIdentifier("tab1")) sql("explain cost select distinct * from tab1").show(false) }} {{ Relation[id#155L] parquet, Statistics(sizeInBytes=1568.0 B, hints=none) }} > CommandUtils.updateTableStats should call refreshTable > ------------------------------------------------------ > > Key: SPARK-21969 > URL: https://issues.apache.org/jira/browse/SPARK-21969 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.3.0 > Reporter: Bogdan Raducanu > > The table is cached so even though statistics are removed, they will still be > used by the existing sessions. > {code} > spark.range(100).write.saveAsTable("tab1") > sql("analyze table tab1 compute statistics") > sql("explain cost select distinct * from tab1").show(false) > {code} > Produces: > {code} > Relation[id#103L] parquet, Statistics(sizeInBytes=784.0 B, rowCount=100, > hints=none) > {code} > {code} > spark.range(100).write.mode("append").saveAsTable("tab1") > sql("explain cost select distinct * from tab1").show(false) > {code} > After append something, the same stats are used > {code} > Relation[id#135L] parquet, Statistics(sizeInBytes=784.0 B, rowCount=100, > hints=none) > {code} > Manually refreshing the table removes the stats > {code} > spark.sessionState.catalog.refreshTable(TableIdentifier("tab1")) > sql("explain cost select distinct * from tab1").show(false) > {code} > {code} > Relation[id#155L] parquet, Statistics(sizeInBytes=1568.0 B, hints=none) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org