GitHub user aokolnychyi opened a pull request: https://github.com/apache/spark/pull/19252
[SPARK-21969][SQL] CommandUtils.updateTableStats should call refreshTable ## What changes were proposed in this pull request? Tables in the catalog cache are not invalidated once their statistics are updated. As a consequence, existing sessions will use the cached information even though it is not valid anymore. Consider and an example below. ``` // step 1 spark.range(100).write.saveAsTable("tab1") // step 2 spark.sql("analyze table tab1 compute statistics") // step 3 spark.sql("explain cost select distinct * from tab1").show(false) // step 4 spark.range(100).write.mode("append").saveAsTable("tab1") // step 5 spark.sql("explain cost select distinct * from tab1").show(false) ``` After step 3, the table will be present in the catalog relation cache. Step 4 will correctly update the metadata inside the catalog but will NOT invalidate the cache. By the way, ``spark.sql("analyze table tab1 compute statistics")`` between step 3 and step 4 would also solve the problem. ## How was this patch tested? Current and additional unit tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/aokolnychyi/spark spark-21969 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19252.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19252 ---- commit ba963b46cd2917315bc2bd0cf237c7d9f79e9d65 Author: aokolnychyi <anton.okolnyc...@sap.com> Date: 2017-09-16T11:57:52Z [SPARK-21969][SQL] CommandUtils.updateTableStats should call refreshTable ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org