Hi all,
I used to thought that REFRESH statement is just incremental metadata reload. It can't detect file deletion or modification. So we should use INVALIDATE METADATA for these cases. However, one of my friends told me that they always use REFRESH statement in their ETL pipeline, either adding new files or replacing the whole table files. They never use INVALIDATE METADATA and haven't encounter any errors. I realized my thought is wrong and digged into the codes. I found comments in HdfsTable.java said that in these two cases we should use INVALIDATE METADATA instead of REFRESH: an ALTER TABLE ADD PARTITION or dynamic partition insert is executed through Hive. This does not update the lastDdlTime. Hdfs rebalancer is executed. This changes the block locations but doesn't update the mtime (file modification time). However, in my experiments, for all manual changes made in Hive or HDFS, we just need to trigger REFRESH statement. For example, modifying or deleting files under an existent partition, adding partitions in Hive by ALTER TABLE ADD PARTITION etc. In HdfsTable#refreshFileMetadata, all manual changes (add/delete/modify) of data files can be detected and file descriptors will be updated. Thus, the previous comments are wrong. There're only two cases we should use INVALIDATE METADATA: When new tables are created outside Impala When block locations are changed by HDFS balancer (This one is for increasing local reads) Hope you could correct me if I'm wrong. Thanks, Quanlong