Hi all,

I used to thought that REFRESH statement is just incremental metadata reload. 
It can't detect file deletion or modification. So we should use INVALIDATE 
METADATA for these cases.
However, one of my friends told me that they always use REFRESH statement in 
their ETL pipeline, either adding new files or replacing the whole table files. 
They never use INVALIDATE METADATA and haven't encounter any errors.


I realized my thought is wrong and digged into the codes. I found comments in 
HdfsTable.java said that in these two cases we should use INVALIDATE METADATA 
instead of REFRESH:
an ALTER TABLE ADD PARTITION or dynamic partition insert is executed through 
Hive. This does not update the lastDdlTime.
Hdfs rebalancer is executed. This changes the block locations but doesn't 
update the mtime (file modification time).
However, in my experiments, for all manual changes made in Hive or HDFS, we 
just need to trigger REFRESH statement. For example, modifying or deleting 
files under an existent partition, adding partitions in Hive by ALTER TABLE ADD 
PARTITION etc.


In HdfsTable#refreshFileMetadata, all manual changes (add/delete/modify) of 
data files can be detected and file descriptors will be updated.
Thus, the previous comments are wrong. There're only two cases we should use 
INVALIDATE METADATA:
When new tables are created outside Impala
When block locations are changed by HDFS balancer (This one is for increasing 
local reads)
Hope you could correct me if I'm wrong. 


Thanks,
Quanlong

Reply via email to