Hi Quanlong, You're pretty much correct. REFRESH can handle the majority of external metadata modifications (adding/dropping files/partitions, etc) and INVALIDATE METADATA should be used in the two use cases you mention. I am sorry you had to look at the code to figure that out. I checked our documentation (https://www.cloudera.com/documentation/enterprise/ latest/topics/impala_refresh.html) and I see that some parts are not as explicit as they should. I filed a docs JIRA ( https://issues.apache.org/jira/browse/IMPALA-5918).
Thanks Dimitris On Mon, Sep 11, 2017 at 5:55 AM, Quanlong Huang <huang_quanl...@126.com> wrote: > Hi all, > > > I used to thought that REFRESH statement is just incremental metadata > reload. It can't detect file deletion or modification. So we should use > INVALIDATE METADATA for these cases. > However, one of my friends told me that they always use REFRESH statement > in their ETL pipeline, either adding new files or replacing the whole table > files. They never use INVALIDATE METADATA and haven't encounter any errors. > > > I realized my thought is wrong and digged into the codes. I found comments > in HdfsTable.java said that in these two cases we should use INVALIDATE > METADATA instead of REFRESH: > an ALTER TABLE ADD PARTITION or dynamic partition insert is executed > through Hive. This does not update the lastDdlTime. > Hdfs rebalancer is executed. This changes the block locations but doesn't > update the mtime (file modification time). > However, in my experiments, for all manual changes made in Hive or HDFS, > we just need to trigger REFRESH statement. For example, modifying or > deleting files under an existent partition, adding partitions in Hive by > ALTER TABLE ADD PARTITION etc. > > > In HdfsTable#refreshFileMetadata, all manual changes (add/delete/modify) > of data files can be detected and file descriptors will be updated. > Thus, the previous comments are wrong. There're only two cases we should > use INVALIDATE METADATA: > When new tables are created outside Impala > When block locations are changed by HDFS balancer (This one is for > increasing local reads) > Hope you could correct me if I'm wrong. > > > Thanks, > Quanlong