Hi Quanlong,

You're pretty much correct. REFRESH can handle the majority of external
metadata modifications (adding/dropping files/partitions, etc) and
INVALIDATE METADATA should be used in the two use cases you mention. I am
sorry you had to look at the code to figure that out. I checked our
documentation (https://www.cloudera.com/documentation/enterprise/
latest/topics/impala_refresh.html) and I see that some parts are not as
explicit as they should. I filed a docs JIRA (
https://issues.apache.org/jira/browse/IMPALA-5918).

Thanks
Dimitris

On Mon, Sep 11, 2017 at 5:55 AM, Quanlong Huang <huang_quanl...@126.com>
wrote:

> Hi all,
>
>
> I used to thought that REFRESH statement is just incremental metadata
> reload. It can't detect file deletion or modification. So we should use
> INVALIDATE METADATA for these cases.
> However, one of my friends told me that they always use REFRESH statement
> in their ETL pipeline, either adding new files or replacing the whole table
> files. They never use INVALIDATE METADATA and haven't encounter any errors.
>
>
> I realized my thought is wrong and digged into the codes. I found comments
> in HdfsTable.java said that in these two cases we should use INVALIDATE
> METADATA instead of REFRESH:
> an ALTER TABLE ADD PARTITION or dynamic partition insert is executed
> through Hive. This does not update the lastDdlTime.
> Hdfs rebalancer is executed. This changes the block locations but doesn't
> update the mtime (file modification time).
> However, in my experiments, for all manual changes made in Hive or HDFS,
> we just need to trigger REFRESH statement. For example, modifying or
> deleting files under an existent partition, adding partitions in Hive by
> ALTER TABLE ADD PARTITION etc.
>
>
> In HdfsTable#refreshFileMetadata, all manual changes (add/delete/modify)
> of data files can be detected and file descriptors will be updated.
> Thus, the previous comments are wrong. There're only two cases we should
> use INVALIDATE METADATA:
> When new tables are created outside Impala
> When block locations are changed by HDFS balancer (This one is for
> increasing local reads)
> Hope you could correct me if I'm wrong.
>
>
> Thanks,
> Quanlong

Reply via email to