[ https://issues.apache.org/jira/browse/IMPALA-7047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16480133#comment-16480133 ]
Todd Lipcon commented on IMPALA-7047: ------------------------------------- On a relatively small table with 374 files, REFRESH spends about a second in this code path (each RPC is 2-3ms due to RTT). > REFRESH on unpartitioned tables calls getBlockLocations on every file > --------------------------------------------------------------------- > > Key: IMPALA-7047 > URL: https://issues.apache.org/jira/browse/IMPALA-7047 > Project: IMPALA > Issue Type: Bug > Components: Catalog > Affects Versions: Impala 2.13.0 > Reporter: Todd Lipcon > Priority: Major > Labels: metadata > > In HdfsTable.updateUnpartitionedTableFileMd() the existing default Partition > object is reset, and a new empty one is created. It then calls > refreshPartitionFileMetadata with this new partition which has an empty list > of file descriptors. This ends up listing the directory, and for each file, > since it doesn't find it in the empty descriptor list, will make a separate > RPC to HDFS to get the locations. > This is quite wasteful vs just using the API that returns the located > statuses for the directory. > Alternatively, it seems like it should probably keep around the old file > descriptor list in the new Partition object so that the incremental refresh > path can work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org