Hello Bharath Vissapragada, Vuk Ercegovac, I'd like you to do a code review. Please visit
http://gerrit.cloudera.org:8080/11227 to review the following change. Change subject: IMPALA-7047. Refreshing partitions should not make an RPC per file ...................................................................... IMPALA-7047. Refreshing partitions should not make an RPC per file The code to handle REFRESH of a single partition was incorrectly ignoring the previously-known file descriptors. This meant that, instead of only calling 'getFileBlockLocations' on the files that had changed since the prior load, it was instead calling it on every file. In addition to refresh of single partitions this also affected refresh of unpartitioned tables (which is implemented as a refresh of its single "default" partition). This patch fixes the behavior by copying over the existing file descriptor list into the re-created partition before refreshing it. A new unit test uses FS statistics to verify the change. The new assertions act as a regression test and fail if I comment out the fix. I also tested this by pointing my dev box at a remote filesystem that was approximately 60ms away. The initial load of an unpartitioned table with approximately 45000 files takes around 23 seconds in this setup. Without the patch in place, REFRESH was taking upwards of 35 minutes (I got tired and gave up at this point). Multiplying the 60ms round trip by 45000 files estimates 45 minutes. With the fix in place, REFRESH of the same table took around 4.5 seconds. Clearly, in typical setups where catalogd and HDFS are on a shared local network, the gains won't be so dramatic. But, even with a 1ms round trip (plausible when including fixed RPC overhead and potentially congested datacenter networks) this would save 45 seconds on this example table with 45000 files. Change-Id: I2051b96599206164aaa06ecbdf64374c46eda956 --- M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/test/java/org/apache/impala/catalog/CatalogTest.java 2 files changed, 33 insertions(+), 2 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/27/11227/1 -- To view, visit http://gerrit.cloudera.org:8080/11227 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I2051b96599206164aaa06ecbdf64374c46eda956 Gerrit-Change-Number: 11227 Gerrit-PatchSet: 1 Gerrit-Owner: Todd Lipcon <t...@apache.org> Gerrit-Reviewer: Bharath Vissapragada <bhara...@cloudera.com> Gerrit-Reviewer: Vuk Ercegovac <vercego...@cloudera.com>