[ 
https://issues.apache.org/jira/browse/IMPALA-7320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16575441#comment-16575441
 ] 

ASF subversion and git services commented on IMPALA-7320:
---------------------------------------------------------

Commit 7f9a74ffcaf1818f1f3c9d427557acca21a627da in impala's branch 
refs/heads/master from [~tlipcon]
[ https://git-wip-us.apache.org/repos/asf?p=impala.git;h=7f9a74f ]

IMPALA-7320. Avoid calling getFileStatus() for each partition when table is 
loaded

Prior to this patch, when a table is first loaded, the catalog iterated
over each of the partition directories and called getFileStatus() on
each, serially, to determine the overall access level of the table.

In some testing, each such call took 1-2ms, so this could add many
seconds to the overall table load time for a table with thousands of
partitions and also add to the NN load.

This patch adds some batch pre-fetching of file status information: for
any parent directory which contains more than one partition, we use the
listStatus() API to fetch the FileStatus objects in bulk.

A new unit test verifies the number of API calls made to the NameNode
during a table load.

Change-Id: I83e5ebc214d6620d165e13f8cc80f8fdda100734
Reviewed-on: http://gerrit.cloudera.org:8080/11027
Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Reviewed-by: Todd Lipcon <t...@apache.org>


> Loading HDFS tables calls getFileStatus on each partition serially
> ------------------------------------------------------------------
>
>                 Key: IMPALA-7320
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7320
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Catalog
>    Affects Versions: Impala 3.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Major
>
> The catalog caches the access level (permissions) of each of the partitions 
> in an HDFS table. This is all loaded when the table is first loaded, and is 
> done so by making serial calls to getFileStatus() on each of the partitions. 
> In most case, all of the partitions are in a single directory and we could 
> get all of the information through a single call to listFileStatus() on the 
> parent. In my testing, a typical getFileStatus call took 1-2 milliseconds, so 
> on a large table with tens of thousands of partitions this can shave many 
> seconds off of the table load time as well as reduce load on the NN.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to