[ 
https://issues.apache.org/jira/browse/IMPALA-10117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17254772#comment-17254772
 ] 

ASF subversion and git services commented on IMPALA-10117:
----------------------------------------------------------

Commit 43b6093dc00a746f8617dc6e1a63fef2dd82d76b in impala's branch 
refs/heads/master from Tim Armstrong
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=43b6093 ]

IMPALA-10117: Skip calls to FsPermissionCache for blob stores

This avoids calling precacheChildrenOf() in cases when the
cached values will never be used. This change simply skips
calling precacheChildrenOf() in the cases when getPermissions()
is never called.

There is some opportunity to clean up this permissions
checking further, but I decided to keep this fix limited
in scope.

Change-Id: I2034695a956307309f656d56aa57aa07ae5163d8
Reviewed-on: http://gerrit.cloudera.org:8080/16898
Reviewed-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>


> Skip calls to FsPermissionCache for blob stores
> -----------------------------------------------
>
>                 Key: IMPALA-10117
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10117
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>            Reporter: Sahil Takiar
>            Assignee: Tim Armstrong
>            Priority: Major
>              Labels: performance
>
> The {{FsPermissionCache}} is described as:
> {code:java}
> /**
>  * Simple non-thread-safe cache for resolved file permissions. This allows
>  * pre-caching permissions by listing the status of all files within a 
> directory,
>  * and then using that cache to avoid round trips to the FileSystem for later
>  * queries of those paths.
>  */ {code}
> I confirmed, and {{FsPermissionCache#precacheChildrenOf}} is actually called 
> for data stored on S3. The issue is that {{FsPermissionCache#getPermissions}} 
> is called inside {{HdfsTable#getAvailableAccessLevel}}, which is skipped for 
> S3. So all the cached metadata is not used. The problem is that 
> {{precacheChildrenOf}} calls {{getFileStatus}} for all files, which results 
> in a bunch of unnecessary metadata operations to S3 + a bunch of cached 
> metadata that is never used.
> {{precacheChildrenOf}} is actually only invoked in the specific scenario 
> described below:
> {code}
>     // Only preload permissions if the number of partitions to be added is
>     // large (3x) relative to the number of existing partitions. This covers
>     // two common cases:
>     //
>     // 1) initial load of a table (no existing partition metadata)
>     // 2) ALTER TABLE RECOVER PARTITIONS after creating a table pointing to
>     // an already-existing partition directory tree
>     //
>     // Without this heuristic, we would end up using a "listStatus" call to
>     // potentially fetch a bunch of irrelevant information about existing
>     // partitions when we only want to know about a small number of 
> newly-added
>     // partitions.
> {code}
> Regardless, skipping the call to {{precacheChildrenOf}} for blob stores 
> should (1) improve table loading time for S3 backed tables, and (2) decrease 
> catalogd memory requirements when loading a bunch of tables stored on S3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to