[Impala-ASF-CR] IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded
Todd Lipcon has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/11027 ) Change subject: IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded .. IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded Prior to this patch, when a table is first loaded, the catalog iterated over each of the partition directories and called getFileStatus() on each, serially, to determine the overall access level of the table. In some testing, each such call took 1-2ms, so this could add many seconds to the overall table load time for a table with thousands of partitions and also add to the NN load. This patch adds some batch pre-fetching of file status information: for any parent directory which contains more than one partition, we use the listStatus() API to fetch the FileStatus objects in bulk. A new unit test verifies the number of API calls made to the NameNode during a table load. Change-Id: I83e5ebc214d6620d165e13f8cc80f8fdda100734 Reviewed-on: http://gerrit.cloudera.org:8080/11027 Tested-by: Impala Public Jenkins Reviewed-by: Todd Lipcon --- M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java A fe/src/main/java/org/apache/impala/util/FsPermissionCache.java M fe/src/main/java/org/apache/impala/util/FsPermissionChecker.java M fe/src/test/java/org/apache/impala/catalog/CatalogTest.java 4 files changed, 218 insertions(+), 59 deletions(-) Approvals: Impala Public Jenkins: Verified Todd Lipcon: Looks good to me, approved -- To view, visit http://gerrit.cloudera.org:8080/11027 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I83e5ebc214d6620d165e13f8cc80f8fdda100734 Gerrit-Change-Number: 11027 Gerrit-PatchSet: 9 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Vuk Ercegovac
[Impala-ASF-CR] IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/11027 ) Change subject: IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded .. Patch Set 8: Code-Review+2 Forwarding +2 -- To view, visit http://gerrit.cloudera.org:8080/11027 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I83e5ebc214d6620d165e13f8cc80f8fdda100734 Gerrit-Change-Number: 11027 Gerrit-PatchSet: 8 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Thu, 09 Aug 2018 21:39:27 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11027 ) Change subject: IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded .. Patch Set 8: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/11027 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I83e5ebc214d6620d165e13f8cc80f8fdda100734 Gerrit-Change-Number: 11027 Gerrit-PatchSet: 8 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Thu, 09 Aug 2018 00:52:46 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11027 ) Change subject: IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded .. Patch Set 8: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/2957/ DRY_RUN=true -- To view, visit http://gerrit.cloudera.org:8080/11027 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I83e5ebc214d6620d165e13f8cc80f8fdda100734 Gerrit-Change-Number: 11027 Gerrit-PatchSet: 8 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Wed, 08 Aug 2018 21:39:19 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11027 ) Change subject: IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded .. Patch Set 6: Verified-1 Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/2943/ -- To view, visit http://gerrit.cloudera.org:8080/11027 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I83e5ebc214d6620d165e13f8cc80f8fdda100734 Gerrit-Change-Number: 11027 Gerrit-PatchSet: 6 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Tue, 07 Aug 2018 20:26:26 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11027 ) Change subject: IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded .. Patch Set 6: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/2943/ DRY_RUN=true -- To view, visit http://gerrit.cloudera.org:8080/11027 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I83e5ebc214d6620d165e13f8cc80f8fdda100734 Gerrit-Change-Number: 11027 Gerrit-PatchSet: 6 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Tue, 07 Aug 2018 17:10:32 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11027 ) Change subject: IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded .. Patch Set 4: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/143/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/11027 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I83e5ebc214d6620d165e13f8cc80f8fdda100734 Gerrit-Change-Number: 11027 Gerrit-PatchSet: 4 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Wed, 01 Aug 2018 23:11:46 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded
Hello Bharath Vissapragada, Tianyi Wang, Impala Public Jenkins, Vuk Ercegovac, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/11027 to look at the new patch set (#4). Change subject: IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded .. IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded Prior to this patch, when a table is first loaded, the catalog iterated over each of the partition directories and called getFileStatus() on each, serially, to determine the overall access level of the table. In some testing, each such call took 1-2ms, so this could add many seconds to the overall table load time for a table with thousands of partitions and also add to the NN load. This patch adds some batch pre-fetching of file status information: for any parent directory which contains more than one partition, we use the listStatus() API to fetch the FileStatus objects in bulk. A new unit test verifies the number of API calls made to the NameNode during a table load. Change-Id: I83e5ebc214d6620d165e13f8cc80f8fdda100734 --- M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java A fe/src/main/java/org/apache/impala/util/FsPermissionCache.java M fe/src/main/java/org/apache/impala/util/FsPermissionChecker.java M fe/src/test/java/org/apache/impala/catalog/CatalogTest.java 4 files changed, 218 insertions(+), 59 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/27/11027/4 -- To view, visit http://gerrit.cloudera.org:8080/11027 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I83e5ebc214d6620d165e13f8cc80f8fdda100734 Gerrit-Change-Number: 11027 Gerrit-PatchSet: 4 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Vuk Ercegovac
[Impala-ASF-CR] IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/11027 ) Change subject: IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded .. Patch Set 3: (2 comments) http://gerrit.cloudera.org:8080/#/c/11027/3/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java: http://gerrit.cloudera.org:8080/#/c/11027/3/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1566 PS3, Line 1566: void > nit: Java-style comment: Returning FsPermissionCache (vs) passing the outp Done http://gerrit.cloudera.org:8080/#/c/11027/3/fe/src/main/java/org/apache/impala/util/FsPermissionCache.java File fe/src/main/java/org/apache/impala/util/FsPermissionCache.java: http://gerrit.cloudera.org:8080/#/c/11027/3/fe/src/main/java/org/apache/impala/util/FsPermissionCache.java@45 PS3, Line 45: FsPermissionChecker checker = FsPermissionChecker.getInstance(); > Make it a static class member may be? this is already just an accessor for a static instance in FsPermissionChecker, so there isn't really any benefit to "caching" the result -- To view, visit http://gerrit.cloudera.org:8080/11027 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I83e5ebc214d6620d165e13f8cc80f8fdda100734 Gerrit-Change-Number: 11027 Gerrit-PatchSet: 3 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Wed, 01 Aug 2018 22:20:18 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded
Bharath Vissapragada has posted comments on this change. ( http://gerrit.cloudera.org:8080/11027 ) Change subject: IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded .. Patch Set 3: Code-Review+2 (3 comments) http://gerrit.cloudera.org:8080/#/c/11027/2/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java: http://gerrit.cloudera.org:8080/#/c/11027/2/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1557 PS2, Line 1557: } > I added a simple heuristic here that compares size(msPartitions) vs the num That looks reasonable to me. http://gerrit.cloudera.org:8080/#/c/11027/3/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java: http://gerrit.cloudera.org:8080/#/c/11027/3/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1566 PS3, Line 1566: void nit: Java-style comment: Returning FsPermissionCache (vs) passing the output param by reference. I don't see this discussed in the Google Java-style guide, but I have a feeling that the former is preferred. The latter appears more C++ style to me. http://gerrit.cloudera.org:8080/#/c/11027/3/fe/src/main/java/org/apache/impala/util/FsPermissionCache.java File fe/src/main/java/org/apache/impala/util/FsPermissionCache.java: http://gerrit.cloudera.org:8080/#/c/11027/3/fe/src/main/java/org/apache/impala/util/FsPermissionCache.java@45 PS3, Line 45: FsPermissionChecker checker = FsPermissionChecker.getInstance(); Make it a static class member may be? -- To view, visit http://gerrit.cloudera.org:8080/11027 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I83e5ebc214d6620d165e13f8cc80f8fdda100734 Gerrit-Change-Number: 11027 Gerrit-PatchSet: 3 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Fri, 27 Jul 2018 06:41:51 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11027 ) Change subject: IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded .. Patch Set 3: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/69/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/11027 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I83e5ebc214d6620d165e13f8cc80f8fdda100734 Gerrit-Change-Number: 11027 Gerrit-PatchSet: 3 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Thu, 26 Jul 2018 02:22:26 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/11027 ) Change subject: IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded .. Patch Set 2: (1 comment) http://gerrit.cloudera.org:8080/#/c/11027/2/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java: http://gerrit.cloudera.org:8080/#/c/11027/2/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1557 PS2, Line 1557: preloadPermissionsCache(msPartitions, permCache); > Same question as above. Could this be an overkill if size(msPartitions) is I added a simple heuristic here that compares size(msPartitions) vs the number of existing partitions, and only uses this trick if the number of new partitions is 3x the number of existing partitions. That should catch RECOVER PARTITIONS use case when basically constructing the entire hierarchy from an existing table path, but not affect "alter table add" or RECOVER when it just adds a few. Let me know what you think -- To view, visit http://gerrit.cloudera.org:8080/11027 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I83e5ebc214d6620d165e13f8cc80f8fdda100734 Gerrit-Change-Number: 11027 Gerrit-PatchSet: 2 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Thu, 26 Jul 2018 01:42:03 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded
Hello Bharath Vissapragada, Tianyi Wang, Vuk Ercegovac, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/11027 to look at the new patch set (#3). Change subject: IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded .. IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded Prior to this patch, when a table is first loaded, the catalog iterated over each of the partition directories and called getFileStatus() on each, serially, to determine the overall access level of the table. In some testing, each such call took 1-2ms, so this could add many seconds to the overall table load time for a table with thousands of partitions and also add to the NN load. This patch adds some batch pre-fetching of file status information: for any parent directory which contains more than one partition, we use the listStatus() API to fetch the FileStatus objects in bulk. A new unit test verifies the number of API calls made to the NameNode during a table load. Change-Id: I83e5ebc214d6620d165e13f8cc80f8fdda100734 --- M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java A fe/src/main/java/org/apache/impala/util/FsPermissionCache.java M fe/src/main/java/org/apache/impala/util/FsPermissionChecker.java M fe/src/test/java/org/apache/impala/catalog/CatalogTest.java 4 files changed, 220 insertions(+), 59 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/27/11027/3 -- To view, visit http://gerrit.cloudera.org:8080/11027 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I83e5ebc214d6620d165e13f8cc80f8fdda100734 Gerrit-Change-Number: 11027 Gerrit-PatchSet: 3 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Vuk Ercegovac
[Impala-ASF-CR] IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11027 ) Change subject: IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded .. Patch Set 3: Build Started https://jenkins.impala.io/job/gerrit-code-review-checks/69/ Running initial code review checks. This is experimental - please report any issues to tarmstr...@cloudera.com or on this JIRA: IMPALA-7317 -- To view, visit http://gerrit.cloudera.org:8080/11027 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I83e5ebc214d6620d165e13f8cc80f8fdda100734 Gerrit-Change-Number: 11027 Gerrit-PatchSet: 3 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Thu, 26 Jul 2018 01:40:25 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/11027 ) Change subject: IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded .. Patch Set 2: (1 comment) http://gerrit.cloudera.org:8080/#/c/11027/2/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java: http://gerrit.cloudera.org:8080/#/c/11027/2/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@957 PS2, Line 957: preloadPermissionsCache(msPartitions, permCache); > Looking at the consumers of this, alterTableAddPartitions() and alterTableR Good point about the incremental updates. Let me take this back to the drawing board and think about that. -- To view, visit http://gerrit.cloudera.org:8080/11027 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I83e5ebc214d6620d165e13f8cc80f8fdda100734 Gerrit-Change-Number: 11027 Gerrit-PatchSet: 2 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Wed, 25 Jul 2018 16:32:51 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded
Bharath Vissapragada has posted comments on this change. ( http://gerrit.cloudera.org:8080/11027 ) Change subject: IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded .. Patch Set 2: (3 comments) http://gerrit.cloudera.org:8080/#/c/11027/2/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java: http://gerrit.cloudera.org:8080/#/c/11027/2/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@897 PS2, Line 897: this update? http://gerrit.cloudera.org:8080/#/c/11027/2/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@957 PS2, Line 957: preloadPermissionsCache(msPartitions, permCache); Looking at the consumers of this, alterTableAddPartitions() and alterTableRecoverPartitions(), I have a feeling that this optimization in this codepath is an overkill (for large tables, say 100k partitions and every partition under tableDir). It is probably useful in alterTableRecoverPartitions() where we could add large number of partitions in a single batch, but probably not in alterTableAddPartitions(). Also, both these methods only add partitions in batches of MAX_PARTITION_UPDATES_PER_RPC (500). So we could ideally we repeating RPCs in preloadPermissionsCache for ceil(num_parts/ MAX_PARTITION_UPDATES_PER_RPC) times. Thoughts? http://gerrit.cloudera.org:8080/#/c/11027/2/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1557 PS2, Line 1557: preloadPermissionsCache(msPartitions, permCache); Same question as above. Could this be an overkill if size(msPartitions) is small? Say we are refreshing 10 partitions in a 100k partition table? Looking at the listStatusIterator() Impl, it seems to be using some RemoteIterator impl that does batching of RPCs. I couldn't figure out what is the basis for batching, but I get a feeling that the number of RPCs could be > 1. Additionally, we create a bunch of temporary objects for each partition and that could affect GC Just curious if I'm getting it right and what your thoughts are. -- To view, visit http://gerrit.cloudera.org:8080/11027 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I83e5ebc214d6620d165e13f8cc80f8fdda100734 Gerrit-Change-Number: 11027 Gerrit-PatchSet: 2 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Wed, 25 Jul 2018 06:34:16 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11027 ) Change subject: IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded .. Patch Set 2: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/37/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/11027 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I83e5ebc214d6620d165e13f8cc80f8fdda100734 Gerrit-Change-Number: 11027 Gerrit-PatchSet: 2 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Tue, 24 Jul 2018 19:40:03 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11027 ) Change subject: IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded .. Patch Set 2: Build Started https://jenkins.impala.io/job/gerrit-code-review-checks/37/ Running initial code review checks. This is experimental - please report any issues to tarmstr...@cloudera.com or on this JIRA: IMPALA-7317 -- To view, visit http://gerrit.cloudera.org:8080/11027 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I83e5ebc214d6620d165e13f8cc80f8fdda100734 Gerrit-Change-Number: 11027 Gerrit-PatchSet: 2 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Tue, 24 Jul 2018 18:46:26 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded
Hello Bharath Vissapragada, Tianyi Wang, Vuk Ercegovac, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/11027 to look at the new patch set (#2). Change subject: IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded .. IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded Prior to this patch, when a table is first loaded, the catalog iterated over each of the partition directories and called getFileStatus() on each, serially, to determine the overall access level of the table. In some testing, each such call took 1-2ms, so this could add many seconds to the overall table load time for a table with thousands of partitions and also add to the NN load. This patch adds some batch pre-fetching of file status information: for any parent directory which contains more than one partition, we use the listStatus() API to fetch the FileStatus objects in bulk. A new unit test verifies the number of API calls made to the NameNode during a table load. Change-Id: I83e5ebc214d6620d165e13f8cc80f8fdda100734 --- M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java A fe/src/main/java/org/apache/impala/util/FsPermissionCache.java M fe/src/main/java/org/apache/impala/util/FsPermissionChecker.java M fe/src/test/java/org/apache/impala/catalog/CatalogTest.java 4 files changed, 205 insertions(+), 59 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/27/11027/2 -- To view, visit http://gerrit.cloudera.org:8080/11027 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I83e5ebc214d6620d165e13f8cc80f8fdda100734 Gerrit-Change-Number: 11027 Gerrit-PatchSet: 2 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Vuk Ercegovac
[Impala-ASF-CR] IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/11027 ) Change subject: IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded .. Patch Set 1: (8 comments) http://gerrit.cloudera.org:8080/#/c/11027/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java: http://gerrit.cloudera.org:8080/#/c/11027/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@857 PS1, Line 857: HdfsPartition partition = createPartition(msPartition.getSd(), msPartition, permCache); > longline Done http://gerrit.cloudera.org:8080/#/c/11027/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1005 PS1, Line 1005: ); > Can you add the table name too. Done http://gerrit.cloudera.org:8080/#/c/11027/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1015 PS1, Line 1015:* Throws CatalogException if one of the supplied storage descriptors contains metadata that > nit: longline Done http://gerrit.cloudera.org:8080/#/c/11027/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1026 PS1, Line 1026: HdfsPartition hdfsPartition = createPartition(partition.getSd(), partition, permCache); > longline Done http://gerrit.cloudera.org:8080/#/c/11027/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1625 PS1, Line 1625: HdfsPartition partition = createPartition(msPartition.getSd(), msPartition, permCache); > longline Done http://gerrit.cloudera.org:8080/#/c/11027/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1675 PS1, Line 1675: ); > , ioe); Done http://gerrit.cloudera.org:8080/#/c/11027/1/fe/src/main/java/org/apache/impala/util/FsPermissionCache.java File fe/src/main/java/org/apache/impala/util/FsPermissionCache.java: http://gerrit.cloudera.org:8080/#/c/11027/1/fe/src/main/java/org/apache/impala/util/FsPermissionCache.java@40 PS1, Line 40: FileSystem > Technically not all paths need to be fully-qualified. I'll see if we can re I looked through the source for FileStatus.getPath() and it seems to "do the right thing" here -- if the path that we listed is qualified, it returns a similarly-qualified path on the resulting children, so things should match up appropriately without having to include the FileSystem here. http://gerrit.cloudera.org:8080/#/c/11027/1/fe/src/main/java/org/apache/impala/util/FsPermissionCache.java@45 PS1, Line 45: if (perms != null) { : return perms; : } > single line. Done -- To view, visit http://gerrit.cloudera.org:8080/11027 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I83e5ebc214d6620d165e13f8cc80f8fdda100734 Gerrit-Change-Number: 11027 Gerrit-PatchSet: 1 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Tue, 24 Jul 2018 18:38:56 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/11027 ) Change subject: IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/11027/1/fe/src/main/java/org/apache/impala/util/FsPermissionCache.java File fe/src/main/java/org/apache/impala/util/FsPermissionCache.java: http://gerrit.cloudera.org:8080/#/c/11027/1/fe/src/main/java/org/apache/impala/util/FsPermissionCache.java@40 PS1, Line 40: FileSystem > qq. What is the use of having this as a part of map's key? Isn't it already Technically not all paths need to be fully-qualified. I'll see if we can restrict this function to only take fully-qualified paths, or ensure canonicalization, etc. -- To view, visit http://gerrit.cloudera.org:8080/11027 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I83e5ebc214d6620d165e13f8cc80f8fdda100734 Gerrit-Change-Number: 11027 Gerrit-PatchSet: 1 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Tue, 24 Jul 2018 16:08:48 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded
Bharath Vissapragada has posted comments on this change. ( http://gerrit.cloudera.org:8080/11027 ) Change subject: IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded .. Patch Set 1: (8 comments) Patch lgtm overall. Bunch of nits and one clarification. http://gerrit.cloudera.org:8080/#/c/11027/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java: http://gerrit.cloudera.org:8080/#/c/11027/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@857 PS1, Line 857: HdfsPartition partition = createPartition(msPartition.getSd(), msPartition, permCache); longline http://gerrit.cloudera.org:8080/#/c/11027/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1005 PS1, Line 1005: ); Can you add the table name too. http://gerrit.cloudera.org:8080/#/c/11027/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1015 PS1, Line 1015:* Throws CatalogException if one of the supplied storage descriptors contains metadata that nit: longline http://gerrit.cloudera.org:8080/#/c/11027/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1026 PS1, Line 1026: HdfsPartition hdfsPartition = createPartition(partition.getSd(), partition, permCache); longline http://gerrit.cloudera.org:8080/#/c/11027/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1625 PS1, Line 1625: HdfsPartition partition = createPartition(msPartition.getSd(), msPartition, permCache); longline http://gerrit.cloudera.org:8080/#/c/11027/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1675 PS1, Line 1675: ); , ioe); http://gerrit.cloudera.org:8080/#/c/11027/1/fe/src/main/java/org/apache/impala/util/FsPermissionCache.java File fe/src/main/java/org/apache/impala/util/FsPermissionCache.java: http://gerrit.cloudera.org:8080/#/c/11027/1/fe/src/main/java/org/apache/impala/util/FsPermissionCache.java@40 PS1, Line 40: FileSystem qq. What is the use of having this as a part of map's key? Isn't it already a part of Path's URI? http://gerrit.cloudera.org:8080/#/c/11027/1/fe/src/main/java/org/apache/impala/util/FsPermissionCache.java@45 PS1, Line 45: if (perms != null) { : return perms; : } single line. -- To view, visit http://gerrit.cloudera.org:8080/11027 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I83e5ebc214d6620d165e13f8cc80f8fdda100734 Gerrit-Change-Number: 11027 Gerrit-PatchSet: 1 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Tue, 24 Jul 2018 06:54:21 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11027 ) Change subject: IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded .. Patch Set 1: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/25/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/11027 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I83e5ebc214d6620d165e13f8cc80f8fdda100734 Gerrit-Change-Number: 11027 Gerrit-PatchSet: 1 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Tue, 24 Jul 2018 00:53:34 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11027 ) Change subject: IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded .. Patch Set 1: Build Started https://jenkins.impala.io/job/gerrit-code-review-checks/25/ Running initial code review checks. This is experimental - please report any issues to tarmstr...@cloudera.com or on this JIRA: IMPALA-7317 -- To view, visit http://gerrit.cloudera.org:8080/11027 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I83e5ebc214d6620d165e13f8cc80f8fdda100734 Gerrit-Change-Number: 11027 Gerrit-PatchSet: 1 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Mon, 23 Jul 2018 23:58:07 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded
Hello Bharath Vissapragada, Tianyi Wang, Vuk Ercegovac, I'd like you to do a code review. Please visit http://gerrit.cloudera.org:8080/11027 to review the following change. Change subject: IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded .. IMPALA-7320. Avoid calling getFileStatus() for each partition when table is loaded Prior to this patch, when a table is first loaded, the catalog iterated over each of the partition directories and called getFileStatus() on each, serially, to determine the overall access level of the table. In some testing, each such call took 1-2ms, so this could add many seconds to the overall table load time for a table with thousands of partitions and also add to the NN load. This patch adds some batch pre-fetching of file status information: for any parent directory which contains more than one partition, we use the listStatus() API to fetch the FileStatus objects in bulk. A new unit test verifies the number of API calls made to the NameNode during a table load. Change-Id: I83e5ebc214d6620d165e13f8cc80f8fdda100734 --- M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java A fe/src/main/java/org/apache/impala/util/FsPermissionCache.java M fe/src/main/java/org/apache/impala/util/FsPermissionChecker.java M fe/src/test/java/org/apache/impala/catalog/CatalogTest.java 4 files changed, 200 insertions(+), 54 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/27/11027/1 -- To view, visit http://gerrit.cloudera.org:8080/11027 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I83e5ebc214d6620d165e13f8cc80f8fdda100734 Gerrit-Change-Number: 11027 Gerrit-PatchSet: 1 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Vuk Ercegovac