[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/18191 ) Change subject: IMPALA-9433: Improved caching of HdfsFileHandles .. IMPALA-9433: Improved caching of HdfsFileHandles Seperated LRU caching functionality to a templated LruMultiCache class. Replaced std::multimap with std::unordered_map with std::list for O(1) lookups and less memory overhead, as it stores each key one time. Added boost::intrusive::list to handle LRU relations with less overhead. Added O(1) release method, instead of O(n) with minimal memory overhead. Implemented RAII Accessor to remove the responsibility of releasing the objects from the user. Wrapped cache accessor and related DiskIOManager metrics to a FileHandleCache::Accessor. Removed Release*() call trees from FileHandleCache and DiskIOManager, removed scoped exit from HdfsFileReader as they are handled automatically. Testing: Implemented extensive unit testing of the class, including forced rehashes, collisions, capacity overshoot, explicit/automatic release and destroy. Ran tests/custom_cluster/test_hdfs_fd_caching.py to verify FileHandleCache::Accessor behaviour through metrics. Ran bin/single_node_perf_run.py with TPCH and TPC-DS on parquet tables, no visible change in performance: TPCH scale=10 iterations=100: Delta(Avg)=-0.67% Delta(GeoMean)=-0.49% TPC-DS scale=10 iterations= 50: Delta(Avg)=-0.02% Delta(GeoMean)= 0.00% Tested some manual queries on functional_parquet.widetable_1000_cols with 64 threads but did not notice significant changes in scan times. Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a Reviewed-on: http://gerrit.cloudera.org:8080/18191 Reviewed-by: Impala Public Jenkins Tested-by: Impala Public Jenkins --- M be/src/runtime/io/disk-io-mgr.cc M be/src/runtime/io/disk-io-mgr.h M be/src/runtime/io/handle-cache.h M be/src/runtime/io/handle-cache.inline.h M be/src/runtime/io/hdfs-file-reader.cc M be/src/util/CMakeLists.txt A be/src/util/lru-multi-cache-test.cc A be/src/util/lru-multi-cache.h A be/src/util/lru-multi-cache.inline.h 9 files changed, 1,175 insertions(+), 274 deletions(-) Approvals: Impala Public Jenkins: Looks good to me, approved; Verified -- To view, visit http://gerrit.cloudera.org:8080/18191 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a Gerrit-Change-Number: 18191 Gerrit-PatchSet: 29 Gerrit-Owner: Gergely Fürnstáhl Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gergely Fürnstáhl Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18191 ) Change subject: IMPALA-9433: Improved caching of HdfsFileHandles .. Patch Set 28: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/18191 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a Gerrit-Change-Number: 18191 Gerrit-PatchSet: 28 Gerrit-Owner: Gergely Fürnstáhl Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gergely Fürnstáhl Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 04 Mar 2022 16:57:00 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/18191 ) Change subject: IMPALA-9433: Improved caching of HdfsFileHandles .. Patch Set 27: Code-Review+2 LGTM! -- To view, visit http://gerrit.cloudera.org:8080/18191 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a Gerrit-Change-Number: 18191 Gerrit-PatchSet: 27 Gerrit-Owner: Gergely Fürnstáhl Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gergely Fürnstáhl Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 04 Mar 2022 11:59:17 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18191 ) Change subject: IMPALA-9433: Improved caching of HdfsFileHandles .. Patch Set 28: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/7903/ DRY_RUN=false -- To view, visit http://gerrit.cloudera.org:8080/18191 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a Gerrit-Change-Number: 18191 Gerrit-PatchSet: 28 Gerrit-Owner: Gergely Fürnstáhl Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gergely Fürnstáhl Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 04 Mar 2022 11:59:39 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18191 ) Change subject: IMPALA-9433: Improved caching of HdfsFileHandles .. Patch Set 28: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/18191 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a Gerrit-Change-Number: 18191 Gerrit-PatchSet: 28 Gerrit-Owner: Gergely Fürnstáhl Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gergely Fürnstáhl Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 04 Mar 2022 11:59:38 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18191 ) Change subject: IMPALA-9433: Improved caching of HdfsFileHandles .. Patch Set 27: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/10238/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/18191 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a Gerrit-Change-Number: 18191 Gerrit-PatchSet: 27 Gerrit-Owner: Gergely Fürnstáhl Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gergely Fürnstáhl Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 01 Mar 2022 10:52:22 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles
Gergely Fürnstáhl has uploaded a new patch set (#27). ( http://gerrit.cloudera.org:8080/18191 ) Change subject: IMPALA-9433: Improved caching of HdfsFileHandles .. IMPALA-9433: Improved caching of HdfsFileHandles Seperated LRU caching functionality to a templated LruMultiCache class. Replaced std::multimap with std::unordered_map with std::list for O(1) lookups and less memory overhead, as it stores each key one time. Added boost::intrusive::list to handle LRU relations with less overhead. Added O(1) release method, instead of O(n) with minimal memory overhead. Implemented RAII Accessor to remove the responsibility of releasing the objects from the user. Wrapped cache accessor and related DiskIOManager metrics to a FileHandleCache::Accessor. Removed Release*() call trees from FileHandleCache and DiskIOManager, removed scoped exit from HdfsFileReader as they are handled automatically. Testing: Implemented extensive unit testing of the class, including forced rehashes, collisions, capacity overshoot, explicit/automatic release and destroy. Ran tests/custom_cluster/test_hdfs_fd_caching.py to verify FileHandleCache::Accessor behaviour through metrics. Ran bin/single_node_perf_run.py with TPCH and TPC-DS on parquet tables, no visible change in performance: TPCH scale=10 iterations=100: Delta(Avg)=-0.67% Delta(GeoMean)=-0.49% TPC-DS scale=10 iterations= 50: Delta(Avg)=-0.02% Delta(GeoMean)= 0.00% Tested some manual queries on functional_parquet.widetable_1000_cols with 64 threads but did not notice significant changes in scan times. Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a --- M be/src/runtime/io/disk-io-mgr.cc M be/src/runtime/io/disk-io-mgr.h M be/src/runtime/io/handle-cache.h M be/src/runtime/io/handle-cache.inline.h M be/src/runtime/io/hdfs-file-reader.cc M be/src/util/CMakeLists.txt A be/src/util/lru-multi-cache-test.cc A be/src/util/lru-multi-cache.h A be/src/util/lru-multi-cache.inline.h 9 files changed, 1,175 insertions(+), 274 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/91/18191/27 -- To view, visit http://gerrit.cloudera.org:8080/18191 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a Gerrit-Change-Number: 18191 Gerrit-PatchSet: 27 Gerrit-Owner: Gergely Fürnstáhl Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gergely Fürnstáhl Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18191 ) Change subject: IMPALA-9433: Improved caching of HdfsFileHandles .. Patch Set 26: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/10235/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/18191 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a Gerrit-Change-Number: 18191 Gerrit-PatchSet: 26 Gerrit-Owner: Gergely Fürnstáhl Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gergely Fürnstáhl Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 28 Feb 2022 14:34:10 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles
Gergely Fürnstáhl has uploaded a new patch set (#26). ( http://gerrit.cloudera.org:8080/18191 ) Change subject: IMPALA-9433: Improved caching of HdfsFileHandles .. IMPALA-9433: Improved caching of HdfsFileHandles Seperated LRU caching functionality to a templated LruMultiCache class. Replaced std::multimap with std::unordered_map with std::list for O(1) lookups and less memory overhead, as it stores each key one time. Added boost::intrusive::list to handle LRU relations with less overhead. Added O(1) release method, instead of O(n) with minimal memory overhead. Implemented RAII Accessor to remove the responsibility of releasing the objects from the user. Wrapped cache accessor and related DiskIOManager metrics to a FileHandleCache::Accessor. Removed Release*() call trees from FileHandleCache and DiskIOManager, removed scoped exit from HdfsFileReader as they are handled automatically. Testing: Implemented extensive unit testing of the class, including forced rehashes, collisions, capacity overshoot, explicit/automatic release and destroy. Ran tests/custom_cluster/test_hdfs_fd_caching.py to verify FileHandleCache::Accessor behaviour through metrics. Ran bin/single_node_perf_run.py with TPCH and TPC-DS on parquet tables, no visible change in performance: TPCH scale=10 iterations=100: Delta(Avg)=-0.67% Delta(GeoMean)=-0.49% TPC-DS scale=10 iterations= 50: Delta(Avg)=-0.02% Delta(GeoMean)= 0.00% Tested some manual queries on functional_parquet.widetable_1000_cols with 64 threads but did not notice significant changes in scan times. Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a --- M be/src/runtime/io/disk-io-mgr.cc M be/src/runtime/io/disk-io-mgr.h M be/src/runtime/io/handle-cache.h M be/src/runtime/io/handle-cache.inline.h M be/src/runtime/io/hdfs-file-reader.cc M be/src/util/CMakeLists.txt A be/src/util/lru-multi-cache-test.cc A be/src/util/lru-multi-cache.h A be/src/util/lru-multi-cache.inline.h 9 files changed, 1,175 insertions(+), 274 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/91/18191/26 -- To view, visit http://gerrit.cloudera.org:8080/18191 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a Gerrit-Change-Number: 18191 Gerrit-PatchSet: 26 Gerrit-Owner: Gergely Fürnstáhl Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gergely Fürnstáhl Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/18191 ) Change subject: IMPALA-9433: Improved caching of HdfsFileHandles .. Patch Set 25: Code-Review+1 (5 comments) The code looks awesome! http://gerrit.cloudera.org:8080/#/c/18191/25/be/src/runtime/io/handle-cache.inline.h File be/src/runtime/io/handle-cache.inline.h: http://gerrit.cloudera.org:8080/#/c/18191/25/be/src/runtime/io/handle-cache.inline.h@80 PS25, Line 80: if (cache_accessor_.Get()) : ImpaladMetrics::IO_MGR_NUM_FILE_HANDLES_OUTSTANDING->Increment(1L); nit: multi-line if stmt needs braces http://gerrit.cloudera.org:8080/#/c/18191/25/be/src/util/lru-multi-cache.h File be/src/util/lru-multi-cache.h: http://gerrit.cloudera.org:8080/#/c/18191/25/be/src/util/lru-multi-cache.h@53 PS25, Line 53: deigned nit: designed http://gerrit.cloudera.org:8080/#/c/18191/25/be/src/util/lru-multi-cache.h@61 PS25, Line 61: nit: extra space http://gerrit.cloudera.org:8080/#/c/18191/25/be/src/util/lru-multi-cache.h@74 PS25, Line 74: LRU order Please mention that least recently used elements are at the front. http://gerrit.cloudera.org:8080/#/c/18191/25/be/src/util/lru-multi-cache.inline.h File be/src/util/lru-multi-cache.inline.h: http://gerrit.cloudera.org:8080/#/c/18191/25/be/src/util/lru-multi-cache.inline.h@214 PS25, Line 214: in_use Do we need 'in_use'? Can't we just use 'member_hook.is_linked()' instead? Maybe we just need a member function InUse(): return !member_hook.is_linked(); -- To view, visit http://gerrit.cloudera.org:8080/18191 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a Gerrit-Change-Number: 18191 Gerrit-PatchSet: 25 Gerrit-Owner: Gergely Fürnstáhl Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gergely Fürnstáhl Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 28 Feb 2022 11:21:41 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/18191 ) Change subject: IMPALA-9433: Improved caching of HdfsFileHandles .. Patch Set 25: Code-Review+2 (1 comment) This is really good, thanks for all the hard work! http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.h File be/src/util/lru-multi-cache.h: http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.h@170 PS21, Line 170: typedef std::list Container; : > Started to implement with boost::optional, could work for Release/Destroy, Ok, now I get it. The current approach is good. -- To view, visit http://gerrit.cloudera.org:8080/18191 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a Gerrit-Change-Number: 18191 Gerrit-PatchSet: 25 Gerrit-Owner: Gergely Fürnstáhl Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gergely Fürnstáhl Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 25 Feb 2022 17:32:15 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18191 ) Change subject: IMPALA-9433: Improved caching of HdfsFileHandles .. Patch Set 25: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/10228/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/18191 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a Gerrit-Change-Number: 18191 Gerrit-PatchSet: 25 Gerrit-Owner: Gergely Fürnstáhl Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gergely Fürnstáhl Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 25 Feb 2022 11:48:36 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles
Gergely Fürnstáhl has uploaded a new patch set (#25). ( http://gerrit.cloudera.org:8080/18191 ) Change subject: IMPALA-9433: Improved caching of HdfsFileHandles .. IMPALA-9433: Improved caching of HdfsFileHandles Seperated LRU caching functionality to a templated LruMultiCache class. Replaced std::multimap with std::unordered_map with std::list for O(1) lookups and less memory overhead, as it stores each key one time. Added boost::intrusive::list to handle LRU relations with less overhead. Added O(1) release method, instead of O(n) with minimal memory overhead. Implemented RAII Accessor to remove the responsibility of releasing the objects from the user. Wrapped cache accessor and related DiskIOManager metrics to a FileHandleCache::Accessor. Removed Release*() call trees from FileHandleCache and DiskIOManager, removed scoped exit from HdfsFileReader as they are handled automatically. Testing: Implemented extensive unit testing of the class, including forced rehashes, collisions, capacity overshoot, explicit/automatic release and destroy. Ran tests/custom_cluster/test_hdfs_fd_caching.py to verify FileHandleCache::Accessor behaviour through metrics. Ran bin/single_node_perf_run.py with TPCH and TPC-DS on parquet tables, no visible change in performance: TPCH scale=10 iterations=100: Delta(Avg)=-0.67% Delta(GeoMean)=-0.49% TPC-DS scale=10 iterations= 50: Delta(Avg)=-0.02% Delta(GeoMean)= 0.00% Tested some manual queries on functional_parquet.widetable_1000_cols with 64 threads but did not notice significant changes in scan times. Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a --- M be/src/runtime/io/disk-io-mgr.cc M be/src/runtime/io/disk-io-mgr.h M be/src/runtime/io/handle-cache.h M be/src/runtime/io/handle-cache.inline.h M be/src/runtime/io/hdfs-file-reader.cc M be/src/util/CMakeLists.txt A be/src/util/lru-multi-cache-test.cc A be/src/util/lru-multi-cache.h A be/src/util/lru-multi-cache.inline.h 9 files changed, 1,178 insertions(+), 274 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/91/18191/25 -- To view, visit http://gerrit.cloudera.org:8080/18191 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a Gerrit-Change-Number: 18191 Gerrit-PatchSet: 25 Gerrit-Owner: Gergely Fürnstáhl Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gergely Fürnstáhl Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles
Gergely Fürnstáhl has posted comments on this change. ( http://gerrit.cloudera.org:8080/18191 ) Change subject: IMPALA-9433: Improved caching of HdfsFileHandles .. Patch Set 24: Yes, I did some tests on functional_parquet.widetable_1000_cols, with increased number of threads (I could tweak up to 64 I think) to maybe catch some improvement in lock contention , but did not notice any significant change in performance. -- To view, visit http://gerrit.cloudera.org:8080/18191 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a Gerrit-Change-Number: 18191 Gerrit-PatchSet: 24 Gerrit-Owner: Gergely Fürnstáhl Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gergely Fürnstáhl Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 25 Feb 2022 11:25:04 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles
Gergely Fürnstáhl has posted comments on this change. ( http://gerrit.cloudera.org:8080/18191 ) Change subject: IMPALA-9433: Improved caching of HdfsFileHandles .. Patch Set 21: (1 comment) http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.h File be/src/util/lru-multi-cache.h: http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.h@170 PS21, Line 170: Container_internal& container; : typename Container_internal::iterator it; > I haven't tried this, but here is something that might work: Started to implement with boost::optional, could work for Release/Destroy, but not for eviction. EvictOne only sees the ValueType_internal object through the LRU list, that's why I put the self referencing iterator there in the beginning. I could add another layer of indirection, not using the ValueType_internal as the LruList's inner type, rather a seperate inner class which has a Container::iterator member and the intrusive list hook, but that somewhat loses the point of the intrusive list, I don't really like it -- To view, visit http://gerrit.cloudera.org:8080/18191 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a Gerrit-Change-Number: 18191 Gerrit-PatchSet: 21 Gerrit-Owner: Gergely Fürnstáhl Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gergely Fürnstáhl Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 25 Feb 2022 11:24:23 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/18191 ) Change subject: IMPALA-9433: Improved caching of HdfsFileHandles .. Patch Set 24: Code-Review+1 (1 comment) This is looking good to me. Did we do a targeted performance test? One example would be to make a Parquet table with a large number of files with lots of files. In past work, I took the Parquet file from functional_parquet.widetable_1000_cols and created a table with 1000 copies of that file (copy the table schema by using "create table like"). A select * on that table would require a large number of file handle cache accesses. http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.h File be/src/util/lru-multi-cache.h: http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.h@170 PS21, Line 170: typedef std::list Container; : > I've considered moving this to the accessor, but decided to keep it here to I haven't tried this, but here is something that might work: 1. Add a boolean variable indicating whether the Accessor is valid 2. Add an Accessor constructor that takes no arguments. It initializes valid to false and doesn't do anything to the iterator. The iterator is in an undefined state, so we can't use it. For the valid=false case, we shouldn't need to use it. 3. Modify the other constructor to always set valid=true and set the iterator. 4. Modify code to treat valid=false the same as the null case today. -- To view, visit http://gerrit.cloudera.org:8080/18191 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a Gerrit-Change-Number: 18191 Gerrit-PatchSet: 24 Gerrit-Owner: Gergely Fürnstáhl Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gergely Fürnstáhl Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 25 Feb 2022 03:10:22 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18191 ) Change subject: IMPALA-9433: Improved caching of HdfsFileHandles .. Patch Set 24: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/10222/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/18191 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a Gerrit-Change-Number: 18191 Gerrit-PatchSet: 24 Gerrit-Owner: Gergely Fürnstáhl Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gergely Fürnstáhl Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 24 Feb 2022 13:32:42 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18191 ) Change subject: IMPALA-9433: Improved caching of HdfsFileHandles .. Patch Set 22: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/10221/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/18191 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a Gerrit-Change-Number: 18191 Gerrit-PatchSet: 22 Gerrit-Owner: Gergely Fürnstáhl Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gergely Fürnstáhl Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 24 Feb 2022 13:26:21 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles
Gergely Fürnstáhl has uploaded a new patch set (#24). ( http://gerrit.cloudera.org:8080/18191 ) Change subject: IMPALA-9433: Improved caching of HdfsFileHandles .. IMPALA-9433: Improved caching of HdfsFileHandles Seperated LRU caching functionality to a templated LruMultiCache class. Replaced std::multimap with std::unordered_map with std::list for O(1) lookups and less memory overhead, as it stores each key one time. Added boost::intrusive::list to handle LRU relations with less overhead. Added O(1) release method, instead of O(n) with minimal memory overhead. Implemented RAII Accessor to remove the responsibility of releasing the objects from the user. Wrapped cache accessor and related DiskIOManager metrics to a FileHandleCache::Accessor. Removed Release*() call trees from FileHandleCache and DiskIOManager, removed scoped exit from HdfsFileReader as they are handled automatically. Testing: Implemented extensive unit testing of the class, including forced rehashes, collisions, capacity overshoot, explicit/automatic release and destroy. Ran tests/custom_cluster/test_hdfs_fd_caching.py to verify FileHandleCache::Accessor behaviour through metrics. Ran bin/single_node_perf_run.py with TPCH and TPC-DS on parquet tables, no visible change in performance: TPCH scale=10 iterations=100: Delta(Avg)=-0.67% Delta(GeoMean)=-0.49% TPC-DS scale=10 iterations= 50: Delta(Avg)=-0.02% Delta(GeoMean)= 0.00% Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a --- M be/src/runtime/io/disk-io-mgr.cc M be/src/runtime/io/disk-io-mgr.h M be/src/runtime/io/handle-cache.h M be/src/runtime/io/handle-cache.inline.h M be/src/runtime/io/hdfs-file-reader.cc M be/src/util/CMakeLists.txt A be/src/util/lru-multi-cache-test.cc A be/src/util/lru-multi-cache.h A be/src/util/lru-multi-cache.inline.h 9 files changed, 1,178 insertions(+), 274 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/91/18191/24 -- To view, visit http://gerrit.cloudera.org:8080/18191 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a Gerrit-Change-Number: 18191 Gerrit-PatchSet: 24 Gerrit-Owner: Gergely Fürnstáhl Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gergely Fürnstáhl Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles
Gergely Fürnstáhl has uploaded a new patch set (#22). ( http://gerrit.cloudera.org:8080/18191 ) Change subject: IMPALA-9433: Improved caching of HdfsFileHandles .. IMPALA-9433: Improved caching of HdfsFileHandles Seperated LRU caching functionality to a templated LruMultiCache class. Replaced std::multimap with std::unordered_map with std::list for O(1) lookups and less memory overhead, as it stores each key one time. Added boost::intrusive::list to handle LRU relations with less overhead. Added O(1) release method, instead of O(n) with minimal memory overhead. Implemented RAII Accessor to remove the responsibility of releasing the objects from the user. Wrapped cache accessor and related DiskIOManager metrics to a FileHandleCache::Accessor. Removed Release*() call trees from FileHandleCache and DiskIOManager, removed scoped exit from HdfsFileReader as they are handled automatically. Testing: Implemented extensive unit testing of the class, including forced rehashes, collisions, capacity overshoot, explicit/automatic release and destroy. Ran tests/custom_cluster/test_hdfs_fd_caching.py to verify FileHandleCache::Accessor behaviour through metrics. Ran bin/single_node_perf_run.py with TPCH and TPC-DS on parquet tables, no visible change in performance: TPCH scale=10 iterations=100: Delta(Avg)=-0.67% Delta(GeoMean)=-0.49% TPC-DS scale=10 iterations= 50: Delta(Avg)=-0.02% Delta(GeoMean)= 0.00% Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a --- M be/src/runtime/io/disk-io-mgr.cc M be/src/runtime/io/disk-io-mgr.h M be/src/runtime/io/handle-cache.h M be/src/runtime/io/handle-cache.inline.h M be/src/runtime/io/hdfs-file-reader.cc M be/src/util/CMakeLists.txt A be/src/util/lru-multi-cache-test.cc A be/src/util/lru-multi-cache.h A be/src/util/lru-multi-cache.inline.h 9 files changed, 1,182 insertions(+), 274 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/91/18191/22 -- To view, visit http://gerrit.cloudera.org:8080/18191 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a Gerrit-Change-Number: 18191 Gerrit-PatchSet: 22 Gerrit-Owner: Gergely Fürnstáhl Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gergely Fürnstáhl Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles
Gergely Fürnstáhl has posted comments on this change. ( http://gerrit.cloudera.org:8080/18191 ) Change subject: IMPALA-9433: Improved caching of HdfsFileHandles .. Patch Set 21: (9 comments) http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/runtime/io/handle-cache.inline.h File be/src/runtime/io/handle-cache.inline.h: http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/runtime/io/handle-cache.inline.h@167 PS21, Line 167: // Opening a file handle requires talking to the NameNode so it can take some time. : RETURN_IF_ERROR(accessor_tmp.Get()->Init(hdfs_monitor_)); > Ok, great! Done http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/runtime/io/handle-cache.inline.h@167 PS21, Line 167: // Opening a file handle requires talking to the NameNode so it can take some time. : RETURN_IF_ERROR(accessor_tmp.Get()->Init(hdfs_monitor_)); > Ok, great! Done http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.h File be/src/util/lru-multi-cache.h: http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.h@170 PS21, Line 170: Container_internal& container; : typename Container_internal::iterator it; > I was thinking about whether there is a way to avoid having the element kee I've considered moving this to the accessor, but decided to keep it here to spare constructions every time an accessor is created. I like the idea of having iterator instead of pointer in the accessor, it would make it clear it's not something to delete. However, I can't see a clear solution for empty Accessor yet: - if we want to default initialize the iterator, we would need access to the owning container too - creating a dummy empty Container in the cache is possible, but feels more hacky than this way In an earlier patchset, I used a unique_ptr with custom deleter (cache->Release()). That's still a possibility, making the default destructor good enough, however FileHandleCache::Accessor calls explicit release/destroy anyways to handle metrics, so I decided to remove it. http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.h@252 PS21, Line 252: ValueType_internal* p_value_internal_; > Consider making this an iterator to the std::list element I did, explained it on line 171 http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.inline.h File be/src/util/lru-multi-cache.inline.h: http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.inline.h@147 PS21, Line 147: // Move the object to the back of the owning list as it is no longer available : container.splice(container.end(), container, container_it); > Ok, that makes sense. Done http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.inline.h@175 PS21, Line 175: container.emplace_back((*this), stored_key, container, std::forward(args)...); > If you wanted to get an iterator back, the regular emplace() call returns a Thanks, a bit cleaner this way http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.inline.h@201 PS21, Line 201: // Move the object to the front, keep LRU relation in owning list too to : // be able to age out unused objects : container.splice(container.begin(), container, p_value_internal->it); > Basically, once you decide to move the in_use=true to the end of the list i Yes, exactly http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.inline.h@222 PS21, Line 222: if (container.size() == 1) { : // Last element, owning list can be removed to prevent aging : hash_table_.erase(p_value_internal->key); : } else { : // Remove from owning list : container.erase(p_value_internal->it); : } > Ok, so can we add a DCHECK that this is in use? Maybe also add a comment th Done http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.inline.h@251 PS21, Line 251: // Remove from LRU cache : value_internal.member_hook.unlink(); > Nit: Can we add a DCHECK that this is not in use? Done -- To view, visit http://gerrit.cloudera.org:8080/18191 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a Gerrit-Change-Number: 18191 Gerrit-PatchSet: 21 Gerrit-Owner: Gergely Fürnstáhl Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gergely Fürnstáhl Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 24 Feb 2022 13:03:02 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18191 ) Change subject: IMPALA-9433: Improved caching of HdfsFileHandles .. Patch Set 22: (4 comments) http://gerrit.cloudera.org:8080/#/c/18191/22/be/src/runtime/io/handle-cache.h File be/src/runtime/io/handle-cache.h: http://gerrit.cloudera.org:8080/#/c/18191/22/be/src/runtime/io/handle-cache.h@117 PS22, Line 117: typedef LruMultiCache, CachedHdfsFileHandle> CacheType; line too long (91 > 90) http://gerrit.cloudera.org:8080/#/c/18191/22/be/src/util/lru-multi-cache.inline.h File be/src/util/lru-multi-cache.inline.h: http://gerrit.cloudera.org:8080/#/c/18191/22/be/src/util/lru-multi-cache.inline.h@132 PS22, Line 132: line has trailing whitespace http://gerrit.cloudera.org:8080/#/c/18191/22/be/src/util/lru-multi-cache.inline.h@165 PS22, Line 165: auto container_it = container.emplace(container.end(), (*this), stored_key, container, std::forward(args)...); line too long (118 > 90) http://gerrit.cloudera.org:8080/#/c/18191/22/be/src/util/lru-multi-cache.inline.h@252 PS22, Line 252: line has trailing whitespace -- To view, visit http://gerrit.cloudera.org:8080/18191 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a Gerrit-Change-Number: 18191 Gerrit-PatchSet: 22 Gerrit-Owner: Gergely Fürnstáhl Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gergely Fürnstáhl Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 24 Feb 2022 13:04:08 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/18191 ) Change subject: IMPALA-9433: Improved caching of HdfsFileHandles .. Patch Set 21: (8 comments) > (12 comments) > > > > > (8 comments) > > > > > > > > Thank you for taking this on. There is a lot of history here. > > > > Originally, the file handle cache used a generic structure > like > > > > this: > > > > https://github.com/apache/impala/blob/branch-2.10.0/be/src/util/lru-cache.h > > > > > > > > In my rewrite, I switched it to remove the generic structure. > > > This > > > > heads back in the other direction. > > > > > > > > I like that you have backend tests for the generic data > > > structure, > > > > which is definitely one advantage of that approach. One > > question > > > I > > > > have about moving back to a generic structure is whether we > > would > > > > be able to add new customization to the file handle cache > case. > > I > > > > had been thinking about adding a file structure that could > > > contain > > > > additional per-file data and/or stats. Is that possible with > > the > > > > new generic structure? > > > > > > "had been thinking about adding a file structure that could > > contain > > > additional per-file data and/or stat" > > > > > > I was thinking about similar things (e.g. caching processed > > > Parquet/ORC headers), but this seems a somewhat different > feature > > > to me - while we want to cache more than one file handle per > file > > > and apply LRU logic per handle, we want to cache data for a > file > > > only once and apply LRU logic per file. > > > > Yeah, it's unclear whether we would ever want to extend the file > > handle cache > > to deal with other things. Separate data structures may be > cleaner > > even > > if it means duplicating filename strings or other things. The > file > > handle > > cache is pretty unusual in structure and historically we haven't > > extended > > it. I don't have any strong objection to a generic structure. I > > just wanted > > to think through whether there are any extensions that would end > up > > getting more complicated. > > > > For more ordinary caches that don't need duplication, we should > be > > using the > > cache implementations in be/src/util/cache, because that also > gets > > us different > > cache eviction policies. > > Moving the caching to a separate class definitely helps with > testing. > > I think making it generic helps with encapsulation too, as the > caching algorithm (even if it is somewhat specialized to a use > case) has nothing to do with the stored data. Moreover, making it > generic helped with unit testing too, as I didn't have to juggle > around file handles to test out the caching feature. > > As I understand, the requirements were simple enough to fit it in a > generic structure. This would help in the future to decide - if we > ever want to use something similar to this - whether if it's easier > to extend with some configurability or not. We can specialize this > more towards file handle caching, although the arguments for > templated key/value are still applicable in that case, maybe we can > change the name and place of the code for something less generic. > > Regarding to per-file data/stats, we could squeeze it in (e.g. > unordered_map> but I don't think > that is a good direction, it breaks the encapsulation of the cache. > I would rather put a container in FileHandleCache class next to the > cache and do handle based operations/metrics during creation > (GetFileHandle() new entry branch) and do access based > operation/metrics in FileHandleCache::Accessor This generic design is good, and we don't need to overthink this. My feeling is that the odds of us reusing the generic structure for something else are pretty low. (We'll find out! Maybe I'm wrong.) If you believe that, then I think an intermediate point is a non-generic structure that allows mocking the FileHandle. In other words, it allows for writing backend tests without creating real file handles, but it also isn't dealing with arbitrary types. Either way, I think if we need to extend this, then we will find a way to make it work. I'm not concerned. http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/runtime/io/handle-cache.inline.h File be/src/runtime/io/handle-cache.inline.h: http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/runtime/io/handle-cache.inline.h@167 PS21, Line 167: // Opening a file handle requires talking to the NameNode so it can take some time. : RETURN_IF_ERROR(accessor_tmp.Get()->Init(hdfs_monitor_)); > Yes, it will be in_use, does not lock the cache and not available for other Ok, great! http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.h File be/src/util/lru-multi-cache.h: http://gerrit.cloudera.org:8080/#/c/18191/21/
[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles
Gergely Fürnstáhl has posted comments on this change. ( http://gerrit.cloudera.org:8080/18191 ) Change subject: IMPALA-9433: Improved caching of HdfsFileHandles .. Patch Set 21: (12 comments) > > > (8 comments) > > > > > > Thank you for taking this on. There is a lot of history here. > > > Originally, the file handle cache used a generic structure like > > > this: > > > https://github.com/apache/impala/blob/branch-2.10.0/be/src/util/lru-cache.h > > > > > > In my rewrite, I switched it to remove the generic structure. > > This > > > heads back in the other direction. > > > > > > I like that you have backend tests for the generic data > > structure, > > > which is definitely one advantage of that approach. One > question > > I > > > have about moving back to a generic structure is whether we > would > > > be able to add new customization to the file handle cache case. > I > > > had been thinking about adding a file structure that could > > contain > > > additional per-file data and/or stats. Is that possible with > the > > > new generic structure? > > > > "had been thinking about adding a file structure that could > contain > > additional per-file data and/or stat" > > > > I was thinking about similar things (e.g. caching processed > > Parquet/ORC headers), but this seems a somewhat different feature > > to me - while we want to cache more than one file handle per file > > and apply LRU logic per handle, we want to cache data for a file > > only once and apply LRU logic per file. > > Yeah, it's unclear whether we would ever want to extend the file > handle cache > to deal with other things. Separate data structures may be cleaner > even > if it means duplicating filename strings or other things. The file > handle > cache is pretty unusual in structure and historically we haven't > extended > it. I don't have any strong objection to a generic structure. I > just wanted > to think through whether there are any extensions that would end up > getting more complicated. > > For more ordinary caches that don't need duplication, we should be > using the > cache implementations in be/src/util/cache, because that also gets > us different > cache eviction policies. Moving the caching to a separate class definitely helps with testing. I think making it generic helps with encapsulation too, as the caching algorithm (even if it is somewhat specialized to a use case) has nothing to do with the stored data. Moreover, making it generic helped with unit testing too, as I didn't have to juggle around file handles to test out the caching feature. As I understand, the requirements were simple enough to fit it in a generic structure. This would help in the future to decide - if we ever want to use something similar to this - whether if it's easier to extend with some configurability or not. We can specialize this more towards file handle caching, although the arguments for templated key/value are still applicable in that case, maybe we can change the name and place of the code for something less generic. Regarding to per-file data/stats, we could squeeze it in (e.g. unordered_map> but I don't think that is a good direction, it breaks the encapsulation of the cache. I would rather put a container in FileHandleCache class next to the cache and do handle based operations/metrics during creation (GetFileHandle() new entry branch) and do access based operation/metrics in FileHandleCache::Accessor http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/runtime/io/handle-cache.h File be/src/runtime/io/handle-cache.h: http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/runtime/io/handle-cache.h@118 PS21, Line 118: return file_handle.mtime() == mtime; > The semantics of mtime is unclear to me - why do we handle it separately a Very good point, I think I got stuck on earlier design :) http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/runtime/io/handle-cache.inline.h File be/src/runtime/io/handle-cache.inline.h: http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/runtime/io/handle-cache.inline.h@167 PS21, Line 167: // Opening a file handle requires talking to the NameNode so it can take some time. : RETURN_IF_ERROR(accessor_tmp.Get()->Init(hdfs_monitor_)); > Let me double-check my understanding of the threading here: Yes, it will be in_use, does not lock the cache and not available for other threads. http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/runtime/io/handle-cache.inline.h@167 PS21, Line 167: // Opening a file handle requires talking to the NameNode so it can take some time. : RETURN_IF_ERROR(accessor_tmp.Get()->Init(hdfs_monitor_)); > Another question: How does this work if this call fails? Does the entry get Very good point, now it will be released back to the cache, I will add a destroy http://gerrit.cloudera.org:8080/#/c/18191/21
[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/18191 ) Change subject: IMPALA-9433: Improved caching of HdfsFileHandles .. Patch Set 21: (1 comment) http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/runtime/io/handle-cache.inline.h File be/src/runtime/io/handle-cache.inline.h: http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/runtime/io/handle-cache.inline.h@167 PS21, Line 167: // Opening a file handle requires talking to the NameNode so it can take some time. : RETURN_IF_ERROR(accessor_tmp.Get()->Init(hdfs_monitor_)); > Let me double-check my understanding of the threading here: Another question: How does this work if this call fails? Does the entry get removed from the cache? -- To view, visit http://gerrit.cloudera.org:8080/18191 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a Gerrit-Change-Number: 18191 Gerrit-PatchSet: 21 Gerrit-Owner: Gergely Fürnstáhl Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gergely Fürnstáhl Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 22 Feb 2022 18:47:22 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/18191 ) Change subject: IMPALA-9433: Improved caching of HdfsFileHandles .. Patch Set 21: > > (8 comments) > > > > Thank you for taking this on. There is a lot of history here. > > Originally, the file handle cache used a generic structure like > > this: > > https://github.com/apache/impala/blob/branch-2.10.0/be/src/util/lru-cache.h > > > > In my rewrite, I switched it to remove the generic structure. > This > > heads back in the other direction. > > > > I like that you have backend tests for the generic data > structure, > > which is definitely one advantage of that approach. One question > I > > have about moving back to a generic structure is whether we would > > be able to add new customization to the file handle cache case. I > > had been thinking about adding a file structure that could > contain > > additional per-file data and/or stats. Is that possible with the > > new generic structure? > > "had been thinking about adding a file structure that could contain > additional per-file data and/or stat" > > I was thinking about similar things (e.g. caching processed > Parquet/ORC headers), but this seems a somewhat different feature > to me - while we want to cache more than one file handle per file > and apply LRU logic per handle, we want to cache data for a file > only once and apply LRU logic per file. Yeah, it's unclear whether we would ever want to extend the file handle cache to deal with other things. Separate data structures may be cleaner even if it means duplicating filename strings or other things. The file handle cache is pretty unusual in structure and historically we haven't extended it. I don't have any strong objection to a generic structure. I just wanted to think through whether there are any extensions that would end up getting more complicated. For more ordinary caches that don't need duplication, we should be using the cache implementations in be/src/util/cache, because that also gets us different cache eviction policies. -- To view, visit http://gerrit.cloudera.org:8080/18191 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a Gerrit-Change-Number: 18191 Gerrit-PatchSet: 21 Gerrit-Owner: Gergely Fürnstáhl Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gergely Fürnstáhl Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 22 Feb 2022 18:46:25 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/18191 ) Change subject: IMPALA-9433: Improved caching of HdfsFileHandles .. Patch Set 21: > (8 comments) > > Thank you for taking this on. There is a lot of history here. > Originally, the file handle cache used a generic structure like > this: > https://github.com/apache/impala/blob/branch-2.10.0/be/src/util/lru-cache.h > > In my rewrite, I switched it to remove the generic structure. This > heads back in the other direction. > > I like that you have backend tests for the generic data structure, > which is definitely one advantage of that approach. One question I > have about moving back to a generic structure is whether we would > be able to add new customization to the file handle cache case. I > had been thinking about adding a file structure that could contain > additional per-file data and/or stats. Is that possible with the > new generic structure? "had been thinking about adding a file structure that could contain additional per-file data and/or stat" I was thinking about similar things (e.g. caching processed Parquet/ORC headers), but this seems a somewhat different feature to me - while we want to cache more than one file handle per file and apply LRU logic per handle, we want to cache data for a file only once and apply LRU logic per file. -- To view, visit http://gerrit.cloudera.org:8080/18191 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a Gerrit-Change-Number: 18191 Gerrit-PatchSet: 21 Gerrit-Owner: Gergely Fürnstáhl Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gergely Fürnstáhl Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 22 Feb 2022 11:22:51 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/18191 ) Change subject: IMPALA-9433: Improved caching of HdfsFileHandles .. Patch Set 21: (8 comments) Thank you for taking this on. There is a lot of history here. Originally, the file handle cache used a generic structure like this: https://github.com/apache/impala/blob/branch-2.10.0/be/src/util/lru-cache.h In my rewrite, I switched it to remove the generic structure. This heads back in the other direction. I like that you have backend tests for the generic data structure, which is definitely one advantage of that approach. One question I have about moving back to a generic structure is whether we would be able to add new customization to the file handle cache case. I had been thinking about adding a file structure that could contain additional per-file data and/or stats. Is that possible with the new generic structure? http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/runtime/io/handle-cache.inline.h File be/src/runtime/io/handle-cache.inline.h: http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/runtime/io/handle-cache.inline.h@167 PS21, Line 167: // Opening a file handle requires talking to the NameNode so it can take some time. : RETURN_IF_ERROR(accessor_tmp.Get()->Init(hdfs_monitor_)); Let me double-check my understanding of the threading here: The EmplaceAndGet adds an entry, but that entry is in_use=true. If this function takes a while, other threads will ignore the entry because it is in use. http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/runtime/io/handle-cache.inline.h@170 PS21, Line 170: it/out Nit: in/out http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.h File be/src/util/lru-multi-cache.h: http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.h@125 PS21, Line 125: resf ot Nit: rest of http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.h@170 PS21, Line 170: Container_internal& container; : typename Container_internal::iterator it; Did we consider using an intrusive list for this? http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.inline.h File be/src/util/lru-multi-cache.inline.h: http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.inline.h@147 PS21, Line 147: // Move the object to the back of the owning list as it is no longer available : container.splice(container.end(), container, container_it); Why move it to the back? It could stay where it is and Get() would skip over it because in_use=true (i.e. change the break above to a continue). I don't think we get a meaningful speedup from avoiding a couple link list hops and shuffling the list around. http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.inline.h@197 PS21, Line 197: p_value_internal->timestamp_seconds = MonotonicSeconds() > I think that we didn't refresh timestamp_second in the old version, so soon The old behavior does reset the timestamp each time we use it. The code is subtle, but what is happening is that each get/release cycle the entry is removed from the LRU list and then added back. The timestamp is set each time it is added to the LRU list. If something is in continuous use, we would keep it open. One nice thing about the old behavior is that the time-based eviction is O(1), because the ordering matches the LRU list ordering. The eviction thread can walk from oldest to newest and then stop when it sees the first timestamp that isn't eligible for eviction. http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.inline.h@201 PS21, Line 201: // Move the object to the front, keep LRU relation in owning list too to : // be able to age out unused objects : container.splice(container.begin(), container, p_value_internal->it); Do we need to be moving entries? I'm not sure this gets us anything. Even if we leave the objects in the order they were created, the fact that we prefer the earlier entries will cause the later entries to be less frequently used and age out faster. http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.inline.h@222 PS21, Line 222: if (container.size() == 1) { : // Last element, owning list can be removed to prevent aging : hash_table_.erase(p_value_internal->key); : } else { : // Remove from owning list : container.erase(p_value_internal->it); : } When/how does the element get removed from the LRU list? (I'm guessing it has something to do with auto unlink) -- To view, visit http://gerrit.cloudera.org:8080/18191 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerr
[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/18191 ) Change subject: IMPALA-9433: Improved caching of HdfsFileHandles .. Patch Set 21: (3 comments) http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/runtime/io/handle-cache.h File be/src/runtime/io/handle-cache.h: http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/runtime/io/handle-cache.h@118 PS21, Line 118: return file_handle.mtime() == mtime; The semantics of mtime is unclear to me - why do we handle it separately and not as the part of the key? My understanding is that we always do lookups with exact file name and mtime and they remain constant for the lifetime of the handle. http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.h File be/src/util/lru-multi-cache.h: http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.h@78 PS21, Line 78: /// this is a intrusive (as known as not owning) list, used for eviction nit: an http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.inline.h File be/src/util/lru-multi-cache.inline.h: http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.inline.h@197 PS21, Line 197: p_value_internal->timestamp_seconds = MonotonicSeconds() I think that we didn't refresh timestamp_second in the old version, so sooner or later we closed handles, even if it was used again and again. This can influence what happens if the file was deleted - keeping a file handle always open means the replica on the given data node cannot be deleted. https://www.quora.com/What-happens-if-you-attempt-to-delete-a-file-in-HDFS-while-its-being-used-for-a-job -- To view, visit http://gerrit.cloudera.org:8080/18191 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a Gerrit-Change-Number: 18191 Gerrit-PatchSet: 21 Gerrit-Owner: Gergely Fürnstáhl Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gergely Fürnstáhl Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 17 Feb 2022 19:03:12 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18191 ) Change subject: IMPALA-9433: Improved caching of HdfsFileHandles .. Patch Set 21: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/10159/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/18191 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a Gerrit-Change-Number: 18191 Gerrit-PatchSet: 21 Gerrit-Owner: Gergely Fürnstáhl Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gergely Fürnstáhl Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 15 Feb 2022 09:21:19 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles
Gergely Fürnstáhl has uploaded a new patch set (#21). ( http://gerrit.cloudera.org:8080/18191 ) Change subject: IMPALA-9433: Improved caching of HdfsFileHandles .. IMPALA-9433: Improved caching of HdfsFileHandles Seperated LRU caching functionality to a templated LruMultiCache class. Replaced std::multimap with std::unordered_map with std::list for O(1) lookups and less memory overhead, as it stores each key one time. Added boost::intrusive::list to handle LRU relations with less overhead. Added O(1) release method, instead of O(n) with minimal memory overhead. Implemented RAII Accessor to remove the responsibility of releasing the objects from the user. Wrapped cache accessor and related DiskIOManager metrics to a FileHandleCache::Accessor. Removed Release*() call trees from FileHandleCache and DiskIOManager, removed scoped exit from HdfsFileReader as they are handled automatically. Testing: Implemented extensive unit testing of the class, including forced rehashes, collisions, capacity overshoot, explicit/automatic release and destroy. Ran tests/custom_cluster/test_hdfs_fd_caching.py to verify FileHandleCache::Accessor behaviour through metrics. Ran bin/single_node_perf_run.py with TPCH and TPC-DS on parquet tables, no visible change in performance: TPCH scale=10 iterations=100: Delta(Avg)=-0.67% Delta(GeoMean)=-0.49% TPC-DS scale=10 iterations= 50: Delta(Avg)=-0.02% Delta(GeoMean)= 0.00% Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a --- M be/src/runtime/io/disk-io-mgr.cc M be/src/runtime/io/disk-io-mgr.h M be/src/runtime/io/handle-cache.h M be/src/runtime/io/handle-cache.inline.h M be/src/runtime/io/hdfs-file-reader.cc M be/src/util/CMakeLists.txt A be/src/util/lru-multi-cache-test.cc A be/src/util/lru-multi-cache.h A be/src/util/lru-multi-cache.inline.h 9 files changed, 1,190 insertions(+), 274 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/91/18191/21 -- To view, visit http://gerrit.cloudera.org:8080/18191 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a Gerrit-Change-Number: 18191 Gerrit-PatchSet: 21 Gerrit-Owner: Gergely Fürnstáhl Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gergely Fürnstáhl Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18191 ) Change subject: IMPALA-9433: Improved caching of HdfsFileHandles .. Patch Set 20: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/10152/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/18191 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a Gerrit-Change-Number: 18191 Gerrit-PatchSet: 20 Gerrit-Owner: Gergely Fürnstáhl Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gergely Fürnstáhl Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 14 Feb 2022 13:03:00 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles
Gergely Fürnstáhl has uploaded a new patch set (#20). ( http://gerrit.cloudera.org:8080/18191 ) Change subject: IMPALA-9433: Improved caching of HdfsFileHandles .. IMPALA-9433: Improved caching of HdfsFileHandles Seperated LRU caching functionality to a templated LruMultiCache class. Replaced std::multimap with std::unordered_map with std::list for O(1) lookups and less memory overhead, as it stores each key one time. Added boost::intrusive::list to handle LRU relations with less overhead. Added O(1) release method, instead of O(n) with minimal memory overhead. Implemented RAII Accessor to remove the responsibility of releasing the objects from the user. Wrapped cache accessor and related DiskIOManager metrics to a FileHandleCache::Accessor. Removed Release*() call trees from FileHandleCache and DiskIOManager, removed scoped exit from HdfsFileReader as they are handled automatically. Testing: Implemented extensive unit testing of the class, including forced rehashes, collisions, capacity overshoot, explicit/automatic release and destroy. Ran tests/custom_cluster/test_hdfs_fd_caching.py to verify FileHandleCache::Accessor behaviour through metrics. Ran bin/single_node_perf_run.py with TPCH and TPC-DS on parquet tables, no visible change in performance: TPCH scale=10 iterations=100: Delta(Avg)=-0.67% Delta(GeoMean)=-0.49% TPC-DS scale=10 iterations= 50: Delta(Avg)=-0.02% Delta(GeoMean)= 0.00% Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a --- M be/src/runtime/io/disk-io-mgr.cc M be/src/runtime/io/disk-io-mgr.h M be/src/runtime/io/handle-cache.h M be/src/runtime/io/handle-cache.inline.h M be/src/runtime/io/hdfs-file-reader.cc M be/src/util/CMakeLists.txt A be/src/util/lru-multi-cache-test.cc A be/src/util/lru-multi-cache.h A be/src/util/lru-multi-cache.inline.h 9 files changed, 1,183 insertions(+), 274 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/91/18191/20 -- To view, visit http://gerrit.cloudera.org:8080/18191 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a Gerrit-Change-Number: 18191 Gerrit-PatchSet: 20 Gerrit-Owner: Gergely Fürnstáhl Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gergely Fürnstáhl Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18191 ) Change subject: IMPALA-9433: Improved caching of HdfsFileHandles .. Patch Set 19: Build Failed https://jenkins.impala.io/job/gerrit-code-review-checks/10150/ : Initial code review checks failed. See linked job for details on the failure. -- To view, visit http://gerrit.cloudera.org:8080/18191 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a Gerrit-Change-Number: 18191 Gerrit-PatchSet: 19 Gerrit-Owner: Gergely Fürnstáhl Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gergely Fürnstáhl Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 14 Feb 2022 11:47:34 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18191 ) Change subject: IMPALA-9433: Improved caching of HdfsFileHandles .. Patch Set 18: Build Failed https://jenkins.impala.io/job/gerrit-code-review-checks/10149/ : Initial code review checks failed. See linked job for details on the failure. -- To view, visit http://gerrit.cloudera.org:8080/18191 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a Gerrit-Change-Number: 18191 Gerrit-PatchSet: 18 Gerrit-Owner: Gergely Fürnstáhl Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gergely Fürnstáhl Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 14 Feb 2022 11:39:21 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles
Gergely Fürnstáhl has uploaded a new patch set (#19). ( http://gerrit.cloudera.org:8080/18191 ) Change subject: IMPALA-9433: Improved caching of HdfsFileHandles .. IMPALA-9433: Improved caching of HdfsFileHandles Seperated LRU caching functionality to a templated LruMultiCache class. Replaced std::multimap with std::unordered_map with std::list for O(1) lookups and less memory overhead, as it stores each key one time. Added boost::intrusive::list to handle LRU relations with less overhead. Added O(1) release method, instead of O(n) with minimal memory overhead. Implemented RAII Accessor to remove the responsibility of releasing the objects from the user. Wrapped cache accessor and related DiskIOManager metrics to a FileHandleCache::Accessor. Removed Release*() call trees from FileHandleCache and DiskIOManager, removed scoped exit from HdfsFileReader as they are handled automatically. Testing: Implemented extensive unit testing of the class, including forced rehashes, collisions, capacity overshoot, explicit/automatic release and destroy. Ran tests/custom_cluster/test_hdfs_fd_caching.py to verify FileHandleCache::Accessor behaviour through metrics. Ran bin/single_node_perf_run.py with TPCH and TPC-DS on parquet tables, no visible change in performance: TPCH scale=10 iterations=100: Delta(Avg)=-0.67% Delta(GeoMean)=-0.49% TPC-DS scale=10 iterations= 50: Delta(Avg)=-0.02% Delta(GeoMean)= 0.00% Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a --- M be/src/runtime/io/disk-io-mgr.cc M be/src/runtime/io/disk-io-mgr.h M be/src/runtime/io/handle-cache.h M be/src/runtime/io/handle-cache.inline.h M be/src/runtime/io/hdfs-file-reader.cc M be/src/util/CMakeLists.txt A be/src/util/lru-multi-cache-test.cc A be/src/util/lru-multi-cache.h A be/src/util/lru-multi-cache.inline.h 9 files changed, 1,188 insertions(+), 274 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/91/18191/19 -- To view, visit http://gerrit.cloudera.org:8080/18191 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a Gerrit-Change-Number: 18191 Gerrit-PatchSet: 19 Gerrit-Owner: Gergely Fürnstáhl Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gergely Fürnstáhl Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Zoltan Borok-Nagy