[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles

2022-03-04 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/18191 )

Change subject: IMPALA-9433: Improved caching of HdfsFileHandles
..

IMPALA-9433: Improved caching of HdfsFileHandles

Seperated LRU caching functionality to a templated LruMultiCache class.

Replaced std::multimap with std::unordered_map with std::list for O(1)
lookups and less memory overhead, as it stores each key one time. Added
boost::intrusive::list to handle LRU relations with less overhead.
Added O(1) release method, instead of O(n) with minimal memory overhead.
Implemented RAII Accessor to remove the responsibility of releasing
the objects from the user.

Wrapped cache accessor and related DiskIOManager metrics to a
FileHandleCache::Accessor. Removed Release*() call trees from
FileHandleCache and DiskIOManager, removed scoped exit from
HdfsFileReader as they are handled automatically.

Testing:

Implemented extensive unit testing of the class, including forced
rehashes, collisions, capacity overshoot, explicit/automatic release
and destroy.

Ran tests/custom_cluster/test_hdfs_fd_caching.py to verify
FileHandleCache::Accessor behaviour through metrics.

Ran bin/single_node_perf_run.py with TPCH and TPC-DS on parquet tables,
no visible change in performance:
TPCH   scale=10 iterations=100: Delta(Avg)=-0.67% Delta(GeoMean)=-0.49%
TPC-DS scale=10 iterations= 50: Delta(Avg)=-0.02% Delta(GeoMean)= 0.00%

Tested some manual queries on functional_parquet.widetable_1000_cols
with 64 threads but did not notice significant changes in scan times.

Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a
Reviewed-on: http://gerrit.cloudera.org:8080/18191
Reviewed-by: Impala Public Jenkins 
Tested-by: Impala Public Jenkins 
---
M be/src/runtime/io/disk-io-mgr.cc
M be/src/runtime/io/disk-io-mgr.h
M be/src/runtime/io/handle-cache.h
M be/src/runtime/io/handle-cache.inline.h
M be/src/runtime/io/hdfs-file-reader.cc
M be/src/util/CMakeLists.txt
A be/src/util/lru-multi-cache-test.cc
A be/src/util/lru-multi-cache.h
A be/src/util/lru-multi-cache.inline.h
9 files changed, 1,175 insertions(+), 274 deletions(-)

Approvals:
  Impala Public Jenkins: Looks good to me, approved; Verified

--
To view, visit http://gerrit.cloudera.org:8080/18191
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a
Gerrit-Change-Number: 18191
Gerrit-PatchSet: 29
Gerrit-Owner: Gergely Fürnstáhl 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gergely Fürnstáhl 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles

2022-03-04 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18191 )

Change subject: IMPALA-9433: Improved caching of HdfsFileHandles
..


Patch Set 28: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/18191
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a
Gerrit-Change-Number: 18191
Gerrit-PatchSet: 28
Gerrit-Owner: Gergely Fürnstáhl 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gergely Fürnstáhl 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 04 Mar 2022 16:57:00 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles

2022-03-04 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18191 )

Change subject: IMPALA-9433: Improved caching of HdfsFileHandles
..


Patch Set 27: Code-Review+2

LGTM!


--
To view, visit http://gerrit.cloudera.org:8080/18191
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a
Gerrit-Change-Number: 18191
Gerrit-PatchSet: 27
Gerrit-Owner: Gergely Fürnstáhl 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gergely Fürnstáhl 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 04 Mar 2022 11:59:17 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles

2022-03-04 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18191 )

Change subject: IMPALA-9433: Improved caching of HdfsFileHandles
..


Patch Set 28:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/7903/ 
DRY_RUN=false


--
To view, visit http://gerrit.cloudera.org:8080/18191
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a
Gerrit-Change-Number: 18191
Gerrit-PatchSet: 28
Gerrit-Owner: Gergely Fürnstáhl 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gergely Fürnstáhl 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 04 Mar 2022 11:59:39 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles

2022-03-04 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18191 )

Change subject: IMPALA-9433: Improved caching of HdfsFileHandles
..


Patch Set 28: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/18191
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a
Gerrit-Change-Number: 18191
Gerrit-PatchSet: 28
Gerrit-Owner: Gergely Fürnstáhl 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gergely Fürnstáhl 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 04 Mar 2022 11:59:38 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles

2022-03-01 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18191 )

Change subject: IMPALA-9433: Improved caching of HdfsFileHandles
..


Patch Set 27:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/10238/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/18191
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a
Gerrit-Change-Number: 18191
Gerrit-PatchSet: 27
Gerrit-Owner: Gergely Fürnstáhl 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gergely Fürnstáhl 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 01 Mar 2022 10:52:22 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles

2022-03-01 Thread Code Review
Gergely Fürnstáhl has uploaded a new patch set (#27). ( 
http://gerrit.cloudera.org:8080/18191 )

Change subject: IMPALA-9433: Improved caching of HdfsFileHandles
..

IMPALA-9433: Improved caching of HdfsFileHandles

Seperated LRU caching functionality to a templated LruMultiCache class.

Replaced std::multimap with std::unordered_map with std::list for O(1)
lookups and less memory overhead, as it stores each key one time. Added
boost::intrusive::list to handle LRU relations with less overhead.
Added O(1) release method, instead of O(n) with minimal memory overhead.
Implemented RAII Accessor to remove the responsibility of releasing
the objects from the user.

Wrapped cache accessor and related DiskIOManager metrics to a
FileHandleCache::Accessor. Removed Release*() call trees from
FileHandleCache and DiskIOManager, removed scoped exit from
HdfsFileReader as they are handled automatically.

Testing:

Implemented extensive unit testing of the class, including forced
rehashes, collisions, capacity overshoot, explicit/automatic release
and destroy.

Ran tests/custom_cluster/test_hdfs_fd_caching.py to verify
FileHandleCache::Accessor behaviour through metrics.

Ran bin/single_node_perf_run.py with TPCH and TPC-DS on parquet tables,
no visible change in performance:
TPCH   scale=10 iterations=100: Delta(Avg)=-0.67% Delta(GeoMean)=-0.49%
TPC-DS scale=10 iterations= 50: Delta(Avg)=-0.02% Delta(GeoMean)= 0.00%

Tested some manual queries on functional_parquet.widetable_1000_cols
with 64 threads but did not notice significant changes in scan times.

Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a
---
M be/src/runtime/io/disk-io-mgr.cc
M be/src/runtime/io/disk-io-mgr.h
M be/src/runtime/io/handle-cache.h
M be/src/runtime/io/handle-cache.inline.h
M be/src/runtime/io/hdfs-file-reader.cc
M be/src/util/CMakeLists.txt
A be/src/util/lru-multi-cache-test.cc
A be/src/util/lru-multi-cache.h
A be/src/util/lru-multi-cache.inline.h
9 files changed, 1,175 insertions(+), 274 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/91/18191/27
--
To view, visit http://gerrit.cloudera.org:8080/18191
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a
Gerrit-Change-Number: 18191
Gerrit-PatchSet: 27
Gerrit-Owner: Gergely Fürnstáhl 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gergely Fürnstáhl 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles

2022-02-28 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18191 )

Change subject: IMPALA-9433: Improved caching of HdfsFileHandles
..


Patch Set 26:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/10235/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/18191
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a
Gerrit-Change-Number: 18191
Gerrit-PatchSet: 26
Gerrit-Owner: Gergely Fürnstáhl 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gergely Fürnstáhl 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 28 Feb 2022 14:34:10 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles

2022-02-28 Thread Code Review
Gergely Fürnstáhl has uploaded a new patch set (#26). ( 
http://gerrit.cloudera.org:8080/18191 )

Change subject: IMPALA-9433: Improved caching of HdfsFileHandles
..

IMPALA-9433: Improved caching of HdfsFileHandles

Seperated LRU caching functionality to a templated LruMultiCache class.

Replaced std::multimap with std::unordered_map with std::list for O(1)
lookups and less memory overhead, as it stores each key one time. Added
boost::intrusive::list to handle LRU relations with less overhead.
Added O(1) release method, instead of O(n) with minimal memory overhead.
Implemented RAII Accessor to remove the responsibility of releasing
the objects from the user.

Wrapped cache accessor and related DiskIOManager metrics to a
FileHandleCache::Accessor. Removed Release*() call trees from
FileHandleCache and DiskIOManager, removed scoped exit from
HdfsFileReader as they are handled automatically.

Testing:

Implemented extensive unit testing of the class, including forced
rehashes, collisions, capacity overshoot, explicit/automatic release
and destroy.

Ran tests/custom_cluster/test_hdfs_fd_caching.py to verify
FileHandleCache::Accessor behaviour through metrics.

Ran bin/single_node_perf_run.py with TPCH and TPC-DS on parquet tables,
no visible change in performance:
TPCH   scale=10 iterations=100: Delta(Avg)=-0.67% Delta(GeoMean)=-0.49%
TPC-DS scale=10 iterations= 50: Delta(Avg)=-0.02% Delta(GeoMean)= 0.00%

Tested some manual queries on functional_parquet.widetable_1000_cols
with 64 threads but did not notice significant changes in scan times.

Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a
---
M be/src/runtime/io/disk-io-mgr.cc
M be/src/runtime/io/disk-io-mgr.h
M be/src/runtime/io/handle-cache.h
M be/src/runtime/io/handle-cache.inline.h
M be/src/runtime/io/hdfs-file-reader.cc
M be/src/util/CMakeLists.txt
A be/src/util/lru-multi-cache-test.cc
A be/src/util/lru-multi-cache.h
A be/src/util/lru-multi-cache.inline.h
9 files changed, 1,175 insertions(+), 274 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/91/18191/26
--
To view, visit http://gerrit.cloudera.org:8080/18191
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a
Gerrit-Change-Number: 18191
Gerrit-PatchSet: 26
Gerrit-Owner: Gergely Fürnstáhl 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gergely Fürnstáhl 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles

2022-02-28 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18191 )

Change subject: IMPALA-9433: Improved caching of HdfsFileHandles
..


Patch Set 25: Code-Review+1

(5 comments)

The code looks awesome!

http://gerrit.cloudera.org:8080/#/c/18191/25/be/src/runtime/io/handle-cache.inline.h
File be/src/runtime/io/handle-cache.inline.h:

http://gerrit.cloudera.org:8080/#/c/18191/25/be/src/runtime/io/handle-cache.inline.h@80
PS25, Line 80:   if (cache_accessor_.Get())
 : 
ImpaladMetrics::IO_MGR_NUM_FILE_HANDLES_OUTSTANDING->Increment(1L);
nit: multi-line if stmt needs braces


http://gerrit.cloudera.org:8080/#/c/18191/25/be/src/util/lru-multi-cache.h
File be/src/util/lru-multi-cache.h:

http://gerrit.cloudera.org:8080/#/c/18191/25/be/src/util/lru-multi-cache.h@53
PS25, Line 53: deigned
nit: designed


http://gerrit.cloudera.org:8080/#/c/18191/25/be/src/util/lru-multi-cache.h@61
PS25, Line 61:
nit: extra space


http://gerrit.cloudera.org:8080/#/c/18191/25/be/src/util/lru-multi-cache.h@74
PS25, Line 74: LRU order
Please mention that least recently used elements are at the front.


http://gerrit.cloudera.org:8080/#/c/18191/25/be/src/util/lru-multi-cache.inline.h
File be/src/util/lru-multi-cache.inline.h:

http://gerrit.cloudera.org:8080/#/c/18191/25/be/src/util/lru-multi-cache.inline.h@214
PS25, Line 214: in_use
Do we need 'in_use'? Can't we just use 'member_hook.is_linked()' instead?

Maybe we just need a member function InUse(): return !member_hook.is_linked();



--
To view, visit http://gerrit.cloudera.org:8080/18191
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a
Gerrit-Change-Number: 18191
Gerrit-PatchSet: 25
Gerrit-Owner: Gergely Fürnstáhl 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gergely Fürnstáhl 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 28 Feb 2022 11:21:41 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles

2022-02-25 Thread Joe McDonnell (Code Review)
Joe McDonnell has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18191 )

Change subject: IMPALA-9433: Improved caching of HdfsFileHandles
..


Patch Set 25: Code-Review+2

(1 comment)

This is really good, thanks for all the hard work!

http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.h
File be/src/util/lru-multi-cache.h:

http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.h@170
PS21, Line 170:   typedef std::list Container;
  :
> Started to implement with boost::optional, could work for Release/Destroy,
Ok, now I get it. The current approach is good.



--
To view, visit http://gerrit.cloudera.org:8080/18191
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a
Gerrit-Change-Number: 18191
Gerrit-PatchSet: 25
Gerrit-Owner: Gergely Fürnstáhl 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gergely Fürnstáhl 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 25 Feb 2022 17:32:15 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles

2022-02-25 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18191 )

Change subject: IMPALA-9433: Improved caching of HdfsFileHandles
..


Patch Set 25:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/10228/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/18191
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a
Gerrit-Change-Number: 18191
Gerrit-PatchSet: 25
Gerrit-Owner: Gergely Fürnstáhl 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gergely Fürnstáhl 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 25 Feb 2022 11:48:36 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles

2022-02-25 Thread Code Review
Gergely Fürnstáhl has uploaded a new patch set (#25). ( 
http://gerrit.cloudera.org:8080/18191 )

Change subject: IMPALA-9433: Improved caching of HdfsFileHandles
..

IMPALA-9433: Improved caching of HdfsFileHandles

Seperated LRU caching functionality to a templated LruMultiCache class.

Replaced std::multimap with std::unordered_map with std::list for O(1)
lookups and less memory overhead, as it stores each key one time. Added
boost::intrusive::list to handle LRU relations with less overhead.
Added O(1) release method, instead of O(n) with minimal memory overhead.
Implemented RAII Accessor to remove the responsibility of releasing
the objects from the user.

Wrapped cache accessor and related DiskIOManager metrics to a
FileHandleCache::Accessor. Removed Release*() call trees from
FileHandleCache and DiskIOManager, removed scoped exit from
HdfsFileReader as they are handled automatically.

Testing:

Implemented extensive unit testing of the class, including forced
rehashes, collisions, capacity overshoot, explicit/automatic release
and destroy.

Ran tests/custom_cluster/test_hdfs_fd_caching.py to verify
FileHandleCache::Accessor behaviour through metrics.

Ran bin/single_node_perf_run.py with TPCH and TPC-DS on parquet tables,
no visible change in performance:
TPCH   scale=10 iterations=100: Delta(Avg)=-0.67% Delta(GeoMean)=-0.49%
TPC-DS scale=10 iterations= 50: Delta(Avg)=-0.02% Delta(GeoMean)= 0.00%

Tested some manual queries on functional_parquet.widetable_1000_cols
with 64 threads but did not notice significant changes in scan times.

Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a
---
M be/src/runtime/io/disk-io-mgr.cc
M be/src/runtime/io/disk-io-mgr.h
M be/src/runtime/io/handle-cache.h
M be/src/runtime/io/handle-cache.inline.h
M be/src/runtime/io/hdfs-file-reader.cc
M be/src/util/CMakeLists.txt
A be/src/util/lru-multi-cache-test.cc
A be/src/util/lru-multi-cache.h
A be/src/util/lru-multi-cache.inline.h
9 files changed, 1,178 insertions(+), 274 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/91/18191/25
--
To view, visit http://gerrit.cloudera.org:8080/18191
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a
Gerrit-Change-Number: 18191
Gerrit-PatchSet: 25
Gerrit-Owner: Gergely Fürnstáhl 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gergely Fürnstáhl 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles

2022-02-25 Thread Code Review
Gergely Fürnstáhl has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18191 )

Change subject: IMPALA-9433: Improved caching of HdfsFileHandles
..


Patch Set 24:

Yes, I did some tests on functional_parquet.widetable_1000_cols, with increased 
number of threads (I could tweak up to 64 I think) to maybe catch some 
improvement in lock contention , but did not notice any significant change in 
performance.


--
To view, visit http://gerrit.cloudera.org:8080/18191
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a
Gerrit-Change-Number: 18191
Gerrit-PatchSet: 24
Gerrit-Owner: Gergely Fürnstáhl 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gergely Fürnstáhl 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 25 Feb 2022 11:25:04 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles

2022-02-25 Thread Code Review
Gergely Fürnstáhl has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18191 )

Change subject: IMPALA-9433: Improved caching of HdfsFileHandles
..


Patch Set 21:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.h
File be/src/util/lru-multi-cache.h:

http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.h@170
PS21, Line 170: Container_internal& container;
  : typename Container_internal::iterator it;
> I haven't tried this, but here is something that might work:
Started to implement with boost::optional, could work for Release/Destroy, but 
not for eviction. EvictOne only sees the ValueType_internal object through the 
LRU list, that's why I put the self referencing iterator there in the beginning.

I could add another layer of indirection, not using the ValueType_internal as 
the LruList's inner type, rather a seperate inner class which has a 
Container::iterator member and the intrusive list hook, but that somewhat loses 
the point of the intrusive list, I don't really like it



--
To view, visit http://gerrit.cloudera.org:8080/18191
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a
Gerrit-Change-Number: 18191
Gerrit-PatchSet: 21
Gerrit-Owner: Gergely Fürnstáhl 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gergely Fürnstáhl 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 25 Feb 2022 11:24:23 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles

2022-02-24 Thread Joe McDonnell (Code Review)
Joe McDonnell has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18191 )

Change subject: IMPALA-9433: Improved caching of HdfsFileHandles
..


Patch Set 24: Code-Review+1

(1 comment)

This is looking good to me.

Did we do a targeted performance test? One example would be to make a Parquet 
table with a large number of files with lots of files. In past work, I took the 
Parquet file from functional_parquet.widetable_1000_cols and created a table 
with 1000 copies of that file (copy the table schema by using "create table 
like"). A select * on that table would require a large number of file handle 
cache accesses.

http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.h
File be/src/util/lru-multi-cache.h:

http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.h@170
PS21, Line 170:   typedef std::list Container;
  :
> I've considered moving this to the accessor, but decided to keep it here to
I haven't tried this, but here is something that might work:
1. Add a boolean variable indicating whether the Accessor is valid
2. Add an Accessor constructor that takes no arguments. It initializes valid to 
false and doesn't do anything to the iterator. The iterator is in an undefined 
state, so we can't use it. For the valid=false case, we shouldn't need to use 
it.
3. Modify the other constructor to always set valid=true and set the iterator.
4. Modify code to treat valid=false the same as the null case today.



--
To view, visit http://gerrit.cloudera.org:8080/18191
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a
Gerrit-Change-Number: 18191
Gerrit-PatchSet: 24
Gerrit-Owner: Gergely Fürnstáhl 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gergely Fürnstáhl 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 25 Feb 2022 03:10:22 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles

2022-02-24 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18191 )

Change subject: IMPALA-9433: Improved caching of HdfsFileHandles
..


Patch Set 24:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/10222/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/18191
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a
Gerrit-Change-Number: 18191
Gerrit-PatchSet: 24
Gerrit-Owner: Gergely Fürnstáhl 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gergely Fürnstáhl 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 24 Feb 2022 13:32:42 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles

2022-02-24 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18191 )

Change subject: IMPALA-9433: Improved caching of HdfsFileHandles
..


Patch Set 22:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/10221/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/18191
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a
Gerrit-Change-Number: 18191
Gerrit-PatchSet: 22
Gerrit-Owner: Gergely Fürnstáhl 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gergely Fürnstáhl 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 24 Feb 2022 13:26:21 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles

2022-02-24 Thread Code Review
Gergely Fürnstáhl has uploaded a new patch set (#24). ( 
http://gerrit.cloudera.org:8080/18191 )

Change subject: IMPALA-9433: Improved caching of HdfsFileHandles
..

IMPALA-9433: Improved caching of HdfsFileHandles

Seperated LRU caching functionality to a templated LruMultiCache class.

Replaced std::multimap with std::unordered_map with std::list for O(1)
lookups and less memory overhead, as it stores each key one time. Added
boost::intrusive::list to handle LRU relations with less overhead.
Added O(1) release method, instead of O(n) with minimal memory overhead.
Implemented RAII Accessor to remove the responsibility of releasing
the objects from the user.

Wrapped cache accessor and related DiskIOManager metrics to a
FileHandleCache::Accessor. Removed Release*() call trees from
FileHandleCache and DiskIOManager, removed scoped exit from
HdfsFileReader as they are handled automatically.

Testing:

Implemented extensive unit testing of the class, including forced
rehashes, collisions, capacity overshoot, explicit/automatic release
and destroy.

Ran tests/custom_cluster/test_hdfs_fd_caching.py to verify
FileHandleCache::Accessor behaviour through metrics.

Ran bin/single_node_perf_run.py with TPCH and TPC-DS on parquet tables,
no visible change in performance:
TPCH   scale=10 iterations=100: Delta(Avg)=-0.67% Delta(GeoMean)=-0.49%
TPC-DS scale=10 iterations= 50: Delta(Avg)=-0.02% Delta(GeoMean)= 0.00%

Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a
---
M be/src/runtime/io/disk-io-mgr.cc
M be/src/runtime/io/disk-io-mgr.h
M be/src/runtime/io/handle-cache.h
M be/src/runtime/io/handle-cache.inline.h
M be/src/runtime/io/hdfs-file-reader.cc
M be/src/util/CMakeLists.txt
A be/src/util/lru-multi-cache-test.cc
A be/src/util/lru-multi-cache.h
A be/src/util/lru-multi-cache.inline.h
9 files changed, 1,178 insertions(+), 274 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/91/18191/24
--
To view, visit http://gerrit.cloudera.org:8080/18191
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a
Gerrit-Change-Number: 18191
Gerrit-PatchSet: 24
Gerrit-Owner: Gergely Fürnstáhl 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gergely Fürnstáhl 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles

2022-02-24 Thread Code Review
Gergely Fürnstáhl has uploaded a new patch set (#22). ( 
http://gerrit.cloudera.org:8080/18191 )

Change subject: IMPALA-9433: Improved caching of HdfsFileHandles
..

IMPALA-9433: Improved caching of HdfsFileHandles

Seperated LRU caching functionality to a templated LruMultiCache class.

Replaced std::multimap with std::unordered_map with std::list for O(1)
lookups and less memory overhead, as it stores each key one time. Added
boost::intrusive::list to handle LRU relations with less overhead.
Added O(1) release method, instead of O(n) with minimal memory overhead.
Implemented RAII Accessor to remove the responsibility of releasing
the objects from the user.

Wrapped cache accessor and related DiskIOManager metrics to a
FileHandleCache::Accessor. Removed Release*() call trees from
FileHandleCache and DiskIOManager, removed scoped exit from
HdfsFileReader as they are handled automatically.

Testing:

Implemented extensive unit testing of the class, including forced
rehashes, collisions, capacity overshoot, explicit/automatic release
and destroy.

Ran tests/custom_cluster/test_hdfs_fd_caching.py to verify
FileHandleCache::Accessor behaviour through metrics.

Ran bin/single_node_perf_run.py with TPCH and TPC-DS on parquet tables,
no visible change in performance:
TPCH   scale=10 iterations=100: Delta(Avg)=-0.67% Delta(GeoMean)=-0.49%
TPC-DS scale=10 iterations= 50: Delta(Avg)=-0.02% Delta(GeoMean)= 0.00%

Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a
---
M be/src/runtime/io/disk-io-mgr.cc
M be/src/runtime/io/disk-io-mgr.h
M be/src/runtime/io/handle-cache.h
M be/src/runtime/io/handle-cache.inline.h
M be/src/runtime/io/hdfs-file-reader.cc
M be/src/util/CMakeLists.txt
A be/src/util/lru-multi-cache-test.cc
A be/src/util/lru-multi-cache.h
A be/src/util/lru-multi-cache.inline.h
9 files changed, 1,182 insertions(+), 274 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/91/18191/22
--
To view, visit http://gerrit.cloudera.org:8080/18191
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a
Gerrit-Change-Number: 18191
Gerrit-PatchSet: 22
Gerrit-Owner: Gergely Fürnstáhl 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gergely Fürnstáhl 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles

2022-02-24 Thread Code Review
Gergely Fürnstáhl has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18191 )

Change subject: IMPALA-9433: Improved caching of HdfsFileHandles
..


Patch Set 21:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/runtime/io/handle-cache.inline.h
File be/src/runtime/io/handle-cache.inline.h:

http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/runtime/io/handle-cache.inline.h@167
PS21, Line 167:   // Opening a file handle requires talking to the NameNode so 
it can take some time.
  :   RETURN_IF_ERROR(accessor_tmp.Get()->Init(hdfs_monitor_));
> Ok, great!
Done


http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/runtime/io/handle-cache.inline.h@167
PS21, Line 167:   // Opening a file handle requires talking to the NameNode so 
it can take some time.
  :   RETURN_IF_ERROR(accessor_tmp.Get()->Init(hdfs_monitor_));
> Ok, great!
Done


http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.h
File be/src/util/lru-multi-cache.h:

http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.h@170
PS21, Line 170: Container_internal& container;
  : typename Container_internal::iterator it;
> I was thinking about whether there is a way to avoid having the element kee
I've considered moving this to the accessor, but decided to keep it here to 
spare constructions every time an accessor is created.

I like the idea of having iterator instead of pointer in the accessor, it would 
make it clear it's not something to delete. However, I can't see a clear 
solution for empty Accessor yet:
  - if we want to default initialize the iterator, we would need access to the 
owning container too
  - creating a dummy empty Container in the cache is possible, but feels more 
hacky than this way

In an earlier patchset, I used a unique_ptr with custom deleter 
(cache->Release()). That's still a possibility, making the default destructor 
good enough, however FileHandleCache::Accessor calls explicit release/destroy 
anyways to handle metrics, so I decided to remove it.


http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.h@252
PS21, Line 252: ValueType_internal* p_value_internal_;
> Consider making this an iterator to the std::list element
I did, explained it on line 171


http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.inline.h
File be/src/util/lru-multi-cache.inline.h:

http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.inline.h@147
PS21, Line 147:   // Move the object to the back of the owning list as it 
is no longer available
  :   container.splice(container.end(), container, 
container_it);
> Ok, that makes sense.
Done


http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.inline.h@175
PS21, Line 175: container.emplace_back((*this), stored_key, container, 
std::forward(args)...);
> If you wanted to get an iterator back, the regular emplace() call returns a
Thanks, a bit cleaner this way


http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.inline.h@201
PS21, Line 201:   // Move the object to the front, keep LRU relation in owning 
list too to
  :   // be able to age out unused objects
  :   container.splice(container.begin(), container, 
p_value_internal->it);
> Basically, once you decide to move the in_use=true to the end of the list i
Yes, exactly


http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.inline.h@222
PS21, Line 222:   if (container.size() == 1) {
  : // Last element, owning list can be removed to prevent aging
  : hash_table_.erase(p_value_internal->key);
  :   } else {
  : // Remove from owning list
  : container.erase(p_value_internal->it);
  :   }
> Ok, so can we add a DCHECK that this is in use? Maybe also add a comment th
Done


http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.inline.h@251
PS21, Line 251:   // Remove from LRU cache
  :   value_internal.member_hook.unlink();
> Nit: Can we add a DCHECK that this is not in use?
Done



--
To view, visit http://gerrit.cloudera.org:8080/18191
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a
Gerrit-Change-Number: 18191
Gerrit-PatchSet: 21
Gerrit-Owner: Gergely Fürnstáhl 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gergely Fürnstáhl 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 24 Feb 2022 13:03:02 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles

2022-02-24 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18191 )

Change subject: IMPALA-9433: Improved caching of HdfsFileHandles
..


Patch Set 22:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/18191/22/be/src/runtime/io/handle-cache.h
File be/src/runtime/io/handle-cache.h:

http://gerrit.cloudera.org:8080/#/c/18191/22/be/src/runtime/io/handle-cache.h@117
PS22, Line 117: typedef LruMultiCache, 
CachedHdfsFileHandle> CacheType;
line too long (91 > 90)


http://gerrit.cloudera.org:8080/#/c/18191/22/be/src/util/lru-multi-cache.inline.h
File be/src/util/lru-multi-cache.inline.h:

http://gerrit.cloudera.org:8080/#/c/18191/22/be/src/util/lru-multi-cache.inline.h@132
PS22, Line 132:
line has trailing whitespace


http://gerrit.cloudera.org:8080/#/c/18191/22/be/src/util/lru-multi-cache.inline.h@165
PS22, Line 165:   auto container_it = container.emplace(container.end(), 
(*this), stored_key, container, std::forward(args)...);
line too long (118 > 90)


http://gerrit.cloudera.org:8080/#/c/18191/22/be/src/util/lru-multi-cache.inline.h@252
PS22, Line 252:  
line has trailing whitespace



--
To view, visit http://gerrit.cloudera.org:8080/18191
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a
Gerrit-Change-Number: 18191
Gerrit-PatchSet: 22
Gerrit-Owner: Gergely Fürnstáhl 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gergely Fürnstáhl 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 24 Feb 2022 13:04:08 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles

2022-02-23 Thread Joe McDonnell (Code Review)
Joe McDonnell has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18191 )

Change subject: IMPALA-9433: Improved caching of HdfsFileHandles
..


Patch Set 21:

(8 comments)

> (12 comments)
 >
 > > > > (8 comments)
 > > > >
 > > > > Thank you for taking this on. There is a lot of history here.
 > > > > Originally, the file handle cache used a generic structure
 > like
 > > > > this:
 > > > > https://github.com/apache/impala/blob/branch-2.10.0/be/src/util/lru-cache.h
 > > > >
 > > > > In my rewrite, I switched it to remove the generic structure.
 > > > This
 > > > > heads back in the other direction.
 > > > >
 > > > > I like that you have backend tests for the generic data
 > > > structure,
 > > > > which is definitely one advantage of that approach. One
 > > question
 > > > I
 > > > > have about moving back to a generic structure is whether we
 > > would
 > > > > be able to add new customization to the file handle cache
 > case.
 > > I
 > > > > had been thinking about adding a file structure that could
 > > > contain
 > > > > additional per-file data and/or stats. Is that possible with
 > > the
 > > > > new generic structure?
 > > >
 > > > "had been thinking about adding a file structure that could
 > > contain
 > > > additional per-file data and/or stat"
 > > >
 > > > I was thinking about similar things (e.g. caching processed
 > > > Parquet/ORC headers), but this seems a somewhat different
 > feature
 > > > to me - while we want to cache more than one file handle per
 > file
 > > > and apply LRU logic per handle, we want to cache data for a
 > file
 > > > only once and apply LRU logic per file.
 > >
 > > Yeah, it's unclear whether we would ever want to extend the file
 > > handle cache
 > > to deal with other things. Separate data structures may be
 > cleaner
 > > even
 > > if it means duplicating filename strings or other things. The
 > file
 > > handle
 > > cache is pretty unusual in structure and historically we haven't
 > > extended
 > > it. I don't have any strong objection to a generic structure. I
 > > just wanted
 > > to think through whether there are any extensions that would end
 > up
 > > getting more complicated.
 > >
 > > For more ordinary caches that don't need duplication, we should
 > be
 > > using the
 > > cache implementations in be/src/util/cache, because that also
 > gets
 > > us different
 > > cache eviction policies.
 >
 > Moving the caching to a separate class definitely helps with
 > testing.
 >
 > I think making it generic helps with encapsulation too, as the
 > caching algorithm (even if it is somewhat specialized to a use
 > case) has nothing to do with the stored data. Moreover, making it
 > generic helped with unit testing too, as I didn't have to juggle
 > around file handles to test out the caching feature.
 >
 > As I understand, the requirements were simple enough to fit it in a
 > generic structure. This would help in the future to decide - if we
 > ever want to use something similar to this - whether if it's easier
 > to extend with some configurability or not. We can specialize this
 > more towards file handle caching, although the arguments for
 > templated key/value are still applicable in that case, maybe we can
 > change the name and place of the code for something less generic.
 >
 > Regarding to per-file data/stats, we could squeeze it in (e.g.
 > unordered_map> but I don't think
 > that is a good direction, it breaks the encapsulation of the cache.
 > I would rather put a container in FileHandleCache class next to the
 > cache and do handle based operations/metrics during creation
 > (GetFileHandle() new entry branch) and do access based
 > operation/metrics in FileHandleCache::Accessor

This generic design is good, and we don't need to overthink this.
My feeling is that the odds of us reusing the generic structure for
something else are pretty low. ​(We'll find out! Maybe I'm wrong.)
If you believe that, then I think an intermediate point is a
non-generic structure that allows mocking the FileHandle. In other
words, it allows for writing backend tests without creating real
file handles, but it also isn't dealing with arbitrary types.
Either way, I think if we need to extend this, then we will find
a way to make it work. I'm not concerned.

http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/runtime/io/handle-cache.inline.h
File be/src/runtime/io/handle-cache.inline.h:

http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/runtime/io/handle-cache.inline.h@167
PS21, Line 167:   // Opening a file handle requires talking to the NameNode so 
it can take some time.
  :   RETURN_IF_ERROR(accessor_tmp.Get()->Init(hdfs_monitor_));
> Yes, it will be in_use, does not lock the cache and not available for other
Ok, great!


http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.h
File be/src/util/lru-multi-cache.h:

http://gerrit.cloudera.org:8080/#/c/18191/21/

[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles

2022-02-23 Thread Code Review
Gergely Fürnstáhl has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18191 )

Change subject: IMPALA-9433: Improved caching of HdfsFileHandles
..


Patch Set 21:

(12 comments)

> > > (8 comments)
 > > >
 > > > Thank you for taking this on. There is a lot of history here.
 > > > Originally, the file handle cache used a generic structure like
 > > > this:
 > > > https://github.com/apache/impala/blob/branch-2.10.0/be/src/util/lru-cache.h
 > > >
 > > > In my rewrite, I switched it to remove the generic structure.
 > > This
 > > > heads back in the other direction.
 > > >
 > > > I like that you have backend tests for the generic data
 > > structure,
 > > > which is definitely one advantage of that approach. One
 > question
 > > I
 > > > have about moving back to a generic structure is whether we
 > would
 > > > be able to add new customization to the file handle cache case.
 > I
 > > > had been thinking about adding a file structure that could
 > > contain
 > > > additional per-file data and/or stats. Is that possible with
 > the
 > > > new generic structure?
 > >
 > > "had been thinking about adding a file structure that could
 > contain
 > > additional per-file data and/or stat"
 > >
 > > I was thinking about similar things (e.g. caching processed
 > > Parquet/ORC headers), but this seems a somewhat different feature
 > > to me - while we want to cache more than one file handle per file
 > > and apply LRU logic per handle, we want to cache data for a file
 > > only once and apply LRU logic per file.
 >
 > Yeah, it's unclear whether we would ever want to extend the file
 > handle cache
 > to deal with other things. Separate data structures may be cleaner
 > even
 > if it means duplicating filename strings or other things. The file
 > handle
 > cache is pretty unusual in structure and historically we haven't
 > extended
 > it. I don't have any strong objection to a generic structure. I
 > just wanted
 > to think through whether there are any extensions that would end up
 > getting more complicated.
 >
 > For more ordinary caches that don't need duplication, we should be
 > using the
 > cache implementations in be/src/util/cache, because that also gets
 > us different
 > cache eviction policies.

Moving the caching to a separate class definitely helps with testing.

I think making it generic helps with encapsulation too, as the caching 
algorithm (even if it is somewhat specialized to a use case) has nothing to do 
with the stored data. Moreover, making it generic helped with unit testing too, 
as I didn't have to juggle around file handles to test out the caching feature.

As I understand, the requirements were simple enough to fit it in a generic 
structure. This would help in the future to decide - if we ever want to use 
something similar to this - whether if it's easier to extend with some 
configurability or not. We can specialize this more towards file handle 
caching, although the arguments for templated key/value are still applicable in 
that case, maybe we can change the name and place of the code for something 
less generic.

Regarding to per-file data/stats, we could squeeze it in (e.g. 
unordered_map> but I don't think that is a 
good direction, it breaks the encapsulation of the cache. I would rather put a 
container in FileHandleCache class next to the cache and do handle based 
operations/metrics during creation (GetFileHandle() new entry branch) and do 
access based operation/metrics in FileHandleCache::Accessor

http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/runtime/io/handle-cache.h
File be/src/runtime/io/handle-cache.h:

http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/runtime/io/handle-cache.h@118
PS21, Line 118: return file_handle.mtime() == mtime;
> The semantics of mtime  is unclear to me - why do we handle it separately a
Very good point, I think I got stuck on earlier design :)


http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/runtime/io/handle-cache.inline.h
File be/src/runtime/io/handle-cache.inline.h:

http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/runtime/io/handle-cache.inline.h@167
PS21, Line 167:   // Opening a file handle requires talking to the NameNode so 
it can take some time.
  :   RETURN_IF_ERROR(accessor_tmp.Get()->Init(hdfs_monitor_));
> Let me double-check my understanding of the threading here:
Yes, it will be in_use, does not lock the cache and not available for other 
threads.


http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/runtime/io/handle-cache.inline.h@167
PS21, Line 167:   // Opening a file handle requires talking to the NameNode so 
it can take some time.
  :   RETURN_IF_ERROR(accessor_tmp.Get()->Init(hdfs_monitor_));
> Another question: How does this work if this call fails? Does the entry get
Very good point, now it will be released back to the cache, I will add a destroy


http://gerrit.cloudera.org:8080/#/c/18191/21

[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles

2022-02-22 Thread Joe McDonnell (Code Review)
Joe McDonnell has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18191 )

Change subject: IMPALA-9433: Improved caching of HdfsFileHandles
..


Patch Set 21:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/runtime/io/handle-cache.inline.h
File be/src/runtime/io/handle-cache.inline.h:

http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/runtime/io/handle-cache.inline.h@167
PS21, Line 167:   // Opening a file handle requires talking to the NameNode so 
it can take some time.
  :   RETURN_IF_ERROR(accessor_tmp.Get()->Init(hdfs_monitor_));
> Let me double-check my understanding of the threading here:
Another question: How does this work if this call fails? Does the entry get 
removed from the cache?



--
To view, visit http://gerrit.cloudera.org:8080/18191
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a
Gerrit-Change-Number: 18191
Gerrit-PatchSet: 21
Gerrit-Owner: Gergely Fürnstáhl 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gergely Fürnstáhl 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 22 Feb 2022 18:47:22 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles

2022-02-22 Thread Joe McDonnell (Code Review)
Joe McDonnell has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18191 )

Change subject: IMPALA-9433: Improved caching of HdfsFileHandles
..


Patch Set 21:

> > (8 comments)
 > >
 > > Thank you for taking this on. There is a lot of history here.
 > > Originally, the file handle cache used a generic structure like
 > > this:
 > > https://github.com/apache/impala/blob/branch-2.10.0/be/src/util/lru-cache.h
 > >
 > > In my rewrite, I switched it to remove the generic structure.
 > This
 > > heads back in the other direction.
 > >
 > > I like that you have backend tests for the generic data
 > structure,
 > > which is definitely one advantage of that approach. One question
 > I
 > > have about moving back to a generic structure is whether we would
 > > be able to add new customization to the file handle cache case. I
 > > had been thinking about adding a file structure that could
 > contain
 > > additional per-file data and/or stats. Is that possible with the
 > > new generic structure?
 >
 > "had been thinking about adding a file structure that could contain
 > additional per-file data and/or stat"
 >
 > I was thinking about similar things (e.g. caching processed
 > Parquet/ORC headers), but this seems a somewhat different feature
 > to me - while we want to cache more than one file handle per file
 > and apply LRU logic per handle, we want to cache data for a file
 > only once and apply LRU logic per file.

Yeah, it's unclear whether we would ever want to extend the file handle cache
to deal with other things. Separate data structures may be cleaner even
if it means duplicating filename strings or other things. The file handle
cache is pretty unusual in structure and historically we haven't extended
it. I don't have any strong objection to a generic structure. I just wanted
to think through whether there are any extensions that would end up
getting more complicated.

For more ordinary caches that don't need duplication, we should be using the
cache implementations in be/src/util/cache, because that also gets us different
cache eviction policies.


--
To view, visit http://gerrit.cloudera.org:8080/18191
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a
Gerrit-Change-Number: 18191
Gerrit-PatchSet: 21
Gerrit-Owner: Gergely Fürnstáhl 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gergely Fürnstáhl 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 22 Feb 2022 18:46:25 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles

2022-02-22 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18191 )

Change subject: IMPALA-9433: Improved caching of HdfsFileHandles
..


Patch Set 21:

> (8 comments)
 >
 > Thank you for taking this on. There is a lot of history here.
 > Originally, the file handle cache used a generic structure like
 > this:
 > https://github.com/apache/impala/blob/branch-2.10.0/be/src/util/lru-cache.h
 >
 > In my rewrite, I switched it to remove the generic structure. This
 > heads back in the other direction.
 >
 > I like that you have backend tests for the generic data structure,
 > which is definitely one advantage of that approach. One question I
 > have about moving back to a generic structure is whether we would
 > be able to add new customization to the file handle cache case. I
 > had been thinking about adding a file structure that could contain
 > additional per-file data and/or stats. Is that possible with the
 > new generic structure?

"had been thinking about adding a file structure that could contain additional 
per-file data and/or stat"

I was thinking about similar things (e.g. caching processed Parquet/ORC 
headers), but this seems a somewhat different feature to me - while we want to 
cache more than one file handle per file and apply LRU logic per handle, we 
want to cache data for a file only once and apply LRU logic per file.


--
To view, visit http://gerrit.cloudera.org:8080/18191
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a
Gerrit-Change-Number: 18191
Gerrit-PatchSet: 21
Gerrit-Owner: Gergely Fürnstáhl 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gergely Fürnstáhl 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 22 Feb 2022 11:22:51 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles

2022-02-21 Thread Joe McDonnell (Code Review)
Joe McDonnell has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18191 )

Change subject: IMPALA-9433: Improved caching of HdfsFileHandles
..


Patch Set 21:

(8 comments)

Thank you for taking this on. There is a lot of history here. Originally, the 
file handle cache used a generic structure like this:
https://github.com/apache/impala/blob/branch-2.10.0/be/src/util/lru-cache.h

In my rewrite, I switched it to remove the generic structure. This heads back 
in the other direction.

I like that you have backend tests for the generic data structure, which is 
definitely one advantage of that approach. One question I have about moving 
back to a generic structure is whether we would be able to add new 
customization to the file handle cache case. I had been thinking about adding a 
file structure that could contain additional per-file data and/or stats. Is 
that possible with the new generic structure?

http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/runtime/io/handle-cache.inline.h
File be/src/runtime/io/handle-cache.inline.h:

http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/runtime/io/handle-cache.inline.h@167
PS21, Line 167:   // Opening a file handle requires talking to the NameNode so 
it can take some time.
  :   RETURN_IF_ERROR(accessor_tmp.Get()->Init(hdfs_monitor_));
Let me double-check my understanding of the threading here:
The EmplaceAndGet adds an entry, but that entry is in_use=true. If this 
function takes a while, other threads will ignore the entry because it is in 
use.


http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/runtime/io/handle-cache.inline.h@170
PS21, Line 170: it/out
Nit: in/out


http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.h
File be/src/util/lru-multi-cache.h:

http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.h@125
PS21, Line 125: resf ot
Nit: rest of


http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.h@170
PS21, Line 170: Container_internal& container;
  : typename Container_internal::iterator it;
Did we consider using an intrusive list for this?


http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.inline.h
File be/src/util/lru-multi-cache.inline.h:

http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.inline.h@147
PS21, Line 147:   // Move the object to the back of the owning list as it 
is no longer available
  :   container.splice(container.end(), container, 
container_it);
Why move it to the back? It could stay where it is and Get() would skip over it 
because in_use=true (i.e. change the break above to a continue). I don't think 
we get a meaningful speedup from avoiding a couple link list hops and shuffling 
the list around.


http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.inline.h@197
PS21, Line 197:   p_value_internal->timestamp_seconds = MonotonicSeconds()
> I think that we didn't refresh timestamp_second in the old version, so soon
The old behavior does reset the timestamp each time we use it. The code is 
subtle, but what is happening is that each get/release cycle the entry is 
removed from the LRU list and then added back. The timestamp is set each time 
it is added to the LRU list. If something is in continuous use, we would keep 
it open.

One nice thing about the old behavior is that the time-based eviction is O(1), 
because the ordering matches the LRU list ordering. The eviction thread can 
walk from oldest to newest and then stop when it sees the first timestamp that 
isn't eligible for eviction.


http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.inline.h@201
PS21, Line 201:   // Move the object to the front, keep LRU relation in owning 
list too to
  :   // be able to age out unused objects
  :   container.splice(container.begin(), container, 
p_value_internal->it);
Do we need to be moving entries? I'm not sure this gets us anything.

Even if we leave the objects in the order they were created, the fact that we 
prefer the earlier entries will cause the later entries to be less frequently 
used and age out faster.


http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.inline.h@222
PS21, Line 222:   if (container.size() == 1) {
  : // Last element, owning list can be removed to prevent aging
  : hash_table_.erase(p_value_internal->key);
  :   } else {
  : // Remove from owning list
  : container.erase(p_value_internal->it);
  :   }
When/how does the element get removed from the LRU list? (I'm guessing it has 
something to do with auto unlink)



--
To view, visit http://gerrit.cloudera.org:8080/18191
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerr

[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles

2022-02-17 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18191 )

Change subject: IMPALA-9433: Improved caching of HdfsFileHandles
..


Patch Set 21:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/runtime/io/handle-cache.h
File be/src/runtime/io/handle-cache.h:

http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/runtime/io/handle-cache.h@118
PS21, Line 118: return file_handle.mtime() == mtime;
The semantics of mtime  is unclear to me - why do we handle it separately and 
not as the part of the key? My understanding is that we always do lookups with 
exact file name and mtime and they remain constant for the lifetime of the 
handle.


http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.h
File be/src/util/lru-multi-cache.h:

http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.h@78
PS21, Line 78: /// this is a intrusive (as known as not owning) list, used 
for eviction
nit: an


http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.inline.h
File be/src/util/lru-multi-cache.inline.h:

http://gerrit.cloudera.org:8080/#/c/18191/21/be/src/util/lru-multi-cache.inline.h@197
PS21, Line 197:   p_value_internal->timestamp_seconds = MonotonicSeconds()
I think that we didn't refresh timestamp_second in the old version, so sooner 
or later we closed handles, even if it was used again and again. This can 
influence what happens if the file was deleted - keeping a file handle always 
open means the replica on the given data node cannot be deleted.

https://www.quora.com/What-happens-if-you-attempt-to-delete-a-file-in-HDFS-while-its-being-used-for-a-job



--
To view, visit http://gerrit.cloudera.org:8080/18191
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a
Gerrit-Change-Number: 18191
Gerrit-PatchSet: 21
Gerrit-Owner: Gergely Fürnstáhl 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gergely Fürnstáhl 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 17 Feb 2022 19:03:12 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles

2022-02-15 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18191 )

Change subject: IMPALA-9433: Improved caching of HdfsFileHandles
..


Patch Set 21:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/10159/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/18191
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a
Gerrit-Change-Number: 18191
Gerrit-PatchSet: 21
Gerrit-Owner: Gergely Fürnstáhl 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gergely Fürnstáhl 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 15 Feb 2022 09:21:19 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles

2022-02-15 Thread Code Review
Gergely Fürnstáhl has uploaded a new patch set (#21). ( 
http://gerrit.cloudera.org:8080/18191 )

Change subject: IMPALA-9433: Improved caching of HdfsFileHandles
..

IMPALA-9433: Improved caching of HdfsFileHandles

Seperated LRU caching functionality to a templated LruMultiCache class.

Replaced std::multimap with std::unordered_map with std::list for O(1)
lookups and less memory overhead, as it stores each key one time. Added
boost::intrusive::list to handle LRU relations with less overhead.
Added O(1) release method, instead of O(n) with minimal memory overhead.
Implemented RAII Accessor to remove the responsibility of releasing
the objects from the user.

Wrapped cache accessor and related DiskIOManager metrics to a
FileHandleCache::Accessor. Removed Release*() call trees from
FileHandleCache and DiskIOManager, removed scoped exit from
HdfsFileReader as they are handled automatically.

Testing:

Implemented extensive unit testing of the class, including forced
rehashes, collisions, capacity overshoot, explicit/automatic release
and destroy.

Ran tests/custom_cluster/test_hdfs_fd_caching.py to verify
FileHandleCache::Accessor behaviour through metrics.

Ran bin/single_node_perf_run.py with TPCH and TPC-DS on parquet tables,
no visible change in performance:
TPCH   scale=10 iterations=100: Delta(Avg)=-0.67% Delta(GeoMean)=-0.49%
TPC-DS scale=10 iterations= 50: Delta(Avg)=-0.02% Delta(GeoMean)= 0.00%

Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a
---
M be/src/runtime/io/disk-io-mgr.cc
M be/src/runtime/io/disk-io-mgr.h
M be/src/runtime/io/handle-cache.h
M be/src/runtime/io/handle-cache.inline.h
M be/src/runtime/io/hdfs-file-reader.cc
M be/src/util/CMakeLists.txt
A be/src/util/lru-multi-cache-test.cc
A be/src/util/lru-multi-cache.h
A be/src/util/lru-multi-cache.inline.h
9 files changed, 1,190 insertions(+), 274 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/91/18191/21
--
To view, visit http://gerrit.cloudera.org:8080/18191
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a
Gerrit-Change-Number: 18191
Gerrit-PatchSet: 21
Gerrit-Owner: Gergely Fürnstáhl 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gergely Fürnstáhl 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles

2022-02-14 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18191 )

Change subject: IMPALA-9433: Improved caching of HdfsFileHandles
..


Patch Set 20:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/10152/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/18191
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a
Gerrit-Change-Number: 18191
Gerrit-PatchSet: 20
Gerrit-Owner: Gergely Fürnstáhl 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gergely Fürnstáhl 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 14 Feb 2022 13:03:00 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles

2022-02-14 Thread Code Review
Gergely Fürnstáhl has uploaded a new patch set (#20). ( 
http://gerrit.cloudera.org:8080/18191 )

Change subject: IMPALA-9433: Improved caching of HdfsFileHandles
..

IMPALA-9433: Improved caching of HdfsFileHandles

Seperated LRU caching functionality to a templated LruMultiCache class.

Replaced std::multimap with std::unordered_map with std::list for O(1)
lookups and less memory overhead, as it stores each key one time. Added
boost::intrusive::list to handle LRU relations with less overhead.
Added O(1) release method, instead of O(n) with minimal memory overhead.
Implemented RAII Accessor to remove the responsibility of releasing
the objects from the user.

Wrapped cache accessor and related DiskIOManager metrics to a
FileHandleCache::Accessor. Removed Release*() call trees from
FileHandleCache and DiskIOManager, removed scoped exit from
HdfsFileReader as they are handled automatically.

Testing:

Implemented extensive unit testing of the class, including forced
rehashes, collisions, capacity overshoot, explicit/automatic release
and destroy.

Ran tests/custom_cluster/test_hdfs_fd_caching.py to verify
FileHandleCache::Accessor behaviour through metrics.

Ran bin/single_node_perf_run.py with TPCH and TPC-DS on parquet tables,
no visible change in performance:
TPCH   scale=10 iterations=100: Delta(Avg)=-0.67% Delta(GeoMean)=-0.49%
TPC-DS scale=10 iterations= 50: Delta(Avg)=-0.02% Delta(GeoMean)= 0.00%

Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a
---
M be/src/runtime/io/disk-io-mgr.cc
M be/src/runtime/io/disk-io-mgr.h
M be/src/runtime/io/handle-cache.h
M be/src/runtime/io/handle-cache.inline.h
M be/src/runtime/io/hdfs-file-reader.cc
M be/src/util/CMakeLists.txt
A be/src/util/lru-multi-cache-test.cc
A be/src/util/lru-multi-cache.h
A be/src/util/lru-multi-cache.inline.h
9 files changed, 1,183 insertions(+), 274 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/91/18191/20
--
To view, visit http://gerrit.cloudera.org:8080/18191
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a
Gerrit-Change-Number: 18191
Gerrit-PatchSet: 20
Gerrit-Owner: Gergely Fürnstáhl 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gergely Fürnstáhl 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles

2022-02-14 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18191 )

Change subject: IMPALA-9433: Improved caching of HdfsFileHandles
..


Patch Set 19:

Build Failed

https://jenkins.impala.io/job/gerrit-code-review-checks/10150/ : Initial code 
review checks failed. See linked job for details on the failure.


--
To view, visit http://gerrit.cloudera.org:8080/18191
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a
Gerrit-Change-Number: 18191
Gerrit-PatchSet: 19
Gerrit-Owner: Gergely Fürnstáhl 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gergely Fürnstáhl 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 14 Feb 2022 11:47:34 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles

2022-02-14 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18191 )

Change subject: IMPALA-9433: Improved caching of HdfsFileHandles
..


Patch Set 18:

Build Failed

https://jenkins.impala.io/job/gerrit-code-review-checks/10149/ : Initial code 
review checks failed. See linked job for details on the failure.


--
To view, visit http://gerrit.cloudera.org:8080/18191
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a
Gerrit-Change-Number: 18191
Gerrit-PatchSet: 18
Gerrit-Owner: Gergely Fürnstáhl 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gergely Fürnstáhl 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 14 Feb 2022 11:39:21 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9433: Improved caching of HdfsFileHandles

2022-02-14 Thread Code Review
Gergely Fürnstáhl has uploaded a new patch set (#19). ( 
http://gerrit.cloudera.org:8080/18191 )

Change subject: IMPALA-9433: Improved caching of HdfsFileHandles
..

IMPALA-9433: Improved caching of HdfsFileHandles

Seperated LRU caching functionality to a templated LruMultiCache class.

Replaced std::multimap with std::unordered_map with std::list for O(1)
lookups and less memory overhead, as it stores each key one time. Added
boost::intrusive::list to handle LRU relations with less overhead.
Added O(1) release method, instead of O(n) with minimal memory overhead.
Implemented RAII Accessor to remove the responsibility of releasing
the objects from the user.

Wrapped cache accessor and related DiskIOManager metrics to a
FileHandleCache::Accessor. Removed Release*() call trees from
FileHandleCache and DiskIOManager, removed scoped exit from
HdfsFileReader as they are handled automatically.

Testing:

Implemented extensive unit testing of the class, including forced
rehashes, collisions, capacity overshoot, explicit/automatic release
and destroy.

Ran tests/custom_cluster/test_hdfs_fd_caching.py to verify
FileHandleCache::Accessor behaviour through metrics.

Ran bin/single_node_perf_run.py with TPCH and TPC-DS on parquet tables,
no visible change in performance:
TPCH   scale=10 iterations=100: Delta(Avg)=-0.67% Delta(GeoMean)=-0.49%
TPC-DS scale=10 iterations= 50: Delta(Avg)=-0.02% Delta(GeoMean)= 0.00%

Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a
---
M be/src/runtime/io/disk-io-mgr.cc
M be/src/runtime/io/disk-io-mgr.h
M be/src/runtime/io/handle-cache.h
M be/src/runtime/io/handle-cache.inline.h
M be/src/runtime/io/hdfs-file-reader.cc
M be/src/util/CMakeLists.txt
A be/src/util/lru-multi-cache-test.cc
A be/src/util/lru-multi-cache.h
A be/src/util/lru-multi-cache.inline.h
9 files changed, 1,188 insertions(+), 274 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/91/18191/19
--
To view, visit http://gerrit.cloudera.org:8080/18191
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6b5c5e9e2b5db2847ab88c41f667c9ca1b03d51a
Gerrit-Change-Number: 18191
Gerrit-PatchSet: 19
Gerrit-Owner: Gergely Fürnstáhl 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gergely Fürnstáhl 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Zoltan Borok-Nagy