Wellington Chevreuil created HBASE-29727:
--------------------------------------------
Summary: Introduce a String pool for repeating filename, region
and cf string fields in BlockCacheKey
Key: HBASE-29727
URL: https://issues.apache.org/jira/browse/HBASE-29727
Project: HBase
Issue Type: Improvement
Reporter: Wellington Chevreuil
Assignee: Wellington Chevreuil
For every block added to BucketCache, we create and keep a BlockCacheKey object
with a String attribute for the file name the blocks belong to, plus the Path
containing the entire path for the given file. HFiles will normally contain
many blocks, and for all blocks from a same file, these attributes will have
the very same value, yet, we create different instances for each of the blocks.
When using file based bucket cache, where the bucket cache size is in the TB
magnitude, the total block count in the cache can grow very large, and so is
the heap used by the BucketCache object, due to the high count of BlockCacheKey
instances it has to keep.
For a few years now, the reference architecture with my employer for hbase
clusters on the cloud has been to deploy hbase root dir on cloud storage, then
use ephemeral SSD disks shipped within the RSes node VMs to for a file based
BucketCache. At the moment, the standard VM profile used allows for as much as
1.6TB of BucketCache capacity. For a cache of such size, with the default block
size of 64KB, we see on average, 30M blocks, with a minimal heap usage around
12GB.
With cloud providers now offering different VM profiles with more ephemeral SSD
disks capacity, we are looking for alternatives to optimise the heap usage by
BucketCache. The approach proposed here, is to define a "string pool" for
mapping the String attributes in the BlockCacheKey class to integer ids, so
that we can save some bytes for blocks from same file.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)