+1, the new feature looks good with table level setup especially when cache
is exhausted.

Sent from Gmail Mobile


On Wed, Jul 30, 2025 at 1:41 PM Wellington Chevreuil <
[email protected]> wrote:

> Attaching the ycsb runs mentioned previously:
>
>
> https://docs.google.com/document/d/1juZoND9ju4daOUkHnlcYzU3Nr7wUQGKEORRHll3KPRQ/edit?tab=t.0
>
> Em qua., 30 de jul. de 2025 às 17:38, Wellington Chevreuil <
> [email protected]> escreveu:
>
> > Greetings everyone! As previously shared in this email
> > <https://lists.apache.org/thread/jr1cljrdct01xtqsrgp4fpb301j9h72k>, we
> > have been working on this functionality at Cloudera for some time, and as
> > we prepare to make it GA for our broader customer base, we thought it
> could
> > be a nice addition to the apache hbase distribution too.
> >
> > The most relevant use case for this functionality is when deploying hbase
> > root dir on an object store cloud storage, such as S3, relying on file
> > based bucket cache for optimal performance. For datasets where records
> have
> > a concept of date and access pattern based on such date values, i.e.,
> most
> > accessed data are those with the most recent date value, time based
> > priority can be configured so that only these recent data need to be kept
> > in the cache.
> >
> > The current Time Based Priority for BucketCache implementation allows for
> > defining an "age" threshold for blocks to be kept in the BucketCache,
> where
> > blocks "older" than this threshold would bypass the BucketCache if read
> > (even when cacheOnRead is enabled), and in case of already cached blocks
> > ageing, those would be picked first by eviction runs.
> >
> > It has been developed in two stages:
> > 1) Time Based Priority for BucketCache: the initial framework for
> > extracting blocks age and the block priority logic in BucketCache. This
> > relies on the builtin cell timestamps for determining the block age, and
> > the existing DateTieredCompaction for grouping blocks of similar age
> within
> > the same file. The related design doc
> > <
> https://docs.google.com/document/d/1Qd3kvZodBDxHTFCIRtoePgMbvyuUSxeydi2SEWQFQro/edit?tab=t.0#heading=h.gjdgxs
> >
> > has been shared in the parent jira and in the discussion email mentioned
> > above.
> > 2) Custom Time Based BucketCache Priority: an enhancement over the
> initial
> > development, it extends DateTieredCompaction to allow for custom values
> to
> > be used for cell grouping into separate files. Custom implemented value
> > providers can be plugged into the framework, so that user schema specific
> > values can now be used for defining cache priority. The original cell
> > timestamp based priority has been wrapped into a builtin provider
> > implementation, as well as a qualifier based provider has also been
> > defined. This second phase design doc
> > <
> https://docs.google.com/document/d/1uBGIO9IQ-FbSrE5dnUMRtQS23NbCbAmRVDkAOADcU_E/edit?tab=t.0#heading=h.jxvnkznuj997
> >
> > has also been shared in the related jira.
> >
> > The feature requires a global flag (disabled by default) to be turned on
> > in order to even perform age checks. It also requires extra configuration
> > on individual column families, as only blocks for the configured column
> > families would have the age checked. Blocks from column families not
> > defining any time based priority settings would simply be treated as high
> > priority ones and have preference to be cached.
> >
> > Our suggestion is to have this merged into master, branch-3 and branch-2
> > branches. We had executed some ycsb runs to compare different setups for
> > the feature (all using S3 as the root dir storage), as well as a binary
> > version not containing this code as a baseline comparison on same
> hardware,
> > and while we see relevant impacts on the scenarios where the dataset
> > doesn't fit into the cache capacity, we see little deviation otherwise.
> >
> > Best Regards,
> > Wellington
> >
>

Reply via email to