(trafficserver) branch master updated: Add S3-FIFO RAM cache eviction algorithm (ram_cache.algorithm = 2) (#13255)

masaori Tue, 23 Jun 2026 15:37:58 -0700

This is an automated email from the ASF dual-hosted git repository.

masaori335 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/trafficserver.git



The following commit(s) were added to refs/heads/master by this push:
     new edc35b31ca Add S3-FIFO RAM cache eviction algorithm 
(ram_cache.algorithm = 2) (#13255)
edc35b31ca is described below

commit edc35b31ca3888483f0475d4139c8a6917ebd9fa
Author: Phong Nguyen <[email protected]>
AuthorDate: Tue Jun 23 15:36:48 2026 -0700

    Add S3-FIFO RAM cache eviction algorithm (ram_cache.algorithm = 2) (#13255)
    
    * Add S3-FIFO RAM cache eviction algorithm (ram_cache.algorithm = 2)
    
    S3-FIFO (Yang et al., SOSP 2023) is a FIFO-based eviction policy: a
    small admission queue and a main queue (a 2-bit clock), plus a ghost
    queue of recently evicted keys. The small queue and ghost filter
    one-hit-wonders, giving scan resistance and strong hit rates on CDN
    and key-value workloads at low cost -- a hit needs no list reordering.
    
    Selectable as ram_cache.algorithm = 2 alongside CLFUS (0) and LRU (1).
    The policy is byte-budgeted; its eviction metadata (the ghost included,
    bounded by object size and an entry-count cap) is accounted within
    ram_cache.size so total memory stays within the configured budget. Like
    LRU and CLFUS it enforces one resident copy per key (a put with a new
    aux key discards the stale one) and allocates entries from a per-thread
    ProxyAllocator (Thread.h).
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
    
    * Fix ghost capacity calculation truncation order
    
    Co-authored-by: Copilot Autofix powered by AI 
<[email protected]>
    
    * Make S3-FIFO RAM cache tunables configurable via records.yaml
    
    The S3-FIFO queue split, ghost bounds, and promotion threshold were
    compile-time constants. Expose them as
    proxy.config.cache.ram_cache.s3fifo.{main_percent,ghost_size_percent,
    ghost_mem_percent,promote_threshold} so operators can tune the policy
    without a rebuild; the defaults match the paper and the prior behavior.
    
    Each setting carries a RECC_INT range in RecordsConfig.cc, so an
    out-of-range value is rejected at config load with a warning and the
    documented default is used in its place -- the same guard every other
    RAM cache record relies on, rather than a bespoke clamp. The admin
    guide and the ram_cache regression test (a non-default "tuned" pass)
    are updated to cover the new settings.
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
    
    * Add entry for S3-FIFO into NOTICE
    
    ---------
    
    Co-authored-by: Claude Opus 4.8 (1M context) <[email protected]>
    Co-authored-by: Copilot Autofix powered by AI 
<[email protected]>
---
 NOTICE                                    |   9 +
 doc/admin-guide/files/records.yaml.en.rst |  63 ++++-
 doc/admin-guide/storage/index.en.rst      |  14 +-
 include/iocore/cache/Cache.h              |   5 +-
 include/iocore/eventsystem/Thread.h       |   1 +
 src/iocore/cache/CMakeLists.txt           |   1 +
 src/iocore/cache/Cache.cc                 |  70 +++--
 src/iocore/cache/CacheProcessor.cc        |   7 +
 src/iocore/cache/CacheTest.cc             |  28 +-
 src/iocore/cache/P_CacheInternal.h        |   4 +
 src/iocore/cache/P_RamCache.h             |   1 +
 src/iocore/cache/RamCacheS3FIFO.cc        | 437 ++++++++++++++++++++++++++++++
 src/records/RecordsConfig.cc              |  11 +-
 13 files changed, 609 insertions(+), 42 deletions(-)

diff --git a/NOTICE b/NOTICE
index 31787fecbf..ad0c5f11b5 100644
--- a/NOTICE
+++ b/NOTICE
@@ -118,3 +118,12 @@ LS-HPACK provides functionality to encode and decode HTTP 
headers using
 HPACK compression mechanism specified in RFC 7541.
 Copyright (c) 2018 - 2023 LiteSpeed Technologies Inc, (MIT License)
 https://github.com/litespeedtech/ls-hpack.git
+
+~~
+
+S3-FIFO: A simple, scalable FIFO-based algorithm with three static queues
+Copyright 2023, Carnegie Mellon University (Apache-2.0)
+
+Website: https://s3fifo.com/
+Paper: http://dx.doi.org/10.1145/3600006.3613147
+Reference implementation: https://github.com/Thesys-lab/sosp23-s3fifo
diff --git a/doc/admin-guide/files/records.yaml.en.rst 
b/doc/admin-guide/files/records.yaml.en.rst
index 8d9debca24..a554c554a2 100644
--- a/doc/admin-guide/files/records.yaml.en.rst
+++ b/doc/admin-guide/files/records.yaml.en.rst
@@ -2905,10 +2905,65 @@ RAM Cache
 
 .. ts:cv:: CONFIG proxy.config.cache.ram_cache.algorithm INT 1
 
-   Two distinct RAM caches are supported, the default (1) being the simpler
-   **LRU** (*Least Recently Used*) cache. As an alternative, the **CLFUS**
-   (*Clocked Least Frequently Used by Size*) is also available, by changing 
this
-   configuration to 0.
+   Three RAM cache eviction algorithms are supported, selected by this value:
+
+   ``1``
+       **LRU** (*Least Recently Used*), the default -- the simplest policy,
+       favoring recency. Pairs with
+       :ts:cv:`proxy.config.cache.ram_cache.use_seen_filter` for scan
+       resistance.
+
+   ``0``
+       **CLFUS** (*Clocked Least Frequently Used by Size*), which balances
+       recency, frequency, and object size. It is the only algorithm that
+       supports in-RAM compression
+       (:ts:cv:`proxy.config.cache.ram_cache.compress`).
+
+   ``2``
+       **S3-FIFO** (*Simple Scalable Static FIFO*): a small admission queue and
+       a main queue (both FIFO), plus a ghost queue of recently evicted keys,
+       which together filter one-hit-wonders. Scan-resistant and inexpensive
+       (no per-hit reordering); strong hit rates on CDN and key-value
+       workloads. Its eviction metadata (including the ghost) is accounted
+       within :ts:cv:`proxy.config.cache.ram_cache.size`. Experimental; it does
+       not use the seen filter or support in-RAM compression. Its queue split,
+       ghost bounds, and promotion threshold can be tuned with the
+       ``proxy.config.cache.ram_cache.s3fifo.*`` settings below; the defaults
+       follow the original paper and suit most workloads.
+
+.. ts:cv:: CONFIG proxy.config.cache.ram_cache.s3fifo.main_percent INT 90
+
+   Only applies when :ts:cv:`proxy.config.cache.ram_cache.algorithm` is ``2``
+   (S3-FIFO). The target size of the main queue as a percentage of the resident
+   budget; the remainder is the small admission queue. The default ``90`` gives
+   the ~10% small / ~90% main split from the paper. Valid range is ``1`` to
+   ``99`` (an out-of-range value is rejected with a warning and the default is
+   used); a larger value grows the main queue at the expense of the admission
+   queue.
+
+.. ts:cv:: CONFIG proxy.config.cache.ram_cache.s3fifo.ghost_size_percent INT 90
+
+   Only applies when :ts:cv:`proxy.config.cache.ram_cache.algorithm` is ``2``
+   (S3-FIFO). The ghost queue remembers the keys of recently evicted objects 
for
+   up to this percentage of :ts:cv:`proxy.config.cache.ram_cache.size` worth of
+   object bytes. Valid range is ``0`` to ``100``; ``0`` disables this bound.
+
+.. ts:cv:: CONFIG proxy.config.cache.ram_cache.s3fifo.ghost_mem_percent INT 25
+
+   Only applies when :ts:cv:`proxy.config.cache.ram_cache.algorithm` is ``2``
+   (S3-FIFO). Caps the memory the ghost queue's per-key metadata may consume at
+   this percentage of :ts:cv:`proxy.config.cache.ram_cache.size`. This metadata
+   is counted against the configured RAM cache size, so total memory stays
+   within the budget regardless of object cardinality. Valid range is ``0`` to
+   ``100``; ``0`` disables the ghost queue.
+
+.. ts:cv:: CONFIG proxy.config.cache.ram_cache.s3fifo.promote_threshold INT 2
+
+   Only applies when :ts:cv:`proxy.config.cache.ram_cache.algorithm` is ``2``
+   (S3-FIFO). The number of times an object in the small admission queue must 
be
+   reused before it is promoted to the main queue instead of being demoted to
+   the ghost. The default ``2`` admits objects seen at least twice. Valid range
+   is ``1`` to ``3``; a higher value makes admission to the main queue 
stricter.
 
 .. ts:cv:: CONFIG proxy.config.cache.ram_cache.use_seen_filter INT 1
 
diff --git a/doc/admin-guide/storage/index.en.rst 
b/doc/admin-guide/storage/index.en.rst
index f68808b101..aa3754df5e 100644
--- a/doc/admin-guide/storage/index.en.rst
+++ b/doc/admin-guide/storage/index.en.rst
@@ -75,18 +75,22 @@ and reduces load on disks, especially during temporary 
traffic peaks.
 You can configure the RAM cache size to suit your needs, as described in
 :ref:`changing-the-size-of-the-ram-cache` below.
 
-The RAM cache supports two cache eviction algorithms, a regular *LRU*
-(Least Recently Used) and the more advanced *CLFUS* (Clocked Least
+The RAM cache supports three cache eviction algorithms: a regular *LRU*
+(Least Recently Used); the more advanced *CLFUS* (Clocked Least
 Frequently Used by Size; which balances recentness, frequency, and size
-to maximize hit rate, similar to a most frequently used algorithm).
-The default is to use *LRU*, and this is controlled via
+to maximize hit rate, similar to a most frequently used algorithm); and
+*S3-FIFO* (Simple Scalable Static FIFO), a FIFO-based policy whose small
+admission queue and ghost list filter one-hit-wonders, giving strong hit
+rates on CDN and key-value workloads at low cost. The default is to use
+*LRU*, and this is controlled via
 :ts:cv:`proxy.config.cache.ram_cache.algorithm`.
 
 Both the *LRU* and *CLFUS* RAM caches support a configuration to increase
 scan resistance. In a typical *LRU*, if you request all possible objects in
 sequence, you will effectively churn the cache on every request. The option
 :ts:cv:`proxy.config.cache.ram_cache.use_seen_filter` can be set to add some
-resistance against this problem.
+resistance against this problem. *S3-FIFO* is scan-resistant by design,
+through its admission queue, and does not use the seen filter.
 
 In addition, *CLFUS* also supports compressing in the RAM cache itself.
 This can be useful for content which is not compressed by itself (e.g.
diff --git a/include/iocore/cache/Cache.h b/include/iocore/cache/Cache.h
index 3a7da523b5..9a0fe5d430 100644
--- a/include/iocore/cache/Cache.h
+++ b/include/iocore/cache/Cache.h
@@ -36,8 +36,9 @@ static constexpr ts::ModuleVersion CACHE_MODULE_VERSION(1, 0);
 
 #define SCAN_KB_PER_SECOND 8192 // 1TB/8MB = 131072 = 36 HOURS to scan a TB
 
-#define RAM_CACHE_ALGORITHM_CLFUS 0
-#define RAM_CACHE_ALGORITHM_LRU   1
+#define RAM_CACHE_ALGORITHM_CLFUS  0
+#define RAM_CACHE_ALGORITHM_LRU    1
+#define RAM_CACHE_ALGORITHM_S3FIFO 2
 
 #define CACHE_COMPRESSION_NONE    0
 #define CACHE_COMPRESSION_FASTLZ  1
diff --git a/include/iocore/eventsystem/Thread.h 
b/include/iocore/eventsystem/Thread.h
index 0a34dd633c..435762e71e 100644
--- a/include/iocore/eventsystem/Thread.h
+++ b/include/iocore/eventsystem/Thread.h
@@ -134,6 +134,7 @@ public:
   ProxyAllocator openDirEntryAllocator;
   ProxyAllocator ramCacheCLFUSEntryAllocator;
   ProxyAllocator ramCacheLRUEntryAllocator;
+  ProxyAllocator ramCacheS3FIFOEntryAllocator;
   ProxyAllocator evacuationBlockAllocator;
   ProxyAllocator ioDataAllocator;
   ProxyAllocator ioAllocator;
diff --git a/src/iocore/cache/CMakeLists.txt b/src/iocore/cache/CMakeLists.txt
index f8b3817f72..f8a252b430 100644
--- a/src/iocore/cache/CMakeLists.txt
+++ b/src/iocore/cache/CMakeLists.txt
@@ -33,6 +33,7 @@ add_library(
   PreservationTable.cc
   RamCacheCLFUS.cc
   RamCacheLRU.cc
+  RamCacheS3FIFO.cc
   Store.cc
   Stripe.cc
   StripeSM.cc
diff --git a/src/iocore/cache/Cache.cc b/src/iocore/cache/Cache.cc
index 69692d890d..d5dcf3df76 100644
--- a/src/iocore/cache/Cache.cc
+++ b/src/iocore/cache/Cache.cc
@@ -56,35 +56,39 @@ constexpr ts::VersionNumber 
CACHE_DB_VERSION(CACHE_DB_MAJOR_VERSION, CACHE_DB_MI
 
 // Configuration
 
-int64_t cache_config_ram_cache_size                = AUTO_SIZE_RAM_CACHE;
-int     cache_config_ram_cache_algorithm           = 1;
-int     cache_config_ram_cache_compress            = 0;
-int     cache_config_ram_cache_compress_percent    = 90;
-int     cache_config_ram_cache_use_seen_filter     = 1;
-int     cache_config_http_max_alts                 = 3;
-int     cache_config_log_alternate_eviction        = 0;
-int     cache_config_dir_sync_frequency            = 60;
-int     cache_config_dir_sync_delay                = 500;
-int     cache_config_dir_sync_max_write            = (2 * 1024 * 1024);
-int     cache_config_dir_sync_parallel_tasks       = 1;
-int     cache_config_permit_pinning                = 0;
-int     cache_config_select_alternate              = 1;
-int     cache_config_max_doc_size                  = 0;
-int     cache_config_min_average_object_size       = ESTIMATED_OBJECT_SIZE;
-int64_t cache_config_ram_cache_cutoff              = AGG_SIZE;
-int     cache_config_max_disk_errors               = 5;
-int     cache_config_hit_evacuate_percent          = 10;
-int     cache_config_hit_evacuate_size_limit       = 0;
-int     cache_config_force_sector_size             = 0;
-int     cache_config_target_fragment_size          = 
DEFAULT_TARGET_FRAGMENT_SIZE;
-int     cache_config_agg_write_backlog             = AGG_SIZE * 2;
-int     cache_config_enable_checksum               = 0;
-int     cache_config_alt_rewrite_max_size          = 4096;
-int     cache_config_read_while_writer             = 0;
-int     cache_config_mutex_retry_delay             = 2;
-int     cache_read_while_writer_retry_delay        = 50;
-int     cache_config_read_while_writer_max_retries = 10;
-int     cache_config_persist_bad_disks             = false;
+int64_t cache_config_ram_cache_size                      = AUTO_SIZE_RAM_CACHE;
+int     cache_config_ram_cache_algorithm                 = 1;
+int     cache_config_ram_cache_compress                  = 0;
+int     cache_config_ram_cache_compress_percent          = 90;
+int     cache_config_ram_cache_use_seen_filter           = 1;
+int     cache_config_ram_cache_s3fifo_main_percent       = 90;
+int     cache_config_ram_cache_s3fifo_ghost_size_percent = 90;
+int     cache_config_ram_cache_s3fifo_ghost_mem_percent  = 25;
+int     cache_config_ram_cache_s3fifo_promote_threshold  = 2;
+int     cache_config_http_max_alts                       = 3;
+int     cache_config_log_alternate_eviction              = 0;
+int     cache_config_dir_sync_frequency                  = 60;
+int     cache_config_dir_sync_delay                      = 500;
+int     cache_config_dir_sync_max_write                  = (2 * 1024 * 1024);
+int     cache_config_dir_sync_parallel_tasks             = 1;
+int     cache_config_permit_pinning                      = 0;
+int     cache_config_select_alternate                    = 1;
+int     cache_config_max_doc_size                        = 0;
+int     cache_config_min_average_object_size             = 
ESTIMATED_OBJECT_SIZE;
+int64_t cache_config_ram_cache_cutoff                    = AGG_SIZE;
+int     cache_config_max_disk_errors                     = 5;
+int     cache_config_hit_evacuate_percent                = 10;
+int     cache_config_hit_evacuate_size_limit             = 0;
+int     cache_config_force_sector_size                   = 0;
+int     cache_config_target_fragment_size                = 
DEFAULT_TARGET_FRAGMENT_SIZE;
+int     cache_config_agg_write_backlog                   = AGG_SIZE * 2;
+int     cache_config_enable_checksum                     = 0;
+int     cache_config_alt_rewrite_max_size                = 4096;
+int     cache_config_read_while_writer                   = 0;
+int     cache_config_mutex_retry_delay                   = 2;
+int     cache_read_while_writer_retry_delay              = 50;
+int     cache_config_read_while_writer_max_retries       = 10;
+int     cache_config_persist_bad_disks                   = false;
 
 // Globals
 
@@ -865,6 +869,14 @@ ink_cache_init(ts::ModuleVersion v)
   RecEstablishStaticConfigInt32(cache_config_ram_cache_compress_percent, 
"proxy.config.cache.ram_cache.compress_percent");
   cache_config_ram_cache_use_seen_filter = 
RecGetRecordInt("proxy.config.cache.ram_cache.use_seen_filter").value_or(0);
 
+  RecEstablishStaticConfigInt32(cache_config_ram_cache_s3fifo_main_percent, 
"proxy.config.cache.ram_cache.s3fifo.main_percent");
+  
RecEstablishStaticConfigInt32(cache_config_ram_cache_s3fifo_ghost_size_percent,
+                                
"proxy.config.cache.ram_cache.s3fifo.ghost_size_percent");
+  
RecEstablishStaticConfigInt32(cache_config_ram_cache_s3fifo_ghost_mem_percent,
+                                
"proxy.config.cache.ram_cache.s3fifo.ghost_mem_percent");
+  
RecEstablishStaticConfigInt32(cache_config_ram_cache_s3fifo_promote_threshold,
+                                
"proxy.config.cache.ram_cache.s3fifo.promote_threshold");
+
   RecEstablishStaticConfigInt32(cache_config_http_max_alts, 
"proxy.config.cache.limits.http.max_alts");
   Dbg(dbg_ctl_cache_init, "proxy.config.cache.limits.http.max_alts = %d", 
cache_config_http_max_alts);
 
diff --git a/src/iocore/cache/CacheProcessor.cc 
b/src/iocore/cache/CacheProcessor.cc
index b2e874b8dc..1fa158d988 100644
--- a/src/iocore/cache/CacheProcessor.cc
+++ b/src/iocore/cache/CacheProcessor.cc
@@ -1501,6 +1501,10 @@ CacheProcessor::cacheInitialized()
 
     if (gnstripes) {
       // new ram_caches, with algorithm from the config
+      static const char *const ram_cache_algorithm_name[] = {"CLFUS", "LRU", 
"S3-FIFO"};
+      int const                ram_alg                    = 
cache_config_ram_cache_algorithm;
+      Dbg(dbg_ctl_cache_init, "ram_cache algorithm = %d (%s)", ram_alg,
+          (ram_alg >= 0 && ram_alg <= RAM_CACHE_ALGORITHM_S3FIFO) ? 
ram_cache_algorithm_name[ram_alg] : "unknown");
       for (int i = 0; i < gnstripes; i++) {
         switch (cache_config_ram_cache_algorithm) {
         default:
@@ -1510,6 +1514,9 @@ CacheProcessor::cacheInitialized()
         case RAM_CACHE_ALGORITHM_LRU:
           gstripes[i]->ram_cache = new_RamCacheLRU();
           break;
+        case RAM_CACHE_ALGORITHM_S3FIFO:
+          gstripes[i]->ram_cache = new_RamCacheS3FIFO();
+          break;
         }
       }
 
diff --git a/src/iocore/cache/CacheTest.cc b/src/iocore/cache/CacheTest.cc
index cd86df6051..3d39abe8cc 100644
--- a/src/iocore/cache/CacheTest.cc
+++ b/src/iocore/cache/CacheTest.cc
@@ -679,8 +679,34 @@ REGRESSION_TEST(ram_cache)(RegressionTest *t, int level, 
int *pstatus)
   for (int s = 20; s <= 24; s += 4) {
     int64_t cache_size = 1LL << s;
     *pstatus           = REGRESSION_TEST_PASSED;
-    if (!test_RamCache(t, new_RamCacheLRU(), "LRU", cache_size) || 
!test_RamCache(t, new_RamCacheCLFUS(), "CLFUS", cache_size)) {
+    if (!test_RamCache(t, new_RamCacheLRU(), "LRU", cache_size) || 
!test_RamCache(t, new_RamCacheCLFUS(), "CLFUS", cache_size) ||
+        !test_RamCache(t, new_RamCacheS3FIFO(), "S3-FIFO", cache_size)) {
       *pstatus = REGRESSION_TEST_FAILED;
     }
   }
+
+  // Exercise the S3-FIFO tunables with valid non-default values: the policy 
must still pass the
+  // hit-rate floor and the size invariant, proving the records -> init() 
config plumbing is wired
+  // and that a non-default queue split, ghost bound, and promotion threshold 
remain correct.
+  // (Out-of-range values are rejected by the records.yaml RECC_INT 
validation, not here.)
+  {
+    int const     saved_main    = cache_config_ram_cache_s3fifo_main_percent;
+    int const     saved_gsize   = 
cache_config_ram_cache_s3fifo_ghost_size_percent;
+    int const     saved_gmem    = 
cache_config_ram_cache_s3fifo_ghost_mem_percent;
+    int const     saved_promote = 
cache_config_ram_cache_s3fifo_promote_threshold;
+    int64_t const cache_size    = 1LL << 24;
+
+    cache_config_ram_cache_s3fifo_main_percent       = 80; // valid, 
non-default
+    cache_config_ram_cache_s3fifo_ghost_size_percent = 50;
+    cache_config_ram_cache_s3fifo_ghost_mem_percent  = 15;
+    cache_config_ram_cache_s3fifo_promote_threshold  = 1;
+    if (!test_RamCache(t, new_RamCacheS3FIFO(), "S3-FIFO tuned", cache_size)) {
+      *pstatus = REGRESSION_TEST_FAILED;
+    }
+
+    cache_config_ram_cache_s3fifo_main_percent       = saved_main;
+    cache_config_ram_cache_s3fifo_ghost_size_percent = saved_gsize;
+    cache_config_ram_cache_s3fifo_ghost_mem_percent  = saved_gmem;
+    cache_config_ram_cache_s3fifo_promote_threshold  = saved_promote;
+  }
 }
diff --git a/src/iocore/cache/P_CacheInternal.h 
b/src/iocore/cache/P_CacheInternal.h
index 08e77291bb..c030fe211d 100644
--- a/src/iocore/cache/P_CacheInternal.h
+++ b/src/iocore/cache/P_CacheInternal.h
@@ -130,6 +130,10 @@ extern int cache_config_agg_write_backlog;
 extern int cache_config_ram_cache_compress;
 extern int cache_config_ram_cache_compress_percent;
 extern int cache_config_ram_cache_use_seen_filter;
+extern int cache_config_ram_cache_s3fifo_main_percent;
+extern int cache_config_ram_cache_s3fifo_ghost_size_percent;
+extern int cache_config_ram_cache_s3fifo_ghost_mem_percent;
+extern int cache_config_ram_cache_s3fifo_promote_threshold;
 extern int cache_config_hit_evacuate_percent;
 extern int cache_config_hit_evacuate_size_limit;
 extern int cache_config_force_sector_size;
diff --git a/src/iocore/cache/P_RamCache.h b/src/iocore/cache/P_RamCache.h
index b4833e6d96..1afe103751 100644
--- a/src/iocore/cache/P_RamCache.h
+++ b/src/iocore/cache/P_RamCache.h
@@ -45,3 +45,4 @@ public:
 
 RamCache *new_RamCacheLRU();
 RamCache *new_RamCacheCLFUS();
+RamCache *new_RamCacheS3FIFO();
diff --git a/src/iocore/cache/RamCacheS3FIFO.cc 
b/src/iocore/cache/RamCacheS3FIFO.cc
new file mode 100644
index 0000000000..7471165107
--- /dev/null
+++ b/src/iocore/cache/RamCacheS3FIFO.cc
@@ -0,0 +1,437 @@
+/** @file
+
+  A brief file description
+
+  @section license License
+
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+ */
+
+// S3-FIFO RAM cache replacement policy.
+//
+// The design follows Yang, Zhang, Qiu, Yue & Rashmi, "FIFO queues are all you 
need for cache
+// eviction" (SOSP 2023) and https://s3fifo.com, mirroring the libCacheSim 
reference (S3FIFO.c).
+//
+// EXPERIMENTAL: three FIFO queues -- a small admission queue S (~10% of 
capacity), a main queue M
+// (~90%, a 2-bit CLOCK), and a ghost queue G that remembers the keys of 
recently evicted objects
+// (no data). New objects enter S; on eviction from S an object that was 
reused (frequency >= 2) is
+// promoted to M, otherwise it is demoted to G. A subsequent miss whose key is 
in G enters M
+// directly. The small queue + ghost give S3-FIFO admission control that 
filters one-hit-wonders --
+// exactly the CDN "quick demotion" property -- at FIFO cost (no per-hit 
reordering). The ghost
+// holds keys, so it costs per-entry metadata at high object cardinality; that 
metadata is bounded
+// and counted against ram_cache.size so total memory stays within the 
configured budget. Selectable
+// as proxy.config.cache.ram_cache.algorithm = 2; the queue split, ghost 
bounds, and promotion
+// threshold are tunable via proxy.config.cache.ram_cache.s3fifo.* (see init() 
and the admin guide,
+// records.yaml.en.rst).
+
+#include "P_RamCache.h"
+#include "P_CacheInternal.h"
+#include "StripeSM.h"
+#include "iocore/eventsystem/IOBuffer.h"
+#include "tscore/CryptoHash.h"
+#include "tscore/List.h"
+
+#define ENTRY_OVERHEAD 128 // per-entry metadata counted against ram_cache.size
+#define FREQ_MAX       3   // 2-bit saturating frequency counter; keep 
ram_cache.s3fifo.promote_threshold's range in sync
+
+namespace
+{
+DbgCtl dbg_ctl_ram_cache{"ram_cache"};
+} // namespace
+
+enum { SEG_SMALL = 0, SEG_MAIN = 1, SEG_GHOST = 2 };
+
+struct RamCacheS3FIFOEntry {
+  CryptoHash key;
+  uint64_t   auxkey;
+  uint32_t   size; // object bytes; resident entries account size + 
ENTRY_OVERHEAD against the budget
+  uint8_t    seg;  // SEG_SMALL / SEG_MAIN / SEG_GHOST
+  uint8_t    freq; // 0..FREQ_MAX
+  LINK(RamCacheS3FIFOEntry, lru_link);
+  LINK(RamCacheS3FIFOEntry, hash_link);
+  Ptr<IOBufferData> data; // null for ghost entries
+};
+
+struct RamCacheS3FIFO : public RamCache {
+  int     get(CryptoHash *key, Ptr<IOBufferData> *ret_data, uint64_t auxkey = 
0) override;
+  int     put(CryptoHash *key, IOBufferData *data, uint32_t len, bool copy = 
false, uint64_t auxkey = 0) override;
+  int     fixup(const CryptoHash *key, uint64_t old_auxkey, uint64_t 
new_auxkey) override;
+  int64_t size() const override;
+  void    init(int64_t max_bytes, StripeSM *stripe) override;
+
+private:
+  int     _main_percent      = 90; // main queue target size, percent of the 
resident budget (configurable)
+  int     _promote_threshold = 2;  // reuse count in the small queue that 
promotes an object to main (configurable)
+  int64_t _max_bytes         = 0;
+  int64_t _ghost_size_limit  = 0; // ghost object-size bound (keeps it 
proportional for big objects)
+  int64_t _ghost_max         = 0; // ghost entry-count bound (caps metadata to 
ghost_mem_percent)
+  int64_t _s_bytes           = 0; // resident bytes in the small queue (incl. 
ENTRY_OVERHEAD per entry)
+  int64_t _m_bytes           = 0; // resident bytes in the main queue
+  int64_t _g_bytes           = 0; // ghost object-size sum (vs 
_ghost_size_limit)
+  int64_t _g_count           = 0; // ghost entries; each costs 
~ENTRY_OVERHEAD, counted in the budget
+  int64_t _objects           = 0; // resident objects (S + M)
+  int64_t _nentries          = 0; // hash entries (resident + ghost), for 
sizing the hash table
+
+  Que(RamCacheS3FIFOEntry, lru_link) _seg[3]; // head = oldest, tail = newest, 
per segment
+  DList(RamCacheS3FIFOEntry, hash_link) *_bucket = nullptr;
+  int       _nbuckets                            = 0;
+  int       _ibuckets                            = 0;
+  StripeSM *_stripe                              = nullptr;
+
+  void                 _resize_hashtable();
+  RamCacheS3FIFOEntry *_lookup(const CryptoHash *key, uint64_t auxkey);
+  void                 _unlink_hash(RamCacheS3FIFOEntry *e);
+  void                 _remove(RamCacheS3FIFOEntry *e);
+  void                 _to_main(RamCacheS3FIFOEntry *e);
+  void                 _to_ghost(RamCacheS3FIFOEntry *e);
+  bool                 _evict_ghost_one();
+  void                 _enforce_ghost();
+  void                 _evict_small();
+  void                 _evict_main();
+  void                 _evict();
+};
+
+ClassAllocator<RamCacheS3FIFOEntry, false> 
ramCacheS3FIFOEntryAllocator("RamCacheS3FIFOEntry");
+
+static const int bucket_sizes[] = {8191,    16381,   32749,    65521,    
131071,   262139,    524287,    1048573,   2097143,
+                                   4194301, 8388593, 16777213, 33554393, 
67108859, 134217689, 268435399, 536870909, 1073741827};
+
+void
+RamCacheS3FIFO::_resize_hashtable()
+{
+  ink_release_assert(_ibuckets < static_cast<int>(sizeof(bucket_sizes) / 
sizeof(bucket_sizes[0])));
+  int     anbuckets = bucket_sizes[_ibuckets];
+  int64_t s         = anbuckets * sizeof(DList(RamCacheS3FIFOEntry, 
hash_link));
+
+  DList(RamCacheS3FIFOEntry, hash_link) *new_bucket = 
static_cast<DList(RamCacheS3FIFOEntry, hash_link) *>(ats_malloc(s));
+  memset(static_cast<void *>(new_bucket), 0, s);
+  if (_bucket) {
+    for (int64_t i = 0; i < _nbuckets; i++) {
+      RamCacheS3FIFOEntry *e = nullptr;
+      while ((e = _bucket[i].pop())) {
+        new_bucket[e->key.slice32(3) % anbuckets].push(e);
+      }
+    }
+    ats_free(_bucket);
+  }
+  _bucket   = new_bucket;
+  _nbuckets = anbuckets;
+}
+
+void
+RamCacheS3FIFO::init(int64_t abytes, StripeSM *astripe)
+{
+  _stripe    = astripe;
+  _max_bytes = abytes;
+  if (!_max_bytes) {
+    return;
+  }
+
+  // The tunables are range-validated in records.yaml (RECC_INT): an 
out-of-range value is rejected
+  // there with a warning and the documented default is used in its place, so 
the values read here
+  // are already within their safe ranges -- no clamping needed.
+  _main_percent          = cache_config_ram_cache_s3fifo_main_percent;
+  _promote_threshold     = cache_config_ram_cache_s3fifo_promote_threshold;
+  int ghost_size_percent = cache_config_ram_cache_s3fifo_ghost_size_percent;
+  int ghost_mem_percent  = cache_config_ram_cache_s3fifo_ghost_mem_percent;
+
+  _ghost_size_limit = (_max_bytes * ghost_size_percent) / 100;
+  // The ghost stores keys only, but each key still costs ~ENTRY_OVERHEAD of 
real memory. Bound the
+  // ghost by both its object-size sum (keeps it proportional for large 
objects) and an entry count
+  // that caps its metadata at ghost_mem_percent of the configured size; the 
metadata is counted
+  // against the budget (see put) so total memory never exceeds ram_cache.size.
+  _ghost_max = ((_max_bytes * ghost_mem_percent) / 100) / ENTRY_OVERHEAD;
+
+  Dbg(dbg_ctl_ram_cache,
+      "S3-FIFO init %" PRId64 " bytes: main_percent=%d ghost_size_percent=%d 
ghost_mem_percent=%d promote_threshold=%d", _max_bytes,
+      _main_percent, ghost_size_percent, ghost_mem_percent, 
_promote_threshold);
+
+  _resize_hashtable();
+}
+
+int64_t
+RamCacheS3FIFO::size() const
+{
+  // Memory accounted against ram_cache.size: resident data plus per-entry 
overhead, plus the
+  // ghost's per-key overhead. This matches the budget enforced in put(), and 
is O(1).
+  return _s_bytes + _m_bytes + _g_count * ENTRY_OVERHEAD;
+}
+
+RamCacheS3FIFOEntry *
+RamCacheS3FIFO::_lookup(const CryptoHash *key, uint64_t auxkey)
+{
+  uint32_t             i = key->slice32(3) % _nbuckets;
+  RamCacheS3FIFOEntry *e = _bucket[i].head;
+  while (e) {
+    if (e->key == *key && e->auxkey == auxkey) {
+      return e;
+    }
+    e = e->hash_link.next;
+  }
+  return nullptr;
+}
+
+void
+RamCacheS3FIFO::_unlink_hash(RamCacheS3FIFOEntry *e)
+{
+  uint32_t b = e->key.slice32(3) % _nbuckets;
+  _bucket[b].remove(e);
+}
+
+// Remove an entry from whichever segment holds it, update all accounting, and 
free it. The single
+// removal point used by eviction, ghost trimming, and stale-aux-key discard 
in put().
+void
+RamCacheS3FIFO::_remove(RamCacheS3FIFOEntry *e)
+{
+  _seg[e->seg].remove(e);
+  if (e->seg == SEG_GHOST) {
+    _g_bytes -= e->size;
+    _g_count--;
+  } else {
+    int64_t resident = ENTRY_OVERHEAD + e->size;
+    if (e->seg == SEG_SMALL) {
+      _s_bytes -= resident;
+    } else {
+      _m_bytes -= resident;
+    }
+    ts::Metrics::Gauge::decrement(cache_rsb.ram_cache_bytes, resident);
+    ts::Metrics::Gauge::decrement(_stripe->cache_vol->vol_rsb.ram_cache_bytes, 
resident);
+    _objects--;
+  }
+  _unlink_hash(e);
+  _nentries--;
+  e->data = nullptr;
+  THREAD_FREE(e, ramCacheS3FIFOEntryAllocator, this_thread());
+}
+
+// Promote a small-queue object that has proven popular into the main queue 
(resident, no data
+// change), resetting its frequency for the main-queue CLOCK.
+void
+RamCacheS3FIFO::_to_main(RamCacheS3FIFOEntry *e)
+{
+  _seg[SEG_SMALL].remove(e);
+  _s_bytes -= ENTRY_OVERHEAD + e->size;
+  e->seg    = SEG_MAIN;
+  e->freq   = 0;
+  _seg[SEG_MAIN].enqueue(e);
+  _m_bytes += ENTRY_OVERHEAD + e->size;
+}
+
+// Demote a small-queue object to the ghost queue: drop its data (freeing 
resident bytes) but keep
+// its key so a quick re-reference can be admitted straight to the main queue.
+void
+RamCacheS3FIFO::_to_ghost(RamCacheS3FIFOEntry *e)
+{
+  _seg[SEG_SMALL].remove(e);
+  _s_bytes -= ENTRY_OVERHEAD + e->size;
+  ts::Metrics::Gauge::decrement(cache_rsb.ram_cache_bytes, ENTRY_OVERHEAD + 
e->size);
+  ts::Metrics::Gauge::decrement(_stripe->cache_vol->vol_rsb.ram_cache_bytes, 
ENTRY_OVERHEAD + e->size);
+  e->data = nullptr;
+  e->seg  = SEG_GHOST;
+  e->freq = 0;
+  _seg[SEG_GHOST].enqueue(e);
+  _g_bytes += e->size;
+  _g_count++;
+  _objects--;
+  _enforce_ghost();
+}
+
+// Drop the oldest ghost key; returns false if the ghost was already empty.
+bool
+RamCacheS3FIFO::_evict_ghost_one()
+{
+  RamCacheS3FIFOEntry *g = _seg[SEG_GHOST].head;
+  if (!g) {
+    return false;
+  }
+  _remove(g);
+  return true;
+}
+
+void
+RamCacheS3FIFO::_enforce_ghost()
+{
+  while ((_g_bytes > _ghost_size_limit || _g_count > _ghost_max) && 
_evict_ghost_one()) {}
+}
+
+// Evict from the small queue: promote reused objects to main, demote the 
first un-reused object to
+// the ghost (which is what actually frees resident bytes). If everything got 
promoted the small
+// queue empties and the caller falls through to evicting main.
+void
+RamCacheS3FIFO::_evict_small()
+{
+  while (_seg[SEG_SMALL].head) {
+    RamCacheS3FIFOEntry *c = _seg[SEG_SMALL].head;
+    if (c->freq >= _promote_threshold) {
+      _to_main(c);
+    } else {
+      _to_ghost(c);
+      return;
+    }
+  }
+}
+
+// Evict from the main queue (2-bit CLOCK): a reused object is reinserted at 
the tail with its
+// frequency decremented; the first object with frequency 0 is evicted.
+void
+RamCacheS3FIFO::_evict_main()
+{
+  while (_seg[SEG_MAIN].head) {
+    RamCacheS3FIFOEntry *c = _seg[SEG_MAIN].head;
+    if (c->freq >= 1) {
+      _seg[SEG_MAIN].remove(c);
+      c->freq -= 1; // 2-bit clock: a reused object gets another pass (freq is 
already <= FREQ_MAX)
+      _seg[SEG_MAIN].enqueue(c);
+    } else {
+      _remove(c);
+      return;
+    }
+  }
+}
+
+void
+RamCacheS3FIFO::_evict()
+{
+  // Main targets MAIN_PERCENT of the budget actually available to resident 
data -- the ghost's
+  // metadata is reserved out of the configured size, so basing the split on 
raw _max_bytes would
+  // (when the ghost is full) keep _m_bytes below the limit forever and starve 
the small queue.
+  int64_t resident_budget = _max_bytes - _g_count * ENTRY_OVERHEAD;
+  if (_m_bytes > resident_budget * _main_percent / 100 || _s_bytes == 0) {
+    _evict_main();
+  } else {
+    _evict_small();
+  }
+}
+
+int
+RamCacheS3FIFO::get(CryptoHash *key, Ptr<IOBufferData> *ret_data, uint64_t 
auxkey)
+{
+  if (!_max_bytes) {
+    return 0;
+  }
+  RamCacheS3FIFOEntry *e = _lookup(key, auxkey);
+  if (e && e->seg != SEG_GHOST) {
+    if (e->freq < FREQ_MAX) {
+      e->freq++;
+    }
+    (*ret_data) = e->data;
+    ts::Metrics::Counter::increment(cache_rsb.ram_cache_hits);
+    
ts::Metrics::Counter::increment(_stripe->cache_vol->vol_rsb.ram_cache_hits);
+    return 1;
+  }
+  ts::Metrics::Counter::increment(cache_rsb.ram_cache_misses);
+  
ts::Metrics::Counter::increment(_stripe->cache_vol->vol_rsb.ram_cache_misses);
+  return 0;
+}
+
+int
+RamCacheS3FIFO::put(CryptoHash *key, IOBufferData *data, [[maybe_unused]] 
uint32_t len, bool, uint64_t auxkey)
+{
+  if (!_max_bytes) {
+    return 0;
+  }
+  uint32_t size = data->block_size();
+  uint32_t i    = key->slice32(3) % _nbuckets;
+
+  // Walk the hash chain. A resident hit just counts the reference; a ghost 
hit is admitted fresh to
+  // the main queue; an entry with the same key but a different aux key is 
stale and is discarded --
+  // the same one-resident-copy-per-key contract LRU and CLFUS enforce.
+  bool                 ghost_hit = false;
+  RamCacheS3FIFOEntry *e         = _bucket[i].head;
+  while (e) {
+    if (e->key == *key) {
+      if (e->auxkey == auxkey) {
+        if (e->seg != SEG_GHOST) { // already resident: count the reference 
(matches LRU's bump)
+          if (e->freq < FREQ_MAX) {
+            e->freq++;
+          }
+          return 1;
+        }
+        RamCacheS3FIFOEntry *next = e->hash_link.next; // ghost hit: admit a 
fresh copy to main
+        _remove(e);
+        ghost_hit = true;
+        e         = next;
+        continue;
+      }
+      RamCacheS3FIFOEntry *next = e->hash_link.next; // stale aux key: discard 
the old copy
+      _remove(e);
+      e = next;
+      continue;
+    }
+    e = e->hash_link.next;
+  }
+
+  // Keep total memory within the configured size: resident data+overhead plus 
the ghost's
+  // metadata (ENTRY_OVERHEAD per remembered key) must fit. Evict resident 
first; if that is
+  // exhausted but ghost metadata still pushes over, drop ghost keys too.
+  int64_t need = ENTRY_OVERHEAD + size;
+  while (_s_bytes + _m_bytes + _g_count * ENTRY_OVERHEAD + need > _max_bytes) {
+    if (_s_bytes + _m_bytes > 0) {
+      _evict();
+    } else if (!_evict_ghost_one()) {
+      break;
+    }
+  }
+  if (_s_bytes + _m_bytes + _g_count * ENTRY_OVERHEAD + need > _max_bytes) {
+    return 0; // object larger than the whole cache: do not admit, so the 
budget is never exceeded
+  }
+
+  RamCacheS3FIFOEntry *ne = THREAD_ALLOC(ramCacheS3FIFOEntryAllocator, 
this_ethread());
+  ne->key                 = *key;
+  ne->auxkey              = auxkey;
+  ne->size                = size;
+  ne->freq                = 0;
+  ne->seg                 = ghost_hit ? SEG_MAIN : SEG_SMALL;
+  ne->data                = data;
+  _bucket[i].push(ne);
+  _seg[ne->seg].enqueue(ne);
+  if (ghost_hit) {
+    _m_bytes += need;
+  } else {
+    _s_bytes += need;
+  }
+  _objects++;
+  _nentries++;
+  ts::Metrics::Gauge::increment(cache_rsb.ram_cache_bytes, need);
+  ts::Metrics::Gauge::increment(_stripe->cache_vol->vol_rsb.ram_cache_bytes, 
need);
+
+  if (_nentries > _nbuckets * 0.75) {
+    ++_ibuckets;
+    _resize_hashtable();
+  }
+  return 1;
+}
+
+int
+RamCacheS3FIFO::fixup(const CryptoHash *key, uint64_t old_auxkey, uint64_t 
new_auxkey)
+{
+  if (!_max_bytes) {
+    return 0;
+  }
+  RamCacheS3FIFOEntry *e = _lookup(key, old_auxkey);
+  if (e && e->seg != SEG_GHOST) {
+    e->auxkey = new_auxkey;
+    return 1;
+  }
+  return 0;
+}
+
+RamCache *
+new_RamCacheS3FIFO()
+{
+  return new RamCacheS3FIFO;
+}
diff --git a/src/records/RecordsConfig.cc b/src/records/RecordsConfig.cc
index 7588e5108d..7890f427ed 100644
--- a/src/records/RecordsConfig.cc
+++ b/src/records/RecordsConfig.cc
@@ -860,7 +860,16 @@ static constexpr RecordElement RecordsConfig[] =
   //  # alternatively: 20971520 (20MB)
   {RECT_CONFIG, "proxy.config.cache.ram_cache.size", RECD_INT, "-1", 
RECU_RESTART_TS, RR_NULL, RECC_STR, "^-?[0-9]+[A-Za-z]{0,}$", RECA_NULL}
   ,
-  {RECT_CONFIG, "proxy.config.cache.ram_cache.algorithm", RECD_INT, "1", 
RECU_RESTART_TS, RR_NULL, RECC_INT, "[0-1]", RECA_NULL}
+  {RECT_CONFIG, "proxy.config.cache.ram_cache.algorithm", RECD_INT, "1", 
RECU_RESTART_TS, RR_NULL, RECC_INT, "[0-2]", RECA_NULL}
+  ,
+  //  # S3-FIFO (ram_cache.algorithm = 2) tunables
+  {RECT_CONFIG, "proxy.config.cache.ram_cache.s3fifo.main_percent", RECD_INT, 
"90", RECU_RESTART_TS, RR_NULL, RECC_INT, "[1-99]", RECA_NULL}
+  ,
+  {RECT_CONFIG, "proxy.config.cache.ram_cache.s3fifo.ghost_size_percent", 
RECD_INT, "90", RECU_RESTART_TS, RR_NULL, RECC_INT, "[0-100]", RECA_NULL}
+  ,
+  {RECT_CONFIG, "proxy.config.cache.ram_cache.s3fifo.ghost_mem_percent", 
RECD_INT, "25", RECU_RESTART_TS, RR_NULL, RECC_INT, "[0-100]", RECA_NULL}
+  ,
+  {RECT_CONFIG, "proxy.config.cache.ram_cache.s3fifo.promote_threshold", 
RECD_INT, "2", RECU_RESTART_TS, RR_NULL, RECC_INT, "[1-3]", RECA_NULL}
   ,
   {RECT_CONFIG, "proxy.config.cache.ram_cache.use_seen_filter", RECD_INT, "1", 
RECU_RESTART_TS, RR_NULL, RECC_INT, "[0-9]", RECA_NULL}
   ,

(trafficserver) branch master updated: Add S3-FIFO RAM cache eviction algorithm (ram_cache.algorithm = 2) (#13255)

Reply via email to