[ 
https://issues.apache.org/jira/browse/HBASE-15560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710972#comment-16710972
 ] 

Ben Manes commented on HBASE-15560:
-----------------------------------

Sorry that this dropped off my radar. I can summarize a few things stated 
before, with minor updates.
h5. Current (SLRU)
 * Design:
 ** Reads perform a ConcurrentHashMap lookup, increment a global AtomicLong 
counter for the access time, and marks the block type as frequent
 ** Writes perform a ConcurrentHashMap update and notifies a thread if the 
cache overflows 
 ** Thread wakes up every 10s or when notified. Performs an O(n lg n) sort, and 
evicts the recency/frequency segments up to watermarks
 * Benefits
 ** Provides scan resistance and captures simple frequency workloads. Is 
optimal for Zipf.
 ** Has minimal latencies at low/modest concurrency as does very little work on 
requesting threads
 * Costs
 ** At high concurrency, AtomicLong would be a synchronization bottleneck (~10M 
op/sec if I recall correctly). This probably does not matter due to disk I/O, 
network I/O, etc. resulting in modest thrashing on this counter.
 ** No back-pressure on writes if the cache cannot evict fast enough. However, 
the I/O involved may make this moot. 
 ** Expected lower hit rates in real-world traces, based on the variety of 
workload we have examined (non-HBase, various freq/recency mixtures)

h5. Caffeine (Proposed, TinyLFU)
 * Design:
 ** Reads perform a ConcurrentHashMap lookup, hash to a ring buffer (growable 
array of buffers), and tries to add the item (up to 3 times, may rehash). If 
the ring buffer is full or a state machine flag is marked, then tryLocks to 
schedule a task on an executor.
 ** Writes perform a ConcurrentHashMap update, add to a ring buffer (blocking 
if full), updates a state machine flag, and tryLocks to schedule a task on an 
executor.
 ** Executor drains the ring buffers, replays the events on the eviction 
policy, evicts if the cache has overflowed (default: ForkJoinPool.commonPool()).
 * Benefits
 ** Allows higher degree of read concurrency by not having a single point of 
contention (striped ring buffers)
 ** Offers back-pressure on writes if the eviction thread cannot keep up 
(deschedules writers by them taking the global lock if the buffer is full)
 ** Spreads out small chunks of O(1) work
 ** Allows more advanced policies / data-structures (TinyLFU, Hierarchical 
TimerWheel) => higher hit rates & more features
 * Costs
 ** Slightly higher penalties on read / write (no free lunch)
 ** Is more biased towards frequency (a negative if a recency-skewed workload)

h5. Synopsis

The SLRU is the cheapest (latency) and most optimal (hit rate) for synthetic 
Zipf testing. It was designed with those considerations in mind. Any other 
solution will trade higher latency for better hit rates and system behavior. 
The question is then if the latency difference is small enough (effectively 
noise) and the higher hit rate improves overall performance. *This can only be 
definitively answered by someone willing to canary an instance in a live 
environment.* My belief, from analyzing hit rates and their impacts on other 
applications, is that there will be a benefit.
h5. TinyLFU improvements

We have been exploring ways to improve TinyLFU-based policies in adversarial 
workloads (recency-biased). In those cases work is brought in, operated on 
repeatedly, and then never touched again. A good example of that is a 
transaction log or a distributed compilation cache (with local cache). In those 
workloads frequency is a negative signal, as by the time the score is high 
enough for retention the item is no longer worth retaining.

We have been working on adaptive schemes by sampling the workload and adjusting 
based on its characteristics 
([paper|https://drive.google.com/open?id=1CT2ASkfuG9qVya9Sn8ZUCZjrFSSyjRA_]). 
Both a naive hill climber and a statistics-based model correct the policy to 
the optimal hit rate. I hope to try [adaptive moment 
estimation|https://arxiv.org/abs/1412.6980], an advanced hill climber, which I 
believe will be the most robust and inexpensive mechanism (as proven by the ML 
community). This work will allow the cache to offer the best hit rate 
regardless of workload, which no other policy has been able to do so far.
h5. Next Steps

I don't think there is anything meaningful that I can offer to this ticket. If 
this was to go in, either a leap of faith by making it an option or someone in 
the community would have to prove the benefit. Without an environment or trace, 
we can't do more than discuss minor details from synthetic testing.

> TinyLFU-based BlockCache
> ------------------------
>
>                 Key: HBASE-15560
>                 URL: https://issues.apache.org/jira/browse/HBASE-15560
>             Project: HBase
>          Issue Type: Improvement
>          Components: BlockCache
>    Affects Versions: 2.0.0
>            Reporter: Ben Manes
>            Assignee: Ben Manes
>            Priority: Major
>         Attachments: HBASE-15560.patch, HBASE-15560.patch, HBASE-15560.patch, 
> HBASE-15560.patch, HBASE-15560.patch, HBASE-15560.patch, HBASE-15560.patch, 
> bc.hit.count, bc.miss.count, branch-1.tinylfu.txt, gets, run_ycsb_c.sh, 
> run_ycsb_loading.sh, tinylfu.patch
>
>
> LruBlockCache uses the Segmented LRU (SLRU) policy to capture frequency and 
> recency of the working set. It achieves concurrency by using an O( n ) 
> background thread to prioritize the entries and evict. Accessing an entry is 
> O(1) by a hash table lookup, recording its logical access time, and setting a 
> frequency flag. A write is performed in O(1) time by updating the hash table 
> and triggering an async eviction thread. This provides ideal concurrency and 
> minimizes the latencies by penalizing the thread instead of the caller. 
> However the policy does not age the frequencies and may not be resilient to 
> various workload patterns.
> W-TinyLFU ([research paper|http://arxiv.org/pdf/1512.00727.pdf]) records the 
> frequency in a counting sketch, ages periodically by halving the counters, 
> and orders entries by SLRU. An entry is discarded by comparing the frequency 
> of the new arrival (candidate) to the SLRU's victim, and keeping the one with 
> the highest frequency. This allows the operations to be performed in O(1) 
> time and, though the use of a compact sketch, a much larger history is 
> retained beyond the current working set. In a variety of real world traces 
> the policy had [near optimal hit 
> rates|https://github.com/ben-manes/caffeine/wiki/Efficiency].
> Concurrency is achieved by buffering and replaying the operations, similar to 
> a write-ahead log. A read is recorded into a striped ring buffer and writes 
> to a queue. The operations are applied in batches under a try-lock by an 
> asynchronous thread, thereby track the usage pattern without incurring high 
> latencies 
> ([benchmarks|https://github.com/ben-manes/caffeine/wiki/Benchmarks#server-class]).
> In YCSB benchmarks the results were inconclusive. For a large cache (99% hit 
> rates) the two caches have near identical throughput and latencies with 
> LruBlockCache narrowly winning. At medium and small caches, TinyLFU had a 
> 1-4% hit rate improvement and therefore lower latencies. The lack luster 
> result is because a synthetic Zipfian distribution is used, which SLRU 
> performs optimally. In a more varied, real-world workload we'd expect to see 
> improvements by being able to make smarter predictions.
> The provided patch implements BlockCache using the 
> [Caffeine|https://github.com/ben-manes/caffeine] caching library (see 
> HighScalability 
> [article|http://highscalability.com/blog/2016/1/25/design-of-a-modern-cache.html]).
> Edward Bortnikov and Eshcar Hillel have graciously provided guidance for 
> evaluating this patch ([github 
> branch|https://github.com/ben-manes/hbase/tree/tinylfu]).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to