[ 
https://issues.apache.org/jira/browse/CASSANDRA-15907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17149000#comment-17149000
 ] 

Caleb Rackliffe commented on CASSANDRA-15907:
---------------------------------------------

At this point, we're sitting on what appears to be 4 distinct approaches to 
addressing the problems in the current implementation. Before trying to 
contrast them all, I want to think through the kinds of usage we expect and the 
consequences of that. Future indexing implementations aside, neither filtering 
queries nor secondary index queries are currently meant to be used at scale 
(especially at CL > ONE/LOCAL_ONE) without partition restrictions. Optimizing 
for that case seems reasonable. The other big axis is how common out of sync 
replicas actually are, and how responsive we have to be from "rare" to "entire 
replica datasets are out of sync". What's currently in trunk does just fine if 
there is very little out-of-sync data, especially in the common case that we're 
limited to a partition. (i.e. The actual number of protection queries is very 
low, because we group by partition.) Its weakness is the edge case.

bq. Issue blocking RFP read immediately at {{MergeListener#onMergedRows}} when 
detecting potential outdated rows

This single-pass solution would excel in situations where there are very few 
silent replicas and put very little stress on the heap, given it could simply 
forgo caching merged rows that don't satisfy the query filter. It also appears 
to be a fairly simple change to the existing logic. The downside of this 
approach is that it would start to issue a pretty high volume of individual row 
protection queries as it came across more silent replicas, without even the 
benefit of mitigating partition grouping. It wouldn't require any new 
guardrails around memory usage, and the worst that could happen is a query 
timeout.

bq. We could try to not cache all the results but advance in blocks of a 
certain fixed number of cached results, so we limit the number of cached 
results while we can still group keys to do less queries. That is, we could 
have that pessimistic SRP read prefetching and caching N rows completed with 
extra queries to the silent replicas, plugged to another group of 
unmerged-merged counters to prefetch more results if (probably) needed

This seems to retain all the nice characteristics of the current trunk 
implementation (most importantly partition grouping for RFP queries), with the 
added benefit that it should only use heap proportional to the actual user 
limit (although not precisely, given the different between the batch size and 
the limit). It wouldn't really require any new guardrails around memory usage, 
given the tighter coupling to the limit or page size, and the worse case is 
also a timeout. The stumbling block feels like complexity, but that might just 
be my lack of creativity. [~adelapena] Wouldn't we have to avoid SRP in the 
first phase of the query to limit the size of the result cache during batches?

I've been trying to figure out a way to merge these two ideas, i.e. to batch 
partition/completion reads in the RFP {{MergeListener}}. Combined w/ filtering, 
also in the {{MergeListener}}, we could discard (i.e. avoid caching) the rows 
that don't pass the filter. The problem is that the return value of 
{{onMergedRows()}} is what presently informs SRP/controls the counter.

> Operational Improvements & Hardening for Replica Filtering Protection
> ---------------------------------------------------------------------
>
>                 Key: CASSANDRA-15907
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15907
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Consistency/Coordination, Feature/2i Index
>            Reporter: Caleb Rackliffe
>            Assignee: Caleb Rackliffe
>            Priority: Normal
>              Labels: 2i, memory
>             Fix For: 3.0.x, 3.11.x, 4.0-beta
>
>
> CASSANDRA-8272 uses additional space on the heap to ensure correctness for 2i 
> and filtering queries at consistency levels above ONE/LOCAL_ONE. There are a 
> few things we should follow up on, however, to make life a bit easier for 
> operators and generally de-risk usage:
> (Note: Line numbers are based on {{trunk}} as of 
> {{3cfe3c9f0dcf8ca8b25ad111800a21725bf152cb}}.)
> *Minor Optimizations*
> * {{ReplicaFilteringProtection:114}} - Given we size them up-front, we may be 
> able to use simple arrays instead of lists for {{rowsToFetch}} and 
> {{originalPartitions}}. Alternatively (or also), we may be able to null out 
> references in these two collections more aggressively. (ex. Using 
> {{ArrayList#set()}} instead of {{get()}} in {{queryProtectedPartitions()}}, 
> assuming we pass {{toFetch}} as an argument to {{querySourceOnKey()}}.)
> * {{ReplicaFilteringProtection:323}} - We may be able to use 
> {{EncodingStats.merge()}} and remove the custom {{stats()}} method.
> * {{DataResolver:111 & 228}} - Cache an instance of 
> {{UnaryOperator#identity()}} instead of creating one on the fly.
> * {{ReplicaFilteringProtection:217}} - We may be able to scatter/gather 
> rather than serially querying every row that needs to be completed. This 
> isn't a clear win perhaps, given it targets the latency of single queries and 
> adds some complexity. (Certainly a decent candidate to kick even out of this 
> issue.)
> *Documentation and Intelligibility*
> * There are a few places (CHANGES.txt, tracing output in 
> {{ReplicaFilteringProtection}}, etc.) where we mention "replica-side 
> filtering protection" (which makes it seem like the coordinator doesn't 
> filter) rather than "replica filtering protection" (which sounds more like 
> what we actually do, which is protect ourselves against incorrect replica 
> filtering results). It's a minor fix, but would avoid confusion.
> * The method call chain in {{DataResolver}} might be a bit simpler if we put 
> the {{repairedDataTracker}} in {{ResolveContext}}.
> *Guardrails*
> * As it stands, we don't have a way to enforce an upper bound on the memory 
> usage of {{ReplicaFilteringProtection}} which caches row responses from the 
> first round of requests. (Remember, these are later used to merged with the 
> second round of results to complete the data for filtering.) Operators will 
> likely need a way to protect themselves, i.e. simply fail queries if they hit 
> a particular threshold rather than GC nodes into oblivion. (Having control 
> over limits and page sizes doesn't quite get us there, because stale results 
> _expand_ the number of incomplete results we must cache.) The fun question is 
> how we do this, with the primary axes being scope (per-query, global, etc.) 
> and granularity (per-partition, per-row, per-cell, actual heap usage, etc.). 
> My starting disposition   on the right trade-off between 
> performance/complexity and accuracy is having something along the lines of 
> cached rows per query. Prior art suggests this probably makes sense alongside 
> things like {{tombstone_failure_threshold}} in {{cassandra.yaml}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to