[ 
https://issues.apache.org/jira/browse/CASSANDRA-15907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17148685#comment-17148685
 ] 

Andres de la Peña commented on CASSANDRA-15907:
-----------------------------------------------

{quote}
the second approach will execute RFP requests in two places:
 # at the beginning of 2nd phase, based on the collected outdated rows from 1st 
phase. These RFP requests can run in parallel and the number can be large.
 # at merge-listener, for additional rows requested by SRP. These RFP requests 
have to run in serial, but the number is usually small.
{quote}
I understand that that would limit the number of cached results, at the expense 
of producing more queries during the second phase. As for parallelizing, that 
would help us a bit but I think it's not going to save us from the degenerate 
cases that worry us, which are those where everything is so out of sync that we 
have to read the entire database.

Perhaps we might consider a more sophisticated way of finding a balance between 
the numbers of cached rows and grouped queries. We could try to not cache all 
the results but advance in blocks of a certain fixed number of cached results, 
so we limit the number of cached results while we can still group keys to do 
less queries. That is, we could have that pessimistic SRP read prefetching and 
caching N rows completed with extra queries to the silent replicas, plugged to 
another group of unmerged-merged counters to prefetch more results if 
(probably) needed, if that makes sense.

Regarding the guardrails, a very reasonable threshold for in-memory cached 
results like, for example, 100 rows, can produce 100 internal queries if they 
are all in different partitions, which are definitively too many queries. Thus, 
we could also consider having another guardrail to limit the number of 
additional SRP/RFP internal queries per user query, so we can fail before 
getting to a timeout. That guardrail could however become obsolete for RFP if 
we implement multi-key queries and we can do the current second phase with a 
single query per replica.

> Operational Improvements & Hardening for Replica Filtering Protection
> ---------------------------------------------------------------------
>
>                 Key: CASSANDRA-15907
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15907
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Consistency/Coordination, Feature/2i Index
>            Reporter: Caleb Rackliffe
>            Assignee: Caleb Rackliffe
>            Priority: Normal
>              Labels: 2i, memory
>             Fix For: 4.0-beta
>
>
> CASSANDRA-8272 uses additional space on the heap to ensure correctness for 2i 
> and filtering queries at consistency levels above ONE/LOCAL_ONE. There are a 
> few things we should follow up on, however, to make life a bit easier for 
> operators and generally de-risk usage:
> (Note: Line numbers are based on {{trunk}} as of 
> {{3cfe3c9f0dcf8ca8b25ad111800a21725bf152cb}}.)
> *Minor Optimizations*
> * {{ReplicaFilteringProtection:114}} - Given we size them up-front, we may be 
> able to use simple arrays instead of lists for {{rowsToFetch}} and 
> {{originalPartitions}}. Alternatively (or also), we may be able to null out 
> references in these two collections more aggressively. (ex. Using 
> {{ArrayList#set()}} instead of {{get()}} in {{queryProtectedPartitions()}}, 
> assuming we pass {{toFetch}} as an argument to {{querySourceOnKey()}}.)
> * {{ReplicaFilteringProtection:323}} - We may be able to use 
> {{EncodingStats.merge()}} and remove the custom {{stats()}} method.
> * {{DataResolver:111 & 228}} - Cache an instance of 
> {{UnaryOperator#identity()}} instead of creating one on the fly.
> * {{ReplicaFilteringProtection:217}} - We may be able to scatter/gather 
> rather than serially querying every row that needs to be completed. This 
> isn't a clear win perhaps, given it targets the latency of single queries and 
> adds some complexity. (Certainly a decent candidate to kick even out of this 
> issue.)
> *Documentation and Intelligibility*
> * There are a few places (CHANGES.txt, tracing output in 
> {{ReplicaFilteringProtection}}, etc.) where we mention "replica-side 
> filtering protection" (which makes it seem like the coordinator doesn't 
> filter) rather than "replica filtering protection" (which sounds more like 
> what we actually do, which is protect ourselves against incorrect replica 
> filtering results). It's a minor fix, but would avoid confusion.
> * The method call chain in {{DataResolver}} might be a bit simpler if we put 
> the {{repairedDataTracker}} in {{ResolveContext}}.
> *Guardrails*
> * As it stands, we don't have a way to enforce an upper bound on the memory 
> usage of {{ReplicaFilteringProtection}} which caches row responses from the 
> first round of requests. (Remember, these are later used to merged with the 
> second round of results to complete the data for filtering.) Operators will 
> likely need a way to protect themselves, i.e. simply fail queries if they hit 
> a particular threshold rather than GC nodes into oblivion. (Having control 
> over limits and page sizes doesn't quite get us there, because stale results 
> _expand_ the number of incomplete results we must cache.) The fun question is 
> how we do this, with the primary axes being scope (per-query, global, etc.) 
> and granularity (per-partition, per-row, per-cell, actual heap usage, etc.). 
> My starting disposition   on the right trade-off between 
> performance/complexity and accuracy is having something along the lines of 
> cached rows per query. Prior art suggests this probably makes sense alongside 
> things like {{tombstone_failure_threshold}} in {{cassandra.yaml}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to