[ https://issues.apache.org/jira/browse/CASSANDRA-15907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17149000#comment-17149000 ]
Caleb Rackliffe commented on CASSANDRA-15907: --------------------------------------------- At this point, we're sitting on what appears to be 4 distinct approaches to addressing the problems in the current implementation. Before trying to contrast them all, I want to think through the kinds of usage we expect and the consequences of that. Future indexing implementations aside, neither filtering queries nor secondary index queries are currently meant to be used at scale (especially at CL > ONE/LOCAL_ONE) without partition restrictions. Optimizing for that case seems reasonable. The other big axis is how common out of sync replicas actually are, and how responsive we have to be from "rare" to "entire replica datasets are out of sync". What's currently in trunk does just fine if there is very little out-of-sync data, especially in the common case that we're limited to a partition. (i.e. The actual number of protection queries is very low, because we group by partition.) Its weakness is the edge case. bq. Issue blocking RFP read immediately at {{MergeListener#onMergedRows}} when detecting potential outdated rows This single-pass solution would excel in situations where there are very few silent replicas and put very little stress on the heap, given it could simply forgo caching merged rows that don't satisfy the query filter. It also appears to be a fairly simple change to the existing logic. The downside of this approach is that it would start to issue a pretty high volume of individual row protection queries as it came across more silent replicas, without even the benefit of mitigating partition grouping. It wouldn't require any new guardrails around memory usage, and the worst that could happen is a query timeout. bq. We could try to not cache all the results but advance in blocks of a certain fixed number of cached results, so we limit the number of cached results while we can still group keys to do less queries. That is, we could have that pessimistic SRP read prefetching and caching N rows completed with extra queries to the silent replicas, plugged to another group of unmerged-merged counters to prefetch more results if (probably) needed This seems to retain all the nice characteristics of the current trunk implementation (most importantly partition grouping for RFP queries), with the added benefit that it should only use heap proportional to the actual user limit (although not precisely, given the different between the batch size and the limit). It wouldn't really require any new guardrails around memory usage, given the tighter coupling to the limit or page size, and the worse case is also a timeout. The stumbling block feels like complexity, but that might just be my lack of creativity. [~adelapena] Wouldn't we have to avoid SRP in the first phase of the query to limit the size of the result cache during batches? I've been trying to figure out a way to merge these two ideas, i.e. to batch partition/completion reads in the RFP {{MergeListener}}. Combined w/ filtering, also in the {{MergeListener}}, we could discard (i.e. avoid caching) the rows that don't pass the filter. The problem is that the return value of {{onMergedRows()}} is what presently informs SRP/controls the counter. > Operational Improvements & Hardening for Replica Filtering Protection > --------------------------------------------------------------------- > > Key: CASSANDRA-15907 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15907 > Project: Cassandra > Issue Type: Improvement > Components: Consistency/Coordination, Feature/2i Index > Reporter: Caleb Rackliffe > Assignee: Caleb Rackliffe > Priority: Normal > Labels: 2i, memory > Fix For: 3.0.x, 3.11.x, 4.0-beta > > > CASSANDRA-8272 uses additional space on the heap to ensure correctness for 2i > and filtering queries at consistency levels above ONE/LOCAL_ONE. There are a > few things we should follow up on, however, to make life a bit easier for > operators and generally de-risk usage: > (Note: Line numbers are based on {{trunk}} as of > {{3cfe3c9f0dcf8ca8b25ad111800a21725bf152cb}}.) > *Minor Optimizations* > * {{ReplicaFilteringProtection:114}} - Given we size them up-front, we may be > able to use simple arrays instead of lists for {{rowsToFetch}} and > {{originalPartitions}}. Alternatively (or also), we may be able to null out > references in these two collections more aggressively. (ex. Using > {{ArrayList#set()}} instead of {{get()}} in {{queryProtectedPartitions()}}, > assuming we pass {{toFetch}} as an argument to {{querySourceOnKey()}}.) > * {{ReplicaFilteringProtection:323}} - We may be able to use > {{EncodingStats.merge()}} and remove the custom {{stats()}} method. > * {{DataResolver:111 & 228}} - Cache an instance of > {{UnaryOperator#identity()}} instead of creating one on the fly. > * {{ReplicaFilteringProtection:217}} - We may be able to scatter/gather > rather than serially querying every row that needs to be completed. This > isn't a clear win perhaps, given it targets the latency of single queries and > adds some complexity. (Certainly a decent candidate to kick even out of this > issue.) > *Documentation and Intelligibility* > * There are a few places (CHANGES.txt, tracing output in > {{ReplicaFilteringProtection}}, etc.) where we mention "replica-side > filtering protection" (which makes it seem like the coordinator doesn't > filter) rather than "replica filtering protection" (which sounds more like > what we actually do, which is protect ourselves against incorrect replica > filtering results). It's a minor fix, but would avoid confusion. > * The method call chain in {{DataResolver}} might be a bit simpler if we put > the {{repairedDataTracker}} in {{ResolveContext}}. > *Guardrails* > * As it stands, we don't have a way to enforce an upper bound on the memory > usage of {{ReplicaFilteringProtection}} which caches row responses from the > first round of requests. (Remember, these are later used to merged with the > second round of results to complete the data for filtering.) Operators will > likely need a way to protect themselves, i.e. simply fail queries if they hit > a particular threshold rather than GC nodes into oblivion. (Having control > over limits and page sizes doesn't quite get us there, because stale results > _expand_ the number of incomplete results we must cache.) The fun question is > how we do this, with the primary axes being scope (per-query, global, etc.) > and granularity (per-partition, per-row, per-cell, actual heap usage, etc.). > My starting disposition on the right trade-off between > performance/complexity and accuracy is having something along the lines of > cached rows per query. Prior art suggests this probably makes sense alongside > things like {{tombstone_failure_threshold}} in {{cassandra.yaml}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org