[ https://issues.apache.org/jira/browse/CASSANDRA-18796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17794117#comment-17794117 ]
Stefan Miklosovic edited comment on CASSANDRA-18796 at 12/7/23 10:34 AM: ------------------------------------------------------------------------- Thank you all for the guidance, I reworked the patch to reflect this new approach. I opened the patch for the review (1) Couple points though: 1) It is done only for SAI, I think that it would be cool to do this for all indexes, "old" secondary indexes too. Tell me if this is a must in this patch or not. 2) I am not completely sure how to know if underlying query is without partition keys in QueryController.maybeTriggerGuardrails method. QueryController just knows what is the first and the last PrimaryKey of such query. The trick I did is that it is a range query (hence without partition keys) if the firstPrimaryKey is equal to lastPrimaryKey. In that case it is basically whole ring, that's how I understand it. Maybe there is better way but I am not sure about that. Keep in mind that this is also happening on replicas only. This is not an issue in SelectStatement where we have more rich API to work with. We may completely get rid of this check on replicas if we think that check in SelectStatement is enough. 3) There is a test case checking warning path for 3 nodes. Failure path has to be still finished. 4) it would be cool to know exactly what the violations were, per violated replica. The way Warnings and its Counter is done for now is that as it merges responses from replicas, it will always merge it in such a way that we only know maximum. So if there are three nodes, one read from 5 referenced indexes, the second one from 10 and the third one from 15, we will see all replicas here but the maximum showed will be 15 - and we do not know what node it is. This would require further refactoring, I think counters would need to be more like a meter, per node. 5) Lastly, we debugged this quite in depth with [~maedhroz] yesterday but I noticed upon testing that warnings and failures do not produce same message to client. Warning looks like this: {code} LogResult{mark=346824, result=[WARN [node1_isolatedExecutor:1] node1 2023-12-07 11:18:46,539 All nodes [/127.0.0.1:7012, /127.0.0.2:7012, /127.0.0.3:7012] referenced more than allowed number of indexes without restrictions on partition key, maximum on a node being 3 indexes, and issued warnings for query SELECT * FROM distributed_test_keyspace.t_0 WHERE v1 = 1701944326319 ALLOW FILTERING (see maximum_sstables_with_non_paritition_key_restricted_query_warn_threshold)]} {code} However, failure looks like this: {code} stefan@cqlsh> select * from cycling.cyclist_semi_pro2 where country = 'ITA'; ReadFailure: Error from server: code=1300 [Replica(s) failed to execute read] message="Operation failed - received 0 responses and 1 failures: READ_TOO_MANY_INDEXES from localhost/127.0.0.1:7000" info={'consistency': 'ONE', 'required_responses': 1, 'received_responses': 0, 'failures': 1, 'error_code_map': {'127.0.0.1': '0x0007'}} {code} The reason why this is happening is not unknown yet, I think it is a bug but we do not know where yet. For starters, when this is called upon failure (2), it will not trigger this (3) which will not update ClientWarnings and it will go with that raw message only which is quite a bummer. I would love if [~dcapwell] took a look to have a fresh set of eyes on this as we are not sure what is happening there. (1) https://github.com/apache/cassandra/pull/2961 (2) https://github.com/apache/cassandra/blob/cassandra-5.0/src/java/org/apache/cassandra/service/reads/ReadCallback.java#L181 (3) https://github.com/apache/cassandra/blob/cassandra-5.0/src/java/org/apache/cassandra/service/reads/ReadCallback.java#L130-L141 was (Author: smiklosovic): Thank you all for the guidance, I reworked the patch to reflect this new approach. I opened the patch for the review (1) Couple points though: 1) It is done only for SAI, I think that it would be cool to do this for all indexes, "old" secondary indexes too. Tell me if this is a must in this patch or not. 2) I am not completely sure how to know if underlying query is without partition keys in QueryController.maybeTriggerGuardrails method. QueryController just knows what is the first and the last PrimaryKey of such query. The trick I did is that it is a range query (hence without partition keys) if the firstPrimaryKey is equal to lastPrimaryKey. In that case it is basically whole ring, that's how I understand it. Maybe there is better way but I am not sure about that. Keep in mind that this is also happening on replicas only. This is not an issue in SelectStatement where we have more rich API to work with. We may completely get rid of this check on replicas if we think that check in SelectStatement is enough. 3) There is a test case checking warning path for 3 nodes. Failure path has to be still finished. 4) it would be cool to know exactly what the violations were, per violated replica. The way Warnings and its Counter is done for now is that as it merges responses from replicas, it will always merge it in such a way that we only know maximum. So if there are three nodes, one read from 5 referenced indexes, the second one from 10 and the third one from 15, we will see all replicas here but the maximum showed will be 15 - and we do not know what node it is. This would require further refactoring, I think counters would need to be more like a meter, per node. 5) Lastly, we debugged this quite in depth with [~maedhroz] yesterday but I noticed upon testing that warnings and failures do not produce same message to client. Warning looks like this: {code} LogResult{mark=346824, result=[WARN [node1_isolatedExecutor:1] node1 2023-12-07 11:18:46,539 All nodes [/127.0.0.1:7012, /127.0.0.2:7012, /127.0.0.3:7012] referenced more than allowed number of indexes without restrictions on partition key, maximum on a node being 3 indexes, and issued warnings for query SELECT * FROM distributed_test_keyspace.t_0 WHERE v1 = 1701944326319 ALLOW FILTERING (see maximum_sstables_with_non_paritition_key_restricted_query_warn_threshold)]} {code} However, failure looks like this: {code} stefan@cqlsh> select * from cycling.cyclist_semi_pro2 where country = 'ITA'; ReadFailure: Error from server: code=1300 [Replica(s) failed to execute read] message="Operation failed - received 0 responses and 1 failures: READ_TOO_MANY_INDEXES from localhost/127.0.0.1:7000" info={'consistency': 'ONE', 'required_responses': 1, 'received_responses': 0, 'failures': 1, 'error_code_map': {'127.0.0.1': '0x0007'}} {code} The reason why this is happening is not unknown yet, I think it is a bug but we do not know where yet. For starters, when this is called upon failure (1), it will not trigger this (2) which will not update ClientWarnings and it will go with that raw message only which is quite a bummer. I would love if [~dcapwell] took a look to have a fresh set of eyes on this as we are not sure what is happening there. https://github.com/apache/cassandra/pull/2961 (1) https://github.com/apache/cassandra/blob/cassandra-5.0/src/java/org/apache/cassandra/service/reads/ReadCallback.java#L181 (2) https://github.com/apache/cassandra/blob/cassandra-5.0/src/java/org/apache/cassandra/service/reads/ReadCallback.java#L130-L141 > Optionally fail when a non-partition-restricted query is issued against a > storage-attached index with a backing table using LCS > ------------------------------------------------------------------------------------------------------------------------------- > > Key: CASSANDRA-18796 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18796 > Project: Cassandra > Issue Type: Improvement > Components: Feature/2i Index, Feature/SAI, Local/Compaction/LCS > Reporter: Caleb Rackliffe > Assignee: Stefan Miklosovic > Priority: Normal > Fix For: 5.0-rc, 5.x > > Time Spent: 2h 20m > Remaining Estimate: 0h > > With LCS, we will have potentially thousands of SSTables for a given user > table. Storage-attached also means SSTable-attached, and searching thousands > of attached indexes is not going to scale well at all locally, due to the > sheer number of searches and amount of postings list merging involved. We > should have a guardrail to prohibit this by default. > Partition-restricted queries, the use-case SAI is broadly designed for, > should be very efficient. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org