Re: Question about PartitionUpdate.singleRowUpdate()
Thanks for the feedback. I've opened: https://issues.apache.org/jira/browse/CASSANDRA-14941 SK On 2018-12-19 17:03, Jeff Jirsa wrote: > Definitely worth a JIRA. Suspect it may be slow to get a response this > close to the holidays, but a JIRA will be a bit more durable than the > mailing list post. > > > On Wed, Dec 19, 2018 at 1:58 PM Sam Klock wrote: > >> Cassandra devs, >> >> I have a question about the implementation of >> PartitionUpdate.singleRowUpdate(), in particular the choice to use >> EncodingStats.NO_STATS when building the resulting PartitionUpdate. Is >> there a functional reason for that -- i.e., is it safe to modify it to >> use an EncodingStats built from deletionInfo, row, and staticRow? >> >> Context: under 3.0.17, we have a table using TWCS and a secondary index. >> We've been having a problem with the sstables for the index lingering >> essentially forever, despite the correlated sstables for the parent >> table being removed pretty much when we expect them to. We traced the >> problem to the use of EncodingStats.NO_STATS in singleRowUpdate(), which >> is being used to create the index updates when we write to the parent >> table. It appears that NO_STATS is making Cassandra think the memtables >> for the index have data from September 2015 in them, which in turn >> prevents it from dropping expired sstables (all of which are much more >> recent than that) for the index. >> >> Experimentally, modifying singleRowUpdate() to build an EncodingStats >> from its inputs (plus the MutableDeletionInfo it creates) seems to fix >> the problem. We don't have any insight into why the existing logic uses >> NO_STATS, however, so we don't know if this change is really safe. Does >> it sound like we're on the right track? (Also: I'm sure we'd be happy >> to open an issue and submit a patch if this sounds like it would be >> useful generally.) >> >> Thanks, >> SK >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org >> For additional commands, e-mail: dev-h...@cassandra.apache.org >> >> > - To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For additional commands, e-mail: dev-h...@cassandra.apache.org
Question about PartitionUpdate.singleRowUpdate()
Cassandra devs, I have a question about the implementation of PartitionUpdate.singleRowUpdate(), in particular the choice to use EncodingStats.NO_STATS when building the resulting PartitionUpdate. Is there a functional reason for that -- i.e., is it safe to modify it to use an EncodingStats built from deletionInfo, row, and staticRow? Context: under 3.0.17, we have a table using TWCS and a secondary index. We've been having a problem with the sstables for the index lingering essentially forever, despite the correlated sstables for the parent table being removed pretty much when we expect them to. We traced the problem to the use of EncodingStats.NO_STATS in singleRowUpdate(), which is being used to create the index updates when we write to the parent table. It appears that NO_STATS is making Cassandra think the memtables for the index have data from September 2015 in them, which in turn prevents it from dropping expired sstables (all of which are much more recent than that) for the index. Experimentally, modifying singleRowUpdate() to build an EncodingStats from its inputs (plus the MutableDeletionInfo it creates) seems to fix the problem. We don't have any insight into why the existing logic uses NO_STATS, however, so we don't know if this change is really safe. Does it sound like we're on the right track? (Also: I'm sure we'd be happy to open an issue and submit a patch if this sounds like it would be useful generally.) Thanks, SK - To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For additional commands, e-mail: dev-h...@cassandra.apache.org
Re: Optimizing queries for partition keys
Can someone please take a look at CASSANDRA-14415 when you have chance? Getting a fix into a Cassandra release is not especially urgent for us, but in lieu of that we would like to know whether it's safe to include in our local build of Cassandra before attempting to deploy it. Thanks, SK On 2018-04-24 14:16, Sam Klock wrote: > Thanks. For those interested: opened CASSANDRA-14415. > > SK > > On 2018-04-19 06:04, Benjamin Lerer wrote: >> Hi Sam, >> >> Your finding is interesting. Effectively, if the number of bytes to skip is >> larger than the remaining bytes in the buffer + the buffer size it could be >> faster to use seek. >> Feel free to open a JIRA ticket and attach your patch. It will be great if >> you could add to the ticket your table schema as well >> as some information on your environment (e.g. disk type). >> >> On Tue, Apr 17, 2018 at 8:53 PM, Sam Klock <skl...@akamai.com> wrote: >> >>> Thanks (and apologies for the delayed response); that was the kind of >>> feedback we were looking for. >>> >>> We backported the fix for CASSANDRA-10657 to 3.0.16, and it partially >>> addresses our problem in the sense that it does limit the data sent on >>> the wire. The performance is still extremely poor, however, due to the >>> fact that Cassandra continues to read large volumes of data from disk. >>> (We've also confirmed this behavior in 3.11.2.) >>> >>> With a bit more investigation, we now believe the problem (after >>> CASSNDRA-10657 is applied) is in RebufferingInputStream.skipBytes(), >>> which appears to read bytes in order to skip them. The subclass used in >>> our case, RandomAccessReader, exposes a seek(), so we overrode >>> skipBytes() in it to make use of seek(), and that seems to resolve the >>> problem. >>> >>> This change is intuitively much safer than the one we'd originally >>> identified, but we'd still like to confirm with you folks whether it's >>> likely safe and, if so whether it's also potentially worth contributing. >>> >>> Thanks, >>> Sk >>> >>> >>> On 2018-03-22 18:16, Benjamin Lerer wrote: >>> >>>> You should check the 3.x release. CASSANDRA-10657 could have fixed your >>>> problem. >>>> >>>> >>>> On Thu, Mar 22, 2018 at 9:15 PM, Benjamin Lerer < >>>> benjamin.le...@datastax.com >>>> >>>>> wrote: >>>>> >>>> >>>> Syvlain explained the problem in CASSANDRA-4536: >>>>> " Let me note that in CQL3 a row that have no live column don't exist, so >>>>> we can't really implement this with a range slice having an empty columns >>>>> list. Instead we should do a range slice with a full-row slice predicate >>>>> with a count of 1, to make sure we do have a live column before including >>>>> the partition key. " >>>>> >>>>> By using ColumnFilter.selectionBuilder(); you do not select all the >>>>> columns. By consequence, some partitions might be returned while they >>>>> should not. >>>>> >>>>> On Thu, Mar 22, 2018 at 6:24 PM, Sam Klock <skl...@akamai.com> wrote: >>>>> >>>>> Cassandra devs, >>>>>> >>>>>> We use workflows in some of our clusters (running 3.0.15) that involve >>>>>> "SELECT DISTINCT key FROM..."-style queries. For some tables, we >>>>>> observed extremely poor performance under light load (i.e., a small >>>>>> number of rows per second and frequent timeouts), which we eventually >>>>>> traced to replicas shipping entire rows (which in some cases could store >>>>>> on the order of MBs of data) to service the query. That surprised us >>>>>> (partly because 2.1 doesn't seem to behave this way), so we did some >>>>>> digging, and we eventually came up with a patch that modifies >>>>>> SelectStatement.java in the following way: if the selection in the query >>>>>> only includes the partition key, then when building a ColumnFilter for >>>>>> the query, use: >>>>>> >>>>>> builder = ColumnFilter.selectionBuilder(); >>>>>> >>>>>> instead of: >>>>>> >>>>>> builder = ColumnFilter.allColumnsBuilder(); >>>>>> >>>>>> to initialize the ColumnFil
Re: Optimizing queries for partition keys
Thanks. For those interested: opened CASSANDRA-14415. SK On 2018-04-19 06:04, Benjamin Lerer wrote: > Hi Sam, > > Your finding is interesting. Effectively, if the number of bytes to skip is > larger than the remaining bytes in the buffer + the buffer size it could be > faster to use seek. > Feel free to open a JIRA ticket and attach your patch. It will be great if > you could add to the ticket your table schema as well > as some information on your environment (e.g. disk type). > > On Tue, Apr 17, 2018 at 8:53 PM, Sam Klock <skl...@akamai.com> wrote: > >> Thanks (and apologies for the delayed response); that was the kind of >> feedback we were looking for. >> >> We backported the fix for CASSANDRA-10657 to 3.0.16, and it partially >> addresses our problem in the sense that it does limit the data sent on >> the wire. The performance is still extremely poor, however, due to the >> fact that Cassandra continues to read large volumes of data from disk. >> (We've also confirmed this behavior in 3.11.2.) >> >> With a bit more investigation, we now believe the problem (after >> CASSNDRA-10657 is applied) is in RebufferingInputStream.skipBytes(), >> which appears to read bytes in order to skip them. The subclass used in >> our case, RandomAccessReader, exposes a seek(), so we overrode >> skipBytes() in it to make use of seek(), and that seems to resolve the >> problem. >> >> This change is intuitively much safer than the one we'd originally >> identified, but we'd still like to confirm with you folks whether it's >> likely safe and, if so whether it's also potentially worth contributing. >> >> Thanks, >> Sk >> >> >> On 2018-03-22 18:16, Benjamin Lerer wrote: >> >>> You should check the 3.x release. CASSANDRA-10657 could have fixed your >>> problem. >>> >>> >>> On Thu, Mar 22, 2018 at 9:15 PM, Benjamin Lerer < >>> benjamin.le...@datastax.com >>> >>>> wrote: >>>> >>> >>> Syvlain explained the problem in CASSANDRA-4536: >>>> " Let me note that in CQL3 a row that have no live column don't exist, so >>>> we can't really implement this with a range slice having an empty columns >>>> list. Instead we should do a range slice with a full-row slice predicate >>>> with a count of 1, to make sure we do have a live column before including >>>> the partition key. " >>>> >>>> By using ColumnFilter.selectionBuilder(); you do not select all the >>>> columns. By consequence, some partitions might be returned while they >>>> should not. >>>> >>>> On Thu, Mar 22, 2018 at 6:24 PM, Sam Klock <skl...@akamai.com> wrote: >>>> >>>> Cassandra devs, >>>>> >>>>> We use workflows in some of our clusters (running 3.0.15) that involve >>>>> "SELECT DISTINCT key FROM..."-style queries. For some tables, we >>>>> observed extremely poor performance under light load (i.e., a small >>>>> number of rows per second and frequent timeouts), which we eventually >>>>> traced to replicas shipping entire rows (which in some cases could store >>>>> on the order of MBs of data) to service the query. That surprised us >>>>> (partly because 2.1 doesn't seem to behave this way), so we did some >>>>> digging, and we eventually came up with a patch that modifies >>>>> SelectStatement.java in the following way: if the selection in the query >>>>> only includes the partition key, then when building a ColumnFilter for >>>>> the query, use: >>>>> >>>>> builder = ColumnFilter.selectionBuilder(); >>>>> >>>>> instead of: >>>>> >>>>> builder = ColumnFilter.allColumnsBuilder(); >>>>> >>>>> to initialize the ColumnFilter.Builder in gatherQueriedColumns(). That >>>>> seems to repair the performance regression, and it doesn't appear to >>>>> break any functionality (based on the unit tests and some smoke tests we >>>>> ran involving insertions and deletions). >>>>> >>>>> We'd like to contribute this patch back to the project, but we're not >>>>> convinced that there aren't subtle correctness issues we're missing, >>>>> judging both from comments in the code and the existence of >>>>> CASSANDRA-5912, which suggests optimizing this kind of query is >>>>&
Re: Optimizing queries for partition keys
Thanks (and apologies for the delayed response); that was the kind of feedback we were looking for. We backported the fix for CASSANDRA-10657 to 3.0.16, and it partially addresses our problem in the sense that it does limit the data sent on the wire. The performance is still extremely poor, however, due to the fact that Cassandra continues to read large volumes of data from disk. (We've also confirmed this behavior in 3.11.2.) With a bit more investigation, we now believe the problem (after CASSNDRA-10657 is applied) is in RebufferingInputStream.skipBytes(), which appears to read bytes in order to skip them. The subclass used in our case, RandomAccessReader, exposes a seek(), so we overrode skipBytes() in it to make use of seek(), and that seems to resolve the problem. This change is intuitively much safer than the one we'd originally identified, but we'd still like to confirm with you folks whether it's likely safe and, if so whether it's also potentially worth contributing. Thanks, Sk On 2018-03-22 18:16, Benjamin Lerer wrote: You should check the 3.x release. CASSANDRA-10657 could have fixed your problem. On Thu, Mar 22, 2018 at 9:15 PM, Benjamin Lerer <benjamin.le...@datastax.com wrote: Syvlain explained the problem in CASSANDRA-4536: " Let me note that in CQL3 a row that have no live column don't exist, so we can't really implement this with a range slice having an empty columns list. Instead we should do a range slice with a full-row slice predicate with a count of 1, to make sure we do have a live column before including the partition key. " By using ColumnFilter.selectionBuilder(); you do not select all the columns. By consequence, some partitions might be returned while they should not. On Thu, Mar 22, 2018 at 6:24 PM, Sam Klock <skl...@akamai.com> wrote: Cassandra devs, We use workflows in some of our clusters (running 3.0.15) that involve "SELECT DISTINCT key FROM..."-style queries. For some tables, we observed extremely poor performance under light load (i.e., a small number of rows per second and frequent timeouts), which we eventually traced to replicas shipping entire rows (which in some cases could store on the order of MBs of data) to service the query. That surprised us (partly because 2.1 doesn't seem to behave this way), so we did some digging, and we eventually came up with a patch that modifies SelectStatement.java in the following way: if the selection in the query only includes the partition key, then when building a ColumnFilter for the query, use: builder = ColumnFilter.selectionBuilder(); instead of: builder = ColumnFilter.allColumnsBuilder(); to initialize the ColumnFilter.Builder in gatherQueriedColumns(). That seems to repair the performance regression, and it doesn't appear to break any functionality (based on the unit tests and some smoke tests we ran involving insertions and deletions). We'd like to contribute this patch back to the project, but we're not convinced that there aren't subtle correctness issues we're missing, judging both from comments in the code and the existence of CASSANDRA-5912, which suggests optimizing this kind of query is nontrivial. So: does this change sound safe to make, or are there corner cases we need to account for? If there are corner cases, are there plausibly ways of addressing them at the SelectStatement level, or will we need to look deeper? Thanks, SK - To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For additional commands, e-mail: dev-h...@cassandra.apache.org - To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For additional commands, e-mail: dev-h...@cassandra.apache.org
Optimizing queries for partition keys
Cassandra devs, We use workflows in some of our clusters (running 3.0.15) that involve "SELECT DISTINCT key FROM..."-style queries. For some tables, we observed extremely poor performance under light load (i.e., a small number of rows per second and frequent timeouts), which we eventually traced to replicas shipping entire rows (which in some cases could store on the order of MBs of data) to service the query. That surprised us (partly because 2.1 doesn't seem to behave this way), so we did some digging, and we eventually came up with a patch that modifies SelectStatement.java in the following way: if the selection in the query only includes the partition key, then when building a ColumnFilter for the query, use: builder = ColumnFilter.selectionBuilder(); instead of: builder = ColumnFilter.allColumnsBuilder(); to initialize the ColumnFilter.Builder in gatherQueriedColumns(). That seems to repair the performance regression, and it doesn't appear to break any functionality (based on the unit tests and some smoke tests we ran involving insertions and deletions). We'd like to contribute this patch back to the project, but we're not convinced that there aren't subtle correctness issues we're missing, judging both from comments in the code and the existence of CASSANDRA-5912, which suggests optimizing this kind of query is nontrivial. So: does this change sound safe to make, or are there corner cases we need to account for? If there are corner cases, are there plausibly ways of addressing them at the SelectStatement level, or will we need to look deeper? Thanks, SK - To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For additional commands, e-mail: dev-h...@cassandra.apache.org
Re: Reconciling expiring cells and tombstones
Thanks Tyler. Sounds like the risk of making this change is tolerable for us. Josef: thanks for the links. We're thinking carefully about our NTP configuration, but we also would like to account for unusual failure modes (e.g., nodes with fast clocks experiencing partitions from NTP servers/peers). Making sure Cassandra's internals are arranged to minimize the damage when these things happen is part of our strategy. Thanks, SK On 06/17/2015 11:17 AM, Tyler Hobbs wrote: Why does Cassandra consistently prefer tombstones to other kinds of cells? It's primarily to have deterministic conflict resolution. I don't recall any specific conversations about preferring tombstones expiring cells, and the original ticket (https://issues.apache.org/jira/browse/CASSANDRA-699) doesn't mention anything. By modifying this behavior in this particular case, do we risk hitting bizarre corner cases? I think this would be safe. The only problem I can think of would happen while your cluster has some patched nodes and some unpatched nodes. If you had any nodes that had both an expiring cell and a tombstone with the same timestamp, then patched replicas would return different results/digests than the unpatched nodes. However, unless you're mixing TTLs with deletes, that's not too likely to happen. Maybe repair combined with clock skew could result in that, but not much else. On Wed, Jun 17, 2015 at 10:05 AM, Josef Lindman Hörnlund jo...@appdata.biz wrote: Hello Sam, This is not answering your direct question but if you worry about clock skew take a look at this great two-part blogpost: https://blog.logentries.com/2014/03/synchronizing-clocks-in-a-cassandra-cluster-pt-1-the-problem/ https://blog.logentries.com/2014/03/synchronizing-clocks-in-a-cassandra-cluster-pt-1-the-problem/ https://blog.logentries.com/2014/03/synchronizing-clocks-in-a-cassandra-cluster-pt-2-solutions/ https://blog.logentries.com/2014/03/synchronizing-clocks-in-a-cassandra-cluster-pt-2-solutions/ Josef Lindman Hörnlund Chief Data Scientist AppData jo...@appdata.biz On 16 Jun 2015, at 20:45, Sam Klock skl...@akamai.com wrote: Hi folks, I have a question about a design choice on how expiring cells are reconciled with tombstones. For two cells with the same timestamp, if one is expiring and one is a tombstone, Cassandra *always* prefers the tombstone. This matches its behavior for normal/non-expiring cells, but the folks in my organization worry about what it may imply for nodes experiencing clock skew. Specifically, we're concerned about scenarios like the following: 1) An expiring cell is committed via some node with a non-skewed clock. 2) Another replica for that cell experiences forward clock skew and decides that the cell is expired. It eventually runs a compaction that converts the cell to a tombstone. 3) The tombstone propagates to other nodes via, e.g., node repair. 4) The other nodes all eventually run their own compactions. Because of the reconciliation logic, the expiring cell is purged on all of the replicas, leaving behind only the tombstone. If the cell should have still been live at (4), the reconciliation logic will result in it being prematurely purged. We have confirmed this behavior experimentally. My organization may be more concerned about clock skew than the larger community, so I don't think we're inclined to propose a patch at this time. But to account for this kind of scenario we would like to patch our internal version of Cassandra to conditionally prefer expiring cells to tombstones if the node believes they should still be live; i.e., in reconcile() in *ExpiringCell.java, instead of: if (cell instanceof DeletedCell) return cell; use: if (cell instanceof DeletedCell) return isLive() ? this : cell; Before we do so, however, we'd like to understand the rationale for the existing behavior and the risks of making changes to it. Why does Cassandra consistently prefer tombstones to other kinds of cells? By modifying this behavior in this particular case, do we risk hitting bizarre corner cases? Thanks, SK
Reconciling expiring cells and tombstones
Hi folks, I have a question about a design choice on how expiring cells are reconciled with tombstones. For two cells with the same timestamp, if one is expiring and one is a tombstone, Cassandra *always* prefers the tombstone. This matches its behavior for normal/non-expiring cells, but the folks in my organization worry about what it may imply for nodes experiencing clock skew. Specifically, we're concerned about scenarios like the following: 1) An expiring cell is committed via some node with a non-skewed clock. 2) Another replica for that cell experiences forward clock skew and decides that the cell is expired. It eventually runs a compaction that converts the cell to a tombstone. 3) The tombstone propagates to other nodes via, e.g., node repair. 4) The other nodes all eventually run their own compactions. Because of the reconciliation logic, the expiring cell is purged on all of the replicas, leaving behind only the tombstone. If the cell should have still been live at (4), the reconciliation logic will result in it being prematurely purged. We have confirmed this behavior experimentally. My organization may be more concerned about clock skew than the larger community, so I don't think we're inclined to propose a patch at this time. But to account for this kind of scenario we would like to patch our internal version of Cassandra to conditionally prefer expiring cells to tombstones if the node believes they should still be live; i.e., in reconcile() in *ExpiringCell.java, instead of: if (cell instanceof DeletedCell) return cell; use: if (cell instanceof DeletedCell) return isLive() ? this : cell; Before we do so, however, we'd like to understand the rationale for the existing behavior and the risks of making changes to it. Why does Cassandra consistently prefer tombstones to other kinds of cells? By modifying this behavior in this particular case, do we risk hitting bizarre corner cases? Thanks, SK