[ https://issues.apache.org/jira/browse/CASSANDRA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17233742#comment-17233742 ]
Caleb Rackliffe edited comment on CASSANDRA-16226 at 11/17/20, 10:37 PM: ------------------------------------------------------------------------- I've posted (the 3.0 version of) an alternative solution here, which should merge up cleanly, given {{TableMetadata#isCQLTable()}} is still present on trunk after CASSANDRA-16217: https://github.com/apache/cassandra/pull/823 (CircleCI is having difficulty with the unit tests, but there is an ex-unit test run [here|https://app.circleci.com/pipelines/github/maedhroz/cassandra/155/workflows/cc54c092-cac1-4144-b9be-577ea4c8a78e] and a default config unit test run [here|https://app.circleci.com/pipelines/github/maedhroz/cassandra/158/workflows/12cd707a-2fdd-4b9f-b512-b5f060a63dbf].) My goals w/ this patch were the following: 1.) Maintain the empty row semantics of tables that have either always been compact or always been non-compact (see [previous|https://issues.apache.org/jira/browse/CASSANDRA-16226?focusedCommentId=17233168&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17233168] comment), and preserve as much of them as possible after dropping compact storage, given the historical limitations of the compact format. 2.) Either improve/reduce or leave unchanged the number of SSTables read in all cases for tables that have never been compact. 3.) Fix the regression around the number of SSTables read in the original description of this issue for compact tables, and avoid having the regression re-introduced when and if compact storage is dropped. 4.) Avoid creating a new SSTable format or burdening operators with running {{upgradesstables}} again to make all of this above work. Although there are some issues w/ the unit tests on CI, the relevant local tests are looking clean, including {{operations.DeleteTest}} and {{operations.UpgradeTest}}, which verify item #1 above, and the other points are by and large satisfied. For point #2, there was actually a case where pure non-compact tables appeared to be reading too many SSTables w/ updates present (although that [needs to be reviewed|https://github.com/apache/cassandra/pull/823/files?file-filters%5B%5D=.java#r524809272]). One drawback of this approach is that when a compact table with existing SSTables has compact storage dropped, those SSTables still do not contain primary key liveness info, and therefore may continue to return zero rows rather than rows w/ primary keys and null regular columns. (Data written after the drop will, of course, have liveness info, and so mixed behavior is possible.) I've discussed a few ideas w/ [~jwest] and [~ifesdjeen] around this, and I'm not convinced any of them would be valuable. (Keep in mind that applications written against compact storage tables will already be handling this case, and good documentation will help even further.) The first is having enough information, either via a sequence number or a new piece of metadata, to identify whether an SSTable was created while a table was still compact. Having this data might be useful in some other way, but it still does not insert primary key liveness info into pre-drop SSTables. DELETE and UPDATE operations never have primary key liveness information attached to them, even for pure non-compact tables, so an insert pre-drop in one SSTable, followed by a delete in a post-drop SSTable, will still translate to the absence of a row rather than a row with null regular cells. The second broad idea is to somehow insert primary key liveness information into pre-drop SSTables, but this is problematic for at least a couple reasons. If we make it a prerequisite for running DROP COMPACT STORAGE, it means physically rewriting SSTables. (To be fair, this may not be a huge additional burden for those still on 2.x, who would have to do this anyway.) The other oddity is that there doesn't appear to be any existing procedure for synthesizing primary key liveness info. In the non-compact world, it is carried only by INSERT operations, not UPDATE or DELETE operations. It's not clear to me how we would determine where to add it during a {{runsstableupgrade}} run, particularly how we would know the difference between INSERT and UPDATE starting with just a compact SSTable on disk. To summarize, I think tables that are purely compact or non-compact should be handled pretty well w/ this patch (and will strictly git all the goals above, where applicable). The interesting case is the mixed compact/non-compact SSTable set caused by dropping compact storage. In this case, I think it makes reasonable trade-offs and avoids all the major pain points for operators. Whatever direction we go, the official docs around compact storage are going to need updating. was (Author: maedhroz): I've posted (the 3.0 version of) an alternative solution here, which should merge up cleanly, given {{TableMetadata#isCQLTable()}} is still present on trunk after CASSANDRA-16217: https://github.com/apache/cassandra/pull/823 (CircleCI is having difficulty with the unit tests, but there is an ex-unit test run [here|https://app.circleci.com/pipelines/github/maedhroz/cassandra/155/workflows/cc54c092-cac1-4144-b9be-577ea4c8a78e] and a default config unit test [here|https://app.circleci.com/pipelines/github/maedhroz/cassandra/158/workflows/12cd707a-2fdd-4b9f-b512-b5f060a63dbf].) My goals w/ this patch were the following: 1.) Maintain the empty row semantics of tables that have either always been compact or always been non-compact (see [previous|https://issues.apache.org/jira/browse/CASSANDRA-16226?focusedCommentId=17233168&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17233168] comment), and preserve as much of them as possible after dropping compact storage, given the historical limitations of the compact format. 2.) Either improve/reduce or leave unchanged the number of SSTables read in all cases for tables that have never been compact. 3.) Fix the regression around the number of SSTables read in the original description of this issue for compact tables, and avoid having the regression re-introduced when and if compact storage is dropped. 4.) Avoid creating a new SSTable format or burdening operators with running {{upgradesstables}} again to make all of this above work. Although there are some issues w/ the unit tests on CI, the relevant local tests are looking clean, including {{operations.DeleteTest}} and {{operations.UpgradeTest}}, which verify item #1 above, and the other points are by and large satisfied. For point #2, there was actually a case where pure non-compact tables appeared to be reading too many SSTables w/ updates present (although that [needs to be reviewed|https://github.com/apache/cassandra/pull/823/files?file-filters%5B%5D=.java#r524809272]). One drawback of this approach is that when a compact table with existing SSTables has compact storage dropped, those SSTables still do not contain primary key liveness info, and therefore may continue to return zero rows rather than rows w/ primary keys and null regular columns. (Data written after the drop will, of course, have liveness info, and so mixed behavior is possible.) I've discussed a few ideas w/ [~jwest] and [~ifesdjeen] around this, and I'm not convinced any of them would be valuable. (Keep in mind that applications written against compact storage tables will already be handling this case, and good documentation will help even further.) The first is having enough information, either via a sequence number or a new piece of metadata, to identify whether an SSTable was created while a table was still compact. Having this data might be useful in some other way, but it still does not insert primary key liveness info into pre-drop SSTables. DELETE and UPDATE operations never have primary key liveness information attached to them, even for pure non-compact tables, so an insert pre-drop in one SSTable, followed by a delete in a post-drop SSTable, will still translate to the absence of a row rather than a row with null regular cells. The second broad idea is to somehow insert primary key liveness information into pre-drop SSTables, but this is problematic for at least a couple reasons. If we make it a prerequisite for running DROP COMPACT STORAGE, it means physically rewriting SSTables. (To be fair, this may not be a huge additional burden for those still on 2.x, who would have to do this anyway.) The other oddity is that there doesn't appear to be any existing procedure for synthesizing primary key liveness info. In the non-compact world, it is carried only by INSERT operations, not UPDATE or DELETE operations. It's not clear to me how we would determine where to add it during a {{runsstableupgrade}} run, particularly how we would know the difference between INSERT and UPDATE starting with just a compact SSTable on disk. To summarize, I think tables that are purely compact or non-compact should be handled pretty well w/ this patch (and will strictly git all the goals above, where applicable). The interesting case is the mixed compact/non-compact SSTable set caused by dropping compact storage. In this case, I think it makes reasonable trade-offs and avoids all the major pain points for operators. Whatever direction we go, the official docs around compact storage are going to need updating. > COMPACT STORAGE SSTables created before 3.0 are not correctly skipped by > timestamp due to missing primary key liveness info > --------------------------------------------------------------------------------------------------------------------------- > > Key: CASSANDRA-16226 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16226 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Local Write-Read Paths > Reporter: Caleb Rackliffe > Assignee: Caleb Rackliffe > Priority: Normal > Labels: perfomance, upgrade > Fix For: 3.0.x, 3.11.x, 4.0-beta > > Time Spent: 2h > Remaining Estimate: 0h > > This was discovered while tracking down a spike in the number of SSTables > per read for a COMPACT STORAGE table after a 2.1 -> 3.0 upgrade. Before 3.0, > there is no direct analog of 3.0's primary key liveness info. When we upgrade > 2.1 COMPACT STORAGE SSTables to the mf format, we simply don't write row > timestamps, even if the original mutations were INSERTs. On read, when we > look at SSTables in order from newest to oldest max timestamp, we expect to > have this primary key liveness information to determine whether we can skip > older SSTables after finding completely populated rows. > ex. I have three SSTables in a COMPACT STORAGE table with max timestamps > 1000, 2000, and 3000. There are many rows in a particular partition, making > filtering on the min and max clustering effectively a no-op. All data is > inserted, and there are no partial updates. A fully specified row with > timestamp 2500 exists in the SSTable with a max timestamp of 3000. With a > proper row timestamp in hand, we can easily ignore the SSTables w/ max > timestamps of 1000 and 2000. Without it, we read 3 SSTables instead of 1, > which likely means a significant performance regression. > The following test illustrates this difference in behavior between 2.1 and > 3.0: > https://github.com/maedhroz/cassandra/commit/84ce9242bedd735ca79d4f06007d127de6a82800 > A solution here might be as simple as having > {{SinglePartitionReadCommand#canRemoveRow()}} only inspect primary key > liveness information for non-compact/CQL tables. Tombstones seem to be > handled at a level above that anyway. (One potential problem with that is > whether or not the distinction will continue to exist in 4.0, and dropping > compact storage from a table doesn't magically make pk liveness information > appear.) -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org