[ 
https://issues.apache.org/jira/browse/CASSANDRA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17233742#comment-17233742
 ] 

Caleb Rackliffe edited comment on CASSANDRA-16226 at 11/17/20, 10:37 PM:
-------------------------------------------------------------------------

I've posted (the 3.0 version of) an alternative solution here, which should 
merge up cleanly, given {{TableMetadata#isCQLTable()}} is still present on 
trunk after CASSANDRA-16217: https://github.com/apache/cassandra/pull/823

(CircleCI is having difficulty with the unit tests, but there is an ex-unit 
test run 
[here|https://app.circleci.com/pipelines/github/maedhroz/cassandra/155/workflows/cc54c092-cac1-4144-b9be-577ea4c8a78e]
 and a default config unit test run 
[here|https://app.circleci.com/pipelines/github/maedhroz/cassandra/158/workflows/12cd707a-2fdd-4b9f-b512-b5f060a63dbf].)

My goals w/ this patch were the following:

1.) Maintain the empty row semantics of tables that have either always been 
compact or always been non-compact (see 
[previous|https://issues.apache.org/jira/browse/CASSANDRA-16226?focusedCommentId=17233168&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17233168]
 comment), and preserve as much of them as possible after dropping compact 
storage, given the historical limitations of the compact format.

2.) Either improve/reduce or leave unchanged the number of SSTables read in all 
cases for tables that have never been compact.

3.) Fix the regression around the number of SSTables read in the original 
description of this issue for compact tables, and avoid having the regression 
re-introduced when and if compact storage is dropped.

4.) Avoid creating a new SSTable format or burdening operators with running 
{{upgradesstables}} again to make all of this above work.

Although there are some issues w/ the unit tests on CI, the relevant local 
tests are looking clean, including {{operations.DeleteTest}} and 
{{operations.UpgradeTest}}, which verify item #1 above, and the other points 
are by and large satisfied. For point #2, there was actually a case where pure 
non-compact tables appeared to be reading too many SSTables w/ updates present 
(although that [needs to be 
reviewed|https://github.com/apache/cassandra/pull/823/files?file-filters%5B%5D=.java#r524809272]).

One drawback of this approach is that when a compact table with existing 
SSTables has compact storage dropped, those SSTables still do not contain 
primary key liveness info, and therefore may continue to return zero rows 
rather than rows w/ primary keys and null regular columns. (Data written after 
the drop will, of course, have liveness info, and so mixed behavior is 
possible.) I've discussed a few ideas w/ [~jwest] and [~ifesdjeen] around this, 
and I'm not convinced any of them would be valuable. (Keep in mind that 
applications written against compact storage tables will already be handling 
this case, and good documentation will help even further.)

The first is having enough information, either via a sequence number or a new 
piece of metadata, to identify whether an SSTable was created while a table was 
still compact. Having this data might be useful in some other way, but it still 
does not insert primary key liveness info into pre-drop SSTables. DELETE and 
UPDATE operations never have primary key liveness information attached to them, 
even for pure non-compact tables, so an insert pre-drop in one SSTable, 
followed by a delete in a post-drop SSTable, will still translate to the 
absence of a row rather than a row with null regular cells.

The second broad idea is to somehow insert primary key liveness information 
into pre-drop SSTables, but this is problematic for at least a couple reasons. 
If we make it a prerequisite for running DROP COMPACT STORAGE, it means 
physically rewriting SSTables. (To be fair, this may not be a huge additional 
burden for those still on 2.x, who would have to do this anyway.) The other 
oddity is that there doesn't appear to be any existing procedure for 
synthesizing primary key liveness info. In the non-compact world, it is carried 
only by INSERT operations, not UPDATE or DELETE operations. It's not clear to 
me how we would determine where to add it during a {{runsstableupgrade}} run, 
particularly how we would know the difference between INSERT and UPDATE 
starting with just a compact SSTable on disk.

To summarize, I think tables that are purely compact or non-compact should be 
handled pretty well w/ this patch (and will strictly git all the goals above, 
where applicable). The interesting case is the mixed compact/non-compact 
SSTable set caused by dropping compact storage. In this case, I think it makes 
reasonable trade-offs and avoids all the major pain points for operators. 
Whatever direction we go, the official docs around compact storage are going to 
need updating.


was (Author: maedhroz):
I've posted (the 3.0 version of) an alternative solution here, which should 
merge up cleanly, given {{TableMetadata#isCQLTable()}} is still present on 
trunk after CASSANDRA-16217: https://github.com/apache/cassandra/pull/823

(CircleCI is having difficulty with the unit tests, but there is an ex-unit 
test run 
[here|https://app.circleci.com/pipelines/github/maedhroz/cassandra/155/workflows/cc54c092-cac1-4144-b9be-577ea4c8a78e]
 and a default config unit test 
[here|https://app.circleci.com/pipelines/github/maedhroz/cassandra/158/workflows/12cd707a-2fdd-4b9f-b512-b5f060a63dbf].)

My goals w/ this patch were the following:

1.) Maintain the empty row semantics of tables that have either always been 
compact or always been non-compact (see 
[previous|https://issues.apache.org/jira/browse/CASSANDRA-16226?focusedCommentId=17233168&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17233168]
 comment), and preserve as much of them as possible after dropping compact 
storage, given the historical limitations of the compact format.

2.) Either improve/reduce or leave unchanged the number of SSTables read in all 
cases for tables that have never been compact.

3.) Fix the regression around the number of SSTables read in the original 
description of this issue for compact tables, and avoid having the regression 
re-introduced when and if compact storage is dropped.

4.) Avoid creating a new SSTable format or burdening operators with running 
{{upgradesstables}} again to make all of this above work.

Although there are some issues w/ the unit tests on CI, the relevant local 
tests are looking clean, including {{operations.DeleteTest}} and 
{{operations.UpgradeTest}}, which verify item #1 above, and the other points 
are by and large satisfied. For point #2, there was actually a case where pure 
non-compact tables appeared to be reading too many SSTables w/ updates present 
(although that [needs to be 
reviewed|https://github.com/apache/cassandra/pull/823/files?file-filters%5B%5D=.java#r524809272]).

One drawback of this approach is that when a compact table with existing 
SSTables has compact storage dropped, those SSTables still do not contain 
primary key liveness info, and therefore may continue to return zero rows 
rather than rows w/ primary keys and null regular columns. (Data written after 
the drop will, of course, have liveness info, and so mixed behavior is 
possible.) I've discussed a few ideas w/ [~jwest] and [~ifesdjeen] around this, 
and I'm not convinced any of them would be valuable. (Keep in mind that 
applications written against compact storage tables will already be handling 
this case, and good documentation will help even further.)

The first is having enough information, either via a sequence number or a new 
piece of metadata, to identify whether an SSTable was created while a table was 
still compact. Having this data might be useful in some other way, but it still 
does not insert primary key liveness info into pre-drop SSTables. DELETE and 
UPDATE operations never have primary key liveness information attached to them, 
even for pure non-compact tables, so an insert pre-drop in one SSTable, 
followed by a delete in a post-drop SSTable, will still translate to the 
absence of a row rather than a row with null regular cells.

The second broad idea is to somehow insert primary key liveness information 
into pre-drop SSTables, but this is problematic for at least a couple reasons. 
If we make it a prerequisite for running DROP COMPACT STORAGE, it means 
physically rewriting SSTables. (To be fair, this may not be a huge additional 
burden for those still on 2.x, who would have to do this anyway.) The other 
oddity is that there doesn't appear to be any existing procedure for 
synthesizing primary key liveness info. In the non-compact world, it is carried 
only by INSERT operations, not UPDATE or DELETE operations. It's not clear to 
me how we would determine where to add it during a {{runsstableupgrade}} run, 
particularly how we would know the difference between INSERT and UPDATE 
starting with just a compact SSTable on disk.

To summarize, I think tables that are purely compact or non-compact should be 
handled pretty well w/ this patch (and will strictly git all the goals above, 
where applicable). The interesting case is the mixed compact/non-compact 
SSTable set caused by dropping compact storage. In this case, I think it makes 
reasonable trade-offs and avoids all the major pain points for operators. 
Whatever direction we go, the official docs around compact storage are going to 
need updating.

> COMPACT STORAGE SSTables created before 3.0 are not correctly skipped by 
> timestamp due to missing primary key liveness info
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-16226
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16226
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Legacy/Local Write-Read Paths
>            Reporter: Caleb Rackliffe
>            Assignee: Caleb Rackliffe
>            Priority: Normal
>              Labels: perfomance, upgrade
>             Fix For: 3.0.x, 3.11.x, 4.0-beta
>
>          Time Spent: 2h
>  Remaining Estimate: 0h
>
> This was discovered while tracking down a spike in the number of  SSTables 
> per read for a COMPACT STORAGE table after a 2.1 -> 3.0 upgrade. Before 3.0, 
> there is no direct analog of 3.0's primary key liveness info. When we upgrade 
> 2.1 COMPACT STORAGE SSTables to the mf format, we simply don't write row 
> timestamps, even if the original mutations were INSERTs. On read, when we 
> look at SSTables in order from newest to oldest max timestamp, we expect to 
> have this primary key liveness information to determine whether we can skip 
> older SSTables after finding completely populated rows.
> ex. I have three SSTables in a COMPACT STORAGE table with max timestamps 
> 1000, 2000, and 3000. There are many rows in a particular partition, making 
> filtering on the min and max clustering effectively a no-op. All data is 
> inserted, and there are no partial updates. A fully specified row with 
> timestamp 2500 exists in the SSTable with a max timestamp of 3000. With a 
> proper row timestamp in hand, we can easily ignore the SSTables w/ max 
> timestamps of 1000 and 2000. Without it, we read 3 SSTables instead of 1, 
> which likely means a significant performance regression. 
> The following test illustrates this difference in behavior between 2.1 and 
> 3.0:
> https://github.com/maedhroz/cassandra/commit/84ce9242bedd735ca79d4f06007d127de6a82800
> A solution here might be as simple as having 
> {{SinglePartitionReadCommand#canRemoveRow()}} only inspect primary key 
> liveness information for non-compact/CQL tables. Tombstones seem to be 
> handled at a level above that anyway. (One potential problem with that is 
> whether or not the distinction will continue to exist in 4.0, and dropping 
> compact storage from a table doesn't magically make pk liveness information 
> appear.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to