[ 
https://issues.apache.org/jira/browse/CASSANDRA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Caleb Rackliffe updated CASSANDRA-16226:
----------------------------------------
    Description: 
This was discovered while tracking down a spike in the number of  SSTables per 
read for a COMPACT STORAGE table after a 2.1 -> 3.0 upgrade. Before 3.0, there 
is no direct analog of 3.0's primary key liveness info. When we upgrade 2.1 
COMPACT STORAGE SSTables to the mf format, we simply don't write row 
timestamps, even if the original mutations were INSERTs. On read, when we look 
at SSTables in order from newest to oldest max timestamp, we expect to have 
this primary key liveness information to determine whether we can skip older 
SSTables after finding completely populated rows.

ex. I have three SSTables in a COMPACT STORAGE table with max timestamps 1000, 
2000, and 3000. There are many rows in a particular partition, making filtering 
on the min and max clustering effectively a no-op. All data is inserted, and 
there are no partial updates. A fully specified row with timestamp 2500 exists 
in the SSTable with a max timestamp of 3000. With a proper row timestamp in 
hand, we can easily ignore the SSTables w/ max timestamps of 1000 and 2000. 
Without it, we read 3 SSTables instead of 1, which likely means a significant 
performance regression. 

The following test illustrates this difference in behavior between 2.1 and 3.0:
https://github.com/maedhroz/cassandra/commit/84ce9242bedd735ca79d4f06007d127de6a82800

A solution here might be as simple as having 
{{SinglePartitionReadCommand#canRemoveRow()}} only inspect primary key liveness 
information for non-compact/CQL tables. Tombstones seem to be handled at a 
level above that anyway. (One potential problem with that is whether or not the 
distinction will continue to exist in 4.0, and dropping compact storage from a 
table doesn't magically make pk liveness information appear.)

  was:
This was discovered while tracking down a spike in the number of  SSTables per 
read for a COMPACT STORAGE table after a 2.1 -> 3.0 upgrade. Before 3.0, there 
is no direct analog of 3.0's primary key liveness info. When we upgrade 2.1 
COMPACT STORAGE SSTables to the mf format, we simply don't write row 
timestamps, even if the original mutations were INSERTs. On read, when we look 
at SSTables in order from newest to oldest max timestamp, we expect to have 
this primary key liveness information to determine whether we can skip older 
SSTables after finding completely populated rows.

ex. I have three SSTables in a COMPACT STORAGE table with max timestamps 1000, 
2000, and 3000. There are many rows in a particular partition, making filtering 
on the min and max clustering effectively a no-op. All data is inserted, and 
there are no partial updates. A fully specified row with timestamp 2500 exists 
in the SSTable with a max timestamp of 3000. With a proper row timestamp in 
hand, we can easily ignore the SSTables w/ max timestamps of 1000 and 2000. 
Without it, we read 3 SSTables instead of 1, which likely means a significant 
performance regression. 

The following test illustrates this difference in behavior between 2.1 and 3.0:
https://github.com/maedhroz/cassandra/commit/84ce9242bedd735ca79d4f06007d127de6a82800

A solution here might be as simple as having 
{{SinglePartitionReadCommand#canRemoveRow()}} only inspect primary key liveness 
information for non-compact/CQL tables. Tombstones seem to be handled at a 
level above that anyway. (One potential problem with that is whether or not the 
distinction will continue to exist in 4.0.)


> COMPACT STORAGE SSTables created before 3.0 are not correctly skipped by 
> timestamp due to missing primary key liveness info
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-16226
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16226
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Legacy/Local Write-Read Paths
>            Reporter: Caleb Rackliffe
>            Priority: Normal
>             Fix For: 3.0.x, 3.11.x, 4.0-beta
>
>
> This was discovered while tracking down a spike in the number of  SSTables 
> per read for a COMPACT STORAGE table after a 2.1 -> 3.0 upgrade. Before 3.0, 
> there is no direct analog of 3.0's primary key liveness info. When we upgrade 
> 2.1 COMPACT STORAGE SSTables to the mf format, we simply don't write row 
> timestamps, even if the original mutations were INSERTs. On read, when we 
> look at SSTables in order from newest to oldest max timestamp, we expect to 
> have this primary key liveness information to determine whether we can skip 
> older SSTables after finding completely populated rows.
> ex. I have three SSTables in a COMPACT STORAGE table with max timestamps 
> 1000, 2000, and 3000. There are many rows in a particular partition, making 
> filtering on the min and max clustering effectively a no-op. All data is 
> inserted, and there are no partial updates. A fully specified row with 
> timestamp 2500 exists in the SSTable with a max timestamp of 3000. With a 
> proper row timestamp in hand, we can easily ignore the SSTables w/ max 
> timestamps of 1000 and 2000. Without it, we read 3 SSTables instead of 1, 
> which likely means a significant performance regression. 
> The following test illustrates this difference in behavior between 2.1 and 
> 3.0:
> https://github.com/maedhroz/cassandra/commit/84ce9242bedd735ca79d4f06007d127de6a82800
> A solution here might be as simple as having 
> {{SinglePartitionReadCommand#canRemoveRow()}} only inspect primary key 
> liveness information for non-compact/CQL tables. Tombstones seem to be 
> handled at a level above that anyway. (One potential problem with that is 
> whether or not the distinction will continue to exist in 4.0, and dropping 
> compact storage from a table doesn't magically make pk liveness information 
> appear.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to