[ 
https://issues.apache.org/jira/browse/CASSANDRA-6563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13992809#comment-13992809
 ] 

Paulo Ricardo Motta Gomes commented on CASSANDRA-6563:
------------------------------------------------------

I've also hit this bug and instrumented the code to identify what was 
happening. Below is a detailed explanation of the issues that lead to this:

* CASSANDRA-3442 introduced the automatic tombstone removal feature, which was 
one of the main new features of C* 1.2 
(http://www.datastax.com/dev/blog/tombstone-removal-improvement-in-1-2).

* In CASSANDRA-4022 the hints column family entered a compaction loop due to 
the single sstable compaction introduced by CASSANDRA-3442 that was not 
clearing tombstones because compacted rows with a lower timestamp were present 
in other sstables. So, the same SSTable was being compated over and over due to 
a large tombstone ratio that never changed.
A new heuristic was introduced to estimate the number of keys of the candidate 
SSTable that overlaps with other SSTables to prevent CASSANDRA-4022. However, 
this heuristic uses the token range to estimate the key overlap, which can be 
an accurate estimation if an OrderedPartitioner is used, but not for a 
RandomPartitioner, since most SSTables will contain the whole node's range. The 
result is that this heuristic not only prevents CASSANDRA-4022 from happening, 
but in my opinion it also prevents any tombstone compaction to happen at all, 
since most of the times the token ranges of all SSTables overlap. Jonathan 
Ellis even noted in the ticket discussion, "I'm worried that with SizeTiered 
compaction we'll have overlap in a lot of cases where we could still compact if 
we looked closer", but this worry was not taken further and the fix was 
integrated.

* Even with the previous fix, the compaction bug reappeared in CASSANDRA-4781. 
Sylvain Lebresne noted that even with the estimation "we could always end up in 
a case where the estimate thinks there is enough droppable tombstones, but in 
practice all the droppable tombstones are in overlapping ranges", and suggested 
to skip the worthDroppingTombstone check for SStables that were compacted 
before tombstone_compaction_interval seconds. The estimation was update because 
it contained a minor bug and the check for "tombstone_compaction_interval" was 
added.

So, after this change the code has pretty much been untouched since 1.2.0 and 
hasn't caused any problems since then, in my opinion because the tombstone 
compaction was never being triggered because of the conservative estimation, as 
illustrated in this JIRA ticket and the following mailing list threads:

* https://www.mail-archive.com/user@cassandra.apache.org/msg29979.html
* http://www.mail-archive.com/user@cassandra.apache.org/msg36144.html
* https://www.mail-archive.com/user@cassandra.apache.org/msg31760.html
* https://www.mail-archive.com/user@cassandra.apache.org/msg35793.html

In my opinion, the tombstone_compaction_interval check is sufficient to prevent 
CASSANDRA-4022 and CASSANDRA-4781, since it will give time for rows to be 
merged and the tombstone compaction to execute succesfully, reducing the 
droppable tombstone ratio and avoiding the compaction loop. So, I propose 
removing altogether the heuristic to estimate the key overlap in order to 
decide if a sstable should be tombstone compacted, keeping only the droppable 
tombstone treshold and tombstone_compaction_interval to decide that.

I'm attaching the patch for 1.2, which I'm testing in one of our production 
servers, and will provide a patch for 2.0.X soon. 

We already gained about 10% disk space due to tombstone compactions that were 
not being triggered before the patch. I will post the graphs soon with the 
space savings and decrease in droppable tombstone ratios. So far we haven't 
noticed any compaction loop.

> TTL histogram compactions not triggered at high "Estimated droppable 
> tombstones" rate
> -------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-6563
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6563
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: 1.2.12ish
>            Reporter: Chris Burroughs
>             Fix For: 1.2.17
>
>         Attachments: 1.2.16-CASSANDRA-6563.txt
>
>
> I have several column families in a largish cluster where virtually all 
> columns are written with a (usually the same) TTL.  My understanding of 
> CASSANDRA-3442 is that sstables that have a high ( > 20%) estimated 
> percentage of droppable tombstones should be individually compacted.  This 
> does not appear to be occurring with size tired compaction.
> Example from one node:
> {noformat}
> $ ll /data/sstables/data/ks/Cf/*Data.db
> -rw-rw-r-- 31 cassandra cassandra 26651211757 Nov 26 22:59 
> /data/sstables/data/ks/Cf/ks-Cf-ic-295562-Data.db
> -rw-rw-r-- 31 cassandra cassandra  6272641818 Nov 27 02:51 
> /data/sstables/data/ks/Cf/ks-Cf-ic-296121-Data.db
> -rw-rw-r-- 31 cassandra cassandra  1814691996 Dec  4 21:50 
> /data/sstables/data/ks/Cf/ks-Cf-ic-320449-Data.db
> -rw-rw-r-- 30 cassandra cassandra 10909061157 Dec 11 17:31 
> /data/sstables/data/ks/Cf/ks-Cf-ic-340318-Data.db
> -rw-rw-r-- 29 cassandra cassandra   459508942 Dec 12 10:37 
> /data/sstables/data/ks/Cf/ks-Cf-ic-342259-Data.db
> -rw-rw-r--  1 cassandra cassandra      336908 Dec 12 12:03 
> /data/sstables/data/ks/Cf/ks-Cf-ic-342307-Data.db
> -rw-rw-r--  1 cassandra cassandra     2063935 Dec 12 12:03 
> /data/sstables/data/ks/Cf/ks-Cf-ic-342309-Data.db
> -rw-rw-r--  1 cassandra cassandra         409 Dec 12 12:03 
> /data/sstables/data/ks/Cf/ks-Cf-ic-342314-Data.db
> -rw-rw-r--  1 cassandra cassandra    31180007 Dec 12 12:03 
> /data/sstables/data/ks/Cf/ks-Cf-ic-342319-Data.db
> -rw-rw-r--  1 cassandra cassandra     2398345 Dec 12 12:03 
> /data/sstables/data/ks/Cf/ks-Cf-ic-342322-Data.db
> -rw-rw-r--  1 cassandra cassandra       21095 Dec 12 12:03 
> /data/sstables/data/ks/Cf/ks-Cf-ic-342331-Data.db
> -rw-rw-r--  1 cassandra cassandra       81454 Dec 12 12:03 
> /data/sstables/data/ks/Cf/ks-Cf-ic-342335-Data.db
> -rw-rw-r--  1 cassandra cassandra     1063718 Dec 12 12:03 
> /data/sstables/data/ks/Cf/ks-Cf-ic-342339-Data.db
> -rw-rw-r--  1 cassandra cassandra      127004 Dec 12 12:03 
> /data/sstables/data/ks/Cf/ks-Cf-ic-342344-Data.db
> -rw-rw-r--  1 cassandra cassandra      146785 Dec 12 12:03 
> /data/sstables/data/ks/Cf/ks-Cf-ic-342346-Data.db
> -rw-rw-r--  1 cassandra cassandra      697338 Dec 12 12:03 
> /data/sstables/data/ks/Cf/ks-Cf-ic-342351-Data.db
> -rw-rw-r--  1 cassandra cassandra     3921428 Dec 12 12:03 
> /data/sstables/data/ks/Cf/ks-Cf-ic-342367-Data.db
> -rw-rw-r--  1 cassandra cassandra      240332 Dec 12 12:03 
> /data/sstables/data/ks/Cf/ks-Cf-ic-342370-Data.db
> -rw-rw-r--  1 cassandra cassandra       45669 Dec 12 12:03 
> /data/sstables/data/ks/Cf/ks-Cf-ic-342374-Data.db
> -rw-rw-r--  1 cassandra cassandra    53127549 Dec 12 12:03 
> /data/sstables/data/ks/Cf/ks-Cf-ic-342375-Data.db
> -rw-rw-r-- 16 cassandra cassandra 12466853166 Dec 25 22:40 
> /data/sstables/data/ks/Cf/ks-Cf-ic-396473-Data.db
> -rw-rw-r-- 12 cassandra cassandra  3903237198 Dec 29 19:42 
> /data/sstables/data/ks/Cf/ks-Cf-ic-408926-Data.db
> -rw-rw-r--  7 cassandra cassandra  3692260987 Jan  3 08:25 
> /data/sstables/data/ks/Cf/ks-Cf-ic-427733-Data.db
> -rw-rw-r--  4 cassandra cassandra  3971403602 Jan  6 20:50 
> /data/sstables/data/ks/Cf/ks-Cf-ic-437537-Data.db
> -rw-rw-r--  3 cassandra cassandra  1007832224 Jan  7 15:19 
> /data/sstables/data/ks/Cf/ks-Cf-ic-440331-Data.db
> -rw-rw-r--  2 cassandra cassandra   896132537 Jan  8 11:05 
> /data/sstables/data/ks/Cf/ks-Cf-ic-447740-Data.db
> -rw-rw-r--  1 cassandra cassandra   963039096 Jan  9 04:59 
> /data/sstables/data/ks/Cf/ks-Cf-ic-449425-Data.db
> -rw-rw-r--  1 cassandra cassandra   232168351 Jan  9 10:14 
> /data/sstables/data/ks/Cf/ks-Cf-ic-450287-Data.db
> -rw-rw-r--  1 cassandra cassandra    73126319 Jan  9 11:28 
> /data/sstables/data/ks/Cf/ks-Cf-ic-450307-Data.db
> -rw-rw-r--  1 cassandra cassandra    40921916 Jan  9 12:08 
> /data/sstables/data/ks/Cf/ks-Cf-ic-450336-Data.db
> -rw-rw-r--  1 cassandra cassandra    60881193 Jan  9 12:23 
> /data/sstables/data/ks/Cf/ks-Cf-ic-450341-Data.db
> -rw-rw-r--  1 cassandra cassandra        4746 Jan  9 12:23 
> /data/sstables/data/ks/Cf/ks-Cf-ic-450350-Data.db
> -rw-rw-r--  1 cassandra cassandra        5769 Jan  9 12:23 
> /data/sstables/data/ks/Cf/ks-Cf-ic-450352-Data.db
> {noformat}
> {noformat}
> 295562: Estimated droppable tombstones: 0.899035828535183
> 296121: Estimated droppable tombstones: 0.9135080937806197
> 320449: Estimated droppable tombstones: 0.8916766879896414
> {noformat}
> I've checked in on this example node several times and compactionstats has 
> not shown any other activity that would be blocking the tombstone based 
> compaction.  The TTL is in the 15-20 day range so an sstable from November 
> should have had ample opportunities by January.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to