Re: Tombstones in memtable

Rahul Reddy Sat, 23 Feb 2019 19:27:12 -0800

```jvm setting

-XX:+UseThreadPriorities
-XX:ThreadPriorityPolicy=42
-XX:+HeapDumpOnOutOfMemoryError
-Xss256k
-XX:StringTableSize=1000003
-XX:+AlwaysPreTouch
-XX:-UseBiasedLocking
-XX:+UseTLAB
-XX:+ResizeTLAB
-XX:+UseNUMA
-XX:+PerfDisableSharedMem
-Djava.net.preferIPv4Stack=true
-XX:+UseG1GC
-XX:G1RSetUpdatingPauseTimePercent=5
-XX:MaxGCPauseMillis=500
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+PrintHeapAtGC
-XX:+PrintTenuringDistribution
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintPromotionFailure
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=10
-XX:GCLogFileSize=10M


Total memory
free
             total       used       free     shared    buffers     cached
Mem:      16434004   16125340     308664         60     172872    5565184
-/+ buffers/cache:   10387284    6046720
Swap:            0          0          0

Heap settings in cassandra-env.sh
MAX_HEAP_SIZE="8192M"
HEAP_NEWSIZE="800M"
```

On Sat, Feb 23, 2019, 10:15 PM Rahul Reddy <rahulreddy1...@gmail.com> wrote:

> Thanks Jeff,
>
> Since low writes and high reads most of the time data in memtables only.
> When I noticed intially issue no stables on disk everything in memtable
> only.
>
> On Sat, Feb 23, 2019, 10:01 PM Jeff Jirsa <jji...@gmail.com> wrote:
>
>> Also given your short ttl and low write rate, you may want to think about
>> how you can keep more in memory - this may mean larger memtable and high
>> flush thresholds (reading from the memtable), or perhaps the partition
>> cache (if you are likely to read the same key multiple times). You’ll also
>> probably win some with basic perf and GC tuning, but can’t really do that
>> via email. Cassandra-8150 has some pointers.
>>
>> --
>> Jeff Jirsa
>>
>>
>> On Feb 23, 2019, at 6:52 PM, Jeff Jirsa <jji...@gmail.com> wrote:
>>
>> You’ll only ever have one tombstone per read, so your load is based on
>> normal read rate not tombstones. The metric isn’t wrong, but it’s not
>> indicative of a problem here given your data model.
>>
>> You’re using STCS do you may be reading from more than one sstable if you
>> update column2 for a given column1, otherwise you’re probably just seeing
>> normal read load. Consider dropping your compression chunk size a bit
>> (given the sizes in your cfstats I’d probably go to 4K instead of 64k), and
>> maybe consider LCS or TWCS instead of STCS (Which is appropriate depends on
>> a lot of factors, but STCS is probably causing a fair bit of unnecessary
>> compactions and probably is very slow to expire data).
>>
>> --
>> Jeff Jirsa
>>
>>
>> On Feb 23, 2019, at 6:31 PM, Rahul Reddy <rahulreddy1...@gmail.com>
>> wrote:
>>
>> Do you see anything wrong with this metric.
>>
>> metric to scan tombstones
>>
>> increase(cassandra_Table_TombstoneScannedHistogram{keyspace="mykeyspace",Table="tablename",function="Count"}[5m])
>>
>> And sametime CPU Spike to 50% whenever I see high tombstone alert.
>>
>> On Sat, Feb 23, 2019, 9:25 PM Jeff Jirsa <jji...@gmail.com> wrote:
>>
>>> Your schema is such that you’ll never read more than one tombstone per
>>> select (unless you’re also doing range reads / table scans that you didn’t
>>> mention) - I’m not quite sure what you’re alerting on, but you’re not going
>>> to have tombstone problems with that table / that select.
>>>
>>> --
>>> Jeff Jirsa
>>>
>>>
>>> On Feb 23, 2019, at 5:55 PM, Rahul Reddy <rahulreddy1...@gmail.com>
>>> wrote:
>>>
>>> Changing gcgs didn't help
>>>
>>> CREATE KEYSPACE ksname WITH replication = {'class':
>>> 'NetworkTopologyStrategy', 'dc1': '3', 'dc2': '3'}  AND durable_writes =
>>> true;
>>>
>>>
>>> ```CREATE TABLE keyspace."table" (
>>>     "column1" text PRIMARY KEY,
>>>     "column2" text
>>> ) WITH bloom_filter_fp_chance = 0.01
>>>     AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
>>>     AND comment = ''
>>>     AND compaction = {'class':
>>> 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
>>> 'max_threshold': '32', 'min_threshold': '4'}
>>>     AND compression = {'chunk_length_in_kb': '64', 'class':
>>> 'org.apache.cassandra.io.compress.LZ4Compressor'}
>>>     AND crc_check_chance = 1.0
>>>     AND dclocal_read_repair_chance = 0.1
>>>     AND default_time_to_live = 18000
>>>     AND gc_grace_seconds = 60
>>>     AND max_index_interval = 2048
>>>     AND memtable_flush_period_in_ms = 0
>>>     AND min_index_interval = 128
>>>     AND read_repair_chance = 0.0
>>>     AND speculative_retry = '99PERCENTILE';
>>>
>>> flushed table and took tsstabledump
>>> grep -i '"expired" : true' SSTables.txt|wc -l
>>> 16439
>>> grep -i '"expired" : false'  SSTables.txt |wc -l
>>> 2657
>>>
>>> ttl is 4 hours.
>>>
>>> INSERT INTO keyspace."TABLE_NAME" ("column1", "column2") VALUES (?, ?)
>>> USING TTL(4hours) ?';
>>> SELECT * FROM keyspace."TABLE_NAME" WHERE "column1" = ?';
>>>
>>> metric to scan tombstones
>>>
>>> increase(cassandra_Table_TombstoneScannedHistogram{keyspace="mykeyspace",Table="tablename",function="Count"}[5m])
>>>
>>> during peak hours. we only have couple of hundred inserts and 5-8k
>>> reads/s per node.
>>> ```
>>>
>>> ```tablestats
>>> Read Count: 605231874
>>> Read Latency: 0.021268529760215503 ms.
>>> Write Count: 2763352
>>> Write Latency: 0.027924007871599422 ms.
>>> Pending Flushes: 0
>>> Table: name
>>> SSTable count: 1
>>> Space used (live): 1413203
>>> Space used (total): 1413203
>>> Space used by snapshots (total): 0
>>> Off heap memory used (total): 28813
>>> SSTable Compression Ratio: 0.5015090954531143
>>> Number of partitions (estimate): 19568
>>> Memtable cell count: 573
>>> Memtable data size: 22971
>>> Memtable off heap memory used: 0
>>> Memtable switch count: 6
>>> Local read count: 529868919
>>> Local read latency: 0.020 ms
>>> Local write count: 2707371
>>> Local write latency: 0.024 ms
>>> Pending flushes: 0
>>> Percent repaired: 0.0
>>> Bloom filter false positives: 1
>>> Bloom filter false ratio: 0.00000
>>> Bloom filter space used: 23888
>>> Bloom filter off heap memory used: 23880
>>> Index summary off heap memory used: 4717
>>> Compression metadata off heap memory used: 216
>>> Compacted partition minimum bytes: 73
>>> Compacted partition maximum bytes: 124
>>> Compacted partition mean bytes: 99
>>> Average live cells per slice (last five minutes): 1.0
>>> Maximum live cells per slice (last five minutes): 1
>>> Average tombstones per slice (last five minutes): 1.0
>>> Maximum tombstones per slice (last five minutes): 1
>>> Dropped Mutations: 0
>>> histograms
>>> Percentile  SSTables     Write Latency      Read Latency    Partition
>>> Size        Cell Count
>>>                               (micros)          (micros)
>>>  (bytes)
>>> 50%             0.00             20.50             17.08
>>> 86                 1
>>> 75%             0.00             24.60             20.50
>>>  124                 1
>>> 95%             0.00             35.43             29.52
>>>  124                 1
>>> 98%             0.00             35.43             42.51
>>>  124                 1
>>> 99%             0.00             42.51             51.01
>>>  124                 1
>>> Min             0.00              8.24              5.72
>>> 73                 0
>>> Max             1.00             42.51            152.32
>>>  124                 1
>>> ```
>>>
>>> 3 node in dc1 and 3 node in dc2 cluster. With instanc type aws  ec2
>>> m4.xlarge
>>>
>>> On Sat, Feb 23, 2019, 7:47 PM Jeff Jirsa <jji...@gmail.com> wrote:
>>>
>>>> Would also be good to see your schema (anonymized if needed) and the
>>>> select queries you’re running
>>>>
>>>>
>>>> --
>>>> Jeff Jirsa
>>>>
>>>>
>>>> On Feb 23, 2019, at 4:37 PM, Rahul Reddy <rahulreddy1...@gmail.com>
>>>> wrote:
>>>>
>>>> Thanks Jeff,
>>>>
>>>> I'm having gcgs set to 10 mins and changed the table ttl also to 5
>>>> hours compared to insert ttl to 4 hours .  Tracing on doesn't show any
>>>> tombstone scans for the reads.  And also log doesn't show tombstone scan
>>>> alerts. Has the reads are happening 5-8k reads per node during the peak
>>>> hours it shows 1M tombstone scans count per read.
>>>>
>>>> On Fri, Feb 22, 2019, 11:46 AM Jeff Jirsa <jji...@gmail.com> wrote:
>>>>
>>>>> If all of your data is TTL’d and you never explicitly delete a cell
>>>>> without using s TTL, you can probably drop your GCGS to 1 hour (or less).
>>>>>
>>>>> Which compaction strategy are you using? You need a way to clear out
>>>>> those tombstones. There exist tombstone compaction sub properties that can
>>>>> help encourage compaction to grab sstables just because they’re full of
>>>>> tombstones which will probably help you.
>>>>>
>>>>>
>>>>> --
>>>>> Jeff Jirsa
>>>>>
>>>>>
>>>>> On Feb 22, 2019, at 8:37 AM, Kenneth Brotman <
>>>>> kenbrot...@yahoo.com.invalid> wrote:
>>>>>
>>>>> Can we see the histogram?  Why wouldn’t you at times have that many
>>>>> tombstones?  Makes sense.
>>>>>
>>>>>
>>>>>
>>>>> Kenneth Brotman
>>>>>
>>>>>
>>>>>
>>>>> *From:* Rahul Reddy [mailto:rahulreddy1...@gmail.com
>>>>> <rahulreddy1...@gmail.com>]
>>>>> *Sent:* Thursday, February 21, 2019 7:06 AM
>>>>> *To:* user@cassandra.apache.org
>>>>> *Subject:* Tombstones in memtable
>>>>>
>>>>>
>>>>>
>>>>> We have small table records are about 5k .
>>>>>
>>>>> All the inserts comes as 4hr ttl and we have table level ttl 1 day and
>>>>> gc grace seconds has 3 hours.  We do 5k reads a second during peak load
>>>>> During the peak load seeing Alerts for tomstone scanned histogram reaching
>>>>> million.
>>>>>
>>>>> Cassandra version 3.11.1. Please let me know how can this tombstone
>>>>> scan can be avoided in memtable
>>>>>
>>>>>

Re: Tombstones in memtable

Reply via email to