[ 
https://issues.apache.org/jira/browse/CASSANDRA-18123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17651500#comment-17651500
 ] 

Branimir Lambov commented on CASSANDRA-18123:
---------------------------------------------

You are right, this currently isn't a problem here, only something to be aware 
of and fix / work around when a compaction strategy needs to split the flush 
output.

I was thinking per-disk splitting on flush can trigger this, but that's handled 
inside the flushing code rather than in the writer.

> Reuse of metadata collector can break key count calculation
> -----------------------------------------------------------
>
>                 Key: CASSANDRA-18123
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18123
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Local/Compaction
>            Reporter: Branimir Lambov
>            Priority: Normal
>             Fix For: 3.0.x, 3.11.x, 4.0.x, 4.1.x, 4.x
>
>
> When flushing a memtable we currently pass a constructed 
> {{MetadataCollector}} to the {{SSTableMultiWriter}} that is used for writing 
> sstables. The latter may decide to split the data into multiple sstables 
> (e.g. for separate disks or driven by compaction strategy) — if it does so, 
> the cardinality estimation component in the reused {{MetadataCollector}} for 
> each individual sstable contains the data for all of them.
> As a result, when such sstables are compacted the estimation for the number 
> of keys in the resulting sstables, which is used to determine the size of the 
> bloom filter for the compaction result, is heavily overestimated.
> This results in much bigger L1 bloom filters than they should be. One example 
> (which came about during testing of the upcoming CEP-26, after insertion of 
> 100GB data with 10% reads):
> (current)
> {code}
>               Bloom filter false positives: 22627369
>               Bloom filter false ratio: 0.02257
>               Bloom filter space used: 1848247864
>               Bloom filter off heap memory used: 2338964088
> {code}
> (fixed)
> {code}
>               Bloom filter false positives: 24426545
>               Bloom filter false ratio: 0.02429
>               Bloom filter space used: 1118910096
>               Bloom filter off heap memory used: 1532357432
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to