[ 
https://issues.apache.org/jira/browse/CASSANDRA-21345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18078440#comment-18078440
 ] 

Matt Byrd commented on CASSANDRA-21345:
---------------------------------------

cc [~marcuse]
added a provisional PR here:

https://github.com/apache/cassandra/pull/4786

May need to do some more reasoning of the correctness properties of doing this 
and whether it's definitely safe.
So far I can't find anything that seems like a reasonable state to begin with 
where this noticeable reduces correctness.

I've attached also the perf differences of running various micro bench runs 
with/without the key makeAdd record change and also skipping cleanup, as you'll 
see cleanup dominates the profile after the patch is applied.

For speed of running the benchmark by default doesn't flush a bunch of sstables 
for the initial state construction (i.e adding them to tracker and other 
machinery), it just simulates a bunch of files on disk so we see the readdir 
behavior dominating. 

{code}
ant microbench-with-profiler -Dbenchmark.name=MemtableFlushBench \              
                                                                                
                                                                                
                        
                                                      
-Dprofiler.opts="event=wall;threads=true;output=flamegraph;simple=true;ann=true"
 \                                                                              
                                                                                
                    
                                                        -Djmh.args="-p 
skipCleanup=true"  
{code} 
[^new-org.apache.cassandra.test.microbench.sstable.MemtableFlushBench.doFlushing-AverageTime-extraFileCount-200000-flushingSStableCount-10-preFillSStableCount-10-rowCount-1-skipCleanup-false-flame-cpu-forward.html]
  
[^new-org.apache.cassandra.test.microbench.sstable.MemtableFlushBench.doFlushing-AverageTime-extraFileCount-200000-flushingSStableCount-10-preFillSStableCount-10-rowCount-1-skipCleanup-true-flame-cpu-forward.html]
  
[^old-org.apache.cassandra.test.microbench.sstable.MemtableFlushBench.doFlushing-AverageTime-extraFileCount-200000-flushingSStableCount-10-preFillSStableCount-10-rowCount-1-skipCleanup-false-flame-cpu-forward.html]
  
[^old-org.apache.cassandra.test.microbench.sstable.MemtableFlushBench.doFlushing-AverageTime-extraFileCount-200000-flushingSStableCount-10-preFillSStableCount-10-rowCount-1-skipCleanup-true-flame-cpu-forward.html]
 
 


> avoid listing files in LogRecord when creating a new SSTableWriter for flush
> ----------------------------------------------------------------------------
>
>                 Key: CASSANDRA-21345
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-21345
>             Project: Apache Cassandra
>          Issue Type: Improvement
>          Components: Local/Compaction, Local/Memtable, Local/Startup and 
> Shutdown
>            Reporter: Matt Byrd
>            Assignee: Matt Byrd
>            Priority: Normal
>         Attachments: flushwriter-stacktrace, 
> new-org.apache.cassandra.test.microbench.sstable.MemtableFlushBench.doFlushing-AverageTime-extraFileCount-200000-flushingSStableCount-10-preFillSStableCount-10-rowCount-1-skipCleanup-false-flame-cpu-forward.html,
>  
> new-org.apache.cassandra.test.microbench.sstable.MemtableFlushBench.doFlushing-AverageTime-extraFileCount-200000-flushingSStableCount-10-preFillSStableCount-10-rowCount-1-skipCleanup-true-flame-cpu-forward.html,
>  
> old-org.apache.cassandra.test.microbench.sstable.MemtableFlushBench.doFlushing-AverageTime-extraFileCount-200000-flushingSStableCount-10-preFillSStableCount-10-rowCount-1-skipCleanup-false-flame-cpu-forward.html,
>  
> old-org.apache.cassandra.test.microbench.sstable.MemtableFlushBench.doFlushing-AverageTime-extraFileCount-200000-flushingSStableCount-10-preFillSStableCount-10-rowCount-1-skipCleanup-true-flame-cpu-forward.html
>
>
> When flushing a memtable to disk, we create a LogRecord by attempting to list 
> the entire directory to see how many files match the theoretical prefix for 
> sstables we'll be creating.
> This is potentially quite expensive, e.g when using LCS and having O(100k) 
> files in a given directory, see attached stack-trace.
> It also doesn't seem strictly necessary since at flush time, no such sstable 
> component files should exist yet. The other point in time this method is 
> called is on remove, initially to just construct a LogRecord to locate the 
> initial ADD LogRecord (and the remove construction does another directory 
> listing independently).
> The file listing is used in a couple of ways for LogRecord construction, both 
> to get the count of components we expect, as well also the last updated 
> timestamp of said components.
> For ADD records we hardcode the last updated timestamp to 0 and the component 
> count is derived via effectively max(expected component count, any files 
> found on disk with prefix), see:
> {code:java}
>         return make(type, getExistingFiles(absoluteTablePath), 
> table.getAllFilePaths().size(), absoluteTablePath);
> {code}
> and then eventually the number of files found on disk via positive timestamp:
> {code:java}
>  LogRecord(type, absolutePath, lastModified, Math.max(minFiles, 
> positiveTSCount));
> {code}
> I'm interested in seeing if we can instead simply always using the expected 
> component count on flush (minFiles), this leads to deterministic behavior in 
> both call sites.
> We likely need to reason carefully about this change given potential 
> correctness concerns.
> However I'm optimistic that it may be feasible, since for one we already 
> assert right before creating the ADD record for flushing, that there aren't 
> already some components on disk:
> {code:java}
>         Set<Component> existingComponents = Sets.filter(components, c -> 
> descriptor.fileFor(c).exists());
>         assert existingComponents.isEmpty() : String.format("Cannot create a 
> new SSTable in directory %s as component files %s already exist there",
>                                                             
> descriptor.directory,
>                                                             
> existingComponents);
>         txn.trackNew(this);
> {code}
>  
> There are maybe some strange edge-cases to reason about where a bug or 
> operator error  results in placing random components that are somehow 
> disjoint or misnamed in the directory prior to the ADD record being created, 
> between it's creation and the search for it in the REMOVE record and 
> thereafter. 
> A lot of startup and recovery mechanism either ignore ADD files, or only make 
> use of their absolute path.
> If anyone has any inklings of problems which may arise or other approaches 
> please let me know.
> I suppose one compromise solution is to fall-back on potentially rather 
> listing the entire directory, just searching explicitly for any of the 
> possible components which could exist via e.g a handful of explicit stat sys 
> calls.
> I feel like this would be quite a lot more complicated for potentially little 
> to no benefit but we could opt to do that instead I we find some concrete 
> problems with just taking the component count.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to