Hi Jeff,
The cluster was build with 2.1 and has been upgraded to 3.11. It's using 
"num_token: 64"  and RF=3 for all keyspaces in all DCs.There are over 100 nodes 
in each of 6 rings.I don't see any messages of the repair/steaming in 
system.log so this shouldn't be related with the repair.I can see messages of 
flushing BOTH major app tables AND system.sstable_activity table, but number of 
sstables is much higher for the app tables.
Thanks,Jiayong Sun
    On Friday, August 13, 2021, 01:43:06 PM PDT, Jeff Jirsa <jji...@gmail.com> 
wrote:  
 
 A very large cluster using vnodes will cause lots of small sstables to stream 
in during repair if the cluster is out of sync. This is one of the reasons that 
the default number of vnodes was decreased in 4.0. How many nodes in the 
cluster, how many DCs, how many vnodes per node, and how many replicas per DC? 
You can confirm or eliminate this possibility by checking the origin of the 
tiny sstables: are they ACTUALLY flushed from the memtable, or are they 
streamed in via repair?  Are they all from the sstable-activity table, or are 
they the main app table? 

On Fri, Aug 13, 2021 at 1:36 PM Bowen Song <bo...@bso.ng> wrote:

  
Hi Jiayong,
 

 
 
That doesn't really match the situation described in the SO question. I 
suspected it was related to repairing a table with MV and large partitions, but 
based on the information you've given, I was clearly wrong.
 
A few hundreds MB partitions is not exactly unusual, I don't see that alone 
could lead to frequent SSTable flushing. A repair session takes weeks to 
complete is a bit worrying in terms of performance and maintainability, but 
again it should not cause this issue.
 
Since we don't know the cause of it, I can see two possible solutions - either 
replace the "broken" node, or dig into the logs (remember to turn on the debug 
logs) and trying to identify the root cause. I personally would recommend 
replacing the problematic node as a quick win.
 
 

 
 
Cheers,
 
Bowen
 
 On 13/08/2021 20:31, Jiayong Sun wrote:
  
  Hi Bowen, 
  We do have reaper repair job scheduled periodically and it can take days even 
weeks to complete one round of repair due to large number of rings/nodes. 
However, we have paused the repair since we are facing this issue. We do not 
use the MV in this cluster. There is major table taking 95% of disk storage and 
workload but its Partition Size is around 30 MB. There are a couple small 
tables with the Max Partition Size over several hundreds of MB but their total 
data size just about a few GB. 
  Any thoughts? 
  Thanks, Jiayong 
      On Friday, August 13, 2021, 03:32:45 AM PDT, Bowen Song <bo...@bso.ng> 
wrote:  
  
     
Hi Jiayong,
 

 
 
Sorry I didn't make it clear in my previous email. When I commented on the 
RAID0 setup, it was only a comment on the RAID0 setup vs JBOD, and that was not 
in relation to the SSTable flushing issue. The part of my previous email after 
the "On the frequent SSTable flush issue" line is the part related to the 
SSTable flushing issue, and those two questions at the end of it remain valid:
 
    
   - Did you run repair?
   - Do you use materialized views?
 
and, if I may, I'd also like to add another question:
    
   - Do you have large (> 100 MB) partitions?
 Those are the 3 things mentioned in the SO question. I'm trying to find the 
connections between the issue you are experiencing and the issue described in 
the SO question.
 

 
 
Cheers,
 
Bowen
 
 

 
  On 13/08/2021 01:36, Jiayong Sun wrote:
  
 
      Hello Bowen, 
  Thanks for your response. Yes, we are aware of the theory that RAID0 vs 
individual JBOD, but all of our clusters are using this RAID0 configuration 
through Azure, while only on this cluster we see this issue so it's hardly to 
conclude root cause to the disk. This is more like workload related, and we are 
seeking feedback here for any other parameters in the yaml that we could tune 
for this. 
  Thanks again, Jiayong Sun 
      On Thursday, August 12, 2021, 04:55:51 AM PDT, Bowen Song <bo...@bso.ng> 
wrote:  
  
     
Hello Jiayong,
 

 
 
Using multiple disks in a RAID0 for Cassandra data directory is not 
recommended. You will get better fault tolerance and often better performance 
too with multiple data directories, one on each disk.
 
If you stick with RAID0, it's not 4 disks, it's 1 from Cassandra's point of 
view, because any read or write operation will have to touch all 4 member 
disks. Therefore, 4 flush writers doesn't make much sense.
 
On the frequent SSTable flush issue, a quick internet search leads me to:
 
 
* an old bug in Cassandra 2.1 - CASSANDRA-8409 which shouldn't affect 3.x at all
 
* a StackOverflow question may be related
 
 
Did you run repair? Do you use materialized views?
 
 

 
 
Regards,
 
Bowen
 
 

 
  On 11/08/2021 15:58, Jiayong Sun wrote:
  
 
      Hi Erick, 
  The nodes have 4 SSD (1TB for each but we only use 2.4TB of space. Current 
disk usage is about 50%) with RAID0.  Based on number of disks we increased 
memtable_flush_writers: 4 instead of default of 2. 
  For the following we set:   - max heap size - 31GB - 
memtable_heap_space_in_mb (use default) - memtable_offheap_space_in_mb  (use 
default)  
  In the logs, we also noticed system.sstable_activity table has hundreds of MB 
or GB of data and constantly flushing:
  DEBUG [NativePoolCleaner] <timestamp> ColumnFamilyStore.java:932 - Enqueuing 
flush of sstable_activity: 0.293KiB (0%) on-heap, 0.107KiB (0%) off-heap DEBUG 
[NonPeriodicTasks:1] <timestamp> SSTable.java:105 - Deleting 
sstable:/app/cassandra/data/system/sstable_activity-5a1ff267ace03f128563cfae6103c65e/md-103645-big
 DEBUG [NativePoolCleaner] <timestamp> ColumnFamilyStore.java:1322 - Flushing 
largest CFS(Keyspace='system', ColumnFamily='sstable_activity') to free up 
room. Used total: 0.06/1.00, live: 0.00/0.00, flushing: 0.02/0.29, this: 
0.00/0.00 
   Thanks, Jiayong Sun     On Wednesday, August 11, 2021, 12:06:27 AM PDT, 
Erick Ramirez <erick.rami...@datastax.com> wrote:  
  
       4 flush writers isn't bad since the default is 2. It doesn't make a 
difference if you have fast disks (like NVMe SSDs) because only 1 thread gets 
used. 
  But if flushes are slow, the work gets distributed to 4 flush writers so you 
end up with smaller flush sizes although it's difficult to tell how tiny the 
SSTables would be without analysing the logs and overall performance of your 
cluster. 
  Was there a specific reason you decided to bump it up to 4? I'm just trying 
to get a sense of why you did it since it might provide some clues. Out of 
curiosity, what do you have set for the following? - max heap size - 
memtable_heap_space_in_mb - memtable_offheap_space_in_mb    
             
                       
 
  

Reply via email to