[ 
https://issues.apache.org/jira/browse/CASSANDRA-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168822#comment-14168822
 ] 

Nikolai Grigoriev edited comment on CASSANDRA-7949 at 10/12/14 11:59 PM:
-------------------------------------------------------------------------

I did another round of testing and I can confirm my previous suspicion. If LCS 
goes into "STCS fallback" mode there seems to be some kind of "point of no 
return". After loading fairly large amount of data I end up with a number of 
large (from few Gb to 200+Gb) sstables. After that the cluster simply goes 
downhill - it never recovers. Even if there is no traffic except the repair 
service (DSE OpsCenter) the number of pending compactions never declines. It 
actually grows. Sstables also grow and grow in size until the moment one of the 
compactions runs out of disk space and crashes the node.

Also I believe once in this state there is no way out. sstablesplit tool, as 
far as I understand, cannot be used with the live node. And the tool splits the 
data in single thread. I have measured its performance on my system, it 
processes about 13Mb/s on average, thus, to split all these large sstables it 
would take many DAYS.

I have got an idea that might actually help. That JVM property from 
CASSANDRA-6621 - it seems to be what I need right now. I have tried it and it 
seems (so far) that when compacting my nodes produce only the sstables of the 
target size, i.e (I may be wrong but so far it seems so) it is splitting the 
large sstables into the small ones while the nodes are on. If it continues like 
this I may hope to eventually get rid of mega-huge-sstables and then LCS 
performance should be back to normal. Will provide an update later.


was (Author: ngrigor...@gmail.com):
I did another round of testing and I can confirm my previous suspicion. If LCS 
goes into "STCS fallback" mode there seems to be some kind of "point of no 
return". After loading fairly large amount of data I end up with a number of 
large (from few Gb to 200+Gb) sstables. After that the cluster simply goes 
downhill - it never recovers. Even if there is no traffic except the repair 
service (DSE OpsCenter) the number of pending compactions never declines. It 
actually grows. Sstables also grow and grow in size until the moment one of the 
compactions runs out of disk space and crashes the node.

Also I believe once in this state there is no way out. sstablesplit tool, as 
far as I understand, cannot be used with the live node. And the tool splits the 
data in single thread. I have measured its performance on my system, it 
processes about 13Mb/s on average, thus, to split all these large sstables it 
would take many DAYS.

> LCS compaction low performance, many pending compactions, nodes are almost 
> idle
> -------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-7949
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7949
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: DSE 4.5.1-1, Cassandra 2.0.8
>            Reporter: Nikolai Grigoriev
>         Attachments: iostats.txt, nodetool_compactionstats.txt, 
> nodetool_tpstats.txt, pending compactions 2day.png, system.log.gz, vmstat.txt
>
>
> I've been evaluating new cluster of 15 nodes (32 core, 6x800Gb SSD disks + 
> 2x600Gb SAS, 128Gb RAM, OEL 6.5) and I've built a simulator that creates the 
> load similar to the load in our future product. Before running the simulator 
> I had to pre-generate enough data. This was done using Java code and DataStax 
> Java driver. To avoid going deep into details, two tables have been 
> generated. Each table currently has about 55M rows and between few dozens and 
> few thousands of columns in each row.
> This data generation process was generating massive amount of non-overlapping 
> data. Thus, the activity was write-only and highly parallel. This is not the 
> type of the traffic that the system will have ultimately to deal with, it 
> will be mix of reads and updates to the existing data in the future. This is 
> just to explain the choice of LCS, not mentioning the expensive SSD disk 
> space.
> At some point while generating the data I have noticed that the compactions 
> started to pile up. I knew that I was overloading the cluster but I still 
> wanted the genration test to complete. I was expecting to give the cluster 
> enough time to finish the pending compactions and get ready for real traffic.
> However, after the storm of write requests have been stopped I have noticed 
> that the number of pending compactions remained constant (and even climbed up 
> a little bit) on all nodes. After trying to tune some parameters (like 
> setting the compaction bandwidth cap to 0) I have noticed a strange pattern: 
> the nodes were compacting one of the CFs in a single stream using virtually 
> no CPU and no disk I/O. This process was taking hours. After that it would be 
> followed by a short burst of few dozens of compactions running in parallel 
> (CPU at 2000%, some disk I/O - up to 10-20%) and then getting stuck again for 
> many hours doing one compaction at time. So it looks like this:
> # nodetool compactionstats
> pending tasks: 3351
>           compaction type        keyspace           table       completed     
>       total      unit  progress
>                Compaction      myks     table_list1     66499295588   
> 1910515889913     bytes     3.48%
> Active compaction remaining time :        n/a
> # df -h
> ...
> /dev/sdb        1.5T  637G  854G  43% /cassandra-data/disk1
> /dev/sdc        1.5T  425G  1.1T  29% /cassandra-data/disk2
> /dev/sdd        1.5T  429G  1.1T  29% /cassandra-data/disk3
> # find . -name **table_list1**Data** | grep -v snapshot | wc -l
> 1310
> Among these files I see:
> 1043 files of 161Mb (my sstable size is 160Mb)
> 9 large files - 3 between 1 and 2Gb, 3 of 5-8Gb, 55Gb, 70Gb and 370Gb
> 263 files of various sized - between few dozens of Kb and 160Mb
> I've been running the heavy load for about 1,5days and it's been close to 3 
> days after that and the number of pending compactions does not go down.
> I have applied one of the not-so-obvious recommendations to disable 
> multithreaded compactions and that seems to be helping a bit - I see some 
> nodes started to have fewer pending compactions. About half of the cluster, 
> in fact. But even there I see they are sitting idle most of the time lazily 
> compacting in one stream with CPU at ~140% and occasionally doing the bursts 
> of compaction work for few minutes.
> I am wondering if this is really a bug or something in the LCS logic that 
> would manifest itself only in such an edge case scenario where I have loaded 
> lots of unique data quickly.
> By the way, I see this pattern only for one of two tables - the one that has 
> about 4 times more data than another (space-wise, number of rows is the 
> same). Looks like all these pending compactions are really only for that 
> larger table.
> I'll be attaching the relevant logs shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to