[jira] [Commented] (CASSANDRA-7949) LCS compaction low performance, many pending compactions, nodes are almost idle
[ https://issues.apache.org/jira/browse/CASSANDRA-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223036#comment-14223036 ] Nikolai Grigoriev commented on CASSANDRA-7949: -- I have recently realized that there may be relatively cheap (operationally and development-wise) workaround for that limitation. It would also partially address the problem with bootstrapping new node. The root cause of all this is a large amount of data in a single CF on a single node when using LCS for that CF. The performance of a single compaction task running on a single thread is limited anyway. One of the obvious ways to break this limitation is to shard the data across multiple clones of that CF at the application level. Something as dumb as row key hash mod X and add this suffix to the CF name. In my case looks like having X=4 would be more than enough to solve the problem. LCS compaction low performance, many pending compactions, nodes are almost idle --- Key: CASSANDRA-7949 URL: https://issues.apache.org/jira/browse/CASSANDRA-7949 Project: Cassandra Issue Type: Bug Components: Core Environment: DSE 4.5.1-1, Cassandra 2.0.8 Reporter: Nikolai Grigoriev Attachments: iostats.txt, nodetool_compactionstats.txt, nodetool_tpstats.txt, pending compactions 2day.png, system.log.gz, vmstat.txt I've been evaluating new cluster of 15 nodes (32 core, 6x800Gb SSD disks + 2x600Gb SAS, 128Gb RAM, OEL 6.5) and I've built a simulator that creates the load similar to the load in our future product. Before running the simulator I had to pre-generate enough data. This was done using Java code and DataStax Java driver. To avoid going deep into details, two tables have been generated. Each table currently has about 55M rows and between few dozens and few thousands of columns in each row. This data generation process was generating massive amount of non-overlapping data. Thus, the activity was write-only and highly parallel. This is not the type of the traffic that the system will have ultimately to deal with, it will be mix of reads and updates to the existing data in the future. This is just to explain the choice of LCS, not mentioning the expensive SSD disk space. At some point while generating the data I have noticed that the compactions started to pile up. I knew that I was overloading the cluster but I still wanted the genration test to complete. I was expecting to give the cluster enough time to finish the pending compactions and get ready for real traffic. However, after the storm of write requests have been stopped I have noticed that the number of pending compactions remained constant (and even climbed up a little bit) on all nodes. After trying to tune some parameters (like setting the compaction bandwidth cap to 0) I have noticed a strange pattern: the nodes were compacting one of the CFs in a single stream using virtually no CPU and no disk I/O. This process was taking hours. After that it would be followed by a short burst of few dozens of compactions running in parallel (CPU at 2000%, some disk I/O - up to 10-20%) and then getting stuck again for many hours doing one compaction at time. So it looks like this: # nodetool compactionstats pending tasks: 3351 compaction typekeyspace table completed total unit progress Compaction myks table_list1 66499295588 1910515889913 bytes 3.48% Active compaction remaining time :n/a # df -h ... /dev/sdb1.5T 637G 854G 43% /cassandra-data/disk1 /dev/sdc1.5T 425G 1.1T 29% /cassandra-data/disk2 /dev/sdd1.5T 429G 1.1T 29% /cassandra-data/disk3 # find . -name **table_list1**Data** | grep -v snapshot | wc -l 1310 Among these files I see: 1043 files of 161Mb (my sstable size is 160Mb) 9 large files - 3 between 1 and 2Gb, 3 of 5-8Gb, 55Gb, 70Gb and 370Gb 263 files of various sized - between few dozens of Kb and 160Mb I've been running the heavy load for about 1,5days and it's been close to 3 days after that and the number of pending compactions does not go down. I have applied one of the not-so-obvious recommendations to disable multithreaded compactions and that seems to be helping a bit - I see some nodes started to have fewer pending compactions. About half of the cluster, in fact. But even there I see they are sitting idle most of the time lazily compacting in one stream with CPU at ~140% and occasionally doing the bursts of compaction work for few minutes. I am wondering if this is really a bug or something in the LCS logic that would manifest itself only in such an edge case scenario where I have loaded lots
[jira] [Commented] (CASSANDRA-7949) LCS compaction low performance, many pending compactions, nodes are almost idle
[ https://issues.apache.org/jira/browse/CASSANDRA-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208154#comment-14208154 ] Nikolai Grigoriev commented on CASSANDRA-7949: -- I had to rebuild one of the nodes in that test cluster. After bootstrapping it I have checked the results - I had over 6,5K pending compactions and many large sstables (between few Gb and 40-60Gb). I knew that under traffic this will *never* return to reasonable number of pending compactions. I have decided to give it another try, enable the option from CASSANDRA-6621 and re-bootstrap. This time I did not end up with huge sstables but, I think, it will also never recover. This is, essentially, what the node does most of the time: {code} pending tasks: 7217 compaction typekeyspace table completed total unit progress Compaction myks mytable1 5434997373 10667184206 bytes50.95% Compaction myksmytable2 1080506914 7466286503 bytes14.47% Active compaction remaining time : 0h00m09s {code} while: {code} # nodetool cfstats myks.mytable1 Keyspace: myks Read Count: 49783 Read Latency: 38.612470602414476 ms. Write Count: 521971 Write Latency: 1.3617571608384373 ms. Pending Tasks: 0 Table: mytable1 SSTable count: 7893 SSTables in each level: [7828/4, 10, 56, 0, 0, 0, 0, 0, 0] Space used (live), bytes: 1181508730955 Space used (total), bytes: 1181509085659 SSTable Compression Ratio: 0.3068450302663634 Number of keys (estimate): 28180352 Memtable cell count: 153554 Memtable data size, bytes: 41190431 Memtable switch count: 178 Local read count: 49826 Local read latency: 38.886 ms Local write count: 522464 Local write latency: 1.392 ms Pending tasks: 0 Bloom filter false positives: 11802553 Bloom filter false ratio: 0.98767 Bloom filter space used, bytes: 17686928 Compacted partition minimum bytes: 104 Compacted partition maximum bytes: 3379391 Compacted partition mean bytes: 142171 Average live cells per slice (last five minutes): 537.5 Average tombstones per slice (last five minutes): 0.0 {code} By the way, this is the picture from another node that functions normally: {code} # nodetool cfstats myks.mytable1 Keyspace: myks Read Count: 4638154 Read Latency: 20.784106776316612 ms. Write Count: 15067667 Write Latency: 1.7291775639188205 ms. Pending Tasks: 0 Table: mytable1 SSTable count: 4561 SSTables in each level: [37/4, 15/10, 106/100, 1053/1000, 3350, 0, 0, 0, 0] Space used (live), bytes: 1129716897255 Space used (total), bytes: 1129752918759 SSTable Compression Ratio: 0.33488717551698993 Number of keys (estimate): 25036672 Memtable cell count: 334212 Memtable data size, bytes: 115610737 Memtable switch count: 4476 Local read count: 4638155 Local read latency: 20.784 ms Local write count: 15067679 Local write latency: 1.729 ms Pending tasks: 0 Bloom filter false positives: 104377 Bloom filter false ratio: 0.59542 Bloom filter space used, bytes: 20319608 Compacted partition minimum bytes: 104 Compacted partition maximum bytes: 3379391 Compacted partition mean bytes: 152368 Average live cells per slice (last five minutes): 529.5 Average tombstones per slice (last five minutes): 0.0 {code} So, not only the streaming has created an excessive amount of sstables, the compactions are not advancing at all. In fact, the number of pending compactions grows slowly on that (first) node. New L0 sstables get added because the write activity is taking place. Just a simple math. If I take the compaction throughput of the node when it uses only one thread and compare it to my write rate I think the latter is like 4x the former. Under this conditions this node will never recover - while having plenty of resources and very fast I/O. LCS compaction low performance, many pending compactions, nodes are almost idle --- Key: CASSANDRA-7949 URL: https://issues.apache.org/jira/browse/CASSANDRA-7949
[jira] [Commented] (CASSANDRA-7949) LCS compaction low performance, many pending compactions, nodes are almost idle
[ https://issues.apache.org/jira/browse/CASSANDRA-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14203751#comment-14203751 ] Nikolai Grigoriev commented on CASSANDRA-7949: -- Here is another extreme (but, unfortunately, real) example of LCS going a bit crazy. {code} # nodetool cfstats myks.mytable Keyspace: myks Read Count: 3006212 Read Latency: 21.02595119106703 ms. Write Count: 11226340 Write Latency: 1.8405579886231844 ms. Pending Tasks: 0 Table: wm_contacts SSTable count: 6530 SSTables in each level: [2369/4, 10, 104/100, 1043/1000, 3004, 0, 0, 0, 0] Space used (live), bytes: 1113384288740 Space used (total), bytes: 1113406795020 SSTable Compression Ratio: 0.3307170610260717 Number of keys (estimate): 26294144 Memtable cell count: 782994 Memtable data size, bytes: 213472460 Memtable switch count: 3493 Local read count: 3006239 Local read latency: 21.026 ms Local write count: 11226517 Local write latency: 1.841 ms Pending tasks: 0 Bloom filter false positives: 41835779 Bloom filter false ratio: 0.97500 Bloom filter space used, bytes: 19666944 Compacted partition minimum bytes: 104 Compacted partition maximum bytes: 3379391 Compacted partition mean bytes: 139451 Average live cells per slice (last five minutes): 444.0 Average tombstones per slice (last five minutes): 0.0 {code} {code} # nodetool compactionstats pending tasks: 190 compaction typekeyspace table completed total unit progress Compaction myksmytable2 7198353690 7446734394 bytes96.66% Compaction myks mytable 4851429651 10717052513 bytes45.27% Active compaction remaining time : 0h00m04s {code} Note the cfstats. The number of sstables at L0 is insane. Yet, C* is sitting quietly compacting the data using 2 cores out of 32. Once it gets into this state I immediately start seeing large sstables forming - instead of 256Mb the sstables of 1-2Gb and more start appearing. And it creates the snowball effect. LCS compaction low performance, many pending compactions, nodes are almost idle --- Key: CASSANDRA-7949 URL: https://issues.apache.org/jira/browse/CASSANDRA-7949 Project: Cassandra Issue Type: Bug Components: Core Environment: DSE 4.5.1-1, Cassandra 2.0.8 Reporter: Nikolai Grigoriev Attachments: iostats.txt, nodetool_compactionstats.txt, nodetool_tpstats.txt, pending compactions 2day.png, system.log.gz, vmstat.txt I've been evaluating new cluster of 15 nodes (32 core, 6x800Gb SSD disks + 2x600Gb SAS, 128Gb RAM, OEL 6.5) and I've built a simulator that creates the load similar to the load in our future product. Before running the simulator I had to pre-generate enough data. This was done using Java code and DataStax Java driver. To avoid going deep into details, two tables have been generated. Each table currently has about 55M rows and between few dozens and few thousands of columns in each row. This data generation process was generating massive amount of non-overlapping data. Thus, the activity was write-only and highly parallel. This is not the type of the traffic that the system will have ultimately to deal with, it will be mix of reads and updates to the existing data in the future. This is just to explain the choice of LCS, not mentioning the expensive SSD disk space. At some point while generating the data I have noticed that the compactions started to pile up. I knew that I was overloading the cluster but I still wanted the genration test to complete. I was expecting to give the cluster enough time to finish the pending compactions and get ready for real traffic. However, after the storm of write requests have been stopped I have noticed that the number of pending compactions remained constant (and even climbed up a little bit) on all nodes. After trying to tune some parameters (like setting the compaction bandwidth cap to 0) I have noticed a strange pattern: the nodes were compacting one of the CFs in a single stream using virtually no CPU and no disk I/O. This process was taking hours. After that it would be followed by a short burst of few dozens of compactions running in parallel (CPU at 2000%, some disk I/O - up to 10-20%) and then getting stuck again for many
[jira] [Commented] (CASSANDRA-7949) LCS compaction low performance, many pending compactions, nodes are almost idle
[ https://issues.apache.org/jira/browse/CASSANDRA-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14176628#comment-14176628 ] Marcus Eriksson commented on CASSANDRA-7949: The first comment has a link to a github branch. But, it is against trunk so don't use it in production (of course) LCS compaction low performance, many pending compactions, nodes are almost idle --- Key: CASSANDRA-7949 URL: https://issues.apache.org/jira/browse/CASSANDRA-7949 Project: Cassandra Issue Type: Bug Components: Core Environment: DSE 4.5.1-1, Cassandra 2.0.8 Reporter: Nikolai Grigoriev Attachments: iostats.txt, nodetool_compactionstats.txt, nodetool_tpstats.txt, pending compactions 2day.png, system.log.gz, vmstat.txt I've been evaluating new cluster of 15 nodes (32 core, 6x800Gb SSD disks + 2x600Gb SAS, 128Gb RAM, OEL 6.5) and I've built a simulator that creates the load similar to the load in our future product. Before running the simulator I had to pre-generate enough data. This was done using Java code and DataStax Java driver. To avoid going deep into details, two tables have been generated. Each table currently has about 55M rows and between few dozens and few thousands of columns in each row. This data generation process was generating massive amount of non-overlapping data. Thus, the activity was write-only and highly parallel. This is not the type of the traffic that the system will have ultimately to deal with, it will be mix of reads and updates to the existing data in the future. This is just to explain the choice of LCS, not mentioning the expensive SSD disk space. At some point while generating the data I have noticed that the compactions started to pile up. I knew that I was overloading the cluster but I still wanted the genration test to complete. I was expecting to give the cluster enough time to finish the pending compactions and get ready for real traffic. However, after the storm of write requests have been stopped I have noticed that the number of pending compactions remained constant (and even climbed up a little bit) on all nodes. After trying to tune some parameters (like setting the compaction bandwidth cap to 0) I have noticed a strange pattern: the nodes were compacting one of the CFs in a single stream using virtually no CPU and no disk I/O. This process was taking hours. After that it would be followed by a short burst of few dozens of compactions running in parallel (CPU at 2000%, some disk I/O - up to 10-20%) and then getting stuck again for many hours doing one compaction at time. So it looks like this: # nodetool compactionstats pending tasks: 3351 compaction typekeyspace table completed total unit progress Compaction myks table_list1 66499295588 1910515889913 bytes 3.48% Active compaction remaining time :n/a # df -h ... /dev/sdb1.5T 637G 854G 43% /cassandra-data/disk1 /dev/sdc1.5T 425G 1.1T 29% /cassandra-data/disk2 /dev/sdd1.5T 429G 1.1T 29% /cassandra-data/disk3 # find . -name **table_list1**Data** | grep -v snapshot | wc -l 1310 Among these files I see: 1043 files of 161Mb (my sstable size is 160Mb) 9 large files - 3 between 1 and 2Gb, 3 of 5-8Gb, 55Gb, 70Gb and 370Gb 263 files of various sized - between few dozens of Kb and 160Mb I've been running the heavy load for about 1,5days and it's been close to 3 days after that and the number of pending compactions does not go down. I have applied one of the not-so-obvious recommendations to disable multithreaded compactions and that seems to be helping a bit - I see some nodes started to have fewer pending compactions. About half of the cluster, in fact. But even there I see they are sitting idle most of the time lazily compacting in one stream with CPU at ~140% and occasionally doing the bursts of compaction work for few minutes. I am wondering if this is really a bug or something in the LCS logic that would manifest itself only in such an edge case scenario where I have loaded lots of unique data quickly. By the way, I see this pattern only for one of two tables - the one that has about 4 times more data than another (space-wise, number of rows is the same). Looks like all these pending compactions are really only for that larger table. I'll be attaching the relevant logs shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7949) LCS compaction low performance, many pending compactions, nodes are almost idle
[ https://issues.apache.org/jira/browse/CASSANDRA-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14176884#comment-14176884 ] Nikolai Grigoriev commented on CASSANDRA-7949: -- Then I doubt I can really try it. We are quite close from production deployment and trying with something that far from what we will use in prod is pointless (for me, not for the fix ;) ). LCS compaction low performance, many pending compactions, nodes are almost idle --- Key: CASSANDRA-7949 URL: https://issues.apache.org/jira/browse/CASSANDRA-7949 Project: Cassandra Issue Type: Bug Components: Core Environment: DSE 4.5.1-1, Cassandra 2.0.8 Reporter: Nikolai Grigoriev Attachments: iostats.txt, nodetool_compactionstats.txt, nodetool_tpstats.txt, pending compactions 2day.png, system.log.gz, vmstat.txt I've been evaluating new cluster of 15 nodes (32 core, 6x800Gb SSD disks + 2x600Gb SAS, 128Gb RAM, OEL 6.5) and I've built a simulator that creates the load similar to the load in our future product. Before running the simulator I had to pre-generate enough data. This was done using Java code and DataStax Java driver. To avoid going deep into details, two tables have been generated. Each table currently has about 55M rows and between few dozens and few thousands of columns in each row. This data generation process was generating massive amount of non-overlapping data. Thus, the activity was write-only and highly parallel. This is not the type of the traffic that the system will have ultimately to deal with, it will be mix of reads and updates to the existing data in the future. This is just to explain the choice of LCS, not mentioning the expensive SSD disk space. At some point while generating the data I have noticed that the compactions started to pile up. I knew that I was overloading the cluster but I still wanted the genration test to complete. I was expecting to give the cluster enough time to finish the pending compactions and get ready for real traffic. However, after the storm of write requests have been stopped I have noticed that the number of pending compactions remained constant (and even climbed up a little bit) on all nodes. After trying to tune some parameters (like setting the compaction bandwidth cap to 0) I have noticed a strange pattern: the nodes were compacting one of the CFs in a single stream using virtually no CPU and no disk I/O. This process was taking hours. After that it would be followed by a short burst of few dozens of compactions running in parallel (CPU at 2000%, some disk I/O - up to 10-20%) and then getting stuck again for many hours doing one compaction at time. So it looks like this: # nodetool compactionstats pending tasks: 3351 compaction typekeyspace table completed total unit progress Compaction myks table_list1 66499295588 1910515889913 bytes 3.48% Active compaction remaining time :n/a # df -h ... /dev/sdb1.5T 637G 854G 43% /cassandra-data/disk1 /dev/sdc1.5T 425G 1.1T 29% /cassandra-data/disk2 /dev/sdd1.5T 429G 1.1T 29% /cassandra-data/disk3 # find . -name **table_list1**Data** | grep -v snapshot | wc -l 1310 Among these files I see: 1043 files of 161Mb (my sstable size is 160Mb) 9 large files - 3 between 1 and 2Gb, 3 of 5-8Gb, 55Gb, 70Gb and 370Gb 263 files of various sized - between few dozens of Kb and 160Mb I've been running the heavy load for about 1,5days and it's been close to 3 days after that and the number of pending compactions does not go down. I have applied one of the not-so-obvious recommendations to disable multithreaded compactions and that seems to be helping a bit - I see some nodes started to have fewer pending compactions. About half of the cluster, in fact. But even there I see they are sitting idle most of the time lazily compacting in one stream with CPU at ~140% and occasionally doing the bursts of compaction work for few minutes. I am wondering if this is really a bug or something in the LCS logic that would manifest itself only in such an edge case scenario where I have loaded lots of unique data quickly. By the way, I see this pattern only for one of two tables - the one that has about 4 times more data than another (space-wise, number of rows is the same). Looks like all these pending compactions are really only for that larger table. I'll be attaching the relevant logs shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7949) LCS compaction low performance, many pending compactions, nodes are almost idle
[ https://issues.apache.org/jira/browse/CASSANDRA-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14176426#comment-14176426 ] Nikolai Grigoriev commented on CASSANDRA-7949: -- [~krummas] Marcus, Which patch are you talking about? I am running latest DSE with Cassandra 2.0.10. LCS compaction low performance, many pending compactions, nodes are almost idle --- Key: CASSANDRA-7949 URL: https://issues.apache.org/jira/browse/CASSANDRA-7949 Project: Cassandra Issue Type: Bug Components: Core Environment: DSE 4.5.1-1, Cassandra 2.0.8 Reporter: Nikolai Grigoriev Attachments: iostats.txt, nodetool_compactionstats.txt, nodetool_tpstats.txt, pending compactions 2day.png, system.log.gz, vmstat.txt I've been evaluating new cluster of 15 nodes (32 core, 6x800Gb SSD disks + 2x600Gb SAS, 128Gb RAM, OEL 6.5) and I've built a simulator that creates the load similar to the load in our future product. Before running the simulator I had to pre-generate enough data. This was done using Java code and DataStax Java driver. To avoid going deep into details, two tables have been generated. Each table currently has about 55M rows and between few dozens and few thousands of columns in each row. This data generation process was generating massive amount of non-overlapping data. Thus, the activity was write-only and highly parallel. This is not the type of the traffic that the system will have ultimately to deal with, it will be mix of reads and updates to the existing data in the future. This is just to explain the choice of LCS, not mentioning the expensive SSD disk space. At some point while generating the data I have noticed that the compactions started to pile up. I knew that I was overloading the cluster but I still wanted the genration test to complete. I was expecting to give the cluster enough time to finish the pending compactions and get ready for real traffic. However, after the storm of write requests have been stopped I have noticed that the number of pending compactions remained constant (and even climbed up a little bit) on all nodes. After trying to tune some parameters (like setting the compaction bandwidth cap to 0) I have noticed a strange pattern: the nodes were compacting one of the CFs in a single stream using virtually no CPU and no disk I/O. This process was taking hours. After that it would be followed by a short burst of few dozens of compactions running in parallel (CPU at 2000%, some disk I/O - up to 10-20%) and then getting stuck again for many hours doing one compaction at time. So it looks like this: # nodetool compactionstats pending tasks: 3351 compaction typekeyspace table completed total unit progress Compaction myks table_list1 66499295588 1910515889913 bytes 3.48% Active compaction remaining time :n/a # df -h ... /dev/sdb1.5T 637G 854G 43% /cassandra-data/disk1 /dev/sdc1.5T 425G 1.1T 29% /cassandra-data/disk2 /dev/sdd1.5T 429G 1.1T 29% /cassandra-data/disk3 # find . -name **table_list1**Data** | grep -v snapshot | wc -l 1310 Among these files I see: 1043 files of 161Mb (my sstable size is 160Mb) 9 large files - 3 between 1 and 2Gb, 3 of 5-8Gb, 55Gb, 70Gb and 370Gb 263 files of various sized - between few dozens of Kb and 160Mb I've been running the heavy load for about 1,5days and it's been close to 3 days after that and the number of pending compactions does not go down. I have applied one of the not-so-obvious recommendations to disable multithreaded compactions and that seems to be helping a bit - I see some nodes started to have fewer pending compactions. About half of the cluster, in fact. But even there I see they are sitting idle most of the time lazily compacting in one stream with CPU at ~140% and occasionally doing the bursts of compaction work for few minutes. I am wondering if this is really a bug or something in the LCS logic that would manifest itself only in such an edge case scenario where I have loaded lots of unique data quickly. By the way, I see this pattern only for one of two tables - the one that has about 4 times more data than another (space-wise, number of rows is the same). Looks like all these pending compactions are really only for that larger table. I'll be attaching the relevant logs shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7949) LCS compaction low performance, many pending compactions, nodes are almost idle
[ https://issues.apache.org/jira/browse/CASSANDRA-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174702#comment-14174702 ] Nikolai Grigoriev commented on CASSANDRA-7949: -- Update: Using the property from CASSANDRA-6621 does help to get out of this state. My cluster is slowly digesting the large sstables and creating bunch of nice small sstables from them. It is slower than using sstablesplit, I believe, because it actually does real compactions and, thus, processes and reprocesses different sets of sstables. My understanding is that every time I get new bunch of L0 sstables there is a phase for updating other levels and it repeats and repeats. With that property set I see that my total number of sstables grows, my number of huge sstables decreases and the average size of the sstable decreases as result. My conclusions so far: 1. STCS fallback in LCS is a double-edged sword. It is needed to prevent the flooding the node with tons of small sstables resulting from ongoing writes. These small ones are often much smaller than the configured target size and hey need to be merged. But also the use of STCS results in generation of the super-sized sstables. These become a large headache when the fallback stops and LCS is supposed to resume normal operations. It appears to me (my humble opinion) that fallback should be done to some kind of specialized rescue STCS flavor that merges the small sstables to approximately the LCS target sstable size BUT DOES NOT create sstables that are much larger than the target size. With this approach the LCS will resume normal operations much faster than the cause for the fallback (abnormally high write load) is gone. 2. LCS has major (performance?) issue when you have super-large sstables in the system. It often gets stuck with single long (many hours) compaction stream that, by itself, will increase the probability of another STCS fallback even with reasonable write load. As a possible workaround I was recommended to consider running multiple C* instances on our relatively powerful machines - to significantly reduce the amount of data per node and increase compaction throughput. 3. In the existing systems, depending on the severity of the STCS fallback work the fix from CASSANDRA-6621 may help to recover while keeping the nodes up. It will take a very long time to recover but the nodes will be online. 4. Recovery (see above) is very long. It is much much longer than the duration of the stress period that causes the condition. In my case I was writing like crazy for about 4 days and it's been over a week of compactions after that. I am still very far from 0 pending compactions. Considering this it makes sense to artificially throttle the write speed when generating the data (like in the use case I described in previous comments). Extra time spent on writing the data will be still significantly shorter than the amount of time required to recover from the consequences of abusing the available write bandwidth. LCS compaction low performance, many pending compactions, nodes are almost idle --- Key: CASSANDRA-7949 URL: https://issues.apache.org/jira/browse/CASSANDRA-7949 Project: Cassandra Issue Type: Bug Components: Core Environment: DSE 4.5.1-1, Cassandra 2.0.8 Reporter: Nikolai Grigoriev Attachments: iostats.txt, nodetool_compactionstats.txt, nodetool_tpstats.txt, pending compactions 2day.png, system.log.gz, vmstat.txt I've been evaluating new cluster of 15 nodes (32 core, 6x800Gb SSD disks + 2x600Gb SAS, 128Gb RAM, OEL 6.5) and I've built a simulator that creates the load similar to the load in our future product. Before running the simulator I had to pre-generate enough data. This was done using Java code and DataStax Java driver. To avoid going deep into details, two tables have been generated. Each table currently has about 55M rows and between few dozens and few thousands of columns in each row. This data generation process was generating massive amount of non-overlapping data. Thus, the activity was write-only and highly parallel. This is not the type of the traffic that the system will have ultimately to deal with, it will be mix of reads and updates to the existing data in the future. This is just to explain the choice of LCS, not mentioning the expensive SSD disk space. At some point while generating the data I have noticed that the compactions started to pile up. I knew that I was overloading the cluster but I still wanted the genration test to complete. I was expecting to give the cluster enough time to finish the pending compactions and get ready for real traffic. However, after the storm of write requests have been
[jira] [Commented] (CASSANDRA-7949) LCS compaction low performance, many pending compactions, nodes are almost idle
[ https://issues.apache.org/jira/browse/CASSANDRA-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14168822#comment-14168822 ] Nikolai Grigoriev commented on CASSANDRA-7949: -- I did another round of testing and I can confirm my previous suspicion. If LCS goes into STCS fallback mode there seems to be some kind of point of no return. After loading fairly large amount of data I end up with a number of large (from few Gb to 200+Gb) sstables. After that the cluster simply goes downhill - it never recovers. Even if there is no traffic except the repair service (DSE OpsCenter) the number of pending compactions never declines. It actually grows. Sstables also grow and grow in size until the moment one of the compactions runs out of disk space and crashes the node. Also I believe once in this state there is no way out. sstablesplit tool, as far as I understand, cannot be used with the live node. And the tool splits the data in single thread. I have measured its performance on my system, it processes about 13Mb/s on average, thus, to split all these large sstables it would take many DAYS. LCS compaction low performance, many pending compactions, nodes are almost idle --- Key: CASSANDRA-7949 URL: https://issues.apache.org/jira/browse/CASSANDRA-7949 Project: Cassandra Issue Type: Bug Components: Core Environment: DSE 4.5.1-1, Cassandra 2.0.8 Reporter: Nikolai Grigoriev Attachments: iostats.txt, nodetool_compactionstats.txt, nodetool_tpstats.txt, pending compactions 2day.png, system.log.gz, vmstat.txt I've been evaluating new cluster of 15 nodes (32 core, 6x800Gb SSD disks + 2x600Gb SAS, 128Gb RAM, OEL 6.5) and I've built a simulator that creates the load similar to the load in our future product. Before running the simulator I had to pre-generate enough data. This was done using Java code and DataStax Java driver. To avoid going deep into details, two tables have been generated. Each table currently has about 55M rows and between few dozens and few thousands of columns in each row. This data generation process was generating massive amount of non-overlapping data. Thus, the activity was write-only and highly parallel. This is not the type of the traffic that the system will have ultimately to deal with, it will be mix of reads and updates to the existing data in the future. This is just to explain the choice of LCS, not mentioning the expensive SSD disk space. At some point while generating the data I have noticed that the compactions started to pile up. I knew that I was overloading the cluster but I still wanted the genration test to complete. I was expecting to give the cluster enough time to finish the pending compactions and get ready for real traffic. However, after the storm of write requests have been stopped I have noticed that the number of pending compactions remained constant (and even climbed up a little bit) on all nodes. After trying to tune some parameters (like setting the compaction bandwidth cap to 0) I have noticed a strange pattern: the nodes were compacting one of the CFs in a single stream using virtually no CPU and no disk I/O. This process was taking hours. After that it would be followed by a short burst of few dozens of compactions running in parallel (CPU at 2000%, some disk I/O - up to 10-20%) and then getting stuck again for many hours doing one compaction at time. So it looks like this: # nodetool compactionstats pending tasks: 3351 compaction typekeyspace table completed total unit progress Compaction myks table_list1 66499295588 1910515889913 bytes 3.48% Active compaction remaining time :n/a # df -h ... /dev/sdb1.5T 637G 854G 43% /cassandra-data/disk1 /dev/sdc1.5T 425G 1.1T 29% /cassandra-data/disk2 /dev/sdd1.5T 429G 1.1T 29% /cassandra-data/disk3 # find . -name **table_list1**Data** | grep -v snapshot | wc -l 1310 Among these files I see: 1043 files of 161Mb (my sstable size is 160Mb) 9 large files - 3 between 1 and 2Gb, 3 of 5-8Gb, 55Gb, 70Gb and 370Gb 263 files of various sized - between few dozens of Kb and 160Mb I've been running the heavy load for about 1,5days and it's been close to 3 days after that and the number of pending compactions does not go down. I have applied one of the not-so-obvious recommendations to disable multithreaded compactions and that seems to be helping a bit - I see some nodes started to have fewer pending compactions. About half of the cluster, in fact. But even there I see they are sitting idle most of the time lazily compacting in one stream with CPU at ~140%
[jira] [Commented] (CASSANDRA-7949) LCS compaction low performance, many pending compactions, nodes are almost idle
[ https://issues.apache.org/jira/browse/CASSANDRA-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14162263#comment-14162263 ] Nikolai Grigoriev commented on CASSANDRA-7949: -- It seems that what I am suffering from in this specific test is similar to CASSANDRA-6621. When I write all unique data to create my initial snapshot I effectively do something similar to what happens when new node is bootstrapped, I think. LCS compaction low performance, many pending compactions, nodes are almost idle --- Key: CASSANDRA-7949 URL: https://issues.apache.org/jira/browse/CASSANDRA-7949 Project: Cassandra Issue Type: Bug Components: Core Environment: DSE 4.5.1-1, Cassandra 2.0.8 Reporter: Nikolai Grigoriev Attachments: iostats.txt, nodetool_compactionstats.txt, nodetool_tpstats.txt, pending compactions 2day.png, system.log.gz, vmstat.txt I've been evaluating new cluster of 15 nodes (32 core, 6x800Gb SSD disks + 2x600Gb SAS, 128Gb RAM, OEL 6.5) and I've built a simulator that creates the load similar to the load in our future product. Before running the simulator I had to pre-generate enough data. This was done using Java code and DataStax Java driver. To avoid going deep into details, two tables have been generated. Each table currently has about 55M rows and between few dozens and few thousands of columns in each row. This data generation process was generating massive amount of non-overlapping data. Thus, the activity was write-only and highly parallel. This is not the type of the traffic that the system will have ultimately to deal with, it will be mix of reads and updates to the existing data in the future. This is just to explain the choice of LCS, not mentioning the expensive SSD disk space. At some point while generating the data I have noticed that the compactions started to pile up. I knew that I was overloading the cluster but I still wanted the genration test to complete. I was expecting to give the cluster enough time to finish the pending compactions and get ready for real traffic. However, after the storm of write requests have been stopped I have noticed that the number of pending compactions remained constant (and even climbed up a little bit) on all nodes. After trying to tune some parameters (like setting the compaction bandwidth cap to 0) I have noticed a strange pattern: the nodes were compacting one of the CFs in a single stream using virtually no CPU and no disk I/O. This process was taking hours. After that it would be followed by a short burst of few dozens of compactions running in parallel (CPU at 2000%, some disk I/O - up to 10-20%) and then getting stuck again for many hours doing one compaction at time. So it looks like this: # nodetool compactionstats pending tasks: 3351 compaction typekeyspace table completed total unit progress Compaction myks table_list1 66499295588 1910515889913 bytes 3.48% Active compaction remaining time :n/a # df -h ... /dev/sdb1.5T 637G 854G 43% /cassandra-data/disk1 /dev/sdc1.5T 425G 1.1T 29% /cassandra-data/disk2 /dev/sdd1.5T 429G 1.1T 29% /cassandra-data/disk3 # find . -name **table_list1**Data** | grep -v snapshot | wc -l 1310 Among these files I see: 1043 files of 161Mb (my sstable size is 160Mb) 9 large files - 3 between 1 and 2Gb, 3 of 5-8Gb, 55Gb, 70Gb and 370Gb 263 files of various sized - between few dozens of Kb and 160Mb I've been running the heavy load for about 1,5days and it's been close to 3 days after that and the number of pending compactions does not go down. I have applied one of the not-so-obvious recommendations to disable multithreaded compactions and that seems to be helping a bit - I see some nodes started to have fewer pending compactions. About half of the cluster, in fact. But even there I see they are sitting idle most of the time lazily compacting in one stream with CPU at ~140% and occasionally doing the bursts of compaction work for few minutes. I am wondering if this is really a bug or something in the LCS logic that would manifest itself only in such an edge case scenario where I have loaded lots of unique data quickly. By the way, I see this pattern only for one of two tables - the one that has about 4 times more data than another (space-wise, number of rows is the same). Looks like all these pending compactions are really only for that larger table. I'll be attaching the relevant logs shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7949) LCS compaction low performance, many pending compactions, nodes are almost idle
[ https://issues.apache.org/jira/browse/CASSANDRA-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152569#comment-14152569 ] Nikolai Grigoriev commented on CASSANDRA-7949: -- Upgraded to Cassandra 2.0.10 (via DSE 4.5.2) today. Switched my tables that used STCS to LCS. Restarted. For last 8 hours I observe this on all nodes: {code} # nodetool compactionstats pending tasks: 13808 compaction typekeyspace table completed total unit progress Compaction mykeyspacetable_1528230773591 1616185183262 bytes32.68% Compaction mykeyspace table_2456361916088 4158821946280 bytes10.97% Active compaction remaining time : 3h57m56s {code} In the beginning of these 8 hours the remaining time was about 4h08m. CPU activity - almost nothing (between 2 and 3 cores), disk I/O - nearly zero. So clearly it compacts in one thread per keyspace and almost does not progress. LCS compaction low performance, many pending compactions, nodes are almost idle --- Key: CASSANDRA-7949 URL: https://issues.apache.org/jira/browse/CASSANDRA-7949 Project: Cassandra Issue Type: Bug Components: Core Environment: DSE 4.5.1-1, Cassandra 2.0.8 Reporter: Nikolai Grigoriev Attachments: iostats.txt, nodetool_compactionstats.txt, nodetool_tpstats.txt, pending compactions 2day.png, system.log.gz, vmstat.txt I've been evaluating new cluster of 15 nodes (32 core, 6x800Gb SSD disks + 2x600Gb SAS, 128Gb RAM, OEL 6.5) and I've built a simulator that creates the load similar to the load in our future product. Before running the simulator I had to pre-generate enough data. This was done using Java code and DataStax Java driver. To avoid going deep into details, two tables have been generated. Each table currently has about 55M rows and between few dozens and few thousands of columns in each row. This data generation process was generating massive amount of non-overlapping data. Thus, the activity was write-only and highly parallel. This is not the type of the traffic that the system will have ultimately to deal with, it will be mix of reads and updates to the existing data in the future. This is just to explain the choice of LCS, not mentioning the expensive SSD disk space. At some point while generating the data I have noticed that the compactions started to pile up. I knew that I was overloading the cluster but I still wanted the genration test to complete. I was expecting to give the cluster enough time to finish the pending compactions and get ready for real traffic. However, after the storm of write requests have been stopped I have noticed that the number of pending compactions remained constant (and even climbed up a little bit) on all nodes. After trying to tune some parameters (like setting the compaction bandwidth cap to 0) I have noticed a strange pattern: the nodes were compacting one of the CFs in a single stream using virtually no CPU and no disk I/O. This process was taking hours. After that it would be followed by a short burst of few dozens of compactions running in parallel (CPU at 2000%, some disk I/O - up to 10-20%) and then getting stuck again for many hours doing one compaction at time. So it looks like this: # nodetool compactionstats pending tasks: 3351 compaction typekeyspace table completed total unit progress Compaction myks table_list1 66499295588 1910515889913 bytes 3.48% Active compaction remaining time :n/a # df -h ... /dev/sdb1.5T 637G 854G 43% /cassandra-data/disk1 /dev/sdc1.5T 425G 1.1T 29% /cassandra-data/disk2 /dev/sdd1.5T 429G 1.1T 29% /cassandra-data/disk3 # find . -name **table_list1**Data** | grep -v snapshot | wc -l 1310 Among these files I see: 1043 files of 161Mb (my sstable size is 160Mb) 9 large files - 3 between 1 and 2Gb, 3 of 5-8Gb, 55Gb, 70Gb and 370Gb 263 files of various sized - between few dozens of Kb and 160Mb I've been running the heavy load for about 1,5days and it's been close to 3 days after that and the number of pending compactions does not go down. I have applied one of the not-so-obvious recommendations to disable multithreaded compactions and that seems to be helping a bit - I see some nodes started to have fewer pending compactions. About half of the cluster, in fact. But even there I see they are sitting idle most of the time lazily compacting in one stream with CPU at ~140% and occasionally doing the bursts of compaction work for few minutes. I am wondering if this is really a bug
[jira] [Commented] (CASSANDRA-7949) LCS compaction low performance, many pending compactions, nodes are almost idle
[ https://issues.apache.org/jira/browse/CASSANDRA-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146632#comment-14146632 ] Nikolai Grigoriev commented on CASSANDRA-7949: -- May be this is not related but I have another small cluster with similar data. I have just upgraded that one to 2.0.10 (not DSE, original open-source version). On all machines in this cluster I have many thousands of sstables, all 160Mb, few ones that are smaller. So they are all L0, no L1 or higher level sstables exist. LCS is used. Number of pending compactions: 0. There is even incoming traffic that writes into that keyspace. nodetool compact returns immediately. LCS compaction low performance, many pending compactions, nodes are almost idle --- Key: CASSANDRA-7949 URL: https://issues.apache.org/jira/browse/CASSANDRA-7949 Project: Cassandra Issue Type: Bug Components: Core Environment: DSE 4.5.1-1, Cassandra 2.0.8 Reporter: Nikolai Grigoriev Attachments: iostats.txt, nodetool_compactionstats.txt, nodetool_tpstats.txt, pending compactions 2day.png, system.log.gz, vmstat.txt I've been evaluating new cluster of 15 nodes (32 core, 6x800Gb SSD disks + 2x600Gb SAS, 128Gb RAM, OEL 6.5) and I've built a simulator that creates the load similar to the load in our future product. Before running the simulator I had to pre-generate enough data. This was done using Java code and DataStax Java driver. To avoid going deep into details, two tables have been generated. Each table currently has about 55M rows and between few dozens and few thousands of columns in each row. This data generation process was generating massive amount of non-overlapping data. Thus, the activity was write-only and highly parallel. This is not the type of the traffic that the system will have ultimately to deal with, it will be mix of reads and updates to the existing data in the future. This is just to explain the choice of LCS, not mentioning the expensive SSD disk space. At some point while generating the data I have noticed that the compactions started to pile up. I knew that I was overloading the cluster but I still wanted the genration test to complete. I was expecting to give the cluster enough time to finish the pending compactions and get ready for real traffic. However, after the storm of write requests have been stopped I have noticed that the number of pending compactions remained constant (and even climbed up a little bit) on all nodes. After trying to tune some parameters (like setting the compaction bandwidth cap to 0) I have noticed a strange pattern: the nodes were compacting one of the CFs in a single stream using virtually no CPU and no disk I/O. This process was taking hours. After that it would be followed by a short burst of few dozens of compactions running in parallel (CPU at 2000%, some disk I/O - up to 10-20%) and then getting stuck again for many hours doing one compaction at time. So it looks like this: # nodetool compactionstats pending tasks: 3351 compaction typekeyspace table completed total unit progress Compaction myks table_list1 66499295588 1910515889913 bytes 3.48% Active compaction remaining time :n/a # df -h ... /dev/sdb1.5T 637G 854G 43% /cassandra-data/disk1 /dev/sdc1.5T 425G 1.1T 29% /cassandra-data/disk2 /dev/sdd1.5T 429G 1.1T 29% /cassandra-data/disk3 # find . -name **table_list1**Data** | grep -v snapshot | wc -l 1310 Among these files I see: 1043 files of 161Mb (my sstable size is 160Mb) 9 large files - 3 between 1 and 2Gb, 3 of 5-8Gb, 55Gb, 70Gb and 370Gb 263 files of various sized - between few dozens of Kb and 160Mb I've been running the heavy load for about 1,5days and it's been close to 3 days after that and the number of pending compactions does not go down. I have applied one of the not-so-obvious recommendations to disable multithreaded compactions and that seems to be helping a bit - I see some nodes started to have fewer pending compactions. About half of the cluster, in fact. But even there I see they are sitting idle most of the time lazily compacting in one stream with CPU at ~140% and occasionally doing the bursts of compaction work for few minutes. I am wondering if this is really a bug or something in the LCS logic that would manifest itself only in such an edge case scenario where I have loaded lots of unique data quickly. By the way, I see this pattern only for one of two tables - the one that has about 4 times more data than another (space-wise, number of rows is the same). Looks like all these pending
[jira] [Commented] (CASSANDRA-7949) LCS compaction low performance, many pending compactions, nodes are almost idle
[ https://issues.apache.org/jira/browse/CASSANDRA-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143118#comment-14143118 ] Marcus Eriksson commented on CASSANDRA-7949: i think this could be fixed by this: CASSANDRA-7745 LCS compaction low performance, many pending compactions, nodes are almost idle --- Key: CASSANDRA-7949 URL: https://issues.apache.org/jira/browse/CASSANDRA-7949 Project: Cassandra Issue Type: Bug Components: Core Environment: DSE 4.5.1-1, Cassandra 2.0.8 Reporter: Nikolai Grigoriev Attachments: iostats.txt, nodetool_compactionstats.txt, nodetool_tpstats.txt, pending compactions 2day.png, system.log.gz, vmstat.txt I've been evaluating new cluster of 15 nodes (32 core, 6x800Gb SSD disks + 2x600Gb SAS, 128Gb RAM, OEL 6.5) and I've built a simulator that creates the load similar to the load in our future product. Before running the simulator I had to pre-generate enough data. This was done using Java code and DataStax Java driver. To avoid going deep into details, two tables have been generated. Each table currently has about 55M rows and between few dozens and few thousands of columns in each row. This data generation process was generating massive amount of non-overlapping data. Thus, the activity was write-only and highly parallel. This is not the type of the traffic that the system will have ultimately to deal with, it will be mix of reads and updates to the existing data in the future. This is just to explain the choice of LCS, not mentioning the expensive SSD disk space. At some point while generating the data I have noticed that the compactions started to pile up. I knew that I was overloading the cluster but I still wanted the genration test to complete. I was expecting to give the cluster enough time to finish the pending compactions and get ready for real traffic. However, after the storm of write requests have been stopped I have noticed that the number of pending compactions remained constant (and even climbed up a little bit) on all nodes. After trying to tune some parameters (like setting the compaction bandwidth cap to 0) I have noticed a strange pattern: the nodes were compacting one of the CFs in a single stream using virtually no CPU and no disk I/O. This process was taking hours. After that it would be followed by a short burst of few dozens of compactions running in parallel (CPU at 2000%, some disk I/O - up to 10-20%) and then getting stuck again for many hours doing one compaction at time. So it looks like this: # nodetool compactionstats pending tasks: 3351 compaction typekeyspace table completed total unit progress Compaction myks table_list1 66499295588 1910515889913 bytes 3.48% Active compaction remaining time :n/a # df -h ... /dev/sdb1.5T 637G 854G 43% /cassandra-data/disk1 /dev/sdc1.5T 425G 1.1T 29% /cassandra-data/disk2 /dev/sdd1.5T 429G 1.1T 29% /cassandra-data/disk3 # find . -name **table_list1**Data** | grep -v snapshot | wc -l 1310 Among these files I see: 1043 files of 161Mb (my sstable size is 160Mb) 9 large files - 3 between 1 and 2Gb, 3 of 5-8Gb, 55Gb, 70Gb and 370Gb 263 files of various sized - between few dozens of Kb and 160Mb I've been running the heavy load for about 1,5days and it's been close to 3 days after that and the number of pending compactions does not go down. I have applied one of the not-so-obvious recommendations to disable multithreaded compactions and that seems to be helping a bit - I see some nodes started to have fewer pending compactions. About half of the cluster, in fact. But even there I see they are sitting idle most of the time lazily compacting in one stream with CPU at ~140% and occasionally doing the bursts of compaction work for few minutes. I am wondering if this is really a bug or something in the LCS logic that would manifest itself only in such an edge case scenario where I have loaded lots of unique data quickly. By the way, I see this pattern only for one of two tables - the one that has about 4 times more data than another (space-wise, number of rows is the same). Looks like all these pending compactions are really only for that larger table. I'll be attaching the relevant logs shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7949) LCS compaction low performance, many pending compactions, nodes are almost idle
[ https://issues.apache.org/jira/browse/CASSANDRA-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143214#comment-14143214 ] Nikolai Grigoriev commented on CASSANDRA-7949: -- Update: I have completed my last data writing test, now I have enough data to start another phase. I did that last test with compaction strategy set to STCS but disabled for the duration of the test. Once all writers have finished I have re-enabled the compactions. In under one day STCS has completed the job on all nodes, I ended up with few dozens (~40 or so) large sstables, total amount of data about 23Tb on 15 nodes. I have switched back to LCS this morning and immediately observed the hockey stick on the pending compaction graph. Now each node reports about 8-10K of pending compactions, they are all compacting in one stream per CF and consume virtually no resources: {code} # nodetool compactionstats pending tasks: 9900 compaction typekeyspace table completed total unit progress Compaction testks test_list2 26630083587 812539331642 bytes 3.28% Compaction testks test_list1 24071738534 1994877844635 bytes 1.21% Active compaction remaining time : 2h16m55s # w 13:41:45 up 23 days, 18:13, 2 users, load average: 1.81, 2.13, 2.51 ... # iostat -mdx 5 Linux 3.8.13-44.el6uek.x86_64 (cassandra01.mydomain.com) 22/09/14 _x86_64_(32 CPU) Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await svctm %util sdb 0.00 5.73 88.00 13.33 5.47 5.16 214.84 0.515.08 0.39 3.98 sda 0.00 8.160.13 65.80 0.00 3.28 101.80 0.060.87 0.11 0.71 sdc 0.00 4.93 75.05 13.34 4.67 5.42 233.62 0.495.55 0.39 3.42 sdd 0.00 5.82 86.40 14.10 5.37 5.52 221.83 0.565.59 0.38 3.81 Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await svctm %util sdb 0.00 0.00 134.600.00 8.37 0.00 127.30 0.060.42 0.42 5.64 sda 0.0013.000.00 220.40 0.00 0.96 8.94 0.010.05 0.01 0.32 sdc 0.00 0.00 36.400.00 2.27 0.00 128.00 0.010.41 0.41 1.50 sdd 0.00 0.00 21.200.00 1.32 0.00 128.00 0.000.19 0.19 0.40 {code} LCS compaction low performance, many pending compactions, nodes are almost idle --- Key: CASSANDRA-7949 URL: https://issues.apache.org/jira/browse/CASSANDRA-7949 Project: Cassandra Issue Type: Bug Components: Core Environment: DSE 4.5.1-1, Cassandra 2.0.8 Reporter: Nikolai Grigoriev Attachments: iostats.txt, nodetool_compactionstats.txt, nodetool_tpstats.txt, pending compactions 2day.png, system.log.gz, vmstat.txt I've been evaluating new cluster of 15 nodes (32 core, 6x800Gb SSD disks + 2x600Gb SAS, 128Gb RAM, OEL 6.5) and I've built a simulator that creates the load similar to the load in our future product. Before running the simulator I had to pre-generate enough data. This was done using Java code and DataStax Java driver. To avoid going deep into details, two tables have been generated. Each table currently has about 55M rows and between few dozens and few thousands of columns in each row. This data generation process was generating massive amount of non-overlapping data. Thus, the activity was write-only and highly parallel. This is not the type of the traffic that the system will have ultimately to deal with, it will be mix of reads and updates to the existing data in the future. This is just to explain the choice of LCS, not mentioning the expensive SSD disk space. At some point while generating the data I have noticed that the compactions started to pile up. I knew that I was overloading the cluster but I still wanted the genration test to complete. I was expecting to give the cluster enough time to finish the pending compactions and get ready for real traffic. However, after the storm of write requests have been stopped I have noticed that the number of pending compactions remained constant (and even climbed up a little bit) on all nodes. After trying to tune some parameters (like setting the compaction bandwidth cap to 0) I have noticed a strange pattern: the nodes were compacting one of the CFs in a single stream using virtually no CPU and no disk I/O. This process was taking hours. After that it would be followed by a short burst of few dozens of
[jira] [Commented] (CASSANDRA-7949) LCS compaction low performance, many pending compactions, nodes are almost idle
[ https://issues.apache.org/jira/browse/CASSANDRA-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143357#comment-14143357 ] Marcus Eriksson commented on CASSANDRA-7949: so, if you switch to STCS and let it compact, you are bound to do a lot of L0 to L1 compaction in the beginning since all sstables are in level 0 and need to pass through L1 before making it to the higher levels. L0 to L1 compactions usually include _all_ L1 sstables, this means that only one can proceed at a time. Looking at your compactionstats, you have one 2TB compaction going on, probably between L0 and L1, that needs to finish before it can continue doing higher level compactions LCS compaction low performance, many pending compactions, nodes are almost idle --- Key: CASSANDRA-7949 URL: https://issues.apache.org/jira/browse/CASSANDRA-7949 Project: Cassandra Issue Type: Bug Components: Core Environment: DSE 4.5.1-1, Cassandra 2.0.8 Reporter: Nikolai Grigoriev Attachments: iostats.txt, nodetool_compactionstats.txt, nodetool_tpstats.txt, pending compactions 2day.png, system.log.gz, vmstat.txt I've been evaluating new cluster of 15 nodes (32 core, 6x800Gb SSD disks + 2x600Gb SAS, 128Gb RAM, OEL 6.5) and I've built a simulator that creates the load similar to the load in our future product. Before running the simulator I had to pre-generate enough data. This was done using Java code and DataStax Java driver. To avoid going deep into details, two tables have been generated. Each table currently has about 55M rows and between few dozens and few thousands of columns in each row. This data generation process was generating massive amount of non-overlapping data. Thus, the activity was write-only and highly parallel. This is not the type of the traffic that the system will have ultimately to deal with, it will be mix of reads and updates to the existing data in the future. This is just to explain the choice of LCS, not mentioning the expensive SSD disk space. At some point while generating the data I have noticed that the compactions started to pile up. I knew that I was overloading the cluster but I still wanted the genration test to complete. I was expecting to give the cluster enough time to finish the pending compactions and get ready for real traffic. However, after the storm of write requests have been stopped I have noticed that the number of pending compactions remained constant (and even climbed up a little bit) on all nodes. After trying to tune some parameters (like setting the compaction bandwidth cap to 0) I have noticed a strange pattern: the nodes were compacting one of the CFs in a single stream using virtually no CPU and no disk I/O. This process was taking hours. After that it would be followed by a short burst of few dozens of compactions running in parallel (CPU at 2000%, some disk I/O - up to 10-20%) and then getting stuck again for many hours doing one compaction at time. So it looks like this: # nodetool compactionstats pending tasks: 3351 compaction typekeyspace table completed total unit progress Compaction myks table_list1 66499295588 1910515889913 bytes 3.48% Active compaction remaining time :n/a # df -h ... /dev/sdb1.5T 637G 854G 43% /cassandra-data/disk1 /dev/sdc1.5T 425G 1.1T 29% /cassandra-data/disk2 /dev/sdd1.5T 429G 1.1T 29% /cassandra-data/disk3 # find . -name **table_list1**Data** | grep -v snapshot | wc -l 1310 Among these files I see: 1043 files of 161Mb (my sstable size is 160Mb) 9 large files - 3 between 1 and 2Gb, 3 of 5-8Gb, 55Gb, 70Gb and 370Gb 263 files of various sized - between few dozens of Kb and 160Mb I've been running the heavy load for about 1,5days and it's been close to 3 days after that and the number of pending compactions does not go down. I have applied one of the not-so-obvious recommendations to disable multithreaded compactions and that seems to be helping a bit - I see some nodes started to have fewer pending compactions. About half of the cluster, in fact. But even there I see they are sitting idle most of the time lazily compacting in one stream with CPU at ~140% and occasionally doing the bursts of compaction work for few minutes. I am wondering if this is really a bug or something in the LCS logic that would manifest itself only in such an edge case scenario where I have loaded lots of unique data quickly. By the way, I see this pattern only for one of two tables - the one that has about 4 times more data than another (space-wise, number of rows is the same). Looks like all these
[jira] [Commented] (CASSANDRA-7949) LCS compaction low performance, many pending compactions, nodes are almost idle
[ https://issues.apache.org/jira/browse/CASSANDRA-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143365#comment-14143365 ] Benedict commented on CASSANDRA-7949: - That sounds like fairly suboptimal behaviour still. But it sounds like CASSANDRA-6696 should help to address it. When there is some time, we should also reintroduce a more functional multi-threaded compaction. It should be quite achievable to build one that is correct, safe and faster for these scenarios. LCS compaction low performance, many pending compactions, nodes are almost idle --- Key: CASSANDRA-7949 URL: https://issues.apache.org/jira/browse/CASSANDRA-7949 Project: Cassandra Issue Type: Bug Components: Core Environment: DSE 4.5.1-1, Cassandra 2.0.8 Reporter: Nikolai Grigoriev Attachments: iostats.txt, nodetool_compactionstats.txt, nodetool_tpstats.txt, pending compactions 2day.png, system.log.gz, vmstat.txt I've been evaluating new cluster of 15 nodes (32 core, 6x800Gb SSD disks + 2x600Gb SAS, 128Gb RAM, OEL 6.5) and I've built a simulator that creates the load similar to the load in our future product. Before running the simulator I had to pre-generate enough data. This was done using Java code and DataStax Java driver. To avoid going deep into details, two tables have been generated. Each table currently has about 55M rows and between few dozens and few thousands of columns in each row. This data generation process was generating massive amount of non-overlapping data. Thus, the activity was write-only and highly parallel. This is not the type of the traffic that the system will have ultimately to deal with, it will be mix of reads and updates to the existing data in the future. This is just to explain the choice of LCS, not mentioning the expensive SSD disk space. At some point while generating the data I have noticed that the compactions started to pile up. I knew that I was overloading the cluster but I still wanted the genration test to complete. I was expecting to give the cluster enough time to finish the pending compactions and get ready for real traffic. However, after the storm of write requests have been stopped I have noticed that the number of pending compactions remained constant (and even climbed up a little bit) on all nodes. After trying to tune some parameters (like setting the compaction bandwidth cap to 0) I have noticed a strange pattern: the nodes were compacting one of the CFs in a single stream using virtually no CPU and no disk I/O. This process was taking hours. After that it would be followed by a short burst of few dozens of compactions running in parallel (CPU at 2000%, some disk I/O - up to 10-20%) and then getting stuck again for many hours doing one compaction at time. So it looks like this: # nodetool compactionstats pending tasks: 3351 compaction typekeyspace table completed total unit progress Compaction myks table_list1 66499295588 1910515889913 bytes 3.48% Active compaction remaining time :n/a # df -h ... /dev/sdb1.5T 637G 854G 43% /cassandra-data/disk1 /dev/sdc1.5T 425G 1.1T 29% /cassandra-data/disk2 /dev/sdd1.5T 429G 1.1T 29% /cassandra-data/disk3 # find . -name **table_list1**Data** | grep -v snapshot | wc -l 1310 Among these files I see: 1043 files of 161Mb (my sstable size is 160Mb) 9 large files - 3 between 1 and 2Gb, 3 of 5-8Gb, 55Gb, 70Gb and 370Gb 263 files of various sized - between few dozens of Kb and 160Mb I've been running the heavy load for about 1,5days and it's been close to 3 days after that and the number of pending compactions does not go down. I have applied one of the not-so-obvious recommendations to disable multithreaded compactions and that seems to be helping a bit - I see some nodes started to have fewer pending compactions. About half of the cluster, in fact. But even there I see they are sitting idle most of the time lazily compacting in one stream with CPU at ~140% and occasionally doing the bursts of compaction work for few minutes. I am wondering if this is really a bug or something in the LCS logic that would manifest itself only in such an edge case scenario where I have loaded lots of unique data quickly. By the way, I see this pattern only for one of two tables - the one that has about 4 times more data than another (space-wise, number of rows is the same). Looks like all these pending compactions are really only for that larger table. I'll be attaching the relevant logs shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7949) LCS compaction low performance, many pending compactions, nodes are almost idle
[ https://issues.apache.org/jira/browse/CASSANDRA-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14141080#comment-14141080 ] Philip Thompson commented on CASSANDRA-7949: Did switching to STCS end up solving your issue? LCS compaction low performance, many pending compactions, nodes are almost idle --- Key: CASSANDRA-7949 URL: https://issues.apache.org/jira/browse/CASSANDRA-7949 Project: Cassandra Issue Type: Bug Components: Core Environment: DSE 4.5.1-1, Cassandra 2.0.8 Reporter: Nikolai Grigoriev Attachments: iostats.txt, nodetool_compactionstats.txt, nodetool_tpstats.txt, pending compactions 2day.png, system.log.gz, vmstat.txt I've been evaluating new cluster of 15 nodes (32 core, 6x800Gb SSD disks + 2x600Gb SAS, 128Gb RAM, OEL 6.5) and I've built a simulator that creates the load similar to the load in our future product. Before running the simulator I had to pre-generate enough data. This was done using Java code and DataStax Java driver. To avoid going deep into details, two tables have been generated. Each table currently has about 55M rows and between few dozens and few thousands of columns in each row. This data generation process was generating massive amount of non-overlapping data. Thus, the activity was write-only and highly parallel. This is not the type of the traffic that the system will have ultimately to deal with, it will be mix of reads and updates to the existing data in the future. This is just to explain the choice of LCS, not mentioning the expensive SSD disk space. At some point while generating the data I have noticed that the compactions started to pile up. I knew that I was overloading the cluster but I still wanted the genration test to complete. I was expecting to give the cluster enough time to finish the pending compactions and get ready for real traffic. However, after the storm of write requests have been stopped I have noticed that the number of pending compactions remained constant (and even climbed up a little bit) on all nodes. After trying to tune some parameters (like setting the compaction bandwidth cap to 0) I have noticed a strange pattern: the nodes were compacting one of the CFs in a single stream using virtually no CPU and no disk I/O. This process was taking hours. After that it would be followed by a short burst of few dozens of compactions running in parallel (CPU at 2000%, some disk I/O - up to 10-20%) and then getting stuck again for many hours doing one compaction at time. So it looks like this: # nodetool compactionstats pending tasks: 3351 compaction typekeyspace table completed total unit progress Compaction myks table_list1 66499295588 1910515889913 bytes 3.48% Active compaction remaining time :n/a # df -h ... /dev/sdb1.5T 637G 854G 43% /cassandra-data/disk1 /dev/sdc1.5T 425G 1.1T 29% /cassandra-data/disk2 /dev/sdd1.5T 429G 1.1T 29% /cassandra-data/disk3 # find . -name **table_list1**Data** | grep -v snapshot | wc -l 1310 Among these files I see: 1043 files of 161Mb (my sstable size is 160Mb) 9 large files - 3 between 1 and 2Gb, 3 of 5-8Gb, 55Gb, 70Gb and 370Gb 263 files of various sized - between few dozens of Kb and 160Mb I've been running the heavy load for about 1,5days and it's been close to 3 days after that and the number of pending compactions does not go down. I have applied one of the not-so-obvious recommendations to disable multithreaded compactions and that seems to be helping a bit - I see some nodes started to have fewer pending compactions. About half of the cluster, in fact. But even there I see they are sitting idle most of the time lazily compacting in one stream with CPU at ~140% and occasionally doing the bursts of compaction work for few minutes. I am wondering if this is really a bug or something in the LCS logic that would manifest itself only in such an edge case scenario where I have loaded lots of unique data quickly. By the way, I see this pattern only for one of two tables - the one that has about 4 times more data than another (space-wise, number of rows is the same). Looks like all these pending compactions are really only for that larger table. I'll be attaching the relevant logs shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7949) LCS compaction low performance, many pending compactions, nodes are almost idle
[ https://issues.apache.org/jira/browse/CASSANDRA-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14141087#comment-14141087 ] Nikolai Grigoriev commented on CASSANDRA-7949: -- Yes and no. Yes - the number of pending compactions started to go down and I ended up with fewer (and large) sstables. But I think the issue is more about LCS compaction performance. Is it normal that LCS cannot efficiently use the host resources while having tons of pending compactions? LCS compaction low performance, many pending compactions, nodes are almost idle --- Key: CASSANDRA-7949 URL: https://issues.apache.org/jira/browse/CASSANDRA-7949 Project: Cassandra Issue Type: Bug Components: Core Environment: DSE 4.5.1-1, Cassandra 2.0.8 Reporter: Nikolai Grigoriev Attachments: iostats.txt, nodetool_compactionstats.txt, nodetool_tpstats.txt, pending compactions 2day.png, system.log.gz, vmstat.txt I've been evaluating new cluster of 15 nodes (32 core, 6x800Gb SSD disks + 2x600Gb SAS, 128Gb RAM, OEL 6.5) and I've built a simulator that creates the load similar to the load in our future product. Before running the simulator I had to pre-generate enough data. This was done using Java code and DataStax Java driver. To avoid going deep into details, two tables have been generated. Each table currently has about 55M rows and between few dozens and few thousands of columns in each row. This data generation process was generating massive amount of non-overlapping data. Thus, the activity was write-only and highly parallel. This is not the type of the traffic that the system will have ultimately to deal with, it will be mix of reads and updates to the existing data in the future. This is just to explain the choice of LCS, not mentioning the expensive SSD disk space. At some point while generating the data I have noticed that the compactions started to pile up. I knew that I was overloading the cluster but I still wanted the genration test to complete. I was expecting to give the cluster enough time to finish the pending compactions and get ready for real traffic. However, after the storm of write requests have been stopped I have noticed that the number of pending compactions remained constant (and even climbed up a little bit) on all nodes. After trying to tune some parameters (like setting the compaction bandwidth cap to 0) I have noticed a strange pattern: the nodes were compacting one of the CFs in a single stream using virtually no CPU and no disk I/O. This process was taking hours. After that it would be followed by a short burst of few dozens of compactions running in parallel (CPU at 2000%, some disk I/O - up to 10-20%) and then getting stuck again for many hours doing one compaction at time. So it looks like this: # nodetool compactionstats pending tasks: 3351 compaction typekeyspace table completed total unit progress Compaction myks table_list1 66499295588 1910515889913 bytes 3.48% Active compaction remaining time :n/a # df -h ... /dev/sdb1.5T 637G 854G 43% /cassandra-data/disk1 /dev/sdc1.5T 425G 1.1T 29% /cassandra-data/disk2 /dev/sdd1.5T 429G 1.1T 29% /cassandra-data/disk3 # find . -name **table_list1**Data** | grep -v snapshot | wc -l 1310 Among these files I see: 1043 files of 161Mb (my sstable size is 160Mb) 9 large files - 3 between 1 and 2Gb, 3 of 5-8Gb, 55Gb, 70Gb and 370Gb 263 files of various sized - between few dozens of Kb and 160Mb I've been running the heavy load for about 1,5days and it's been close to 3 days after that and the number of pending compactions does not go down. I have applied one of the not-so-obvious recommendations to disable multithreaded compactions and that seems to be helping a bit - I see some nodes started to have fewer pending compactions. About half of the cluster, in fact. But even there I see they are sitting idle most of the time lazily compacting in one stream with CPU at ~140% and occasionally doing the bursts of compaction work for few minutes. I am wondering if this is really a bug or something in the LCS logic that would manifest itself only in such an edge case scenario where I have loaded lots of unique data quickly. By the way, I see this pattern only for one of two tables - the one that has about 4 times more data than another (space-wise, number of rows is the same). Looks like all these pending compactions are really only for that larger table. I'll be attaching the relevant logs shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7949) LCS compaction low performance, many pending compactions, nodes are almost idle
[ https://issues.apache.org/jira/browse/CASSANDRA-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14141140#comment-14141140 ] Benedict commented on CASSANDRA-7949: - I must admit, the behaviour does sound very suspicious to me. However looking at the logs it appears to me that the problem may be related to PendingTasks' being an _estimate_, because the compaction manager repeatedly looks for work to do and says nope, nothing. Is it possible you were generating sstables that were contiguous and non-overlapping, so that there were no candidates for compaction? LCS is not my area of expertise, but [~krummas] should certainly have a look. Estimating ~3k compaction tasks and finding _none_ (repeatedly) seems off to me. At the very least we should fix the estimate, but I suspect something else may be happening. LCS compaction low performance, many pending compactions, nodes are almost idle --- Key: CASSANDRA-7949 URL: https://issues.apache.org/jira/browse/CASSANDRA-7949 Project: Cassandra Issue Type: Bug Components: Core Environment: DSE 4.5.1-1, Cassandra 2.0.8 Reporter: Nikolai Grigoriev Attachments: iostats.txt, nodetool_compactionstats.txt, nodetool_tpstats.txt, pending compactions 2day.png, system.log.gz, vmstat.txt I've been evaluating new cluster of 15 nodes (32 core, 6x800Gb SSD disks + 2x600Gb SAS, 128Gb RAM, OEL 6.5) and I've built a simulator that creates the load similar to the load in our future product. Before running the simulator I had to pre-generate enough data. This was done using Java code and DataStax Java driver. To avoid going deep into details, two tables have been generated. Each table currently has about 55M rows and between few dozens and few thousands of columns in each row. This data generation process was generating massive amount of non-overlapping data. Thus, the activity was write-only and highly parallel. This is not the type of the traffic that the system will have ultimately to deal with, it will be mix of reads and updates to the existing data in the future. This is just to explain the choice of LCS, not mentioning the expensive SSD disk space. At some point while generating the data I have noticed that the compactions started to pile up. I knew that I was overloading the cluster but I still wanted the genration test to complete. I was expecting to give the cluster enough time to finish the pending compactions and get ready for real traffic. However, after the storm of write requests have been stopped I have noticed that the number of pending compactions remained constant (and even climbed up a little bit) on all nodes. After trying to tune some parameters (like setting the compaction bandwidth cap to 0) I have noticed a strange pattern: the nodes were compacting one of the CFs in a single stream using virtually no CPU and no disk I/O. This process was taking hours. After that it would be followed by a short burst of few dozens of compactions running in parallel (CPU at 2000%, some disk I/O - up to 10-20%) and then getting stuck again for many hours doing one compaction at time. So it looks like this: # nodetool compactionstats pending tasks: 3351 compaction typekeyspace table completed total unit progress Compaction myks table_list1 66499295588 1910515889913 bytes 3.48% Active compaction remaining time :n/a # df -h ... /dev/sdb1.5T 637G 854G 43% /cassandra-data/disk1 /dev/sdc1.5T 425G 1.1T 29% /cassandra-data/disk2 /dev/sdd1.5T 429G 1.1T 29% /cassandra-data/disk3 # find . -name **table_list1**Data** | grep -v snapshot | wc -l 1310 Among these files I see: 1043 files of 161Mb (my sstable size is 160Mb) 9 large files - 3 between 1 and 2Gb, 3 of 5-8Gb, 55Gb, 70Gb and 370Gb 263 files of various sized - between few dozens of Kb and 160Mb I've been running the heavy load for about 1,5days and it's been close to 3 days after that and the number of pending compactions does not go down. I have applied one of the not-so-obvious recommendations to disable multithreaded compactions and that seems to be helping a bit - I see some nodes started to have fewer pending compactions. About half of the cluster, in fact. But even there I see they are sitting idle most of the time lazily compacting in one stream with CPU at ~140% and occasionally doing the bursts of compaction work for few minutes. I am wondering if this is really a bug or something in the LCS logic that would manifest itself only in such an edge case scenario where I have loaded lots of unique data quickly. By the way, I see this pattern
[jira] [Commented] (CASSANDRA-7949) LCS compaction low performance, many pending compactions, nodes are almost idle
[ https://issues.apache.org/jira/browse/CASSANDRA-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14141154#comment-14141154 ] Nikolai Grigoriev commented on CASSANDRA-7949: -- I understand that it was an estimate but my cluster was trying to process this estimate in almost 3 full days with little progress. About 1,5 days of data injection3 days of compaction with no progress - that does not sound right. And STCS was able to crunch most of the data in about one day after the switch. I strongly suspect that the fact that I was loading and not updating the data at high rate resulted in some sort of edge case scenario for LCS. But considering that the cluster could not recover in reasonable amount of time (exceeding the original load time by factor of 2+) I do believe that something may need to be improved in LCS logic OR some kind of diagnostic message needs to be generated to request a specific action to be taken by the cluster owner. In my case the problem was easy to spot as it was highly visible - but if this happens to one of 50 CFs it may take a while before someone spots endless compactions happening. LCS compaction low performance, many pending compactions, nodes are almost idle --- Key: CASSANDRA-7949 URL: https://issues.apache.org/jira/browse/CASSANDRA-7949 Project: Cassandra Issue Type: Bug Components: Core Environment: DSE 4.5.1-1, Cassandra 2.0.8 Reporter: Nikolai Grigoriev Attachments: iostats.txt, nodetool_compactionstats.txt, nodetool_tpstats.txt, pending compactions 2day.png, system.log.gz, vmstat.txt I've been evaluating new cluster of 15 nodes (32 core, 6x800Gb SSD disks + 2x600Gb SAS, 128Gb RAM, OEL 6.5) and I've built a simulator that creates the load similar to the load in our future product. Before running the simulator I had to pre-generate enough data. This was done using Java code and DataStax Java driver. To avoid going deep into details, two tables have been generated. Each table currently has about 55M rows and between few dozens and few thousands of columns in each row. This data generation process was generating massive amount of non-overlapping data. Thus, the activity was write-only and highly parallel. This is not the type of the traffic that the system will have ultimately to deal with, it will be mix of reads and updates to the existing data in the future. This is just to explain the choice of LCS, not mentioning the expensive SSD disk space. At some point while generating the data I have noticed that the compactions started to pile up. I knew that I was overloading the cluster but I still wanted the genration test to complete. I was expecting to give the cluster enough time to finish the pending compactions and get ready for real traffic. However, after the storm of write requests have been stopped I have noticed that the number of pending compactions remained constant (and even climbed up a little bit) on all nodes. After trying to tune some parameters (like setting the compaction bandwidth cap to 0) I have noticed a strange pattern: the nodes were compacting one of the CFs in a single stream using virtually no CPU and no disk I/O. This process was taking hours. After that it would be followed by a short burst of few dozens of compactions running in parallel (CPU at 2000%, some disk I/O - up to 10-20%) and then getting stuck again for many hours doing one compaction at time. So it looks like this: # nodetool compactionstats pending tasks: 3351 compaction typekeyspace table completed total unit progress Compaction myks table_list1 66499295588 1910515889913 bytes 3.48% Active compaction remaining time :n/a # df -h ... /dev/sdb1.5T 637G 854G 43% /cassandra-data/disk1 /dev/sdc1.5T 425G 1.1T 29% /cassandra-data/disk2 /dev/sdd1.5T 429G 1.1T 29% /cassandra-data/disk3 # find . -name **table_list1**Data** | grep -v snapshot | wc -l 1310 Among these files I see: 1043 files of 161Mb (my sstable size is 160Mb) 9 large files - 3 between 1 and 2Gb, 3 of 5-8Gb, 55Gb, 70Gb and 370Gb 263 files of various sized - between few dozens of Kb and 160Mb I've been running the heavy load for about 1,5days and it's been close to 3 days after that and the number of pending compactions does not go down. I have applied one of the not-so-obvious recommendations to disable multithreaded compactions and that seems to be helping a bit - I see some nodes started to have fewer pending compactions. About half of the cluster, in fact. But even there I see they are sitting idle most of the time lazily compacting
[jira] [Commented] (CASSANDRA-7949) LCS compaction low performance, many pending compactions, nodes are almost idle
[ https://issues.apache.org/jira/browse/CASSANDRA-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14141167#comment-14141167 ] Benedict commented on CASSANDRA-7949: - Well, my point is that the server _thought there was no work to do_. The only logical explanation for that is that it was either wrong, or the data was generated in a manner that actually _didn't_ need compacting as much as the estimate thought (and the estimate was wrong). The estimate being _so_ out of whack seems unlikely to me, but it is a possibility. It's worth mentioning there definitely _were_ compactions happening, just not very many. It's possible these in progress compactions were preventing other compactions that should have been happening from taking place, since they cannot participate in other compactions once they're already involved in some. This _does_ seem like something we need to investigate and understand thoroughly, whatever it turns out to be. LCS compaction low performance, many pending compactions, nodes are almost idle --- Key: CASSANDRA-7949 URL: https://issues.apache.org/jira/browse/CASSANDRA-7949 Project: Cassandra Issue Type: Bug Components: Core Environment: DSE 4.5.1-1, Cassandra 2.0.8 Reporter: Nikolai Grigoriev Attachments: iostats.txt, nodetool_compactionstats.txt, nodetool_tpstats.txt, pending compactions 2day.png, system.log.gz, vmstat.txt I've been evaluating new cluster of 15 nodes (32 core, 6x800Gb SSD disks + 2x600Gb SAS, 128Gb RAM, OEL 6.5) and I've built a simulator that creates the load similar to the load in our future product. Before running the simulator I had to pre-generate enough data. This was done using Java code and DataStax Java driver. To avoid going deep into details, two tables have been generated. Each table currently has about 55M rows and between few dozens and few thousands of columns in each row. This data generation process was generating massive amount of non-overlapping data. Thus, the activity was write-only and highly parallel. This is not the type of the traffic that the system will have ultimately to deal with, it will be mix of reads and updates to the existing data in the future. This is just to explain the choice of LCS, not mentioning the expensive SSD disk space. At some point while generating the data I have noticed that the compactions started to pile up. I knew that I was overloading the cluster but I still wanted the genration test to complete. I was expecting to give the cluster enough time to finish the pending compactions and get ready for real traffic. However, after the storm of write requests have been stopped I have noticed that the number of pending compactions remained constant (and even climbed up a little bit) on all nodes. After trying to tune some parameters (like setting the compaction bandwidth cap to 0) I have noticed a strange pattern: the nodes were compacting one of the CFs in a single stream using virtually no CPU and no disk I/O. This process was taking hours. After that it would be followed by a short burst of few dozens of compactions running in parallel (CPU at 2000%, some disk I/O - up to 10-20%) and then getting stuck again for many hours doing one compaction at time. So it looks like this: # nodetool compactionstats pending tasks: 3351 compaction typekeyspace table completed total unit progress Compaction myks table_list1 66499295588 1910515889913 bytes 3.48% Active compaction remaining time :n/a # df -h ... /dev/sdb1.5T 637G 854G 43% /cassandra-data/disk1 /dev/sdc1.5T 425G 1.1T 29% /cassandra-data/disk2 /dev/sdd1.5T 429G 1.1T 29% /cassandra-data/disk3 # find . -name **table_list1**Data** | grep -v snapshot | wc -l 1310 Among these files I see: 1043 files of 161Mb (my sstable size is 160Mb) 9 large files - 3 between 1 and 2Gb, 3 of 5-8Gb, 55Gb, 70Gb and 370Gb 263 files of various sized - between few dozens of Kb and 160Mb I've been running the heavy load for about 1,5days and it's been close to 3 days after that and the number of pending compactions does not go down. I have applied one of the not-so-obvious recommendations to disable multithreaded compactions and that seems to be helping a bit - I see some nodes started to have fewer pending compactions. About half of the cluster, in fact. But even there I see they are sitting idle most of the time lazily compacting in one stream with CPU at ~140% and occasionally doing the bursts of compaction work for few minutes. I am wondering if this is really a bug or something in the LCS logic that would
[jira] [Commented] (CASSANDRA-7949) LCS compaction low performance, many pending compactions, nodes are almost idle
[ https://issues.apache.org/jira/browse/CASSANDRA-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14141174#comment-14141174 ] Nikolai Grigoriev commented on CASSANDRA-7949: -- Just a small clarification: just not very many is not exactly what I observed. Mainly there was one active compaction but once in a while there was a burst of compactions with high CPU usage, GOSSIP issues caused by nodes being less responsive etc. LCS compaction low performance, many pending compactions, nodes are almost idle --- Key: CASSANDRA-7949 URL: https://issues.apache.org/jira/browse/CASSANDRA-7949 Project: Cassandra Issue Type: Bug Components: Core Environment: DSE 4.5.1-1, Cassandra 2.0.8 Reporter: Nikolai Grigoriev Attachments: iostats.txt, nodetool_compactionstats.txt, nodetool_tpstats.txt, pending compactions 2day.png, system.log.gz, vmstat.txt I've been evaluating new cluster of 15 nodes (32 core, 6x800Gb SSD disks + 2x600Gb SAS, 128Gb RAM, OEL 6.5) and I've built a simulator that creates the load similar to the load in our future product. Before running the simulator I had to pre-generate enough data. This was done using Java code and DataStax Java driver. To avoid going deep into details, two tables have been generated. Each table currently has about 55M rows and between few dozens and few thousands of columns in each row. This data generation process was generating massive amount of non-overlapping data. Thus, the activity was write-only and highly parallel. This is not the type of the traffic that the system will have ultimately to deal with, it will be mix of reads and updates to the existing data in the future. This is just to explain the choice of LCS, not mentioning the expensive SSD disk space. At some point while generating the data I have noticed that the compactions started to pile up. I knew that I was overloading the cluster but I still wanted the genration test to complete. I was expecting to give the cluster enough time to finish the pending compactions and get ready for real traffic. However, after the storm of write requests have been stopped I have noticed that the number of pending compactions remained constant (and even climbed up a little bit) on all nodes. After trying to tune some parameters (like setting the compaction bandwidth cap to 0) I have noticed a strange pattern: the nodes were compacting one of the CFs in a single stream using virtually no CPU and no disk I/O. This process was taking hours. After that it would be followed by a short burst of few dozens of compactions running in parallel (CPU at 2000%, some disk I/O - up to 10-20%) and then getting stuck again for many hours doing one compaction at time. So it looks like this: # nodetool compactionstats pending tasks: 3351 compaction typekeyspace table completed total unit progress Compaction myks table_list1 66499295588 1910515889913 bytes 3.48% Active compaction remaining time :n/a # df -h ... /dev/sdb1.5T 637G 854G 43% /cassandra-data/disk1 /dev/sdc1.5T 425G 1.1T 29% /cassandra-data/disk2 /dev/sdd1.5T 429G 1.1T 29% /cassandra-data/disk3 # find . -name **table_list1**Data** | grep -v snapshot | wc -l 1310 Among these files I see: 1043 files of 161Mb (my sstable size is 160Mb) 9 large files - 3 between 1 and 2Gb, 3 of 5-8Gb, 55Gb, 70Gb and 370Gb 263 files of various sized - between few dozens of Kb and 160Mb I've been running the heavy load for about 1,5days and it's been close to 3 days after that and the number of pending compactions does not go down. I have applied one of the not-so-obvious recommendations to disable multithreaded compactions and that seems to be helping a bit - I see some nodes started to have fewer pending compactions. About half of the cluster, in fact. But even there I see they are sitting idle most of the time lazily compacting in one stream with CPU at ~140% and occasionally doing the bursts of compaction work for few minutes. I am wondering if this is really a bug or something in the LCS logic that would manifest itself only in such an edge case scenario where I have loaded lots of unique data quickly. By the way, I see this pattern only for one of two tables - the one that has about 4 times more data than another (space-wise, number of rows is the same). Looks like all these pending compactions are really only for that larger table. I'll be attaching the relevant logs shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7949) LCS compaction low performance, many pending compactions, nodes are almost idle
[ https://issues.apache.org/jira/browse/CASSANDRA-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14141204#comment-14141204 ] Jeremiah Jordan commented on CASSANDRA-7949: [~benedict] I don't think there were ever *no* compactions. Just periods where there was one compaction going on that was blocking any concurrent compactions from happening. LCS compaction low performance, many pending compactions, nodes are almost idle --- Key: CASSANDRA-7949 URL: https://issues.apache.org/jira/browse/CASSANDRA-7949 Project: Cassandra Issue Type: Bug Components: Core Environment: DSE 4.5.1-1, Cassandra 2.0.8 Reporter: Nikolai Grigoriev Attachments: iostats.txt, nodetool_compactionstats.txt, nodetool_tpstats.txt, pending compactions 2day.png, system.log.gz, vmstat.txt I've been evaluating new cluster of 15 nodes (32 core, 6x800Gb SSD disks + 2x600Gb SAS, 128Gb RAM, OEL 6.5) and I've built a simulator that creates the load similar to the load in our future product. Before running the simulator I had to pre-generate enough data. This was done using Java code and DataStax Java driver. To avoid going deep into details, two tables have been generated. Each table currently has about 55M rows and between few dozens and few thousands of columns in each row. This data generation process was generating massive amount of non-overlapping data. Thus, the activity was write-only and highly parallel. This is not the type of the traffic that the system will have ultimately to deal with, it will be mix of reads and updates to the existing data in the future. This is just to explain the choice of LCS, not mentioning the expensive SSD disk space. At some point while generating the data I have noticed that the compactions started to pile up. I knew that I was overloading the cluster but I still wanted the genration test to complete. I was expecting to give the cluster enough time to finish the pending compactions and get ready for real traffic. However, after the storm of write requests have been stopped I have noticed that the number of pending compactions remained constant (and even climbed up a little bit) on all nodes. After trying to tune some parameters (like setting the compaction bandwidth cap to 0) I have noticed a strange pattern: the nodes were compacting one of the CFs in a single stream using virtually no CPU and no disk I/O. This process was taking hours. After that it would be followed by a short burst of few dozens of compactions running in parallel (CPU at 2000%, some disk I/O - up to 10-20%) and then getting stuck again for many hours doing one compaction at time. So it looks like this: # nodetool compactionstats pending tasks: 3351 compaction typekeyspace table completed total unit progress Compaction myks table_list1 66499295588 1910515889913 bytes 3.48% Active compaction remaining time :n/a # df -h ... /dev/sdb1.5T 637G 854G 43% /cassandra-data/disk1 /dev/sdc1.5T 425G 1.1T 29% /cassandra-data/disk2 /dev/sdd1.5T 429G 1.1T 29% /cassandra-data/disk3 # find . -name **table_list1**Data** | grep -v snapshot | wc -l 1310 Among these files I see: 1043 files of 161Mb (my sstable size is 160Mb) 9 large files - 3 between 1 and 2Gb, 3 of 5-8Gb, 55Gb, 70Gb and 370Gb 263 files of various sized - between few dozens of Kb and 160Mb I've been running the heavy load for about 1,5days and it's been close to 3 days after that and the number of pending compactions does not go down. I have applied one of the not-so-obvious recommendations to disable multithreaded compactions and that seems to be helping a bit - I see some nodes started to have fewer pending compactions. About half of the cluster, in fact. But even there I see they are sitting idle most of the time lazily compacting in one stream with CPU at ~140% and occasionally doing the bursts of compaction work for few minutes. I am wondering if this is really a bug or something in the LCS logic that would manifest itself only in such an edge case scenario where I have loaded lots of unique data quickly. By the way, I see this pattern only for one of two tables - the one that has about 4 times more data than another (space-wise, number of rows is the same). Looks like all these pending compactions are really only for that larger table. I'll be attaching the relevant logs shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7949) LCS compaction low performance, many pending compactions, nodes are almost idle
[ https://issues.apache.org/jira/browse/CASSANDRA-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14141289#comment-14141289 ] Benedict commented on CASSANDRA-7949: - bq. Benedict I don't think there were ever no compactions I stated that there were compactions happening. But there seems something wrong when you have 3k estimated compactions and only 1 taking place. LCS compaction low performance, many pending compactions, nodes are almost idle --- Key: CASSANDRA-7949 URL: https://issues.apache.org/jira/browse/CASSANDRA-7949 Project: Cassandra Issue Type: Bug Components: Core Environment: DSE 4.5.1-1, Cassandra 2.0.8 Reporter: Nikolai Grigoriev Attachments: iostats.txt, nodetool_compactionstats.txt, nodetool_tpstats.txt, pending compactions 2day.png, system.log.gz, vmstat.txt I've been evaluating new cluster of 15 nodes (32 core, 6x800Gb SSD disks + 2x600Gb SAS, 128Gb RAM, OEL 6.5) and I've built a simulator that creates the load similar to the load in our future product. Before running the simulator I had to pre-generate enough data. This was done using Java code and DataStax Java driver. To avoid going deep into details, two tables have been generated. Each table currently has about 55M rows and between few dozens and few thousands of columns in each row. This data generation process was generating massive amount of non-overlapping data. Thus, the activity was write-only and highly parallel. This is not the type of the traffic that the system will have ultimately to deal with, it will be mix of reads and updates to the existing data in the future. This is just to explain the choice of LCS, not mentioning the expensive SSD disk space. At some point while generating the data I have noticed that the compactions started to pile up. I knew that I was overloading the cluster but I still wanted the genration test to complete. I was expecting to give the cluster enough time to finish the pending compactions and get ready for real traffic. However, after the storm of write requests have been stopped I have noticed that the number of pending compactions remained constant (and even climbed up a little bit) on all nodes. After trying to tune some parameters (like setting the compaction bandwidth cap to 0) I have noticed a strange pattern: the nodes were compacting one of the CFs in a single stream using virtually no CPU and no disk I/O. This process was taking hours. After that it would be followed by a short burst of few dozens of compactions running in parallel (CPU at 2000%, some disk I/O - up to 10-20%) and then getting stuck again for many hours doing one compaction at time. So it looks like this: # nodetool compactionstats pending tasks: 3351 compaction typekeyspace table completed total unit progress Compaction myks table_list1 66499295588 1910515889913 bytes 3.48% Active compaction remaining time :n/a # df -h ... /dev/sdb1.5T 637G 854G 43% /cassandra-data/disk1 /dev/sdc1.5T 425G 1.1T 29% /cassandra-data/disk2 /dev/sdd1.5T 429G 1.1T 29% /cassandra-data/disk3 # find . -name **table_list1**Data** | grep -v snapshot | wc -l 1310 Among these files I see: 1043 files of 161Mb (my sstable size is 160Mb) 9 large files - 3 between 1 and 2Gb, 3 of 5-8Gb, 55Gb, 70Gb and 370Gb 263 files of various sized - between few dozens of Kb and 160Mb I've been running the heavy load for about 1,5days and it's been close to 3 days after that and the number of pending compactions does not go down. I have applied one of the not-so-obvious recommendations to disable multithreaded compactions and that seems to be helping a bit - I see some nodes started to have fewer pending compactions. About half of the cluster, in fact. But even there I see they are sitting idle most of the time lazily compacting in one stream with CPU at ~140% and occasionally doing the bursts of compaction work for few minutes. I am wondering if this is really a bug or something in the LCS logic that would manifest itself only in such an edge case scenario where I have loaded lots of unique data quickly. By the way, I see this pattern only for one of two tables - the one that has about 4 times more data than another (space-wise, number of rows is the same). Looks like all these pending compactions are really only for that larger table. I'll be attaching the relevant logs shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7949) LCS compaction low performance, many pending compactions, nodes are almost idle
[ https://issues.apache.org/jira/browse/CASSANDRA-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137054#comment-14137054 ] Nikolai Grigoriev commented on CASSANDRA-7949: -- I see. I could try to switch to STCS now and see what happens. My concern is that the issue seems to be permanent. Even after last night none of the nodes (being vritually idle - the load was over) was able to eat through the pending compactions. And, to my surprise, half of the nodes in the cluster do not even compact fast enough - look at the graphs attached. LCS compaction low performance, many pending compactions, nodes are almost idle --- Key: CASSANDRA-7949 URL: https://issues.apache.org/jira/browse/CASSANDRA-7949 Project: Cassandra Issue Type: Bug Components: Core Environment: DSE 4.5.1-1, Cassandra 2.0.8 Reporter: Nikolai Grigoriev Attachments: iostats.txt, nodetool_compactionstats.txt, nodetool_tpstats.txt, pending compactions 2day, system.log.gz, vmstat.txt I've been evaluating new cluster of 15 nodes (32 core, 6x800Gb SSD disks + 2x600Gb SAS, 128Gb RAM, OEL 6.5) and I've built a simulator that creates the load similar to the load in our future product. Before running the simulator I had to pre-generate enough data. This was done using Java code and DataStax Java driver. To avoid going deep into details, two tables have been generated. Each table currently has about 55M rows and between few dozens and few thousands of columns in each row. This data generation process was generating massive amount of non-overlapping data. Thus, the activity was write-only and highly parallel. This is not the type of the traffic that the system will have ultimately to deal with, it will be mix of reads and updates to the existing data in the future. This is just to explain the choice of LCS, not mentioning the expensive SSD disk space. At some point while generating the data I have noticed that the compactions started to pile up. I knew that I was overloading the cluster but I still wanted the genration test to complete. I was expecting to give the cluster enough time to finish the pending compactions and get ready for real traffic. However, after the storm of write requests have been stopped I have noticed that the number of pending compactions remained constant (and even climbed up a little bit) on all nodes. After trying to tune some parameters (like setting the compaction bandwidth cap to 0) I have noticed a strange pattern: the nodes were compacting one of the CFs in a single stream using virtually no CPU and no disk I/O. This process was taking hours. After that it would be followed by a short burst of few dozens of compactions running in parallel (CPU at 2000%, some disk I/O - up to 10-20%) and then getting stuck again for many hours doing one compaction at time. So it looks like this: # nodetool compactionstats pending tasks: 3351 compaction typekeyspace table completed total unit progress Compaction myks table_list1 66499295588 1910515889913 bytes 3.48% Active compaction remaining time :n/a # df -h ... /dev/sdb1.5T 637G 854G 43% /cassandra-data/disk1 /dev/sdc1.5T 425G 1.1T 29% /cassandra-data/disk2 /dev/sdd1.5T 429G 1.1T 29% /cassandra-data/disk3 # find . -name **table_list1**Data** | grep -v snapshot | wc -l 1310 Among these files I see: 1043 files of 161Mb (my sstable size is 160Mb) 9 large files - 3 between 1 and 2Gb, 3 of 5-8Gb, 55Gb, 70Gb and 370Gb 263 files of various sized - between few dozens of Kb and 160Mb I've been running the heavy load for about 1,5days and it's been close to 3 days after that and the number of pending compactions does not go down. I have applied one of the not-so-obvious recommendations to disable multithreaded compactions and that seems to be helping a bit - I see some nodes started to have fewer pending compactions. About half of the cluster, in fact. But even there I see they are sitting idle most of the time lazily compacting in one stream with CPU at ~140% and occasionally doing the bursts of compaction work for few minutes. I am wondering if this is really a bug or something in the LCS logic that would manifest itself only in such an edge case scenario where I have loaded lots of unique data quickly. By the way, I see this pattern only for one of two tables - the one that has about 4 times more data than another (space-wise, number of rows is the same). Looks like all these pending compactions are really only for that larger table. I'll be attaching the relevant logs shortly. -- This message was
[jira] [Commented] (CASSANDRA-7949) LCS compaction low performance, many pending compactions, nodes are almost idle
[ https://issues.apache.org/jira/browse/CASSANDRA-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138137#comment-14138137 ] Nikolai Grigoriev commented on CASSANDRA-7949: -- Just an update: I have switched to STCS early this morning and by now half of the nodes are getting close to zero pending transactions. Half of remaining nodes seem to be behind but they are compacting at full speed (smoke coming from the lab ;) ) and I see the number of pending compactions going down on them as well. On the nodes where compactions are almost over the number of sstables is now very small, less than a hundred. LCS compaction low performance, many pending compactions, nodes are almost idle --- Key: CASSANDRA-7949 URL: https://issues.apache.org/jira/browse/CASSANDRA-7949 Project: Cassandra Issue Type: Bug Components: Core Environment: DSE 4.5.1-1, Cassandra 2.0.8 Reporter: Nikolai Grigoriev Attachments: iostats.txt, nodetool_compactionstats.txt, nodetool_tpstats.txt, pending compactions 2day.png, system.log.gz, vmstat.txt I've been evaluating new cluster of 15 nodes (32 core, 6x800Gb SSD disks + 2x600Gb SAS, 128Gb RAM, OEL 6.5) and I've built a simulator that creates the load similar to the load in our future product. Before running the simulator I had to pre-generate enough data. This was done using Java code and DataStax Java driver. To avoid going deep into details, two tables have been generated. Each table currently has about 55M rows and between few dozens and few thousands of columns in each row. This data generation process was generating massive amount of non-overlapping data. Thus, the activity was write-only and highly parallel. This is not the type of the traffic that the system will have ultimately to deal with, it will be mix of reads and updates to the existing data in the future. This is just to explain the choice of LCS, not mentioning the expensive SSD disk space. At some point while generating the data I have noticed that the compactions started to pile up. I knew that I was overloading the cluster but I still wanted the genration test to complete. I was expecting to give the cluster enough time to finish the pending compactions and get ready for real traffic. However, after the storm of write requests have been stopped I have noticed that the number of pending compactions remained constant (and even climbed up a little bit) on all nodes. After trying to tune some parameters (like setting the compaction bandwidth cap to 0) I have noticed a strange pattern: the nodes were compacting one of the CFs in a single stream using virtually no CPU and no disk I/O. This process was taking hours. After that it would be followed by a short burst of few dozens of compactions running in parallel (CPU at 2000%, some disk I/O - up to 10-20%) and then getting stuck again for many hours doing one compaction at time. So it looks like this: # nodetool compactionstats pending tasks: 3351 compaction typekeyspace table completed total unit progress Compaction myks table_list1 66499295588 1910515889913 bytes 3.48% Active compaction remaining time :n/a # df -h ... /dev/sdb1.5T 637G 854G 43% /cassandra-data/disk1 /dev/sdc1.5T 425G 1.1T 29% /cassandra-data/disk2 /dev/sdd1.5T 429G 1.1T 29% /cassandra-data/disk3 # find . -name **table_list1**Data** | grep -v snapshot | wc -l 1310 Among these files I see: 1043 files of 161Mb (my sstable size is 160Mb) 9 large files - 3 between 1 and 2Gb, 3 of 5-8Gb, 55Gb, 70Gb and 370Gb 263 files of various sized - between few dozens of Kb and 160Mb I've been running the heavy load for about 1,5days and it's been close to 3 days after that and the number of pending compactions does not go down. I have applied one of the not-so-obvious recommendations to disable multithreaded compactions and that seems to be helping a bit - I see some nodes started to have fewer pending compactions. About half of the cluster, in fact. But even there I see they are sitting idle most of the time lazily compacting in one stream with CPU at ~140% and occasionally doing the bursts of compaction work for few minutes. I am wondering if this is really a bug or something in the LCS logic that would manifest itself only in such an edge case scenario where I have loaded lots of unique data quickly. By the way, I see this pattern only for one of two tables - the one that has about 4 times more data than another (space-wise, number of rows is the same). Looks like all these pending compactions are really only for that larger
[jira] [Commented] (CASSANDRA-7949) LCS compaction low performance, many pending compactions, nodes are almost idle
[ https://issues.apache.org/jira/browse/CASSANDRA-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14136549#comment-14136549 ] Nikolai Grigoriev commented on CASSANDRA-7949: -- system log already includes the logs from log4j.logger.org.apache.cassandra.db.compaction (except log4j.logger.org.apache.cassandra.db.compaction.ParallelCompactionIterable) LCS compaction low performance, many pending compactions, nodes are almost idle --- Key: CASSANDRA-7949 URL: https://issues.apache.org/jira/browse/CASSANDRA-7949 Project: Cassandra Issue Type: Bug Components: Core Environment: DSE 4.5.1-1, Cassandra 2.0.8 Reporter: Nikolai Grigoriev Attachments: iostats.txt, nodetool_compactionstats.txt, nodetool_tpstats.txt, system.log.gz, vmstat.txt I've been evaluating new cluster of 15 nodes (32 core, 6x800Gb SSD disks + 2x600Gb SAS, 128Gb RAM, OEL 6.5) and I've built a simulator that creates the load similar to the load in our future product. Before running the simulator I had to pre-generate enough data. This was done using Java code and DataStax Java driver. To avoid going deep into details, two tables have been generated. Each table currently has about 55M rows and between few dozens and few thousands of columns in each row. This data generation process was generating massive amount of non-overlapping data. Thus, the activity was write-only and highly parallel. This is not the type of the traffic that the system will have ultimately to deal with, it will be mix of reads and updates to the existing data in the future. This is just to explain the choice of LCS, not mentioning the expensive SSD disk space. At some point while generating the data I have noticed that the compactions started to pile up. I knew that I was overloading the cluster but I still wanted the genration test to complete. I was expecting to give the cluster enough time to finish the pending compactions and get ready for real traffic. However, after the storm of write requests have been stopped I have noticed that the number of pending compactions remained constant (and even climbed up a little bit) on all nodes. After trying to tune some parameters (like setting the compaction bandwidth cap to 0) I have noticed a strange pattern: the nodes were compacting one of the CFs in a single stream using virtually no CPU and no disk I/O. This process was taking hours. After that it would be followed by a short burst of few dozens of compactions running in parallel (CPU at 2000%, some disk I/O - up to 10-20%) and then getting stuck again for many hours doing one compaction at time. So it looks like this: # nodetool compactionstats pending tasks: 3351 compaction typekeyspace table completed total unit progress Compaction myks table_list1 66499295588 1910515889913 bytes 3.48% Active compaction remaining time :n/a # df -h ... /dev/sdb1.5T 637G 854G 43% /cassandra-data/disk1 /dev/sdc1.5T 425G 1.1T 29% /cassandra-data/disk2 /dev/sdd1.5T 429G 1.1T 29% /cassandra-data/disk3 # find . -name *wm_contacts*Data* | grep -v snapshot | wc -l 1310 Among these files I see: 1043 files of 161Mb (my sstable size is 160Mb) 9 large files - 3 between 1 and 2Gb, 3 of 5-8Gb, 55Gb, 70Gb and 370Gb 263 files of various sized - between few dozens of Kb and 160Mb I've been running the heavy load for about 1,5days and it's been close to 3 days after that and the number of pending compactions does not go down. I have applied one of the not-so-obvious recommendations to disable multithreaded compactions and that seems to be helping a bit - I see some nodes started to have fewer pending compactions. About half of the cluster, in fact. But even there I see they are sitting idle most of the time lazily compacting in one stream with CPU at ~140% and occasionally doing the bursts of compaction work for few minutes. I am wondering if this is really a bug or something in the LCS logic that would manifest itself only in such an edge case scenario where I have loaded lots of unique data quickly. By the way, I see this pattern only for one of two tables - the one that has about 4 times more data than another (space-wise, number of rows is the same). Looks like all these pending compactions are really only for that larger table. I'll be attaching the relevant logs shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7949) LCS compaction low performance, many pending compactions, nodes are almost idle
[ https://issues.apache.org/jira/browse/CASSANDRA-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14136712#comment-14136712 ] Jeremiah Jordan commented on CASSANDRA-7949: For the initial load you probably want to disable STCS in L0. CASSANDRA-6621. Or maybe use STCS and then switch to LCS when the load is over. LCS compaction low performance, many pending compactions, nodes are almost idle --- Key: CASSANDRA-7949 URL: https://issues.apache.org/jira/browse/CASSANDRA-7949 Project: Cassandra Issue Type: Bug Components: Core Environment: DSE 4.5.1-1, Cassandra 2.0.8 Reporter: Nikolai Grigoriev Attachments: iostats.txt, nodetool_compactionstats.txt, nodetool_tpstats.txt, system.log.gz, vmstat.txt I've been evaluating new cluster of 15 nodes (32 core, 6x800Gb SSD disks + 2x600Gb SAS, 128Gb RAM, OEL 6.5) and I've built a simulator that creates the load similar to the load in our future product. Before running the simulator I had to pre-generate enough data. This was done using Java code and DataStax Java driver. To avoid going deep into details, two tables have been generated. Each table currently has about 55M rows and between few dozens and few thousands of columns in each row. This data generation process was generating massive amount of non-overlapping data. Thus, the activity was write-only and highly parallel. This is not the type of the traffic that the system will have ultimately to deal with, it will be mix of reads and updates to the existing data in the future. This is just to explain the choice of LCS, not mentioning the expensive SSD disk space. At some point while generating the data I have noticed that the compactions started to pile up. I knew that I was overloading the cluster but I still wanted the genration test to complete. I was expecting to give the cluster enough time to finish the pending compactions and get ready for real traffic. However, after the storm of write requests have been stopped I have noticed that the number of pending compactions remained constant (and even climbed up a little bit) on all nodes. After trying to tune some parameters (like setting the compaction bandwidth cap to 0) I have noticed a strange pattern: the nodes were compacting one of the CFs in a single stream using virtually no CPU and no disk I/O. This process was taking hours. After that it would be followed by a short burst of few dozens of compactions running in parallel (CPU at 2000%, some disk I/O - up to 10-20%) and then getting stuck again for many hours doing one compaction at time. So it looks like this: # nodetool compactionstats pending tasks: 3351 compaction typekeyspace table completed total unit progress Compaction myks table_list1 66499295588 1910515889913 bytes 3.48% Active compaction remaining time :n/a # df -h ... /dev/sdb1.5T 637G 854G 43% /cassandra-data/disk1 /dev/sdc1.5T 425G 1.1T 29% /cassandra-data/disk2 /dev/sdd1.5T 429G 1.1T 29% /cassandra-data/disk3 # find . -name **table_list1**Data** | grep -v snapshot | wc -l 1310 Among these files I see: 1043 files of 161Mb (my sstable size is 160Mb) 9 large files - 3 between 1 and 2Gb, 3 of 5-8Gb, 55Gb, 70Gb and 370Gb 263 files of various sized - between few dozens of Kb and 160Mb I've been running the heavy load for about 1,5days and it's been close to 3 days after that and the number of pending compactions does not go down. I have applied one of the not-so-obvious recommendations to disable multithreaded compactions and that seems to be helping a bit - I see some nodes started to have fewer pending compactions. About half of the cluster, in fact. But even there I see they are sitting idle most of the time lazily compacting in one stream with CPU at ~140% and occasionally doing the bursts of compaction work for few minutes. I am wondering if this is really a bug or something in the LCS logic that would manifest itself only in such an edge case scenario where I have loaded lots of unique data quickly. By the way, I see this pattern only for one of two tables - the one that has about 4 times more data than another (space-wise, number of rows is the same). Looks like all these pending compactions are really only for that larger table. I'll be attaching the relevant logs shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)