Hi, We have a cluster (2.0.11) of 6 nodes (RF=3), c3.4xlarge instances, about 50 column families. Cassandra heap takes 8GB out of the 30GB of every instance.
We mainly store time series-like data, where each data point is a binary blob of 5-20KB. We use wide rows, and try to put in the same row all the data that we usually need in a single query (but not more than that). As a result, our application logic is very simple (since we have to do just one query to read the data on average) and read/write response times are very satisfactory. This is a cfhistograms and a cfstats of our heaviest CF: SSTables per Read 1 sstables: 3198856 2 sstables: 45 Write Latency (microseconds) 4 us: 37 5 us: 1247 6 us: 9987 7 us: 31442 8 us: 66121 10 us: 400503 12 us: 1158329 14 us: 2873934 17 us: 11843616 20 us: 24464275 24 us: 30574717 29 us: 24351624 35 us: 16788801 42 us: 3935374 50 us: 797781 60 us: 272160 72 us: 121819 86 us: 64641 103 us: 41085 124 us: 33618 149 us: 199463 179 us: 255445 215 us: 38238 258 us: 12300 310 us: 5307 372 us: 3180 446 us: 2443 535 us: 1773 642 us: 1314 770 us: 991 924 us: 748 1109 us: 606 1331 us: 465 1597 us: 433 1916 us: 453 2299 us: 484 2759 us: 983 3311 us: 976 3973 us: 338 4768 us: 312 5722 us: 237 6866 us: 198 8239 us: 163 9887 us: 138 11864 us: 115 14237 us: 231 17084 us: 550 20501 us: 603 24601 us: 635 29521 us: 875 35425 us: 731 42510 us: 497 51012 us: 476 61214 us: 347 73457 us: 331 88148 us: 273 105778 us: 143 126934 us: 92 152321 us: 47 182785 us: 16 219342 us: 5 263210 us: 2 315852 us: 2 379022 us: 1 454826 us: 1 545791 us: 1 654949 us: 0 785939 us: 0 943127 us: 1 1131752 us: 1 Read Latency (microseconds) 20 us: 1 24 us: 9 29 us: 18 35 us: 96 42 us: 6989 50 us: 113305 60 us: 552348 72 us: 772329 86 us: 654019 103 us: 578404 124 us: 300364 149 us: 111522 179 us: 37385 215 us: 18353 258 us: 10733 310 us: 7915 372 us: 9406 446 us: 7645 535 us: 2773 642 us: 1323 770 us: 1351 924 us: 953 1109 us: 857 1331 us: 1122 1597 us: 800 1916 us: 806 2299 us: 686 2759 us: 581 3311 us: 671 3973 us: 318 4768 us: 318 5722 us: 226 6866 us: 164 8239 us: 161 9887 us: 134 11864 us: 125 14237 us: 184 17084 us: 285 20501 us: 315 24601 us: 378 29521 us: 431 35425 us: 468 42510 us: 469 51012 us: 466 61214 us: 407 73457 us: 337 88148 us: 297 105778 us: 242 126934 us: 135 152321 us: 109 182785 us: 57 219342 us: 41 263210 us: 28 315852 us: 16 379022 us: 12 454826 us: 6 545791 us: 6 654949 us: 0 785939 us: 0 943127 us: 0 1131752 us: 2 Partition Size (bytes) 3311 bytes: 1 3973 bytes: 2 4768 bytes: 0 5722 bytes: 2 6866 bytes: 0 8239 bytes: 0 9887 bytes: 2 11864 bytes: 1 14237 bytes: 0 17084 bytes: 0 20501 bytes: 0 24601 bytes: 0 29521 bytes: 3 35425 bytes: 0 42510 bytes: 1 51012 bytes: 1 61214 bytes: 1 73457 bytes: 3 88148 bytes: 1 105778 bytes: 5 126934 bytes: 2 152321 bytes: 4 182785 bytes: 65 219342 bytes: 165 263210 bytes: 268 315852 bytes: 201 379022 bytes: 30 454826 bytes: 248 545791 bytes: 16 654949 bytes: 41 785939 bytes: 259 943127 bytes: 547 1131752 bytes: 243 1358102 bytes: 176 1629722 bytes: 59 1955666 bytes: 37 2346799 bytes: 41 2816159 bytes: 78 3379391 bytes: 243 4055269 bytes: 122 4866323 bytes: 209 5839588 bytes: 220 7007506 bytes: 266 8409007 bytes: 77 10090808 bytes: 103 12108970 bytes: 1 14530764 bytes: 2 17436917 bytes: 7 20924300 bytes: 410 25109160 bytes: 76 Cell Count per Partition 3 cells: 5 4 cells: 0 5 cells: 0 6 cells: 2 7 cells: 0 8 cells: 0 10 cells: 2 12 cells: 1 14 cells: 0 17 cells: 0 20 cells: 1 24 cells: 3 29 cells: 1 35 cells: 1 42 cells: 0 50 cells: 0 60 cells: 3 72 cells: 0 86 cells: 1 103 cells: 0 124 cells: 11 149 cells: 3 179 cells: 4 215 cells: 10 258 cells: 13 310 cells: 2181 372 cells: 2 446 cells: 2 535 cells: 2 642 cells: 4 770 cells: 7 924 cells: 488 1109 cells: 3 1331 cells: 24 1597 cells: 143 1916 cells: 332 2299 cells: 2 2759 cells: 5 3311 cells: 483 3973 cells: 0 4768 cells: 2 5722 cells: 1 6866 cells: 1 8239 cells: 0 9887 cells: 2 11864 cells: 244 14237 cells: 1 17084 cells: 248 20501 cells: 1 24601 cells: 1 29521 cells: 1 35425 cells: 2 42510 cells: 1 51012 cells: 2 61214 cells: 237 Read Count: 3202919 Read Latency: 0.16807454013042478 ms. Write Count: 118568574 Write Latency: 0.026566498615391967 ms. Pending Tasks: 0 Table: protobuf_by_agent1 SSTable count: 49 SSTables in each level: [1, 11/10, 37, 0, 0, 0, 0, 0, 0] Space used (live), bytes: 6934395462 Space used (total), bytes: 6936424794 SSTable Compression Ratio: 0.17466856216707344 Number of keys (estimate): 7040 Memtable cell count: 47412 Memtable data size, bytes: 171225925 Memtable switch count: 4164 Local read count: 3202919 Local read latency: 0.168 ms Local write count: 118568575 Local write latency: 0.027 ms Pending tasks: 0 Bloom filter false positives: 286 Bloom filter false ratio: 0.00000 Bloom filter space used, bytes: 11480 Compacted partition minimum bytes: 2760 Compacted partition maximum bytes: 107964792 Compacted partition mean bytes: 9316821 Average live cells per slice (last five minutes): 1.0 Average tombstones per slice (last five minutes): 0.0 (notice how "compacted partition maximum bytes" is 108MB where from cfhistograms the distribution is much lower than that) The other day a few nodes went down. The heap was full and full GCs were running constantly. Tracking down the problem, it looked like we misconfigured compaction throughput on some nodes, and when pending compactions went above a certain threshold (~200), the issue started happening. Normally, heap usage is almost always under 4GB. Restarting the nodes and temporarily setting compaction throughput to infinite solved the issue. I was able to reproduce this issue in a testing environment by just loading data in the cluster, and setting compaction throughput to a very low value. Again, when pending compactions went above the threshold, nodes started going down again. My assumption is that, having rows this wide, compaction had to bring in memory a lot of data, and this was be worsened when there were many pending compactions. I did the following experiments: 1) I repeated the same experiment changing the partition key so that each row would be effectively about 1/10th the size of the rows in the original scenario. In this case, I didn't experience any node failure, even when pending compactions went as high as 900. 2) I repeated the same experiment with Cassandra 2.1.2. Even with the original partition key, nodes never went down. I'm not sure I understand why, the only thing I can say is that even after a couple hours pending compactions were never more than 50 (whereas in 2.0.11 it took just about 20 minutes to get to the point of failure) Do you have some best practices on wide rows sizing? Are we doing something wrong by using rows this wide or the problem we experienced is unrelated? Thanks a lot.