Hi,

We have a cluster (2.0.11) of 6 nodes (RF=3), c3.4xlarge instances, about
50 column families. Cassandra heap takes 8GB out of the 30GB of every
instance.

We mainly store time series-like data, where each data point is a binary
blob of 5-20KB. We use wide rows, and try to put in the same row all the
data that we usually need in a single query (but not more than that). As a
result, our application logic is very simple (since we have to do just one
query to read the data on average) and read/write response times are very
satisfactory. This is a cfhistograms and a cfstats of our heaviest CF:

SSTables per Read
1 sstables: 3198856
2 sstables: 45

Write Latency (microseconds)
      4 us: 37
      5 us: 1247
      6 us: 9987
      7 us: 31442
      8 us: 66121
     10 us: 400503
     12 us: 1158329
     14 us: 2873934
     17 us: 11843616
     20 us: 24464275
     24 us: 30574717
     29 us: 24351624
     35 us: 16788801
     42 us: 3935374
     50 us: 797781
     60 us: 272160
     72 us: 121819
     86 us: 64641
    103 us: 41085
    124 us: 33618
    149 us: 199463
    179 us: 255445
    215 us: 38238
    258 us: 12300
    310 us: 5307
    372 us: 3180
    446 us: 2443
    535 us: 1773
    642 us: 1314
    770 us: 991
    924 us: 748
   1109 us: 606
   1331 us: 465
   1597 us: 433
   1916 us: 453
   2299 us: 484
   2759 us: 983
   3311 us: 976
   3973 us: 338
   4768 us: 312
   5722 us: 237
   6866 us: 198
   8239 us: 163
   9887 us: 138
  11864 us: 115
  14237 us: 231
  17084 us: 550
  20501 us: 603
  24601 us: 635
  29521 us: 875
  35425 us: 731
  42510 us: 497
  51012 us: 476
  61214 us: 347
  73457 us: 331
  88148 us: 273
 105778 us: 143
 126934 us: 92
 152321 us: 47
 182785 us: 16
 219342 us: 5
 263210 us: 2
 315852 us: 2
 379022 us: 1
 454826 us: 1
 545791 us: 1
 654949 us: 0
 785939 us: 0
 943127 us: 1
1131752 us: 1

Read Latency (microseconds)
     20 us: 1
     24 us: 9
     29 us: 18
     35 us: 96
     42 us: 6989
     50 us: 113305
     60 us: 552348
     72 us: 772329
     86 us: 654019
    103 us: 578404
    124 us: 300364
    149 us: 111522
    179 us: 37385
    215 us: 18353
    258 us: 10733
    310 us: 7915
    372 us: 9406
    446 us: 7645
    535 us: 2773
    642 us: 1323
    770 us: 1351
    924 us: 953
   1109 us: 857
   1331 us: 1122
   1597 us: 800
   1916 us: 806
   2299 us: 686
   2759 us: 581
   3311 us: 671
   3973 us: 318
   4768 us: 318
   5722 us: 226
   6866 us: 164
   8239 us: 161
   9887 us: 134
  11864 us: 125
  14237 us: 184
  17084 us: 285
  20501 us: 315
  24601 us: 378
  29521 us: 431
  35425 us: 468
  42510 us: 469
  51012 us: 466
  61214 us: 407
  73457 us: 337
  88148 us: 297
 105778 us: 242
 126934 us: 135
 152321 us: 109
 182785 us: 57
 219342 us: 41
 263210 us: 28
 315852 us: 16
 379022 us: 12
 454826 us: 6
 545791 us: 6
 654949 us: 0
 785939 us: 0
 943127 us: 0
1131752 us: 2

Partition Size (bytes)
    3311 bytes: 1
    3973 bytes: 2
    4768 bytes: 0
    5722 bytes: 2
    6866 bytes: 0
    8239 bytes: 0
    9887 bytes: 2
   11864 bytes: 1
   14237 bytes: 0
   17084 bytes: 0
   20501 bytes: 0
   24601 bytes: 0
   29521 bytes: 3
   35425 bytes: 0
   42510 bytes: 1
   51012 bytes: 1
   61214 bytes: 1
   73457 bytes: 3
   88148 bytes: 1
  105778 bytes: 5
  126934 bytes: 2
  152321 bytes: 4
  182785 bytes: 65
  219342 bytes: 165
  263210 bytes: 268
  315852 bytes: 201
  379022 bytes: 30
  454826 bytes: 248
  545791 bytes: 16
  654949 bytes: 41
  785939 bytes: 259
  943127 bytes: 547
 1131752 bytes: 243
 1358102 bytes: 176
 1629722 bytes: 59
 1955666 bytes: 37
 2346799 bytes: 41
 2816159 bytes: 78
 3379391 bytes: 243
 4055269 bytes: 122
 4866323 bytes: 209
 5839588 bytes: 220
 7007506 bytes: 266
 8409007 bytes: 77
10090808 bytes: 103
12108970 bytes: 1
14530764 bytes: 2
17436917 bytes: 7
20924300 bytes: 410
25109160 bytes: 76

Cell Count per Partition
    3 cells: 5
    4 cells: 0
    5 cells: 0
    6 cells: 2
    7 cells: 0
    8 cells: 0
   10 cells: 2
   12 cells: 1
   14 cells: 0
   17 cells: 0
   20 cells: 1
   24 cells: 3
   29 cells: 1
   35 cells: 1
   42 cells: 0
   50 cells: 0
   60 cells: 3
   72 cells: 0
   86 cells: 1
  103 cells: 0
  124 cells: 11
  149 cells: 3
  179 cells: 4
  215 cells: 10
  258 cells: 13
  310 cells: 2181
  372 cells: 2
  446 cells: 2
  535 cells: 2
  642 cells: 4
  770 cells: 7
  924 cells: 488
 1109 cells: 3
 1331 cells: 24
 1597 cells: 143
 1916 cells: 332
 2299 cells: 2
 2759 cells: 5
 3311 cells: 483
 3973 cells: 0
 4768 cells: 2
 5722 cells: 1
 6866 cells: 1
 8239 cells: 0
 9887 cells: 2
11864 cells: 244
14237 cells: 1
17084 cells: 248
20501 cells: 1
24601 cells: 1
29521 cells: 1
35425 cells: 2
42510 cells: 1
51012 cells: 2
61214 cells: 237


Read Count: 3202919
Read Latency: 0.16807454013042478 ms.
Write Count: 118568574
Write Latency: 0.026566498615391967 ms.
Pending Tasks: 0
  Table: protobuf_by_agent1
  SSTable count: 49
  SSTables in each level: [1, 11/10, 37, 0, 0, 0, 0, 0, 0]
  Space used (live), bytes: 6934395462
  Space used (total), bytes: 6936424794
  SSTable Compression Ratio: 0.17466856216707344
  Number of keys (estimate): 7040
  Memtable cell count: 47412
  Memtable data size, bytes: 171225925
  Memtable switch count: 4164
  Local read count: 3202919
  Local read latency: 0.168 ms
  Local write count: 118568575
  Local write latency: 0.027 ms
  Pending tasks: 0
  Bloom filter false positives: 286
  Bloom filter false ratio: 0.00000
  Bloom filter space used, bytes: 11480
  Compacted partition minimum bytes: 2760
  Compacted partition maximum bytes: 107964792
  Compacted partition mean bytes: 9316821
  Average live cells per slice (last five minutes): 1.0
  Average tombstones per slice (last five minutes): 0.0

(notice how "compacted partition maximum bytes" is 108MB where from
cfhistograms the distribution is much lower than that)

The other day a few nodes went down. The heap was full and full GCs were
running constantly. Tracking down the problem, it looked like we
misconfigured compaction throughput on some nodes, and when pending
compactions went above a certain threshold (~200), the issue started
happening. Normally, heap usage is almost always under 4GB. Restarting the
nodes and temporarily setting compaction throughput to infinite solved the
issue.

I was able to reproduce this issue in a testing environment by just loading
data in the cluster, and setting compaction throughput to a very low value.
Again, when pending compactions went above the threshold, nodes started
going down again. My assumption is that, having rows this wide, compaction
had to bring in memory a lot of data, and this was be worsened when there
were many pending compactions. I did the following experiments:

1) I repeated the same experiment changing the partition key so that each
row would be effectively about 1/10th the size of the rows in the original
scenario. In this case, I didn't experience any node failure, even when
pending compactions went as high as 900.

2) I repeated the same experiment with Cassandra 2.1.2. Even with the
original partition key, nodes never went down. I'm not sure I understand
why, the only thing I can say is that even after a couple hours pending
compactions were never more than 50 (whereas in 2.0.11 it took just about
20 minutes to get to the point of failure)

Do you have some best practices on wide rows sizing? Are we doing something
wrong by using rows this wide or the problem we experienced is unrelated?

Thanks a lot.

Reply via email to