Hmm, thanks for the reading. I initially followed some (perhaps too old) maintenance scripts, which included weekly 'nodetool compact'. Is there a way for me to undo the damage? Tombstones will be a very important issue for me since the dataset is very much a rolling dataset using TTLs heavily.
On Sun, Oct 26, 2014 at 6:04 PM, DuyHai Doan <doanduy...@gmail.com> wrote: > "Should doing a major compaction on those nodes lead to a restructuration > of the SSTables?" --> Beware of the major compaction on SizeTiered, it will > create 2 giant SSTables and the expired/outdated/tombstone columns in this > big file will be never cleaned since the SSTable will never get a chance to > be compacted again > > Essentially to reduce the fragmentation of small SSTables you can stay > with SizeTiered compaction and play around with compaction properties (the > thresholds) to make C* group a bunch of files each time it compacts so that > the file number shrinks to a reasonable count > > Since you're using C* 2.1 and anti-compaction has been introduced, I > hesitate advising you to use Leveled compaction as a work-around to reduce > SSTable count. > > Things are a little bit more complicated because of the incremental > repair process (I don't know whether you're using incremental repair or not > in production). The Dev blog says that Leveled compaction is performed only > on repaired SSTables, the un-repaired ones still use SizeTiered, more > details here: > http://www.datastax.com/dev/blog/anticompaction-in-cassandra-2-1 > > Regards > > > > > > On Sun, Oct 26, 2014 at 9:44 PM, Jonathan Haddad <j...@jonhaddad.com> > wrote: > >> If the issue is related to I/O, you're going to want to determine if >> you're saturated. Take a look at `iostat -dmx 1`, you'll see avgqu-sz >> (queue size) and svctm, (service time). The higher those numbers >> are, the most overwhelmed your disk is. >> >> On Sun, Oct 26, 2014 at 12:01 PM, DuyHai Doan <doanduy...@gmail.com> >> wrote: >> > Hello Maxime >> > >> > Increasing the flush writers won't help if your disk I/O is not keeping >> up. >> > >> > I've had a look into the log file, below are some remarks: >> > >> > 1) There are a lot of SSTables on disk for some tables (events for >> example, >> > but not only). I've seen that some compactions are taking up to 32 >> SSTables >> > (which corresponds to the default max value for SizeTiered compaction). >> > >> > 2) There is a secondary index that I found suspicious : loc.loc_id_idx. >> As >> > its name implies I have the impression that it's an index on the id of >> the >> > loc which would lead to almost an 1-1 relationship between the indexed >> value >> > and the original loc. Such index should be avoided because they do not >> > perform well. If it's not an index on the loc_id, please disregard my >> remark >> > >> > 3) There is a clear imbalance of SSTable count on some nodes. In the >> log, I >> > saw: >> > >> > INFO [STREAM-IN-/xxxx.xxxx.xxxx.20] 2014-10-25 02:21:43,360 >> > StreamResultFuture.java:166 - [Stream >> #a6e54ea0-5bed-11e4-8df5-f357715e1a79 >> > ID#0] Prepare completed. Receiving 163 files(4 111 187 195 bytes), >> sending 0 >> > files(0 bytes) >> > >> > INFO [STREAM-IN-/xxxx.xxxx.xxxx.81] 2014-10-25 02:21:46,121 >> > StreamResultFuture.java:166 - [Stream >> #a6e54ea0-5bed-11e4-8df5-f357715e1a79 >> > ID#0] Prepare completed. Receiving 154 files(3 332 779 920 bytes), >> sending 0 >> > files(0 bytes) >> > >> > INFO [STREAM-IN-/xxxx.xxxx.xxxx.71] 2014-10-25 02:21:50,494 >> > StreamResultFuture.java:166 - [Stream >> #a6e54ea0-5bed-11e4-8df5-f357715e1a79 >> > ID#0] Prepare completed. Receiving 1315 files(4 606 316 933 bytes), >> sending >> > 0 files(0 bytes) >> > >> > INFO [STREAM-IN-/xxxx.xxxx.xxxx.217] 2014-10-25 02:21:51,036 >> > StreamResultFuture.java:166 - [Stream >> #a6e54ea0-5bed-11e4-8df5-f357715e1a79 >> > ID#0] Prepare completed. Receiving 1640 files(3 208 023 573 bytes), >> sending >> > 0 files(0 bytes) >> > >> > As you can see, the existing 4 nodes are streaming data to the new >> node and >> > on average the data set size is about 3.3 - 4.5 Gb. However the number >> of >> > SSTables is around 150 files for nodes xxxx.xxxx.xxxx.20 and >> > xxxx.xxxx.xxxx.81 but goes through the roof to reach 1315 files for >> > xxxx.xxxx.xxxx.71 and 1640 files for xxxx.xxxx.xxxx.217 >> > >> > The total data set size is roughly the same but the file number is x10, >> > which mean that you'll have a bunch of tiny files. >> > >> > I guess that upon reception of those files, there will be a massive >> flush >> > to disk, explaining the behaviour you're facing (flush storm) >> > >> > I would suggest looking on nodes xxxx.xxxx.xxxx.71 and >> xxxx.xxxx.xxxx.217 to >> > check for the total SSTable count for each table to confirm this >> intuition >> > >> > Regards >> > >> > >> > On Sun, Oct 26, 2014 at 4:58 PM, Maxime <maxim...@gmail.com> wrote: >> >> >> >> I've emailed you a raw log file of an instance of this happening. >> >> >> >> I've been monitoring more closely the timing of events in tpstats and >> the >> >> logs and I believe this is what is happening: >> >> >> >> - For some reason, C* decides to provoke a flush storm (I say some >> reason, >> >> I'm sure there is one but I have had difficulty determining the >> behaviour >> >> changes between 1.* and more recent releases). >> >> - So we see ~ 3000 flush being enqueued. >> >> - This happens so suddenly that even boosting the number of flush >> writers >> >> to 20 does not suffice. I don't even see "all time blocked" numbers >> for it >> >> before C* stops responding. I suspect this is due to the sudden OOM >> and GC >> >> occurring. >> >> - The last tpstat that comes back before the node goes down indicates >> 20 >> >> active and 3000 pending and the rest 0. It's by far the anomalous >> activity. >> >> >> >> Is there a way to throttle down this generation of Flush? C* complains >> if >> >> I set the queue_size to any value (deprecated now?) and boosting the >> threads >> >> does not seem to help since even at 20 we're an order of magnitude off. >> >> >> >> Suggestions? Comments? >> >> >> >> >> >> On Sun, Oct 26, 2014 at 2:26 AM, DuyHai Doan <doanduy...@gmail.com> >> wrote: >> >>> >> >>> Hello Maxime >> >>> >> >>> Can you put the complete logs and config somewhere ? It would be >> >>> interesting to know what is the cause of the OOM. >> >>> >> >>> On Sun, Oct 26, 2014 at 3:15 AM, Maxime <maxim...@gmail.com> wrote: >> >>>> >> >>>> Thanks a lot that is comforting. We are also small at the moment so I >> >>>> definitely can relate with the idea of keeping small and simple at a >> level >> >>>> where it just works. >> >>>> >> >>>> I see the new Apache version has a lot of fixes so I will try to >> upgrade >> >>>> before I look into downgrading. >> >>>> >> >>>> >> >>>> On Saturday, October 25, 2014, Laing, Michael >> >>>> <michael.la...@nytimes.com> wrote: >> >>>>> >> >>>>> Since no one else has stepped in... >> >>>>> >> >>>>> We have run clusters with ridiculously small nodes - I have a >> >>>>> production cluster in AWS with 4GB nodes each with 1 CPU and >> disk-based >> >>>>> instance storage. It works fine but you can see those little puppies >> >>>>> struggle... >> >>>>> >> >>>>> And I ran into problems such as you observe... >> >>>>> >> >>>>> Upgrading Java to the latest 1.7 and - most importantly - reverting >> to >> >>>>> the default configuration, esp. for heap, seemed to settle things >> down >> >>>>> completely. Also make sure that you are using the 'recommended >> production >> >>>>> settings' from the docs on your boxen. >> >>>>> >> >>>>> However we are running 2.0.x not 2.1.0 so YMMV. >> >>>>> >> >>>>> And we are switching to 15GB nodes w 2 heftier CPUs each and SSD >> >>>>> storage - still a 'small' machine, but much more reasonable for C*. >> >>>>> >> >>>>> However I can't say I am an expert, since I deliberately keep >> things so >> >>>>> simple that we do not encounter problems - it just works so I dig >> into other >> >>>>> stuff. >> >>>>> >> >>>>> ml >> >>>>> >> >>>>> >> >>>>> On Sat, Oct 25, 2014 at 5:22 PM, Maxime <maxim...@gmail.com> wrote: >> >>>>>> >> >>>>>> Hello, I've been trying to add a new node to my cluster ( 4 nodes ) >> >>>>>> for a few days now. >> >>>>>> >> >>>>>> I started by adding a node similar to my current configuration, 4 >> GB >> >>>>>> or RAM + 2 Cores on DigitalOcean. However every time, I would end >> up getting >> >>>>>> OOM errors after many log entries of the type: >> >>>>>> >> >>>>>> INFO [SlabPoolCleaner] 2014-10-25 13:44:57,240 >> >>>>>> ColumnFamilyStore.java:856 - Enqueuing flush of mycf: 5383 (0%) >> on-heap, 0 >> >>>>>> (0%) off-heap >> >>>>>> >> >>>>>> leading to: >> >>>>>> >> >>>>>> ka-120-Data.db (39291 bytes) for commitlog position >> >>>>>> ReplayPosition(segmentId=1414243978538, position=23699418) >> >>>>>> WARN [SharedPool-Worker-13] 2014-10-25 13:48:18,032 >> >>>>>> AbstractTracingAwareExecutorService.java:167 - Uncaught exception >> on thread >> >>>>>> Thread[SharedPool-Worker-13,5,main]: {} >> >>>>>> java.lang.OutOfMemoryError: Java heap space >> >>>>>> >> >>>>>> Thinking it had to do with either compaction somehow or streaming, >> 2 >> >>>>>> activities I've had tremendous issues with in the past; I tried to >> slow down >> >>>>>> the setstreamthroughput to extremely low values all the way to 5. >> I also >> >>>>>> tried setting setcompactionthoughput to 0, and then reading that >> in some >> >>>>>> cases it might be too fast, down to 8. Nothing worked, it merely >> vaguely >> >>>>>> changed the mean time to OOM but not in a way indicating either >> was anywhere >> >>>>>> a solution. >> >>>>>> >> >>>>>> The nodes were configured with 2 GB of Heap initially, I tried to >> >>>>>> crank it up to 3 GB, stressing the host memory to its limit. >> >>>>>> >> >>>>>> After doing some exploration (I am considering writing a Cassandra >> Ops >> >>>>>> documentation with lessons learned since there seems to be little >> of it in >> >>>>>> organized fashions), I read that some people had strange issues on >> lower-end >> >>>>>> boxes like that, so I bit the bullet and upgraded my new node to a >> 8GB + 4 >> >>>>>> Core instance, which was anecdotally better. >> >>>>>> >> >>>>>> To my complete shock, exact same issues are present, even raising >> the >> >>>>>> Heap memory to 6 GB. I figure it can't be a "normal" situation >> anymore, but >> >>>>>> must be a bug somehow. >> >>>>>> >> >>>>>> My cluster is 4 nodes, RF of 2, about 160 GB of data across all >> nodes. >> >>>>>> About 10 CF of varying sizes. Runtime writes are between 300 to >> 900 / >> >>>>>> second. Cassandra 2.1.0, nothing too wild. >> >>>>>> >> >>>>>> Has anyone encountered these kinds of issues before? I would really >> >>>>>> enjoy hearing about the experiences of people trying to run >> small-sized >> >>>>>> clusters like mine. From everything I read, Cassandra operations >> go very >> >>>>>> well on large (16 GB + 8 Cores) machines, but I'm sad to report >> I've had >> >>>>>> nothing but trouble trying to run on smaller machines, perhaps I >> can learn >> >>>>>> from other's experience? >> >>>>>> >> >>>>>> Full logs can be provided to anyone interested. >> >>>>>> >> >>>>>> Cheers >> >>>>> >> >>>>> >> >>> >> >> >> > >> >> >> >> -- >> Jon Haddad >> http://www.rustyrazorblade.com >> twitter: rustyrazorblade >> > >