Re: OOM at Bootstrap Time

Maxime Sun, 26 Oct 2014 15:26:34 -0700

Hmm, thanks for the reading.

I initially followed some (perhaps too old) maintenance scripts, which
included weekly 'nodetool compact'. Is there a way for me to undo the
damage? Tombstones will be a very important issue for me since the dataset
is very much a rolling dataset using TTLs heavily.


On Sun, Oct 26, 2014 at 6:04 PM, DuyHai Doan <doanduy...@gmail.com> wrote:

> "Should doing a major compaction on those nodes lead to a restructuration
> of the SSTables?" --> Beware of the major compaction on SizeTiered, it will
> create 2 giant SSTables and the expired/outdated/tombstone columns in this
> big file will be never cleaned since the SSTable will never get a chance to
> be compacted again
>
> Essentially to reduce the fragmentation of small SSTables you can stay
> with SizeTiered compaction and play around with compaction properties (the
> thresholds) to make C* group a bunch of files each time it compacts so that
> the file number shrinks to a reasonable count
>
> Since you're using C* 2.1 and anti-compaction has been introduced, I
> hesitate advising you to use Leveled compaction as a work-around to reduce
> SSTable count.
>
>  Things are a little bit more complicated because of the incremental
> repair process (I don't know whether you're using incremental repair or not
> in production). The Dev blog says that Leveled compaction is performed only
> on repaired SSTables, the un-repaired ones still use SizeTiered, more
> details here:
> http://www.datastax.com/dev/blog/anticompaction-in-cassandra-2-1
>
> Regards
>
>
>
>
>
> On Sun, Oct 26, 2014 at 9:44 PM, Jonathan Haddad <j...@jonhaddad.com>
> wrote:
>
>> If the issue is related to I/O, you're going to want to determine if
>> you're saturated.  Take a look at `iostat -dmx 1`, you'll see avgqu-sz
>> (queue size) and svctm, (service time).    The higher those numbers
>> are, the most overwhelmed your disk is.
>>
>> On Sun, Oct 26, 2014 at 12:01 PM, DuyHai Doan <doanduy...@gmail.com>
>> wrote:
>> > Hello Maxime
>> >
>> > Increasing the flush writers won't help if your disk I/O is not keeping
>> up.
>> >
>> > I've had a look into the log file, below are some remarks:
>> >
>> > 1) There are a lot of SSTables on disk for some tables (events for
>> example,
>> > but not only). I've seen that some compactions are taking up to 32
>> SSTables
>> > (which corresponds to the default max value for SizeTiered compaction).
>> >
>> > 2) There is a secondary index that I found suspicious : loc.loc_id_idx.
>> As
>> > its name implies I have the impression that it's an index on the id of
>> the
>> > loc which would lead to almost an 1-1 relationship between the indexed
>> value
>> > and the original loc. Such index should be avoided because they do not
>> > perform well. If it's not an index on the loc_id, please disregard my
>> remark
>> >
>> > 3) There is a clear imbalance of SSTable count on some nodes. In the
>> log, I
>> > saw:
>> >
>> > INFO  [STREAM-IN-/xxxx.xxxx.xxxx.20] 2014-10-25 02:21:43,360
>> > StreamResultFuture.java:166 - [Stream
>> #a6e54ea0-5bed-11e4-8df5-f357715e1a79
>> > ID#0] Prepare completed. Receiving 163 files(4 111 187 195 bytes),
>> sending 0
>> > files(0 bytes)
>> >
>> > INFO  [STREAM-IN-/xxxx.xxxx.xxxx.81] 2014-10-25 02:21:46,121
>> > StreamResultFuture.java:166 - [Stream
>> #a6e54ea0-5bed-11e4-8df5-f357715e1a79
>> > ID#0] Prepare completed. Receiving 154 files(3 332 779 920 bytes),
>> sending 0
>> > files(0 bytes)
>> >
>> > INFO  [STREAM-IN-/xxxx.xxxx.xxxx.71] 2014-10-25 02:21:50,494
>> > StreamResultFuture.java:166 - [Stream
>> #a6e54ea0-5bed-11e4-8df5-f357715e1a79
>> > ID#0] Prepare completed. Receiving 1315 files(4 606 316 933 bytes),
>> sending
>> > 0 files(0 bytes)
>> >
>> > INFO  [STREAM-IN-/xxxx.xxxx.xxxx.217] 2014-10-25 02:21:51,036
>> > StreamResultFuture.java:166 - [Stream
>> #a6e54ea0-5bed-11e4-8df5-f357715e1a79
>> > ID#0] Prepare completed. Receiving 1640 files(3 208 023 573 bytes),
>> sending
>> > 0 files(0 bytes)
>> >
>> >  As you can see, the existing 4 nodes are streaming data to the new
>> node and
>> > on average the data set size is about 3.3 - 4.5 Gb. However the number
>> of
>> > SSTables is around 150 files for nodes xxxx.xxxx.xxxx.20 and
>> > xxxx.xxxx.xxxx.81 but goes through the roof to reach 1315 files for
>> > xxxx.xxxx.xxxx.71 and 1640 files for xxxx.xxxx.xxxx.217
>> >
>> >  The total data set size is roughly the same but the file number is x10,
>> > which mean that you'll have a bunch of tiny files.
>> >
>> >  I guess that upon reception of those files, there will be a massive
>> flush
>> > to disk, explaining the behaviour you're facing (flush storm)
>> >
>> > I would suggest looking on nodes xxxx.xxxx.xxxx.71 and
>> xxxx.xxxx.xxxx.217 to
>> > check for the total SSTable count for each table to confirm this
>> intuition
>> >
>> > Regards
>> >
>> >
>> > On Sun, Oct 26, 2014 at 4:58 PM, Maxime <maxim...@gmail.com> wrote:
>> >>
>> >> I've emailed you a raw log file of an instance of this happening.
>> >>
>> >> I've been monitoring more closely the timing of events in tpstats and
>> the
>> >> logs and I believe this is what is happening:
>> >>
>> >> - For some reason, C* decides to provoke a flush storm (I say some
>> reason,
>> >> I'm sure there is one but I have had difficulty determining the
>> behaviour
>> >> changes between 1.* and more recent releases).
>> >> - So we see ~ 3000 flush being enqueued.
>> >> - This happens so suddenly that even boosting the number of flush
>> writers
>> >> to 20 does not suffice. I don't even see "all time blocked" numbers
>> for it
>> >> before C* stops responding. I suspect this is due to the sudden OOM
>> and GC
>> >> occurring.
>> >> - The last tpstat that comes back before the node goes down indicates
>> 20
>> >> active and 3000 pending and the rest 0. It's by far the anomalous
>> activity.
>> >>
>> >> Is there a way to throttle down this generation of Flush? C* complains
>> if
>> >> I set the queue_size to any value (deprecated now?) and boosting the
>> threads
>> >> does not seem to help since even at 20 we're an order of magnitude off.
>> >>
>> >> Suggestions? Comments?
>> >>
>> >>
>> >> On Sun, Oct 26, 2014 at 2:26 AM, DuyHai Doan <doanduy...@gmail.com>
>> wrote:
>> >>>
>> >>> Hello Maxime
>> >>>
>> >>>  Can you put the complete logs and config somewhere ? It would be
>> >>> interesting to know what is the cause of the OOM.
>> >>>
>> >>> On Sun, Oct 26, 2014 at 3:15 AM, Maxime <maxim...@gmail.com> wrote:
>> >>>>
>> >>>> Thanks a lot that is comforting. We are also small at the moment so I
>> >>>> definitely can relate with the idea of keeping small and simple at a
>> level
>> >>>> where it just works.
>> >>>>
>> >>>> I see the new Apache version has a lot of fixes so I will try to
>> upgrade
>> >>>> before I look into downgrading.
>> >>>>
>> >>>>
>> >>>> On Saturday, October 25, 2014, Laing, Michael
>> >>>> <michael.la...@nytimes.com> wrote:
>> >>>>>
>> >>>>> Since no one else has stepped in...
>> >>>>>
>> >>>>> We have run clusters with ridiculously small nodes - I have a
>> >>>>> production cluster in AWS with 4GB nodes each with 1 CPU and
>> disk-based
>> >>>>> instance storage. It works fine but you can see those little puppies
>> >>>>> struggle...
>> >>>>>
>> >>>>> And I ran into problems such as you observe...
>> >>>>>
>> >>>>> Upgrading Java to the latest 1.7 and - most importantly - reverting
>> to
>> >>>>> the default configuration, esp. for heap, seemed to settle things
>> down
>> >>>>> completely. Also make sure that you are using the 'recommended
>> production
>> >>>>> settings' from the docs on your boxen.
>> >>>>>
>> >>>>> However we are running 2.0.x not 2.1.0 so YMMV.
>> >>>>>
>> >>>>> And we are switching to 15GB nodes w 2 heftier CPUs each and SSD
>> >>>>> storage - still a 'small' machine, but much more reasonable for C*.
>> >>>>>
>> >>>>> However I can't say I am an expert, since I deliberately keep
>> things so
>> >>>>> simple that we do not encounter problems - it just works so I dig
>> into other
>> >>>>> stuff.
>> >>>>>
>> >>>>> ml
>> >>>>>
>> >>>>>
>> >>>>> On Sat, Oct 25, 2014 at 5:22 PM, Maxime <maxim...@gmail.com> wrote:
>> >>>>>>
>> >>>>>> Hello, I've been trying to add a new node to my cluster ( 4 nodes )
>> >>>>>> for a few days now.
>> >>>>>>
>> >>>>>> I started by adding a node similar to my current configuration, 4
>> GB
>> >>>>>> or RAM + 2 Cores on DigitalOcean. However every time, I would end
>> up getting
>> >>>>>> OOM errors after many log entries of the type:
>> >>>>>>
>> >>>>>> INFO  [SlabPoolCleaner] 2014-10-25 13:44:57,240
>> >>>>>> ColumnFamilyStore.java:856 - Enqueuing flush of mycf: 5383 (0%)
>> on-heap, 0
>> >>>>>> (0%) off-heap
>> >>>>>>
>> >>>>>> leading to:
>> >>>>>>
>> >>>>>> ka-120-Data.db (39291 bytes) for commitlog position
>> >>>>>> ReplayPosition(segmentId=1414243978538, position=23699418)
>> >>>>>> WARN  [SharedPool-Worker-13] 2014-10-25 13:48:18,032
>> >>>>>> AbstractTracingAwareExecutorService.java:167 - Uncaught exception
>> on thread
>> >>>>>> Thread[SharedPool-Worker-13,5,main]: {}
>> >>>>>> java.lang.OutOfMemoryError: Java heap space
>> >>>>>>
>> >>>>>> Thinking it had to do with either compaction somehow or streaming,
>> 2
>> >>>>>> activities I've had tremendous issues with in the past; I tried to
>> slow down
>> >>>>>> the setstreamthroughput to extremely low values all the way to 5.
>> I also
>> >>>>>> tried setting setcompactionthoughput to 0, and then reading that
>> in some
>> >>>>>> cases it might be too fast, down to 8. Nothing worked, it merely
>> vaguely
>> >>>>>> changed the mean time to OOM but not in a way indicating either
>> was anywhere
>> >>>>>> a solution.
>> >>>>>>
>> >>>>>> The nodes were configured with 2 GB of Heap initially, I tried to
>> >>>>>> crank it up to 3 GB, stressing the host memory to its limit.
>> >>>>>>
>> >>>>>> After doing some exploration (I am considering writing a Cassandra
>> Ops
>> >>>>>> documentation with lessons learned since there seems to be little
>> of it in
>> >>>>>> organized fashions), I read that some people had strange issues on
>> lower-end
>> >>>>>> boxes like that, so I bit the bullet and upgraded my new node to a
>> 8GB + 4
>> >>>>>> Core instance, which was anecdotally better.
>> >>>>>>
>> >>>>>> To my complete shock, exact same issues are present, even raising
>> the
>> >>>>>> Heap memory to 6 GB. I figure it can't be a "normal" situation
>> anymore, but
>> >>>>>> must be a bug somehow.
>> >>>>>>
>> >>>>>> My cluster is 4 nodes, RF of 2, about 160 GB of data across all
>> nodes.
>> >>>>>> About 10 CF of varying sizes. Runtime writes are between 300 to
>> 900 /
>> >>>>>> second. Cassandra 2.1.0, nothing too wild.
>> >>>>>>
>> >>>>>> Has anyone encountered these kinds of issues before? I would really
>> >>>>>> enjoy hearing about the experiences of people trying to run
>> small-sized
>> >>>>>> clusters like mine. From everything I read, Cassandra operations
>> go very
>> >>>>>> well on large (16 GB + 8 Cores) machines, but I'm sad to report
>> I've had
>> >>>>>> nothing but trouble trying to run on smaller machines, perhaps I
>> can learn
>> >>>>>> from other's experience?
>> >>>>>>
>> >>>>>> Full logs can be provided to anyone interested.
>> >>>>>>
>> >>>>>> Cheers
>> >>>>>
>> >>>>>
>> >>>
>> >>
>> >
>>
>>
>>
>> --
>> Jon Haddad
>> http://www.rustyrazorblade.com
>> twitter: rustyrazorblade
>>
>
>

Re: OOM at Bootstrap Time

Reply via email to