Re: Node OOM Problems

Wayne Sun, 22 Aug 2010 04:12:13 -0700

Currently each node has 4x1TB SATA disks. In MySQL we have 15tb currently
with no replication. To move this to Cassandra replication factor 3 we need
45TB assuming the space usage is the same, but it is probably more. We had
assumed a 30 node cluster with 4tb per node would suffice with head room for
compaction and to growth (120 TB).


SSD drives for 30 nodes in this size range are not cost feasible for us. We
can try to use 15k SAS drives and have more spindles but then our per node
cost goes up. I guess I naively thought cassandra would do its magic and a
few commodity SATA hard drives would be fine.

Our performance requirement does not *need* 10k writes/node/sec 24 hours a
day, but if we can not get really good performance the switch from MySQL
becomes harder to rationalize. We can currently restore from a MySQL dump a
2.5 terabyte backup (plain old insert statements) in 4-5 days. I expect as
much or more from cassandra and I feel years away from simply loading 2+tb
into cassandra without so many issues.

What is really required in hardware for a 100+tb cluster with near 10k/sec
write performance sustained? If the answer is SSD what can be expected from
15k SAS drives and what from SATA?

Thank you for your advice, I am struggling with how to make this work. Any
insight you can provide would be greatly appreciated.



On Sun, Aug 22, 2010 at 8:58 AM, Benjamin Black <b...@b3k.us> wrote:

> How much storage do you need?  240G SSDs quite capable of saturating a
> 3Gbps SATA link are $600.  Larger ones are also available with similar
> performance.  Perhaps you could share a bit more about the storage and
> performance requirements.  How SSDs to sustain 10k writes/sec PER NODE
> WITH LINEAR SCALING "breaks down the commodity server concept" eludes
> me.
>
>
> b
>
> On Sat, Aug 21, 2010 at 11:27 PM, Wayne <wav...@gmail.com> wrote:
> > Thank you for the advice, I will try these settings. I am running
> defaults
> > right now. The disk subsystem is one SATA disk for commitlog and 4 SATA
> > disks in raid 0 for the data.
> >
> > From your email you are implying this hardware can not handle this level
> of
> > sustained writes? That kind of breaks down the commodity server concept
> for
> > me. I have never used anything but a 15k SAS disk (fastest disk money
> could
> > buy until SSD) ALWAYS with a database. I have tried to throw out that
> > mentality here but are you saying nothing has really changed/ Spindles
> > spindles spindles as fast as you can afford is what I have always
> known...I
> > guess that applies here? Do I need to spend $10k per node instead of
> $3.5k
> > to get SUSTAINED 10k writes/sec per node?
> >
> >
> >
> > On Sat, Aug 21, 2010 at 11:03 PM, Benjamin Black <b...@b3k.us> wrote:
> >>
> >> My guess is that you have (at least) 2 problems right now:
> >>
> >> You are writing 10k ops/sec to each node, but have default memtable
> >> flush settings.  This is resulting in memtable flushing every 30
> >> seconds (default ops flush setting is 300k).  You thus have a
> >> proliferation of tiny sstables and are seeing minor compactions
> >> triggered every couple of minutes.
> >>
> >> You have started a major compaction which is now competing with those
> >> near constant minor compactions for far too little I/O (3 SATA drives
> >> in RAID0, perhaps?).  Normally, this would result in a massive
> >> ballooning of your heap use as all sorts of activities (like memtable
> >> flushes) backed up, as well.
> >>
> >> I suggest you increase the memtable flush ops to at least 10 (million)
> >> if you are going to sustain that many writes/sec, along with an
> >> increase in the flush MB to match, based on your typical bytes/write
> >> op.  Long term, this level of write activity demands a lot faster
> >> storage (iops and bandwidth).
> >>
> >>
> >> b
> >> On Sat, Aug 21, 2010 at 2:18 AM, Wayne <wav...@gmail.com> wrote:
> >> > I am already running with those options. I thought maybe that is why
> >> > they
> >> > never get completed as they keep pushed pushed down in priority? I am
> >> > getting timeouts now and then but for the most part the cluster keeps
> >> > running. Is it normal/ok for the repair and compaction to take so
> long?
> >> > It
> >> > has been over 12 hours since they were submitted.
> >> >
> >> > On Sat, Aug 21, 2010 at 10:56 AM, Jonathan Ellis <jbel...@gmail.com>
> >> > wrote:
> >> >>
> >> >> yes, the AES is the repair.
> >> >>
> >> >> if you are running linux, try adding the options to reduce compaction
> >> >> priority from
> >> >> http://wiki.apache.org/cassandra/PerformanceTuning
> >> >>
> >> >> On Sat, Aug 21, 2010 at 3:17 AM, Wayne <wav...@gmail.com> wrote:
> >> >> > I could tell from munin that the disk utilization was getting crazy
> >> >> > high,
> >> >> > but the strange thing is that it seemed to "stall". The utilization
> >> >> > went
> >> >> > way
> >> >> > down and everything seemed to flatten out. Requests piled up and
> the
> >> >> > node
> >> >> > was doing nothing. It did not "crash" but was left in a useless
> >> >> > state. I
> >> >> > do
> >> >> > not have access to the tpstats when that occurred. Attached is the
> >> >> > munin
> >> >> > chart, and you can see the flat line after Friday at noon.
> >> >> >
> >> >> > I have reduced the writers from 10 per to 8 per node and they seem
> to
> >> >> > be
> >> >> > still running, but I am afraid they are barely hanging on. I ran
> >> >> > nodetool
> >> >> > repair after rebooting the failed node and I do not think the
> repair
> >> >> > ever
> >> >> > completed. I also later ran compact on each node and some it
> finished
> >> >> > but
> >> >> > some it did not. Below is the tpstats currently for the node I had
> to
> >> >> > restart. Is the AE-SERVICE-STAGE the repair and compaction queued
> up?
> >> >> > It
> >> >> > seems several nodes are not getting enough free cycles to keep up.
> >> >> > They
> >> >> > are
> >> >> > not timing out (30 sec timeout) for the most part but they are also
> >> >> > not
> >> >> > able
> >> >> > to compact. Is this normal? Do I just give it time? I am migrating
> >> >> > 2-3
> >> >> > TB of
> >> >> > data from Mysql so the load is constant and will be for days and it
> >> >> > seems
> >> >> > even with only 8 writer processes per node I am maxed out.
> >> >> >
> >> >> > Thanks for the advice. Any more pointers would be greatly
> >> >> > appreciated.
> >> >> >
> >> >> > Pool Name                    Active   Pending      Completed
> >> >> > FILEUTILS-DELETE-POOL             0         0           1868
> >> >> > STREAM-STAGE                      1         1              2
> >> >> > RESPONSE-STAGE                    0         2      769158645
> >> >> > ROW-READ-STAGE                    0         0         140942
> >> >> > LB-OPERATIONS                     0         0              0
> >> >> > MESSAGE-DESERIALIZER-POOL         1         0     1470221842
> >> >> > GMFD                              0         0         169712
> >> >> > LB-TARGET                         0         0              0
> >> >> > CONSISTENCY-MANAGER               0         0              0
> >> >> > ROW-MUTATION-STAGE                0         1      865124937
> >> >> > MESSAGE-STREAMING-POOL            0         0              6
> >> >> > LOAD-BALANCER-STAGE               0         0              0
> >> >> > FLUSH-SORTER-POOL                 0         0              0
> >> >> > MEMTABLE-POST-FLUSHER             0         0           8088
> >> >> > FLUSH-WRITER-POOL                 0         0           8088
> >> >> > AE-SERVICE-STAGE                  1        34             54
> >> >> > HINTED-HANDOFF-POOL               0         0              7
> >> >> >
> >> >> >
> >> >> >
> >> >> > On Fri, Aug 20, 2010 at 11:56 PM, Bill de hÓra <b...@dehora.net>
> >> >> > wrote:
> >> >> >>
> >> >> >> On Fri, 2010-08-20 at 19:17 +0200, Wayne wrote:
> >> >> >>
> >> >> >> >  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-20 16:57:02,602
> >> >> >> > MessageDeserializationTask.java (line 47) dropping message
> >> >> >> > (1,078,378ms past timeout)
> >> >> >> >  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-20 16:57:02,602
> >> >> >> > MessageDeserializationTask.java (line 47) dropping message
> >> >> >> > (1,078,378ms past timeout)
> >> >> >>
> >> >> >> MESSAGE-DESERIALIZER-POOL usually backs up when other stages are
> >> >> >> bogged
> >> >> >> downstream, (eg here's Ben Black describing the symptom when the
> >> >> >> underlying cause is running out of disk bandwidth, well worth a
> >> >> >> watch
> >> >> >> http://riptano.blip.tv/file/4012133/).
> >> >> >>
> >> >> >> Can you send all of nodetool tpstats?
> >> >> >>
> >> >> >> Bill
> >> >> >>
> >> >> >
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Jonathan Ellis
> >> >> Project Chair, Apache Cassandra
> >> >> co-founder of Riptano, the source for professional Cassandra support
> >> >> http://riptano.com
> >> >
> >> >
> >
> >
>

Re: Node OOM Problems

Reply via email to