Is the need for 10k/sec/node just for bulk loading of data or is it how your app will operate normally? Those are very different things.
On Sun, Aug 22, 2010 at 4:11 AM, Wayne <wav...@gmail.com> wrote: > Currently each node has 4x1TB SATA disks. In MySQL we have 15tb currently > with no replication. To move this to Cassandra replication factor 3 we need > 45TB assuming the space usage is the same, but it is probably more. We had > assumed a 30 node cluster with 4tb per node would suffice with head room for > compaction and to growth (120 TB). > > SSD drives for 30 nodes in this size range are not cost feasible for us. We > can try to use 15k SAS drives and have more spindles but then our per node > cost goes up. I guess I naively thought cassandra would do its magic and a > few commodity SATA hard drives would be fine. > > Our performance requirement does not need 10k writes/node/sec 24 hours a > day, but if we can not get really good performance the switch from MySQL > becomes harder to rationalize. We can currently restore from a MySQL dump a > 2.5 terabyte backup (plain old insert statements) in 4-5 days. I expect as > much or more from cassandra and I feel years away from simply loading 2+tb > into cassandra without so many issues. > > What is really required in hardware for a 100+tb cluster with near 10k/sec > write performance sustained? If the answer is SSD what can be expected from > 15k SAS drives and what from SATA? > > Thank you for your advice, I am struggling with how to make this work. Any > insight you can provide would be greatly appreciated. > > > > On Sun, Aug 22, 2010 at 8:58 AM, Benjamin Black <b...@b3k.us> wrote: >> >> How much storage do you need? 240G SSDs quite capable of saturating a >> 3Gbps SATA link are $600. Larger ones are also available with similar >> performance. Perhaps you could share a bit more about the storage and >> performance requirements. How SSDs to sustain 10k writes/sec PER NODE >> WITH LINEAR SCALING "breaks down the commodity server concept" eludes >> me. >> >> >> b >> >> On Sat, Aug 21, 2010 at 11:27 PM, Wayne <wav...@gmail.com> wrote: >> > Thank you for the advice, I will try these settings. I am running >> > defaults >> > right now. The disk subsystem is one SATA disk for commitlog and 4 SATA >> > disks in raid 0 for the data. >> > >> > From your email you are implying this hardware can not handle this level >> > of >> > sustained writes? That kind of breaks down the commodity server concept >> > for >> > me. I have never used anything but a 15k SAS disk (fastest disk money >> > could >> > buy until SSD) ALWAYS with a database. I have tried to throw out that >> > mentality here but are you saying nothing has really changed/ Spindles >> > spindles spindles as fast as you can afford is what I have always >> > known...I >> > guess that applies here? Do I need to spend $10k per node instead of >> > $3.5k >> > to get SUSTAINED 10k writes/sec per node? >> > >> > >> > >> > On Sat, Aug 21, 2010 at 11:03 PM, Benjamin Black <b...@b3k.us> wrote: >> >> >> >> My guess is that you have (at least) 2 problems right now: >> >> >> >> You are writing 10k ops/sec to each node, but have default memtable >> >> flush settings. This is resulting in memtable flushing every 30 >> >> seconds (default ops flush setting is 300k). You thus have a >> >> proliferation of tiny sstables and are seeing minor compactions >> >> triggered every couple of minutes. >> >> >> >> You have started a major compaction which is now competing with those >> >> near constant minor compactions for far too little I/O (3 SATA drives >> >> in RAID0, perhaps?). Normally, this would result in a massive >> >> ballooning of your heap use as all sorts of activities (like memtable >> >> flushes) backed up, as well. >> >> >> >> I suggest you increase the memtable flush ops to at least 10 (million) >> >> if you are going to sustain that many writes/sec, along with an >> >> increase in the flush MB to match, based on your typical bytes/write >> >> op. Long term, this level of write activity demands a lot faster >> >> storage (iops and bandwidth). >> >> >> >> >> >> b >> >> On Sat, Aug 21, 2010 at 2:18 AM, Wayne <wav...@gmail.com> wrote: >> >> > I am already running with those options. I thought maybe that is why >> >> > they >> >> > never get completed as they keep pushed pushed down in priority? I am >> >> > getting timeouts now and then but for the most part the cluster keeps >> >> > running. Is it normal/ok for the repair and compaction to take so >> >> > long? >> >> > It >> >> > has been over 12 hours since they were submitted. >> >> > >> >> > On Sat, Aug 21, 2010 at 10:56 AM, Jonathan Ellis <jbel...@gmail.com> >> >> > wrote: >> >> >> >> >> >> yes, the AES is the repair. >> >> >> >> >> >> if you are running linux, try adding the options to reduce >> >> >> compaction >> >> >> priority from >> >> >> http://wiki.apache.org/cassandra/PerformanceTuning >> >> >> >> >> >> On Sat, Aug 21, 2010 at 3:17 AM, Wayne <wav...@gmail.com> wrote: >> >> >> > I could tell from munin that the disk utilization was getting >> >> >> > crazy >> >> >> > high, >> >> >> > but the strange thing is that it seemed to "stall". The >> >> >> > utilization >> >> >> > went >> >> >> > way >> >> >> > down and everything seemed to flatten out. Requests piled up and >> >> >> > the >> >> >> > node >> >> >> > was doing nothing. It did not "crash" but was left in a useless >> >> >> > state. I >> >> >> > do >> >> >> > not have access to the tpstats when that occurred. Attached is the >> >> >> > munin >> >> >> > chart, and you can see the flat line after Friday at noon. >> >> >> > >> >> >> > I have reduced the writers from 10 per to 8 per node and they seem >> >> >> > to >> >> >> > be >> >> >> > still running, but I am afraid they are barely hanging on. I ran >> >> >> > nodetool >> >> >> > repair after rebooting the failed node and I do not think the >> >> >> > repair >> >> >> > ever >> >> >> > completed. I also later ran compact on each node and some it >> >> >> > finished >> >> >> > but >> >> >> > some it did not. Below is the tpstats currently for the node I had >> >> >> > to >> >> >> > restart. Is the AE-SERVICE-STAGE the repair and compaction queued >> >> >> > up? >> >> >> > It >> >> >> > seems several nodes are not getting enough free cycles to keep up. >> >> >> > They >> >> >> > are >> >> >> > not timing out (30 sec timeout) for the most part but they are >> >> >> > also >> >> >> > not >> >> >> > able >> >> >> > to compact. Is this normal? Do I just give it time? I am migrating >> >> >> > 2-3 >> >> >> > TB of >> >> >> > data from Mysql so the load is constant and will be for days and >> >> >> > it >> >> >> > seems >> >> >> > even with only 8 writer processes per node I am maxed out. >> >> >> > >> >> >> > Thanks for the advice. Any more pointers would be greatly >> >> >> > appreciated. >> >> >> > >> >> >> > Pool Name Active Pending Completed >> >> >> > FILEUTILS-DELETE-POOL 0 0 1868 >> >> >> > STREAM-STAGE 1 1 2 >> >> >> > RESPONSE-STAGE 0 2 769158645 >> >> >> > ROW-READ-STAGE 0 0 140942 >> >> >> > LB-OPERATIONS 0 0 0 >> >> >> > MESSAGE-DESERIALIZER-POOL 1 0 1470221842 >> >> >> > GMFD 0 0 169712 >> >> >> > LB-TARGET 0 0 0 >> >> >> > CONSISTENCY-MANAGER 0 0 0 >> >> >> > ROW-MUTATION-STAGE 0 1 865124937 >> >> >> > MESSAGE-STREAMING-POOL 0 0 6 >> >> >> > LOAD-BALANCER-STAGE 0 0 0 >> >> >> > FLUSH-SORTER-POOL 0 0 0 >> >> >> > MEMTABLE-POST-FLUSHER 0 0 8088 >> >> >> > FLUSH-WRITER-POOL 0 0 8088 >> >> >> > AE-SERVICE-STAGE 1 34 54 >> >> >> > HINTED-HANDOFF-POOL 0 0 7 >> >> >> > >> >> >> > >> >> >> > >> >> >> > On Fri, Aug 20, 2010 at 11:56 PM, Bill de hÓra <b...@dehora.net> >> >> >> > wrote: >> >> >> >> >> >> >> >> On Fri, 2010-08-20 at 19:17 +0200, Wayne wrote: >> >> >> >> >> >> >> >> > WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-20 16:57:02,602 >> >> >> >> > MessageDeserializationTask.java (line 47) dropping message >> >> >> >> > (1,078,378ms past timeout) >> >> >> >> > WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-20 16:57:02,602 >> >> >> >> > MessageDeserializationTask.java (line 47) dropping message >> >> >> >> > (1,078,378ms past timeout) >> >> >> >> >> >> >> >> MESSAGE-DESERIALIZER-POOL usually backs up when other stages are >> >> >> >> bogged >> >> >> >> downstream, (eg here's Ben Black describing the symptom when the >> >> >> >> underlying cause is running out of disk bandwidth, well worth a >> >> >> >> watch >> >> >> >> http://riptano.blip.tv/file/4012133/). >> >> >> >> >> >> >> >> Can you send all of nodetool tpstats? >> >> >> >> >> >> >> >> Bill >> >> >> >> >> >> >> > >> >> >> > >> >> >> >> >> >> >> >> >> >> >> >> -- >> >> >> Jonathan Ellis >> >> >> Project Chair, Apache Cassandra >> >> >> co-founder of Riptano, the source for professional Cassandra support >> >> >> http://riptano.com >> >> > >> >> > >> > >> > > >