I am already running with those options. I thought maybe that is why they never get completed as they keep pushed pushed down in priority? I am getting timeouts now and then but for the most part the cluster keeps running. Is it normal/ok for the repair and compaction to take so long? It has been over 12 hours since they were submitted.
On Sat, Aug 21, 2010 at 10:56 AM, Jonathan Ellis <jbel...@gmail.com> wrote: > yes, the AES is the repair. > > if you are running linux, try adding the options to reduce compaction > priority from > http://wiki.apache.org/cassandra/PerformanceTuning > > On Sat, Aug 21, 2010 at 3:17 AM, Wayne <wav...@gmail.com> wrote: > > I could tell from munin that the disk utilization was getting crazy high, > > but the strange thing is that it seemed to "stall". The utilization went > way > > down and everything seemed to flatten out. Requests piled up and the node > > was doing nothing. It did not "crash" but was left in a useless state. I > do > > not have access to the tpstats when that occurred. Attached is the munin > > chart, and you can see the flat line after Friday at noon. > > > > I have reduced the writers from 10 per to 8 per node and they seem to be > > still running, but I am afraid they are barely hanging on. I ran nodetool > > repair after rebooting the failed node and I do not think the repair ever > > completed. I also later ran compact on each node and some it finished but > > some it did not. Below is the tpstats currently for the node I had to > > restart. Is the AE-SERVICE-STAGE the repair and compaction queued up? It > > seems several nodes are not getting enough free cycles to keep up. They > are > > not timing out (30 sec timeout) for the most part but they are also not > able > > to compact. Is this normal? Do I just give it time? I am migrating 2-3 TB > of > > data from Mysql so the load is constant and will be for days and it seems > > even with only 8 writer processes per node I am maxed out. > > > > Thanks for the advice. Any more pointers would be greatly appreciated. > > > > Pool Name Active Pending Completed > > FILEUTILS-DELETE-POOL 0 0 1868 > > STREAM-STAGE 1 1 2 > > RESPONSE-STAGE 0 2 769158645 > > ROW-READ-STAGE 0 0 140942 > > LB-OPERATIONS 0 0 0 > > MESSAGE-DESERIALIZER-POOL 1 0 1470221842 > > GMFD 0 0 169712 > > LB-TARGET 0 0 0 > > CONSISTENCY-MANAGER 0 0 0 > > ROW-MUTATION-STAGE 0 1 865124937 > > MESSAGE-STREAMING-POOL 0 0 6 > > LOAD-BALANCER-STAGE 0 0 0 > > FLUSH-SORTER-POOL 0 0 0 > > MEMTABLE-POST-FLUSHER 0 0 8088 > > FLUSH-WRITER-POOL 0 0 8088 > > AE-SERVICE-STAGE 1 34 54 > > HINTED-HANDOFF-POOL 0 0 7 > > > > > > > > On Fri, Aug 20, 2010 at 11:56 PM, Bill de hÓra <b...@dehora.net> wrote: > >> > >> On Fri, 2010-08-20 at 19:17 +0200, Wayne wrote: > >> > >> > WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-20 16:57:02,602 > >> > MessageDeserializationTask.java (line 47) dropping message > >> > (1,078,378ms past timeout) > >> > WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-20 16:57:02,602 > >> > MessageDeserializationTask.java (line 47) dropping message > >> > (1,078,378ms past timeout) > >> > >> MESSAGE-DESERIALIZER-POOL usually backs up when other stages are bogged > >> downstream, (eg here's Ben Black describing the symptom when the > >> underlying cause is running out of disk bandwidth, well worth a watch > >> http://riptano.blip.tv/file/4012133/). > >> > >> Can you send all of nodetool tpstats? > >> > >> Bill > >> > > > > > > > > -- > Jonathan Ellis > Project Chair, Apache Cassandra > co-founder of Riptano, the source for professional Cassandra support > http://riptano.com >