> > Over use the cluster was one thing which I was thinking about, and I > have requested two new nodes (anyway it was something already planned). But > the pattern of nodes with high CPU load is only visible in 1 or two of the > nodes, the rest are working correctly. That made me think that adding two > new nodes maybe will not help. >
Well, then you could trying to replace this node as soon as you have more nodes available. I would use this procedure as I believe it is the most efficient one: http://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_replace_node_t.html . Yet I believe it might not be a hardware or cluster throughput issue, and if it is a hardware issues you probably want to dig it as this machine is yours and not a virtual one. You might want to reuse it anyway. Some questions about the machine and their usage. Disk: What disk hardware and configuration do you use. iostat -mx 5 100 gives you? How is iowait? Any error in the system / kernel logs? CPU How much used are the CPUs in general / worst cases? What is the load average / max and how many cores have the cpu? RAM You are using 10GB heap and CMS right? You seems to say that GC activity looks ok, can you confirm? How much total RAM are the machines using? The point here is to see if we can spot the bottleneck. If there is none, Cassandra is probably badly configured at some point. when running “massive deletes” on one of the nodes > Run the deletes at slower at constant path sounds good and definitely I > will try that. Are clients and queries well configured to use all the nodes evenly? Are deletes well balanced also? If not, balancing the usage of the nodes will probably alleviate things. The update of Cassandra is a good point but I am afraid that if I start the > updates right now the timeouts problems will appear again. During an update > compactions are executed? If it is not I think is safe to update the > cluster. I do not recommend you to upgrade right now indeed. Yet I would do it asap (= as soon as the cluster is ready and clients are compatible with the new version). You should always start operations with an healthy cluster or you might end in a worst situation. Compactions will run normally. Make sure not to run any streaming process (repairs / bootstrap / node removal) during the upgrade and while you have not yet run "nodetool upgradesstable". There is a lot of informations out there about upgrades. C*heers, ----------------------- Alain Rodriguez - al...@thelastpickle.com France The Last Pickle - Apache Cassandra Consulting http://www.thelastpickle.com 2016-04-05 10:32 GMT+02:00 Paco Trujillo <f.truji...@genetwister.nl>: > Hi daemeon > > > > We have check network and it is ok, in fact the nodes are connecting > between themselves with a dedicated network. > > > > *From:* daemeon reiydelle [mailto:daeme...@gmail.com] > *Sent:* maandag 4 april 2016 18:42 > *To:* user@cassandra.apache.org > *Subject:* Re: all the nost are not reacheable when running massive > deletes > > > > Network issues. Could be jumbo frames not consistent or other. > > sent from my mobile > > sent from my mobile > Daemeon C.M. Reiydelle > USA 415.501.0198 > London +44.0.20.8144.9872 > > On Apr 4, 2016 5:34 AM, "Paco Trujillo" <f.truji...@genetwister.nl> wrote: > > Hi everyone > > > > We are having problems with our cluster (7 nodes version 2.0.17) when > running “massive deletes” on one of the nodes (via cql command line). At > the beginning everything is fine, but after a while we start getting > constant NoHostAvailableException using the datastax driver: > > > > Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: > All host(s) tried for query failed (tried: /172.31.7.243:9042 > (com.datastax.driver.core.exceptions.DriverException: Timeout while trying > to acquire available connection (you may want to increase the driver number > of per-host connections)), /172.31.7.245:9042 > (com.datastax.driver.core.exceptions.DriverException: Timeout while trying > to acquire available connection (you may want to increase the driver number > of per-host connections)), /172.31.7.246:9042 > (com.datastax.driver.core.exceptions.DriverException: Timeout while trying > to acquire available connection (you may want to increase the driver number > of per-host connections)), /172.31.7.247:9042, /172.31.7.232:9042, / > 172.31.7.233:9042, /172.31.7.244:9042 [only showing errors of first 3 > hosts, use getErrors() for more details]) > > > > > > All the nodes are running: > > > > UN 172.31.7.244 152.21 GB 256 14.5% > 58abea69-e7ba-4e57-9609-24f3673a7e58 RAC1 > > UN 172.31.7.245 168.4 GB 256 14.5% > bc11b4f0-cf96-4ca5-9a3e-33cc2b92a752 RAC1 > > UN 172.31.7.246 177.71 GB 256 13.7% > 8dc7bb3d-38f7-49b9-b8db-a622cc80346c RAC1 > > UN 172.31.7.247 158.57 GB 256 14.1% > 94022081-a563-4042-81ab-75ffe4d13194 RAC1 > > UN 172.31.7.243 176.83 GB 256 14.6% > 0dda3410-db58-42f2-9351-068bdf68f530 RAC1 > > UN 172.31.7.233 159 GB 256 13.6% > 01e013fb-2f57-44fb-b3c5-fd89d705bfdd RAC1 > > UN 172.31.7.232 166.05 GB 256 15.0% > 4d009603-faa9-4add-b3a2-fe24ec16a7c1 > > > > but two of them have high cpu load, especially the 232 because I am > running a lot of deletes using cqlsh in that node. > > > > I know that deletes generate tombstones, but with 7 nodes in the cluster I > do not think is normal that all the host are not accesible. > > > > We have a replication factor of 3 and for the deletes I am not using any > consistency (so it is using the default ONE). > > > > I check the nodes which a lot of CPU (near 96%) and th gc activity remains > on 1.6% (using only 3 GB from the 10 which have assigned). But looking at > the thread pool stats, the mutation stages pending column grows without > stop, could be that the problem? > > > > I cannot find the reason that originates the timeouts. I already have > increased the timeouts, but It do not think that is a solution because the > timeouts indicated another type of error. Anyone have a tip to try to > determine where is the problem? > > > > Thanks in advance >