Network issues. Could be jumbo frames not consistent or other. sent from my mobile
sent from my mobile Daemeon C.M. Reiydelle USA 415.501.0198 London +44.0.20.8144.9872 On Apr 4, 2016 5:34 AM, "Paco Trujillo" <f.truji...@genetwister.nl> wrote: > Hi everyone > > > > We are having problems with our cluster (7 nodes version 2.0.17) when > running “massive deletes” on one of the nodes (via cql command line). At > the beginning everything is fine, but after a while we start getting > constant NoHostAvailableException using the datastax driver: > > > > Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: > All host(s) tried for query failed (tried: /172.31.7.243:9042 > (com.datastax.driver.core.exceptions.DriverException: Timeout while trying > to acquire available connection (you may want to increase the driver number > of per-host connections)), /172.31.7.245:9042 > (com.datastax.driver.core.exceptions.DriverException: Timeout while trying > to acquire available connection (you may want to increase the driver number > of per-host connections)), /172.31.7.246:9042 > (com.datastax.driver.core.exceptions.DriverException: Timeout while trying > to acquire available connection (you may want to increase the driver number > of per-host connections)), /172.31.7.247:9042, /172.31.7.232:9042, / > 172.31.7.233:9042, /172.31.7.244:9042 [only showing errors of first 3 > hosts, use getErrors() for more details]) > > > > > > All the nodes are running: > > > > UN 172.31.7.244 152.21 GB 256 14.5% > 58abea69-e7ba-4e57-9609-24f3673a7e58 RAC1 > > UN 172.31.7.245 168.4 GB 256 14.5% > bc11b4f0-cf96-4ca5-9a3e-33cc2b92a752 RAC1 > > UN 172.31.7.246 177.71 GB 256 13.7% > 8dc7bb3d-38f7-49b9-b8db-a622cc80346c RAC1 > > UN 172.31.7.247 158.57 GB 256 14.1% > 94022081-a563-4042-81ab-75ffe4d13194 RAC1 > > UN 172.31.7.243 176.83 GB 256 14.6% > 0dda3410-db58-42f2-9351-068bdf68f530 RAC1 > > UN 172.31.7.233 159 GB 256 13.6% > 01e013fb-2f57-44fb-b3c5-fd89d705bfdd RAC1 > > UN 172.31.7.232 166.05 GB 256 15.0% > 4d009603-faa9-4add-b3a2-fe24ec16a7c1 > > > > but two of them have high cpu load, especially the 232 because I am > running a lot of deletes using cqlsh in that node. > > > > I know that deletes generate tombstones, but with 7 nodes in the cluster I > do not think is normal that all the host are not accesible. > > > > We have a replication factor of 3 and for the deletes I am not using any > consistency (so it is using the default ONE). > > > > I check the nodes which a lot of CPU (near 96%) and th gc activity remains > on 1.6% (using only 3 GB from the 10 which have assigned). But looking at > the thread pool stats, the mutation stages pending column grows without > stop, could be that the problem? > > > > I cannot find the reason that originates the timeouts. I already have > increased the timeouts, but It do not think that is a solution because the > timeouts indicated another type of error. Anyone have a tip to try to > determine where is the problem? > > > > Thanks in advance >