Re: all the nost are not reacheable when running massive deletes

Alain RODRIGUEZ Tue, 05 Apr 2016 06:12:08 -0700

>
>  Over use the cluster was one thing which I was thinking about, and I
> have requested two new nodes (anyway it was something already planned). But
> the pattern of nodes with high CPU load is only visible in 1 or two of the
> nodes, the rest are working correctly. That made me think that adding two
> new nodes maybe will not help.
>


Well, then you could trying to replace this node as soon as you have more
nodes available. I would use this procedure as I believe it is the most
efficient one:
http://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_replace_node_t.html
.

Yet I believe it might not be a hardware or cluster throughput issue, and
if it is a hardware issues you probably want to dig it as this machine is
yours and not a virtual one. You might want to reuse it anyway.

Some questions about the machine and their usage.

Disk:
What disk hardware and configuration do you use.
iostat -mx 5 100 gives you? How is iowait?
Any error in the system / kernel logs?

CPU
How much used are the CPUs in general / worst cases?
What is the load average / max and how many cores have the cpu?

RAM
You are using 10GB heap and CMS right? You seems to say that GC activity
looks ok, can you confirm?
How much total RAM are the machines using?

The point here is to see if we can spot the bottleneck. If there is none,
Cassandra is probably badly configured at some point.

when running “massive deletes” on one of the nodes
>

 Run the deletes at slower at constant path sounds good and definitely I
> will try that.


Are clients and queries well configured to use all the nodes evenly? Are
deletes well balanced also? If not, balancing the usage of the nodes will
probably alleviate things.

The update of Cassandra is a good point but I am afraid that if I start the
> updates right now the timeouts problems will appear again. During an update
> compactions are executed? If it is not I think is safe to update the
> cluster.


I do not recommend you to upgrade right now indeed. Yet I would do it asap
(= as soon as the cluster is ready and clients are compatible with the new
version). You should always start operations with an healthy cluster or you
might end in a worst situation. Compactions will run normally. Make sure
not to run any streaming process (repairs / bootstrap / node removal)
during the upgrade and while you have not yet run "nodetool
upgradesstable". There is a lot of informations out there about upgrades.

C*heers,
-----------------------
Alain Rodriguez - al...@thelastpickle.com
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2016-04-05 10:32 GMT+02:00 Paco Trujillo <f.truji...@genetwister.nl>:

> Hi daemeon
>
>
>
> We have check network and it is ok, in fact the nodes are connecting
> between themselves with a dedicated network.
>
>
>
> *From:* daemeon reiydelle [mailto:daeme...@gmail.com]
> *Sent:* maandag 4 april 2016 18:42
> *To:* user@cassandra.apache.org
> *Subject:* Re: all the nost are not reacheable when running massive
> deletes
>
>
>
> Network issues. Could be jumbo frames not consistent or other.
>
> sent from my mobile
>
> sent from my mobile
> Daemeon C.M. Reiydelle
> USA 415.501.0198
> London +44.0.20.8144.9872
>
> On Apr 4, 2016 5:34 AM, "Paco Trujillo" <f.truji...@genetwister.nl> wrote:
>
> Hi everyone
>
>
>
> We are having problems with our cluster (7 nodes version 2.0.17) when
> running “massive deletes” on one of the nodes (via cql command line). At
> the beginning everything is fine, but after a while we start getting
> constant NoHostAvailableException using the datastax driver:
>
>
>
> Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException:
> All host(s) tried for query failed (tried: /172.31.7.243:9042
> (com.datastax.driver.core.exceptions.DriverException: Timeout while trying
> to acquire available connection (you may want to increase the driver number
> of per-host connections)), /172.31.7.245:9042
> (com.datastax.driver.core.exceptions.DriverException: Timeout while trying
> to acquire available connection (you may want to increase the driver number
> of per-host connections)), /172.31.7.246:9042
> (com.datastax.driver.core.exceptions.DriverException: Timeout while trying
> to acquire available connection (you may want to increase the driver number
> of per-host connections)), /172.31.7.247:9042, /172.31.7.232:9042, /
> 172.31.7.233:9042, /172.31.7.244:9042 [only showing errors of first 3
> hosts, use getErrors() for more details])
>
>
>
>
>
> All the nodes are running:
>
>
>
> UN  172.31.7.244  152.21 GB  256     14.5%
> 58abea69-e7ba-4e57-9609-24f3673a7e58  RAC1
>
> UN  172.31.7.245  168.4 GB   256     14.5%
> bc11b4f0-cf96-4ca5-9a3e-33cc2b92a752  RAC1
>
> UN  172.31.7.246  177.71 GB  256     13.7%
> 8dc7bb3d-38f7-49b9-b8db-a622cc80346c  RAC1
>
> UN  172.31.7.247  158.57 GB  256     14.1%
> 94022081-a563-4042-81ab-75ffe4d13194  RAC1
>
> UN  172.31.7.243  176.83 GB  256     14.6%
> 0dda3410-db58-42f2-9351-068bdf68f530  RAC1
>
> UN  172.31.7.233  159 GB     256     13.6%
> 01e013fb-2f57-44fb-b3c5-fd89d705bfdd  RAC1
>
> UN  172.31.7.232  166.05 GB  256     15.0%
> 4d009603-faa9-4add-b3a2-fe24ec16a7c1
>
>
>
> but two of them have high cpu load, especially the 232 because I am
> running a lot of deletes using cqlsh in that node.
>
>
>
> I know that deletes generate tombstones, but with 7 nodes in the cluster I
> do not think is normal that all the host are not accesible.
>
>
>
> We have a replication factor of 3 and for the deletes I am not using any
> consistency (so it is using the default ONE).
>
>
>
> I check the nodes which a lot of CPU (near 96%) and th gc activity remains
> on 1.6% (using only 3 GB from the 10 which have assigned). But looking at
> the thread pool stats, the mutation stages pending column grows without
> stop, could be that the problem?
>
>
>
> I cannot find the reason that originates the timeouts. I already have
> increased the timeouts, but It do not think that is a solution because the
> timeouts indicated another type of error. Anyone have a tip to try to
> determine where is the problem?
>
>
>
> Thanks in advance
>

Re: all the nost are not reacheable when running massive deletes

Reply via email to