Re: Node removal causes spike in pending native-transport requests and clients suffer

Gil Ganz Thu, 11 Mar 2021 01:58:15 -0800

Yes. 192gb.

On Thu, Mar 11, 2021 at 10:29 AM Kane Wilson <k...@raft.so> wrote:


> That is a very large heap.  I presume you are using G1GC? How much memory
> do your servers have?
>
> raft.so - Cassandra consulting, support, managed services
>
> On Thu., 11 Mar. 2021, 18:29 Gil Ganz, <gilg...@gmail.com> wrote:
>
>> I always prefer to do decommission, but the issue here  is these servers
>> are on-prem, and disks die from time to time.
>> It's a very large cluster, in multiple datacenters around the world, so
>> it can take some time before we have a replacement, so we usually need to
>> run removenode in such cases.
>>
>> Other than that there are no issues in the cluster, the load is
>> reasonable, and when this issue happens, following a removenode, this huge
>> number of NTR is what I see, weird thing it's only on some nodes.
>> I have been running with a very small
>> native_transport_max_concurrent_requests_in_bytes  setting for a few days
>> now on some nodes (few mb's compared to the default 0.8 of a 60gb heap), it
>> looks like it's good enough for the app, will roll it out to the entire dc
>> and test removal again.
>>
>>
>> On Tue, Mar 9, 2021 at 10:51 AM Kane Wilson <k...@raft.so> wrote:
>>
>>> It's unlikely to help in this case, but you should be using nodetool
>>> decommission on the node you want to remove rather than removenode from
>>> another node (and definitely don't force removal)
>>>
>>> native_transport_max_concurrent_requests_in_bytes defaults to 10% of the
>>> heap, which I suppose depending on your configuration could potentially
>>> result in a smaller number of concurrent requests than previously. It's
>>> worth a shot setting it higher to see if the issue is related. Is this the
>>> only issue you see on the cluster? I assume load on the cluster is still
>>> low/reasonable and the only symptom you're seeing is the increased NTR
>>> requests?
>>>
>>> raft.so - Cassandra consulting, support, and managed services
>>>
>>>
>>> On Mon, Mar 8, 2021 at 10:47 PM Gil Ganz <gilg...@gmail.com> wrote:
>>>
>>>>
>>>> Hey,
>>>> We have a 3.11.9 cluster (recently upgraded from 2.1.14), and after the
>>>> upgrade we have an issue when we remove a node.
>>>>
>>>> The moment I run the removenode command, 3 servers in the same dc start
>>>> to have a high amount of pending native-transport-requests (getting to
>>>> around 1M) and clients are having issues due to that. We are using vnodes
>>>> (32), so I I don't see why I would have 3 servers busier than others (RF is
>>>> 3 but I don't see why it will be related).
>>>>
>>>> Each node has a few TB of data, and in the past we were able to remove
>>>> a node in ~half a day, today what happens is in the first 1-2 hours we have
>>>> these issues with some nodes, then things go quite, remove is still running
>>>> and clients are ok, a few hours later the same issue is back (with same
>>>> nodes as the problematic ones), and clients have issues again, leading us
>>>> to run removenode force.
>>>>
>>>> Reducing stream throughput and number of compactors has helped
>>>> to mitigate the issues a bit, but we still have this issue of pending
>>>> native-transport requests getting to insane numbers and clients suffering,
>>>> eventually causing us to run remove force. Any idea?
>>>>
>>>> I saw since 3.11.6 there is a parameter
>>>> native_transport_max_concurrent_requests_in_bytes, looking into setting
>>>> this, perhaps this will prevent the amount of pending tasks to get so high.
>>>>
>>>> Gil
>>>>
>>>

Re: Node removal causes spike in pending native-transport requests and clients suffer

Reply via email to