Fwd: Re: How to gracefully decommission a highly loaded node?

2018-12-06 Thread onmstester onmstester
After few hours, i just removed the node. done another node decommissioned, 
which finished successfully (the writer app was down, so no pressure on the 
cluster)  Started another node decommission (third), Since didn't have time to 
wait for decommissioning to finish, i started the writer Application, when 
almost most of decommissioning-node's streaming was done and only a few GBs to 
two other nodes remained to be streamed. After 12 Hours i checked the 
decommissioning node  and netstats says: LEAVING, Restore Replica Count! So 
just ran removednode on this one too. Is there something wrong with 
decommissioning while someones writing to Cluster? Using Apache Cassandra 
3.11.2 Sent using Zoho Mail  Forwarded message  From : 
onmstester onmstester  To : 
"user" Date : Wed, 05 Dec 2018 09:00:34 +0330 
Subject : Fwd: Re: How to gracefully decommission a highly loaded node? 
 Forwarded message  After a long time stuck in LEAVING, 
and "not doing any streams", i killed Cassandra process and restart it, then 
again ran nodetool decommission (Datastax recipe for stuck decommission), now 
it says, LEAVING, "unbootstrap $(the node id)" What's going on? Should i forget 
about decommission and just remove the node? There is an issue to make 
decommission resumable: https://issues.apache.org/jira/browse/CASSANDRA-12008 
but i couldn't figure out how this suppose to work? I was expecting that after 
restarting stucked-decommission-cassandra, it resume the decommissioning 
process, but the node became UN after restart. Sent using Zoho Mail 
 Forwarded message  From : Simon Fontana Oscarsson 
 To : 
"user@cassandra.apache.org" Date : Tue, 04 Dec 2018 
15:20:15 +0330 Subject : Re: How to gracefully decommission a highly loaded 
node?  Forwarded message  
- To 
unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional 
commands, e-mail: user-h...@cassandra.apache.org Hi, If it already uses 100 % 
CPU I have a hard time seeing it being able to do a decomission while serving 
requests. If you have a lot of free space I would first try nodetool 
disableautocompaction. If you don't see any progress in nodetool netstats you 
can also disablebinary, disablethrift and disablehandoff to stop serving client 
requests.  -- SIMON FONTANA OSCARSSON
Software Developer

Ericsson
Ölandsgatan 1
37133 Karlskrona, Sweden
simon.fontana.oscars...@ericsson.com
www.ericsson.com On tis, 2018-12-04 at 14:21 +0330, onmstester onmstester 
wrote: One node suddenly uses 100% CPU, i suspect hardware problems and do not 
have time to trace that, so decided to just remove the node from the cluster, 
but although the node state changed to UL, but no sign of Leaving: the node is 
still compacting and flushing memtables, writing mutations and CPU is 100% for 
hours since. Is there any means to force a Cassandra Node to just decommission 
and stop doing normal things? Due to W.CL=ONE, i can not use removenode and 
shutdown the node Best Regards Sent using Zoho Mail

smime.p7s
Description: Binary data

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Re: Sporadic high IO bandwidth and Linux OOM killer

2018-12-06 Thread Riccardo Ferrari
Alex,

I had few instances in the past that were showing that unresponsivveness
behaviour. Back then I saw with iotop/htop/dstat ... the system was stuck
on a single thread processing (full throttle) for seconds. According to
iotop that was the kswapd0 process. That system was an ubuntu 16.04
actually "Ubuntu 16.04.4 LTS".

>From there I started to dig what kswap process was involved in a system
with no swap and found that is used for mmapping. This erratic (allow me to
say erratic) behaviour was not showing up when I was on 3.0.6 but started
to right after upgrading to 3.0.17.

By "load" I refer to the load as reported by the `nodetool status`. On my
systems, when disk_access_mode is auto (read mmap), it is the sum of the
node load plus the jmv heap size. Of course this is just what I noted on my
systems not really sure if that should be the case on yours too.

I hope someone with more experience than me will add a comment about your
settings. Reading the configuration file, writers and compactors should be
2 at minimum. I can confirm when I tried in the past to change the
concurrent_compactors to 1 I had really bad things happenings (high system
load, high message drop rate, ...)

I have the "feeling", when running on constrained hardware the underlaying
kernel optimization is a must. I agree with Jonathan H. that you should
think about increasing the instance size, CPU and memory mathters a lot.

Best,


On Wed, Dec 5, 2018 at 10:36 PM Oleksandr Shulgin <
oleksandr.shul...@zalando.de> wrote:

> On Wed, 5 Dec 2018, 19:34 Riccardo Ferrari 
>> Hi Alex,
>>
>> I saw that behaviout in the past.
>>
>
> Riccardo,
>
> Thank you for the reply!
>
> Do you refer to kswapd issue only or have you observed more problems that
> match behavior I have described?
>
> I can tell you the kswapd0 usage is connected to the `disk_access_mode`
>> property. On 64bit systems defaults to mmap.
>>
>
> Hm, that's interesting, I will double-check.
>
> That also explains why your virtual memory is so high (it somehow matches
>> the node load, right?).
>>
>
> Not sure what do you mean by "load" here. We have a bit less than 1.5TB
> per node on average.
>
> Regards,
> --
> Alex
>
>


Re: Sporadic high IO bandwidth and Linux OOM killer

2018-12-06 Thread Oleksandr Shulgin
On Thu, Dec 6, 2018 at 11:14 AM Riccardo Ferrari  wrote:

>
> I had few instances in the past that were showing that unresponsivveness
> behaviour. Back then I saw with iotop/htop/dstat ... the system was stuck
> on a single thread processing (full throttle) for seconds. According to
> iotop that was the kswapd0 process. That system was an ubuntu 16.04
> actually "Ubuntu 16.04.4 LTS".
>

Riccardo,

Did you by chance also observe Linux OOM?  How long did the
unresponsiveness last in your case?

>From there I started to dig what kswap process was involved in a system
> with no swap and found that is used for mmapping. This erratic (allow me to
> say erratic) behaviour was not showing up when I was on 3.0.6 but started
> to right after upgrading to 3.0.17.
>
> By "load" I refer to the load as reported by the `nodetool status`. On my
> systems, when disk_access_mode is auto (read mmap), it is the sum of the
> node load plus the jmv heap size. Of course this is just what I noted on my
> systems not really sure if that should be the case on yours too.
>

I've checked and indeed we are using disk_access_mode=auto (well,
implicitly because it's not even part of config file anymore):
DiskAccessMode 'auto' determined to be mmap, indexAccessMode is mmap.

I hope someone with more experience than me will add a comment about your
> settings. Reading the configuration file, writers and compactors should be
> 2 at minimum. I can confirm when I tried in the past to change the
> concurrent_compactors to 1 I had really bad things happenings (high system
> load, high message drop rate, ...)
>

As I've mentioned, we did not observe any other issues with the current
setup: system load is reasonable, no dropped messages, no big number of
hints, request latencies are OK, no big number of pending compactions.
Also during repair everything looks fine.

I have the "feeling", when running on constrained hardware the underlaying
> kernel optimization is a must. I agree with Jonathan H. that you should
> think about increasing the instance size, CPU and memory mathters a lot.
>

How did you solve your issue in the end?  You didn't rollback to 3.0.6?
Did you tune kernel parameters?  Which ones?

Thank you!
--
Alex


streaming errors with sstableloader

2018-12-06 Thread Ivan Iliev
Hello community,

I'm receiving some strange streaming errors while trying to restore certain
sstables snapshots with sstableloader to a new cluster.

While the cluster is up and running and nodes are communicating with
each other, I can see streams failing to the nodes with no obvious reason
and the only exception thrown is:

ERROR 14:00:08,403 [Stream #3d572210-f95f-11e8-bf2d-01149b1d085c] Streaming
error occurred on session with peer 10.35.81.88
java.lang.NullPointerException: null
   at
org.apache.cassandra.db.SerializationHeader$Component.access$400(SerializationHeader.java:271)
~[apache-cassandra-3.11.3.jar:3.11.3]
   at
org.apache.cassandra.db.SerializationHeader$Serializer.serialize(SerializationHeader.java:445)
~[apache-cassandra-3.11.3.jar:3.11.3]
   at
org.apache.cassandra.streaming.messages.FileMessageHeader$FileMessageHeaderSerializer.serialize(FileMessageHeader.java:216)
~[apache-cassandra-3.11.3.jar:3.11.3]
   at
org.apache.cassandra.streaming.messages.OutgoingFileMessage.serialize(OutgoingFileMessage.java:94)
~[apache-cassandra-3.11.3.jar:3.11.3]
   at
org.apache.cassandra.streaming.messages.OutgoingFileMessage$1.serialize(OutgoingFileMessage.java:52)
~[apache-cassandra-3.11.3.jar:3.11.3]
   at
org.apache.cassandra.streaming.messages.OutgoingFileMessage$1.serialize(OutgoingFileMessage.java:41)
~[apache-cassandra-3.11.3.jar:3.11.3]
   at
org.apache.cassandra.streaming.messages.StreamMessage.serialize(StreamMessage.java:50)
~[apache-cassandra-3.11.3.jar:3.11.3]
   at
org.apache.cassandra.streaming.ConnectionHandler$OutgoingMessageHandler.sendMessage(ConnectionHandler.java:408)
~[apache-cassandra-3.11.3.jar:3.11.3]
   at
org.apache.cassandra.streaming.ConnectionHandler$OutgoingMessageHandler.run(ConnectionHandler.java:380)
~[apache-cassandra-3.11.3.jar:3.11.3]
   at java.lang.Thread.run(Thread.java:748) [na:1.8.0_191]
progress: [/10.35.81.88]0:0/3 0  % [/10.35.81.79]0:1/3 0  % [
cassandra01-test.sofia.elex.be/10.35.81.76]0:1/3 0  % total: 0% 2.652KiB/s
(avg: 2.652KiB/s)
progress: [/10.35.81.88]0:0/3 0  % [/10.35.81.79]0:1/3 0  % [
cassandra01-test.sofia.elex.be/10.35.81.76]0:1/3 0  % total: 0% 0.000KiB/s
(avg: 2.651KiB/s)
progress: [/10.35.81.88]0:0/3 0  % [/10.35.81.79]0:1/3 0  % [
cassandra01-test.sofia.elex.be/10.35.81.76]0:1/3 0  % total: 0% 0.000KiB/s
(avg: 2.650KiB/s)
ERROR 14:00:08,406 [Stream #3d572210-f95f-11e8-bf2d-01149b1d085c] Streaming
error occurred on session with peer 10.35.81.79
java.lang.NullPointerException: null
   at
org.apache.cassandra.db.SerializationHeader$Component.access$400(SerializationHeader.java:271)
~[apache-cassandra-3.11.3.jar:3.11.3]
   at
org.apache.cassandra.db.SerializationHeader$Serializer.serialize(SerializationHeader.java:445)
~[apache-cassandra-3.11.3.jar:3.11.3]
   at
org.apache.cassandra.streaming.messages.FileMessageHeader$FileMessageHeaderSerializer.serialize(FileMessageHeader.java:216)
~[apache-cassandra-3.11.3.jar:3.11.3]
   at
org.apache.cassandra.streaming.messages.OutgoingFileMessage.serialize(OutgoingFileMessage.java:94)
~[apache-cassandra-3.11.3.jar:3.11.3]
   at
org.apache.cassandra.streaming.messages.OutgoingFileMessage$1.serialize(OutgoingFileMessage.java:52)
~[apache-cassandra-3.11.3.jar:3.11.3]
   at
org.apache.cassandra.streaming.messages.OutgoingFileMessage$1.serialize(OutgoingFileMessage.java:41)
~[apache-cassandra-3.11.3.jar:3.11.3]
   at
org.apache.cassandra.streaming.messages.StreamMessage.serialize(StreamMessage.java:50)
~[apache-cassandra-3.11.3.jar:3.11.3]
   at
org.apache.cassandra.streaming.ConnectionHandler$OutgoingMessageHandler.sendMessage(ConnectionHandler.java:408)
~[apache-cassandra-3.11.3.jar:3.11.3]
   at
org.apache.cassandra.streaming.ConnectionHandler$OutgoingMessageHandler.run(ConnectionHandler.java:380)
~[apache-cassandra-3.11.3.jar:3.11.3]
   at java.lang.Thread.run(Thread.java:748) [na:1.8.0_191]
progress: [/10.35.81.88]0:0/3 0  % [/10.35.81.79]0:1/3 0  % [
cassandra01-test.sofia.elex.be/10.35.81.76]0:1/3 0  % total: 0% 0.000KiB/s
(avg: 2.650KiB/s)
ERROR 14:00:08,407 [Stream #3d572210-f95f-11e8-bf2d-01149b1d085c] Remote
peer 10.35.81.88 failed stream session.
ERROR 14:00:08,408 [Stream #3d572210-f95f-11e8-bf2d-01149b1d085c] Streaming
error occurred on session with peer 10.35.81.76
java.lang.NullPointerException: null
   at
org.apache.cassandra.db.SerializationHeader$Component.access$400(SerializationHeader.java:271)
~[apache-cassandra-3.11.3.jar:3.11.3]
   at
org.apache.cassandra.db.SerializationHeader$Serializer.serialize(SerializationHeader.java:445)
~[apache-cassandra-3.11.3.jar:3.11.3]
   at
org.apache.cassandra.streaming.messages.FileMessageHeader$FileMessageHeaderSerializer.serialize(FileMessageHeader.java:216)
~[apache-cassandra-3.11.3.jar:3.11.3]
   at
org.apache.cassandra.streaming.messages.OutgoingFileMessage.serialize(OutgoingFileMessage.java:94)
~[apache-cassandra-3.11.3.jar:3.11.3]
   at
org.apache.ca

Re: Sporadic high IO bandwidth and Linux OOM killer

2018-12-06 Thread Riccardo Ferrari
Hi,

To be honest I've never seen the OOM in action on those instances. My Xmx
was 8GB just like yours and that let me think you have some process that is
competing for memory, is it? Do you have any cron, any backup, anything
that can trick the OOMKiller ?

My unresponsiveness was seconds long. This is/was bad becasue gossip
protocol was going crazy by marking nodes down and all the consequences
this can lead in distributed system, think about hints, dynamic snitch, and
whatever depends on node availability ...
Can you share some number about your `tpstats` or system load in general?

On the tuning side I just went through the following article:
https://docs.datastax.com/en/dse/5.1/dse-admin/datastax_enterprise/config/configRecommendedSettings.html

No rollbacks, just moving forward! Right now we are upgrading the instance
size to something more recent than m1.xlarge (for many different reasons,
including security, ECU and network).Nevertheless it might be a good idea
to upgrade to the 3.X branch to leverage on better off-heap memory
management.

Best,


On Thu, Dec 6, 2018 at 2:33 PM Oleksandr Shulgin <
oleksandr.shul...@zalando.de> wrote:

> On Thu, Dec 6, 2018 at 11:14 AM Riccardo Ferrari 
> wrote:
>
>>
>> I had few instances in the past that were showing that unresponsivveness
>> behaviour. Back then I saw with iotop/htop/dstat ... the system was stuck
>> on a single thread processing (full throttle) for seconds. According to
>> iotop that was the kswapd0 process. That system was an ubuntu 16.04
>> actually "Ubuntu 16.04.4 LTS".
>>
>
> Riccardo,
>
> Did you by chance also observe Linux OOM?  How long did the
> unresponsiveness last in your case?
>
> From there I started to dig what kswap process was involved in a system
>> with no swap and found that is used for mmapping. This erratic (allow me to
>> say erratic) behaviour was not showing up when I was on 3.0.6 but started
>> to right after upgrading to 3.0.17.
>>
>> By "load" I refer to the load as reported by the `nodetool status`. On my
>> systems, when disk_access_mode is auto (read mmap), it is the sum of the
>> node load plus the jmv heap size. Of course this is just what I noted on my
>> systems not really sure if that should be the case on yours too.
>>
>
> I've checked and indeed we are using disk_access_mode=auto (well,
> implicitly because it's not even part of config file anymore):
> DiskAccessMode 'auto' determined to be mmap, indexAccessMode is mmap.
>
> I hope someone with more experience than me will add a comment about your
>> settings. Reading the configuration file, writers and compactors should be
>> 2 at minimum. I can confirm when I tried in the past to change the
>> concurrent_compactors to 1 I had really bad things happenings (high system
>> load, high message drop rate, ...)
>>
>
> As I've mentioned, we did not observe any other issues with the current
> setup: system load is reasonable, no dropped messages, no big number of
> hints, request latencies are OK, no big number of pending compactions.
> Also during repair everything looks fine.
>
> I have the "feeling", when running on constrained hardware the underlaying
>> kernel optimization is a must. I agree with Jonathan H. that you should
>> think about increasing the instance size, CPU and memory mathters a lot.
>>
>
> How did you solve your issue in the end?  You didn't rollback to 3.0.6?
> Did you tune kernel parameters?  Which ones?
>
> Thank you!
> --
> Alex
>
>


Determine size of duplicate data

2018-12-06 Thread Ian Spence
Hello,

Is there a way to determine the size of “duplicate” date? I.E. data that a node 
no longer owns after expanding the ring? Data that is removed with a nodetool 
cleanup?

Thank you!

Ian Spence
intermediate devops engineer
Global Relay

ian.spe...@globalrelay.net
Cell: +1 (778) 227-0552

866.484.6630
New York  |  Chicago  |  Vancouver  |  London  (+44.0800.032.9829)  |  
Singapore  (+65.3158.1301)
Global Relay Archive supports email, instant messaging, BlackBerry, Bloomberg, 
Thomson Reuters, Pivot, ICE Chat, LinkedIn, Twitter, Facebook and more.

Ask about Global Relay Message — 
The Future of Collaboration in the Financial Services World

All email sent to or from this address will be retained by Global Relay’s email 
archiving system. This message is intended only for the use of the individual 
or entity to which it is addressed, and may contain information that is 
privileged, confidential, and exempt from disclosure under applicable law.  
Global Relay will not be liable for any compliance or technical information 
provided herein.  All trademarks are the property of their respective owners.


Tool to decide which node to decommission (using vnodes)

2018-12-06 Thread John Sumsion
Here is a tool I worked on to figure out which node to decommission that will 
leave you with the most even token balance afterwards.

https://github.com/jdsumsion/vnode-decommission-calculator

Feel free to use or enhance as you desire.


John...


Re: Migrating from DSE5.1.2 to Opensource cassandra

2018-12-06 Thread Jonathan Koppenhofer
Just to add a few additional notes on the in-place replacement.
* We had to remove system.local and system.peers
* Since we remove those system tables, you also have to put
replace_address_first_boot in cassandra-env with the same IP address.
* We also temporarily add the node as a seed to avoid the node from
bootstrapping
* Don't forget to switch your config back to "normal" after after the nodes
is back up and running
* Probably unrelated to this process, but even after drain when we
originally stopped the node, we noticed DSE did not cleanup the commitlogs
even though the logs said those files were drained. So we had to forcefully
remove commitlogs before bringing the node back up.

Finally... Be sure you test this pretty well. We did this on many clusters,
but your mileage may vary depending on version of DSE and the features you
use.



On Thu, Dec 6, 2018, 1:23 AM Brooke Thorley  Jonathan's high level process for in place conversion looks right.
>
> To answer your original question about versioning, DSE release notes lists
> the equivalent Cassandra version as 3.11.0.
>
> DataStax Enterprise 5.1.2 -
>
> DataStax Enterprise 5.1.10
>
> Apache Cassandra™ 3.11.0 (updated)
>
>
> Kind Regards,
> *Brooke Thorley*
> *VP Technical Operations & Customer Services*
> supp...@instaclustr.com | support.instaclustr.com
>
>
>    
>
>
> Read our latest technical blog posts here
> .
>
> This email has been sent on behalf of Instaclustr Pty. Limited (Australia)
> and Instaclustr Inc (USA).
>
> This email and any attachments may contain confidential and legally
> privileged information.  If you are not the intended recipient, do not copy
> or disclose its content, but please reply to this email immediately and
> highlight the error to the sender and then immediately delete the message.
>
> Instaclustr values your privacy. Our privacy policy can be found at
> https://www.instaclustr.com/company/policies/privacy-policy
>
>
>
>
>
> On Thu, 6 Dec 2018 at 17:19, Dor Laor  wrote:
>
>> An alternative approach is to form another new cluster, leave the
>> original cluster alive (many times
>> it's a must since it needs to be 24x7 online). Double write to the two
>> clusters and later migrate the
>> data to it. Either by taking a snapshot and pass those files to the new
>> cluster or with sstableloader.
>> With this procedure, you'll need to have the same token range ownership.
>>
>> Another solution is to migrate using Spark which will full-table-scan. We
>> have generic code that
>> does it and we can open source it. This way the new cluster can be of any
>> size and speed is also good
>> with large amount of data (100s of TB). This process is also restartable
>> as it takes days to transfer such
>> amount of data.
>>
>> Good luck
>>
>> On Tue, Dec 4, 2018 at 9:04 PM dinesh.jo...@yahoo.com.INVALID
>>  wrote:
>>
>>> Thanks, nice summary of the overall process.
>>>
>>> Dinesh
>>>
>>>
>>> On Tuesday, December 4, 2018, 9:38:47 PM EST, Jonathan Koppenhofer <
>>> j...@koppedomain.com> wrote:
>>>
>>>
>>> Unfortunately, we found this to be a little tricky. We did migrations
>>> from DSE 4.8 and 5.0 to OSS 3.0.x, so you may run into additional issues. I
>>> will also say your best option may be to install a fresh cluster and stream
>>> the data. This wasn't feasible for us at the size and scale in the time
>>> frames and infrastructure restrictions we had. I will have to review my
>>> notes for more detail, but off the top of my head, for an in place
>>> migration...
>>>
>>> Pre-upgrade
>>> * Be sure you are not using any Enterprise features like Search or
>>> Graph. Not only are there not equivalent features in open source, but
>>> theses features require proprietary classes to be in the classpath, or
>>> Cassandra will not even start up.
>>> * By default, I think DSE uses their own custom authenticators,
>>> authorizors, and such. Make sure what you are doing has an open source
>>> equivalent.
>>> * The DSE system keyapaces use custom replication strategies. Convert
>>> these to NTS before upgrade.
>>> * Otherwise, follow the same processes you would do before an upgrade
>>> (repair, snapshot, etc)
>>>
>>> Upgrade
>>> * The easy part is just replacing the binaries as you would in normal
>>> upgrade. Drain and stop the existing node first. You can also do this same
>>> process in a rolling fashion to maintain availability. In our case, we were
>>> doing an in-place upgrade and reusing the same IPs
>>> * DSE unfortunately creates a custom column in a system table that
>>> requires you to remove one (or more) system tables (peers?) to be able to
>>> start the node. You delete these system tables by  removing the sstbles on
>>> disk while the node is down. This is a bit of a headache if using vnodes.
>>> As we are using vnodes, it required us to manually specify num tokens, and
>>> the