Re: Datafile Corruption

2019-08-08 Thread Philip Ó Condúin
Hi Jon,

Good question, I'm not sure if we're using NVMe, I don't see /dev/nvme but
we could still be using it.
We using *Cisco UCS C220 M4 SFF* so I'm just going to check the spec.

Our Kernal is the following, we're using REDHAT so I'm told we can't
upgrade the version until the next major release anyway.
root@cass 0 17:32:28 ~ # uname -r
3.10.0-957.5.1.el7.x86_64

Cheers,
Phil

On Thu, 8 Aug 2019 at 17:35, Jon Haddad  wrote:

> Any chance you're using NVMe with an older Linux kernel?  I've seen a
> *lot* filesystem errors from using older CentOS versions.  You'll want to
> be using a version > 4.15.
>
> On Thu, Aug 8, 2019 at 9:31 AM Philip Ó Condúin 
> wrote:
>
>> *@Jeff *- If it was hardware that would explain it all, but do you think
>> it's possible to have every server in the cluster with a hardware issue?
>> The data is sensitive and the customer would lose their mind if I sent it
>> off-site which is a pity cause I could really do with the help.
>> The corruption is occurring irregularly on every server and instance and
>> column family in the cluster.  Out of 72 instances, we are getting maybe 10
>> corrupt files per day.
>> We are using vnodes (256) and it is happening in both DC's
>>
>> *@Asad *- internode compression is set to ALL on every server.  I have
>> checked the packets for the private interconnect and I can't see any
>> dropped packets, there are dropped packets for other interfaces, but not
>> for the private ones, I will get the network team to double-check this.
>> The corruption is only on the application schema, we are not getting
>> corruption on any system or cass keyspaces.  Corruption is happening in
>> both DC's.  We are getting corruption for the 1 application schema we have
>> across all tables in the keyspace, it's not limited to one table.
>> Im not sure why the app team decided to not use default compression, I
>> must ask them.
>>
>>
>>
>> I have been checking the /var/log/messages today going back a few weeks
>> and can see a serious amount of broken pipe errors across all servers and
>> instances.
>> Here is a snippet from one server but most pipe errors are similar:
>>
>> Jul  9 03:00:08  cassandra: INFO  02:00:08 Writing
>> Memtable-sstable_activity@1126262628(43.631KiB serialized bytes, 18072
>> ops, 0%/0% of on/off-heap limit)
>> Jul  9 03:00:13  kernel: fnic_handle_fip_timer: 8 callbacks suppressed
>> Jul  9 03:00:19  kernel: fnic_handle_fip_timer: 8 callbacks suppressed
>> Jul  9 03:00:22  cassandra: ERROR 02:00:22 Got an IOException during
>> write!
>> Jul  9 03:00:22  cassandra: java.io.IOException: Broken pipe
>> Jul  9 03:00:22  cassandra: at
>> sun.nio.ch.FileDispatcherImpl.write0(Native Method) ~[na:1.8.0_172]
>> Jul  9 03:00:22  cassandra: at
>> sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) ~[na:1.8.0_172]
>> Jul  9 03:00:22  cassandra: at
>> sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) ~[na:1.8.0_172]
>> Jul  9 03:00:22  cassandra: at sun.nio.ch.IOUtil.write(IOUtil.java:65)
>> ~[na:1.8.0_172]
>> Jul  9 03:00:22  cassandra: at
>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
>> ~[na:1.8.0_172]
>> Jul  9 03:00:22  cassandra: at
>> org.apache.thrift.transport.TNonblockingSocket.write(TNonblockingSocket.java:165)
>> ~[libthrift-0.9.2.jar:0.9.2]
>> Jul  9 03:00:22  cassandra: at
>> com.thinkaurelius.thrift.util.mem.Buffer.writeTo(Buffer.java:104)
>> ~[thrift-server-0.3.7.jar:na]
>> Jul  9 03:00:22  cassandra: at
>> com.thinkaurelius.thrift.util.mem.FastMemoryOutputTransport.streamTo(FastMemoryOutputTransport.java:112)
>> ~[thrift-server-0.3.7.jar:na]
>> Jul  9 03:00:22  cassandra: at
>> com.thinkaurelius.thrift.Message.write(Message.java:222)
>> ~[thrift-server-0.3.7.jar:na]
>> Jul  9 03:00:22  cassandra: at
>> com.thinkaurelius.thrift.TDisruptorServer$SelectorThread.handleWrite(TDisruptorServer.java:598)
>> [thrift-server-0.3.7.jar:na]
>> Jul  9 03:00:22  cassandra: at
>> com.thinkaurelius.thrift.TDisruptorServer$SelectorThread.processKey(TDisruptorServer.java:569)
>> [thrift-server-0.3.7.jar:na]
>> Jul  9 03:00:22  cassandra: at
>> com.thinkaurelius.thrift.TDisruptorServer$AbstractSelectorThread.select(TDisruptorServer.java:423)
>> [thrift-server-0.3.7.jar:na]
>> Jul  9 03:00:22  cassandra: at
>> com.thinkaurelius.thrift.TDisruptorServer$AbstractSelectorThread.run(TDisruptorServer.java:383)
>> [thrift-server-0.3.7.jar:na]
>> Jul  9 03:00:25  kernel: fnic_handle_fip_timer: 8 callbacks suppressed
>> Jul  9 03:00:30  cassandra: ERROR 02:00:30 Got an IOException during
>> write!
>> Jul  9 03:00:30  cassandra: java.io.IOException: Broken pipe
>> Jul  9 03:00:30  cassandra: at
>> sun.nio.ch.FileDispatcherImpl.write0(Native Method) ~[na:1.8.0_172]
>> Jul  9 03:00:30  cassandra: at
>> sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) ~[na:1.8.0_172]
>> Jul  9 03:00:30  cassandra: at
>> sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) ~[na:1.8.0_172]
>> Jul  9 03:00:30  cassandra: at 

Re: Datafile Corruption

2019-08-08 Thread Jon Haddad
Any chance you're using NVMe with an older Linux kernel?  I've seen a *lot*
filesystem errors from using older CentOS versions.  You'll want to be
using a version > 4.15.

On Thu, Aug 8, 2019 at 9:31 AM Philip Ó Condúin 
wrote:

> *@Jeff *- If it was hardware that would explain it all, but do you think
> it's possible to have every server in the cluster with a hardware issue?
> The data is sensitive and the customer would lose their mind if I sent it
> off-site which is a pity cause I could really do with the help.
> The corruption is occurring irregularly on every server and instance and
> column family in the cluster.  Out of 72 instances, we are getting maybe 10
> corrupt files per day.
> We are using vnodes (256) and it is happening in both DC's
>
> *@Asad *- internode compression is set to ALL on every server.  I have
> checked the packets for the private interconnect and I can't see any
> dropped packets, there are dropped packets for other interfaces, but not
> for the private ones, I will get the network team to double-check this.
> The corruption is only on the application schema, we are not getting
> corruption on any system or cass keyspaces.  Corruption is happening in
> both DC's.  We are getting corruption for the 1 application schema we have
> across all tables in the keyspace, it's not limited to one table.
> Im not sure why the app team decided to not use default compression, I
> must ask them.
>
>
>
> I have been checking the /var/log/messages today going back a few weeks
> and can see a serious amount of broken pipe errors across all servers and
> instances.
> Here is a snippet from one server but most pipe errors are similar:
>
> Jul  9 03:00:08  cassandra: INFO  02:00:08 Writing
> Memtable-sstable_activity@1126262628(43.631KiB serialized bytes, 18072
> ops, 0%/0% of on/off-heap limit)
> Jul  9 03:00:13  kernel: fnic_handle_fip_timer: 8 callbacks suppressed
> Jul  9 03:00:19  kernel: fnic_handle_fip_timer: 8 callbacks suppressed
> Jul  9 03:00:22  cassandra: ERROR 02:00:22 Got an IOException during write!
> Jul  9 03:00:22  cassandra: java.io.IOException: Broken pipe
> Jul  9 03:00:22  cassandra: at sun.nio.ch.FileDispatcherImpl.write0(Native
> Method) ~[na:1.8.0_172]
> Jul  9 03:00:22  cassandra: at
> sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) ~[na:1.8.0_172]
> Jul  9 03:00:22  cassandra: at
> sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) ~[na:1.8.0_172]
> Jul  9 03:00:22  cassandra: at sun.nio.ch.IOUtil.write(IOUtil.java:65)
> ~[na:1.8.0_172]
> Jul  9 03:00:22  cassandra: at
> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
> ~[na:1.8.0_172]
> Jul  9 03:00:22  cassandra: at
> org.apache.thrift.transport.TNonblockingSocket.write(TNonblockingSocket.java:165)
> ~[libthrift-0.9.2.jar:0.9.2]
> Jul  9 03:00:22  cassandra: at
> com.thinkaurelius.thrift.util.mem.Buffer.writeTo(Buffer.java:104)
> ~[thrift-server-0.3.7.jar:na]
> Jul  9 03:00:22  cassandra: at
> com.thinkaurelius.thrift.util.mem.FastMemoryOutputTransport.streamTo(FastMemoryOutputTransport.java:112)
> ~[thrift-server-0.3.7.jar:na]
> Jul  9 03:00:22  cassandra: at
> com.thinkaurelius.thrift.Message.write(Message.java:222)
> ~[thrift-server-0.3.7.jar:na]
> Jul  9 03:00:22  cassandra: at
> com.thinkaurelius.thrift.TDisruptorServer$SelectorThread.handleWrite(TDisruptorServer.java:598)
> [thrift-server-0.3.7.jar:na]
> Jul  9 03:00:22  cassandra: at
> com.thinkaurelius.thrift.TDisruptorServer$SelectorThread.processKey(TDisruptorServer.java:569)
> [thrift-server-0.3.7.jar:na]
> Jul  9 03:00:22  cassandra: at
> com.thinkaurelius.thrift.TDisruptorServer$AbstractSelectorThread.select(TDisruptorServer.java:423)
> [thrift-server-0.3.7.jar:na]
> Jul  9 03:00:22  cassandra: at
> com.thinkaurelius.thrift.TDisruptorServer$AbstractSelectorThread.run(TDisruptorServer.java:383)
> [thrift-server-0.3.7.jar:na]
> Jul  9 03:00:25  kernel: fnic_handle_fip_timer: 8 callbacks suppressed
> Jul  9 03:00:30  cassandra: ERROR 02:00:30 Got an IOException during write!
> Jul  9 03:00:30  cassandra: java.io.IOException: Broken pipe
> Jul  9 03:00:30  cassandra: at sun.nio.ch.FileDispatcherImpl.write0(Native
> Method) ~[na:1.8.0_172]
> Jul  9 03:00:30  cassandra: at
> sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) ~[na:1.8.0_172]
> Jul  9 03:00:30  cassandra: at
> sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) ~[na:1.8.0_172]
> Jul  9 03:00:30  cassandra: at sun.nio.ch.IOUtil.write(IOUtil.java:65)
> ~[na:1.8.0_172]
> Jul  9 03:00:30  cassandra: at
> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
> ~[na:1.8.0_172]
> Jul  9 03:00:30  cassandra: at
> org.apache.thrift.transport.TNonblockingSocket.write(TNonblockingSocket.java:165)
> ~[libthrift-0.9.2.jar:0.9.2]
> Jul  9 03:00:30  cassandra: at
> com.thinkaurelius.thrift.util.mem.Buffer.writeTo(Buffer.java:104)
> ~[thrift-server-0.3.7.jar:na]
> Jul  9 03:00:30  cassandra: at
> 

Re: Datafile Corruption

2019-08-08 Thread Philip Ó Condúin
*@Jeff *- If it was hardware that would explain it all, but do you think
it's possible to have every server in the cluster with a hardware issue?
The data is sensitive and the customer would lose their mind if I sent it
off-site which is a pity cause I could really do with the help.
The corruption is occurring irregularly on every server and instance and
column family in the cluster.  Out of 72 instances, we are getting maybe 10
corrupt files per day.
We are using vnodes (256) and it is happening in both DC's

*@Asad *- internode compression is set to ALL on every server.  I have
checked the packets for the private interconnect and I can't see any
dropped packets, there are dropped packets for other interfaces, but not
for the private ones, I will get the network team to double-check this.
The corruption is only on the application schema, we are not getting
corruption on any system or cass keyspaces.  Corruption is happening in
both DC's.  We are getting corruption for the 1 application schema we have
across all tables in the keyspace, it's not limited to one table.
Im not sure why the app team decided to not use default compression, I must
ask them.



I have been checking the /var/log/messages today going back a few weeks and
can see a serious amount of broken pipe errors across all servers and
instances.
Here is a snippet from one server but most pipe errors are similar:

Jul  9 03:00:08  cassandra: INFO  02:00:08 Writing
Memtable-sstable_activity@1126262628(43.631KiB serialized bytes, 18072 ops,
0%/0% of on/off-heap limit)
Jul  9 03:00:13  kernel: fnic_handle_fip_timer: 8 callbacks suppressed
Jul  9 03:00:19  kernel: fnic_handle_fip_timer: 8 callbacks suppressed
Jul  9 03:00:22  cassandra: ERROR 02:00:22 Got an IOException during write!
Jul  9 03:00:22  cassandra: java.io.IOException: Broken pipe
Jul  9 03:00:22  cassandra: at sun.nio.ch.FileDispatcherImpl.write0(Native
Method) ~[na:1.8.0_172]
Jul  9 03:00:22  cassandra: at
sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) ~[na:1.8.0_172]
Jul  9 03:00:22  cassandra: at
sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) ~[na:1.8.0_172]
Jul  9 03:00:22  cassandra: at sun.nio.ch.IOUtil.write(IOUtil.java:65)
~[na:1.8.0_172]
Jul  9 03:00:22  cassandra: at
sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
~[na:1.8.0_172]
Jul  9 03:00:22  cassandra: at
org.apache.thrift.transport.TNonblockingSocket.write(TNonblockingSocket.java:165)
~[libthrift-0.9.2.jar:0.9.2]
Jul  9 03:00:22  cassandra: at
com.thinkaurelius.thrift.util.mem.Buffer.writeTo(Buffer.java:104)
~[thrift-server-0.3.7.jar:na]
Jul  9 03:00:22  cassandra: at
com.thinkaurelius.thrift.util.mem.FastMemoryOutputTransport.streamTo(FastMemoryOutputTransport.java:112)
~[thrift-server-0.3.7.jar:na]
Jul  9 03:00:22  cassandra: at
com.thinkaurelius.thrift.Message.write(Message.java:222)
~[thrift-server-0.3.7.jar:na]
Jul  9 03:00:22  cassandra: at
com.thinkaurelius.thrift.TDisruptorServer$SelectorThread.handleWrite(TDisruptorServer.java:598)
[thrift-server-0.3.7.jar:na]
Jul  9 03:00:22  cassandra: at
com.thinkaurelius.thrift.TDisruptorServer$SelectorThread.processKey(TDisruptorServer.java:569)
[thrift-server-0.3.7.jar:na]
Jul  9 03:00:22  cassandra: at
com.thinkaurelius.thrift.TDisruptorServer$AbstractSelectorThread.select(TDisruptorServer.java:423)
[thrift-server-0.3.7.jar:na]
Jul  9 03:00:22  cassandra: at
com.thinkaurelius.thrift.TDisruptorServer$AbstractSelectorThread.run(TDisruptorServer.java:383)
[thrift-server-0.3.7.jar:na]
Jul  9 03:00:25  kernel: fnic_handle_fip_timer: 8 callbacks suppressed
Jul  9 03:00:30  cassandra: ERROR 02:00:30 Got an IOException during write!
Jul  9 03:00:30  cassandra: java.io.IOException: Broken pipe
Jul  9 03:00:30  cassandra: at sun.nio.ch.FileDispatcherImpl.write0(Native
Method) ~[na:1.8.0_172]
Jul  9 03:00:30  cassandra: at
sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) ~[na:1.8.0_172]
Jul  9 03:00:30  cassandra: at
sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) ~[na:1.8.0_172]
Jul  9 03:00:30  cassandra: at sun.nio.ch.IOUtil.write(IOUtil.java:65)
~[na:1.8.0_172]
Jul  9 03:00:30  cassandra: at
sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
~[na:1.8.0_172]
Jul  9 03:00:30  cassandra: at
org.apache.thrift.transport.TNonblockingSocket.write(TNonblockingSocket.java:165)
~[libthrift-0.9.2.jar:0.9.2]
Jul  9 03:00:30  cassandra: at
com.thinkaurelius.thrift.util.mem.Buffer.writeTo(Buffer.java:104)
~[thrift-server-0.3.7.jar:na]
Jul  9 03:00:30  cassandra: at
com.thinkaurelius.thrift.util.mem.FastMemoryOutputTransport.streamTo(FastMemoryOutputTransport.java:112)
~[thrift-server-0.3.7.jar:na]
Jul  9 03:00:30  cassandra: at
com.thinkaurelius.thrift.Message.write(Message.java:222)
~[thrift-server-0.3.7.jar:na]
Jul  9 03:00:30  cassandra: at
com.thinkaurelius.thrift.TDisruptorServer$SelectorThread.handleWrite(TDisruptorServer.java:598)
[thrift-server-0.3.7.jar:na]
Jul  9 03:00:30  cassandra: at

RE: Datafile Corruption

2019-08-08 Thread ZAIDI, ASAD A
Did you check if packets are NOT being dropped for network interfaces Cassandra 
instances are consuming (ifconfig –a) internode compression is set for all 
endpoint – may be network is playing any role here?
is this corruption limited so certain keyspace/table | DCs or is that wide 
spread – the log snippet you shared it looked like only specific keyspace/table 
is affected – is that correct?
When you remove corrupted sstable of a certain table, I guess you verifies all 
nodes for corrupted sstables for same table (may be with with nodetool scrub 
tool) so to limit spread of corruptions – right?
Just curious to know – you’re not using lz4/default compressor for all tables 
there must be some reason for it.



From: Philip Ó Condúin [mailto:philipocond...@gmail.com]
Sent: Thursday, August 08, 2019 6:20 AM
To: user@cassandra.apache.org
Subject: Re: Datafile Corruption

Hi All,

Thank you so much for the replies.

Currently, I have the following list that can potentially cause some sort of 
corruption in a Cassandra cluster.

  *   Sudden Power cut  -  We have had no power cuts in the datacenters
  *   Network Issues - no network issues from what I can tell
  *   Disk full - I don't think this is an issue for us, see disks below.
  *   An issue in Casandra version like Cassandra-13752 - couldn't find any 
Jira issues similar to ours.
  *   Bit Flips - we have compression enabled so I don't think this should be 
an issue.
  *   Repair during upgrade has caused corruption too - we have not upgraded
  *   Dropping and adding columns with the same name but a different type - I 
will need to ask the apps team how they are using the database.


Ok, let me try and explain the issue we are having, I am under a lot of 
pressure from above to get this fixed and I can't figure it out.

This is a PRE-PROD environment.

  *   2 datacenters.
  *   9 physical servers in each datacenter
  *   4 Cassandra instances on each server
  *   72 Cassandra instances across the 2 data centres, 36 in site A, 36 in 
site B.

We also have 2 Reaper Nodes we use for repair.  One reaper node in each 
datacenter each running with its own Cassandra back end in a cluster together.

OS Details [Red Hat Linux]
cass_a@x 0 10:53:01 ~ $ uname -a
Linux x 3.10.0-957.5.1.el7.x86_64 #1 SMP Wed Dec 19 10:46:58 EST 2018 x86_64 
x86_64 x86_64 GNU/Linux

cass_a@x 0 10:57:31 ~ $ cat /etc/*release
NAME="Red Hat Enterprise Linux Server"
VERSION="7.6 (Maipo)"
ID="rhel"

Storage Layout
cass_a@xx 0 10:46:28 ~ $ df -h
Filesystem Size  Used Avail Use% Mounted on
/dev/mapper/vg01-lv_root20G  2.2G   18G  11% /
devtmpfs63G 0   63G   0% /dev
tmpfs   63G 0   63G   0% /dev/shm
tmpfs   63G  4.1G   59G   7% /run
tmpfs   63G 0   63G   0% /sys/fs/cgroup
>> 4 cassandra instances
/dev/sdd   1.5T  802G  688G  54% /data/ssd4
/dev/sda   1.5T  798G  692G  54% /data/ssd1
/dev/sdb   1.5T  681G  810G  46% /data/ssd2
/dev/sdc   1.5T  558G  932G  38% /data/ssd3

Cassandra load is about 200GB and the rest of the space is snapshots

CPU
cass_a@x 127 10:58:47 ~ $ lscpu | grep -E '^Thread|^Core|^Socket|^CPU\('
CPU(s):64
Thread(s) per core:2
Core(s) per socket:16
Socket(s): 2

Description of problem:
During repair of the cluster, we are seeing multiple corruptions in the log 
files on a lot of instances.  There seems to be no pattern to the corruption.  
It seems that the repair job is finding all the corrupted files for us.  The 
repair will hang on the node where the corrupted file is found.  To fix this we 
remove/rename the datafile and bounce the Cassandra instance.  Our hardware/OS 
team have stated there is no problem on their side.  I do not believe it the 
repair causing the corruption.

We have maintenance scripts that run every night running compactions and 
creating snapshots, I decided to turn these off, fix any corruptions we 
currently had and ran major compaction on all nodes, once this was done we had 
a "clean" cluster and we left the cluster for a few days.  After the process we 
noticed one corruption in the cluster, this datafile was created after I turned 
off the maintenance scripts so my theory of the scripts causing the issue was 
wrong.  We then kicked off another repair and started to find more corrupt 
files created after the maintenance script was turned off.


So let me give you an example of a corrupted file and maybe someone might be 
able to work through it with me?

When this corrupted file was reported in the log it looks like it was the 
repair that found it.

$ journalctl -u cassmeta-cass_b.service --since "2019-08-07 22:25:00" --until 
"2019-08-07 22:45:00"

Aug 07 22:30:33 cassandra[34611]: INFO  21:30:33 Writing 

Re: Datafile Corruption

2019-08-08 Thread Jeff Jirsa
The corrupt block exception from the compressor in 2.1/2.2 is something I don’t 
recall ever being attributed to anything other than bad hardware, so that seems 
by far the most likely option. 

The corruption that the compressor is catching says the checksum written 
immediately after the compressed block doesn’t match when read back. 

How sensitive is this data? Would you be able to send one of the corrupt data 
files to a developer to check? Or is it something like PII you can’t share? 

Have you found this corruption on every single instance? Are you using single 
tokens or vnodes? Is it happening in both dcs? 




> On Aug 8, 2019, at 4:20 AM, Philip Ó Condúin  wrote:
> 
> Hi All,
> 
> Thank you so much for the replies.  
> 
> Currently, I have the following list that can potentially cause some sort of 
> corruption in a Cassandra cluster. 
> 
> Sudden Power cut  -  We have had no power cuts in the datacenters
> Network Issues - no network issues from what I can tell
> Disk full - I don't think this is an issue for us, see disks below.
> An issue in Casandra version like Cassandra-13752 - couldn't find any Jira 
> issues similar to ours.
> Bit Flips - we have compression enabled so I don't think this should be an 
> issue.
> Repair during upgrade has caused corruption too - we have not upgraded
> Dropping and adding columns with the same name but a different type - I will 
> need to ask the apps team how they are using the database.
> 
> 
> Ok, let me try and explain the issue we are having, I am under a lot of 
> pressure from above to get this fixed and I can't figure it out.
> 
> This is a PRE-PROD environment.
> 2 datacenters.
> 9 physical servers in each datacenter
> 4 Cassandra instances on each server
> 72 Cassandra instances across the 2 data centres, 36 in site A, 36 in site B.
> 
> We also have 2 Reaper Nodes we use for repair.  One reaper node in each 
> datacenter each running with its own Cassandra back end in a cluster together.
> 
> OS Details [Red Hat Linux]
> cass_a@x 0 10:53:01 ~ $ uname -a
> Linux x 3.10.0-957.5.1.el7.x86_64 #1 SMP Wed Dec 19 10:46:58 EST 2018 x86_64 
> x86_64 x86_64 GNU/Linux
> 
> cass_a@x 0 10:57:31 ~ $ cat /etc/*release
> NAME="Red Hat Enterprise Linux Server"
> VERSION="7.6 (Maipo)"
> ID="rhel"
> 
> Storage Layout 
> cass_a@xx 0 10:46:28 ~ $ df -h
> Filesystem Size  Used Avail Use% Mounted on
> /dev/mapper/vg01-lv_root20G  2.2G   18G  11% /
> devtmpfs63G 0   63G   0% /dev
> tmpfs   63G 0   63G   0% /dev/shm
> tmpfs   63G  4.1G   59G   7% /run
> tmpfs   63G 0   63G   0% /sys/fs/cgroup
> >> 4 cassandra instances
> /dev/sdd   1.5T  802G  688G  54% /data/ssd4
> /dev/sda   1.5T  798G  692G  54% /data/ssd1
> /dev/sdb   1.5T  681G  810G  46% /data/ssd2
> /dev/sdc   1.5T  558G  932G  38% /data/ssd3
> 
> Cassandra load is about 200GB and the rest of the space is snapshots
> 
> CPU
> cass_a@x 127 10:58:47 ~ $ lscpu | grep -E '^Thread|^Core|^Socket|^CPU\('
> CPU(s):64
> Thread(s) per core:2
> Core(s) per socket:16
> Socket(s): 2
> 
> Description of problem:
> During repair of the cluster, we are seeing multiple corruptions in the log 
> files on a lot of instances.  There seems to be no pattern to the corruption. 
>  It seems that the repair job is finding all the corrupted files for us.  The 
> repair will hang on the node where the corrupted file is found.  To fix this 
> we remove/rename the datafile and bounce the Cassandra instance.  Our 
> hardware/OS team have stated there is no problem on their side.  I do not 
> believe it the repair causing the corruption. 
> 
> We have maintenance scripts that run every night running compactions and 
> creating snapshots, I decided to turn these off, fix any corruptions we 
> currently had and ran major compaction on all nodes, once this was done we 
> had a "clean" cluster and we left the cluster for a few days.  After the 
> process we noticed one corruption in the cluster, this datafile was created 
> after I turned off the maintenance scripts so my theory of the scripts 
> causing the issue was wrong.  We then kicked off another repair and started 
> to find more corrupt files created after the maintenance script was turned 
> off.
> 
> 
> So let me give you an example of a corrupted file and maybe someone might be 
> able to work through it with me?
> 
> When this corrupted file was reported in the log it looks like it was the 
> repair that found it.
> 
> $ journalctl -u cassmeta-cass_b.service --since "2019-08-07 22:25:00" --until 
> "2019-08-07 22:45:00"
> 
> Aug 07 22:30:33 cassandra[34611]: INFO  21:30:33 Writing 
> Memtable-compactions_in_progress@830377457(0.008KiB serialized bytes, 1 ops, 
> 0%/0% of on/off-heap limit)
> Aug 07 

Re: Rebuilding a node without clients hitting it

2019-08-08 Thread Cyril Scetbon
Thanks Jeff, that’s the type of parameter I was looking for but I missed it 
when I first read it. We’ll ensure that dynamic snitch is enabled.
—
Cyril Scetbon

> On Aug 5, 2019, at 11:23 PM, Jeff Jirsa  wrote:
> 
> You can make THAT less likely with some snitch trickery (setting the badness 
> for the rebuilding host) via jmx 



Re: Datafile Corruption

2019-08-08 Thread Philip Ó Condúin
Hi All,

Thank you so much for the replies.

Currently, I have the following list that can potentially cause some sort
of corruption in a Cassandra cluster.


   - Sudden Power cut  -  *We have had no power cuts in the datacenters*
   - Network Issues - *no network issues from what I can tell*
   - Disk full - *I don't think this is an issue for us, see disks below.*
   - An issue in Casandra version like Cassandra-13752 -* couldn't find any
   Jira issues similar to ours.*
   - Bit Flips -* we have compression enabled so I don't think this should
   be an issue.*
   - Repair during upgrade has caused corruption too -* we have not
   upgraded*
   - Dropping and adding columns with the same name but a different type - *I
   will need to ask the apps team how they are using the database.*



Ok, let me try and explain the issue we are having, I am under a lot of
pressure from above to get this fixed and I can't figure it out.

This is a PRE-PROD environment.

   - 2 datacenters.
   - 9 physical servers in each datacenter
   - 4 Cassandra instances on each server
   - 72 Cassandra instances across the 2 data centres, 36 in site A, 36 in
   site B.


We also have 2 Reaper Nodes we use for repair.  One reaper node in each
datacenter each running with its own Cassandra back end in a cluster
together.

OS Details [Red Hat Linux]
cass_a@x 0 10:53:01 ~ $ uname -a
Linux x 3.10.0-957.5.1.el7.x86_64 #1 SMP Wed Dec 19 10:46:58 EST 2018
x86_64 x86_64 x86_64 GNU/Linux

cass_a@x 0 10:57:31 ~ $ cat /etc/*release
NAME="Red Hat Enterprise Linux Server"
VERSION="7.6 (Maipo)"
ID="rhel"

Storage Layout
cass_a@xx 0 10:46:28 ~ $ df -h
Filesystem Size  Used Avail Use% Mounted on
/dev/mapper/vg01-lv_root20G  2.2G   18G  11% /
devtmpfs63G 0   63G   0% /dev
tmpfs   63G 0   63G   0% /dev/shm
tmpfs   63G  4.1G   59G   7% /run
tmpfs   63G 0   63G   0% /sys/fs/cgroup
>> 4 cassandra instances
/dev/sdd   1.5T  802G  688G  54% /data/ssd4
/dev/sda   1.5T  798G  692G  54% /data/ssd1
/dev/sdb   1.5T  681G  810G  46% /data/ssd2
/dev/sdc   1.5T  558G  932G  38% /data/ssd3

Cassandra load is about 200GB and the rest of the space is snapshots

CPU
cass_a@x 127 10:58:47 ~ $ lscpu | grep -E '^Thread|^Core|^Socket|^CPU\('
CPU(s):64
Thread(s) per core:2
Core(s) per socket:16
Socket(s): 2

*Description of problem:*
During repair of the cluster, we are seeing multiple corruptions in the log
files on a lot of instances.  There seems to be no pattern to the
corruption.  It seems that the repair job is finding all the corrupted
files for us.  The repair will hang on the node where the corrupted file is
found.  To fix this we remove/rename the datafile and bounce the Cassandra
instance.  Our hardware/OS team have stated there is no problem on their
side.  I do not believe it the repair causing the corruption.

We have maintenance scripts that run every night running compactions and
creating snapshots, I decided to turn these off, fix any corruptions we
currently had and ran major compaction on all nodes, once this was done we
had a "clean" cluster and we left the cluster for a few days.  After the
process we noticed one corruption in the cluster, this datafile was created
after I turned off the maintenance scripts so my theory of the scripts
causing the issue was wrong.  We then kicked off another repair and started
to find more corrupt files created after the maintenance script was turned
off.


So let me give you an example of a corrupted file and maybe someone might
be able to work through it with me?

When this corrupted file was reported in the log it looks like it was the
repair that found it.

$ journalctl -u cassmeta-cass_b.service --since "2019-08-07 22:25:00"
--until "2019-08-07 22:45:00"

Aug 07 22:30:33 cassandra[34611]: INFO  21:30:33 Writing
Memtable-compactions_in_progress@830377457(0.008KiB serialized bytes, 1
ops, 0%/0% of on/off-heap limit)
Aug 07 22:30:33 cassandra[34611]: ERROR 21:30:33 Failed creating a merkle
tree for [repair #9587a200-b95a-11e9-8920-9f72868b8375 on
KeyspaceMetadata/x, (-1476350953672479093,-1474461
Aug 07 22:30:33 cassandra[34611]: ERROR 21:30:33 Exception in thread
Thread[ValidationExecutor:825,1,main]
Aug 07 22:30:33 cassandra[34611]: org.apache.cassandra.io.FSReadError:
org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted:
/x/ssd2/data/KeyspaceMetadata/x-1e453cb0
Aug 07 22:30:33 cassandra[34611]: at
org.apache.cassandra.io.util.RandomAccessReader.readBytes(RandomAccessReader.java:365)
~[apache-cassandra-2.2.13.jar:2.2.13]
Aug 07 22:30:33 cassandra[34611]: at
org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:361)
~[apache-cassandra-2.2.13.jar:2.2.13]
Aug 07 22:30:33 cassandra[34611]: at

Re: Repairs/compactions on tables with solr indexes

2019-08-08 Thread Dinesh Joshi
Hi Ayub,

DSE is a DataStax product and this is the Apache Cassandra mailing list. Could 
you reach out to DataStax?

Dinesh

> On Aug 7, 2019, at 11:17 PM, Ayub M  wrote:
> 
> Hello, we are using DSE Search workload with Search and Cass running on same 
> nodes/jvm. 
> 
> 1. When repairs are run, does it initiate rebuilds of solr indexes? Does it 
> rebuild only when any data is repaired?
> 2. How about the compactions, does it trigger any search indexes rebuilds? I 
> guess not, since data is not getting changed, but not sure. Or maybe when it 
> cleans tombstones, how does solr handles deleted data?
> 4. Is it generally a good idea to run both Cass and Search on same node/JVM? 
> Any potential issues which could arise from such a setup or thats a good way 
> to setup since data is colocated on the same nodes.
> 
> Regards,
> Ayub


-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



RE: Backups in Cassandra

2019-08-08 Thread Rhys.Campbell
Just to add to this…

We do snapshot, incremental and commitlog backups along with schema and config 
backups. All is copied to S3 although we do keep a small number of snapshots / 
inc / commitlog on the local node in the rare event they are needed.

We have written some Ansible to restore the whole cluster. If your cluster is 
beyond a trivial number of nodes then some type of manageable automation is 
required.

Cheers,

R

From: cclive1601你 
Sent: 08 August 2019 04:30
To: user@cassandra.apache.org
Subject: Re: Backups in Cassandra

We have also made backup and restore for Apache Cassandra,backup process are
1.do incremental backup for flushed sstable ;do incremental backup for 
commitlog ;
2.do snapshot for the cluster periodically,also meta info are needed to 
backup(token and table info);
3.for exception like node joining and move(if exist),leave , refresh the meta 
info backup;

restore
1.use incremental sstable to reduce the number of commitlog for restore ,for 
log replay cost much time ;
2.all sstable can do bulkload(just node refresh (so ,my restore node's number 
need to be the same as backup,for sstableloader, it cost much time than this 
method,though use loader does not need the
node to be same as backup))

Connor Lin mailto:linba...@gmail.com>> 于2019年8月8日周四 
上午10:17写道:
Hi Krish,

It is recommended to have backups. Although I haven't practiced it myself, but 
I find this might be helpful.
https://thelastpickle.com/blog/2018/04/03/cassandra-backup-and-restore-aws-ebs.html


Sincerely yours,

Connor Lin


On Thu, Aug 8, 2019 at 5:47 AM Krish Donald 
mailto:gotomyp...@gmail.com>> wrote:
Hi Folks,

First question is , Do you take backup  for your cassandra cluster ?
If answer is yes then question follows:
1. How do you take backup ?
1.1 ) Is it only snapshot?
 1.2 ) We are on AWS with very large cluster around 51 nodes with 
1TB data on each node.
  1.3) Do you take backup and move it to S3 ?

2. If you take backup, how restore process worked for you?

Thanks
Krish


--
you are the apple of my eye !


Repairs/compactions on tables with solr indexes

2019-08-08 Thread Ayub M
Hello, we are using DSE Search workload with Search and Cass running on
same nodes/jvm.

1. When repairs are run, does it initiate rebuilds of solr indexes? Does it
rebuild only when any data is repaired?
2. How about the compactions, does it trigger any search indexes rebuilds?
I guess not, since data is not getting changed, but not sure. Or maybe when
it cleans tombstones, how does solr handles deleted data?
4. Is it generally a good idea to run both Cass and Search on same
node/JVM? Any potential issues which could arise from such a setup or thats
a good way to setup since data is colocated on the same nodes.

Regards,
Ayub