Re: [EXTERNAL] Apache Cassandra upgrade path
Hi, Here are some upgrade options: - Standard rolling upgrade: node by node - Fast rolling upgrade: rack by rack. If clients use CL=LOCAL_ONE then it's OK as long as one rack is UP. For higher CL it's possible assuming you have no more than one replica per rack e.g. CL=LOCAL_QUORUM with RF=3 and 2 racks is a *BAD* setup. But RF=3 with 3 rack is OK. - Double write in another cluster: easy for short TTL data (e.g. TTL of few days) When possible, this option is not only the safest but also allows major change (e.g. Partitioner for legacy clusters). And of course it's a good opportunity to use new cloud instance type, change number of vnodes, etc. As Sean said, it's not possible for C* servers to stream data with other versions when Streaming versions are different. There is no workaround.You can check that here https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/streaming/messages/StreamMessage.java#L35The community plans to work on this limitation to make streaming possible between different major versions starting from C*4.x Last but not least, don't forget to take snapshots (+ backup) and to prepare a rollback script.System keyspace will be automatically snapshotted by Cassandra when the new version will start: the rollback script should be based on that snapshot for the system part.New data (both commitlog and sstables flushed in 3.11 format) will be lost even with such a script but it's useful to test it and to have it ready for the D day.(See also snapshot_before_compaction setting but it might be useless depending on your procedure.) Romain Le vendredi 26 juillet 2019 à 23:51:52 UTC+2, Jai Bheemsen Rao Dhanwada a écrit : yes correct, it doesn't work for the servers. trying to see if any had any workaround for this issue? (may be changing the protocol version during the upgrade time?) On Fri, Jul 26, 2019 at 1:11 PM Durity, Sean R wrote: This would handle client protocol, but not streaming protocol between nodes. Sean Durity – Staff Systems Engineer, Cassandra From: Alok Dwivedi Sent: Friday, July 26, 2019 3:21 PM To: user@cassandra.apache.org Subject: Re: [EXTERNAL] Apache Cassandra upgrade path Hi Sean The recommended practice for upgrade is to explicitly control protocol version in your application during upgrade process. Basically the protocol version is negotiated on first connection and based on chance it can talk to an already upgraded node first which means it will negotiate a higher version that will not be compatible with those nodes which are still one lower Cassandra version. So initially you set it a lower version that is like lower common denominator for mixed mode cluster and then remove the call to explicit setting once upgrade has completed. Clustercluster= Cluster.builder() .addContactPoint("127.0.0.1") .withProtocolVersion(ProtocolVersion.V2) .build(); Refer here for more information if using Java driver https://docs.datastax.com/en/developer/java-driver/3.7/manual/native_protocol/#protocol-version-with-mixed-clusters Same thing applies to drivers in other languages. Thanks Alok Dwivedi Senior Consultant https://www.instaclustr.com/ On Fri, 26 Jul 2019 at 20:03, Jai Bheemsen Rao Dhanwada wrote: Thanks Sean, In my use case all my clusters are multi DC, and I am trying my best effort to upgrade ASAP, however there is a chance since all machines are VMs. Also my key spaces are not uniform across DCs. some are replicated to all DCs and some of them are just one DC, so I am worried there. Is there a way to override the protocol version until the upgrade is done and then change it back once the upgrade is completed? On Fri, Jul 26, 2019 at 11:42 AM Durity, Sean R wrote: What you have seen is totally expected. You can’t stream between different major versions of Cassandra. Get the upgrade done, then worry about any down hardware. If you are using DCs, upgrade one DC at a time, so that there is an available environment in case of any disasters. My advice, though, is to get through the rolling upgrade process as quickly as possible. Don’t stay in a mixed state very long. The cluster will function fine in a mixed state – except for those streaming operations. No repairs, no bootstraps. Sean Durity – Staff Systems Engineer, Cassandra From: Jai Bheemsen Rao Dhanwada Sent: Friday, July 26, 2019 2:24 PM To: user@cassandra.apache.org Subject: [EXTERNAL] Apache Cassandra upgrade path Hello, I am trying to upgrade Apache Cassandra from 2.1.16 to 3.11.3, the regular rolling upgrade process works fine without any issues. However, I am running into an issue where if there is a node with older version dies (hardware failure) and a new node comes up and tries to bootstrap, it's failing. I tried two combinations: 1. Joining replacement node with 2.1.16 version of cassandra In this case nodes
Re: Cluster configuration issue
128GB RAM -> that's a good news, you have plenty of room to increase Cassandra heap size. You can start with, let's say, 12GB in jvm.options or 24GB if you use G1 GC. Let us know if the node starts and if DEBUG/TRACE is useful. You can also try "strace -f -p ..." command to see if the process is doing something when it's stuck, but Cassandra has a lots of threads... Le vendredi 9 novembre 2018 à 19:13:51 UTC+1, Francesco Messere a écrit : Hi Roman yes I modified the .yaml after the issue. The problem is this, if I restart a node in DC-FIRENZE than it not startup I tried first one node and then the second one with the same results. these are the server resources memory 128Gb free total used free shared buff/cache available Mem: 131741388 13858952 72649704 124584 45232732 116825040 Swap: 16777212 0 16777212 cpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 24 On-line CPU(s) list: 0-23 Thread(s) per core: 1 Core(s) per socket: 12 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 79 Model name: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz Stepping: 1 CPU MHz: 1213.523 CPU max MHz: 2900. CPU min MHz: 1200. BogoMIPS: 4399.97 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 30720K NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22 NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23 There is nothing in server logs. On monday I will activate debug and try again to startup cassandra node Thanks Francesco Messere On 09/11/2018 18:51, Romain Hardouin wrote: Ok so all nodes in Firenze are down. I thought only one was KO. After a first look at cassandra.yaml the only issue I saw is seeds: the line you commented out was correct (one seed per DC). But I guess you modified it after the issue. You should fix the swap issue. Also can you add more heap to Cassandra? By the way, what are the specs of servers (RAM, CPU, etc)? Did you check Linux system log? And Cassandra's debug.log? You can even enable TRACE logs in logback.xml ( https://github.com/apache/cassandra/blob/cassandra-3.11.3/conf/logback.xml#L100 ) then try to restart a node in Firenze to see where it blocks but if it's due to low resources, hardware issue or swap it won't be useful. Let's give a try anyway. Le vendredi 9 novembre 2018 à 18:20:57 UTC+1, Francesco Messere a écrit : Hi Romain, you are right, is not possible to work in these towns furtunally I live in Pisa :-). I sow the errors and corrected them, except the swap one. The process stuks, I let it run for 1 day without results. This is the output of nodetool status from the nodes that are up and running (DC-MILANO) /conf/CASSANDRA_SHARE_PROD_conf/bin/cassandra-3.11.3/bin/nodetool -h 192.168.71.210 -p 17052 status Datacenter: DC-FIRENZE == Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack DN 192.168.204.175 ? 256 100.0% a3c8626e-afab-413e-a153-cccfd0b26d06 RACK1 DN 192.168.204.176 ? 256 100.0% 67738ca8-f1f5-46a9-9d23-490bbebcffaa RACK1 Datacenter: DC-MILANO = Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 192.168.71.210 5.95 GiB 256 100.0% 210f0cdd-abee-4fc0-abd3-ecdab618606e RACK1 UN 192.168.71.211 5.83 GiB 256 100.0% 96c30edd-4e6c-4952-82d4-dfdf67f6a06f RACK1 and this is describecluster command output /conf/CASSANDRA_SHARE_PROD_conf/bin/cassandra-3.11.3/bin/nodetool -h 192.168.71.210 -p 17052 describecluster Cluster Information: Name: CASSANDRA_3 Snitch: org.apache.cassandra.locator.GossipingPropertyFileSnitch DynamicEndPointSnitch: enabled Partitioner: org.apache.cassandra.dht.Murmur3Partitioner Schema versions: 6bdd4617-658e-375e-8503-7158df833495: [192.168.71.210, 192.168.71.211] UNREACHABLE: [192.168.204.175, 192.168.204.176] In attach the cassandra.yaml file Regards Francesco Messere On 09/11/2018 17:48, Romain Hardouin wrote: Hi Francesco, it can't work! Milano and Firenze, oh boy, Calcio vs Calcio Storico X-D Ok more seriously, "Updating t
Re: Cluster configuration issue
Ok so all nodes in Firenze are down. I thought only one was KO. After a first look at cassandra.yaml the only issue I saw is seeds: the line you commented out was correct (one seed per DC). But I guess you modified it after the issue. You should fix the swap issue. Also can you add more heap to Cassandra? By the way, what are the specs of servers (RAM, CPU, etc)? Did you check Linux system log? And Cassandra's debug.log?You can even enable TRACE logs in logback.xml ( https://github.com/apache/cassandra/blob/cassandra-3.11.3/conf/logback.xml#L100 ) then try to restart a node in Firenze to see where it blocks but if it's due to low resources, hardware issue or swap it won't be useful. Let's give a try anyway. Le vendredi 9 novembre 2018 à 18:20:57 UTC+1, Francesco Messere a écrit : Hi Romain, you are right, is not possible to work in these towns furtunally I live in Pisa :-). I sow the errors and corrected them, except the swap one. The process stuks, I let it run for 1 day without results. This is the output of nodetool status from the nodes that are up and running (DC-MILANO) /conf/CASSANDRA_SHARE_PROD_conf/bin/cassandra-3.11.3/bin/nodetool -h 192.168.71.210 -p 17052 status Datacenter: DC-FIRENZE == Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack DN 192.168.204.175 ? 256 100.0% a3c8626e-afab-413e-a153-cccfd0b26d06 RACK1 DN 192.168.204.176 ? 256 100.0% 67738ca8-f1f5-46a9-9d23-490bbebcffaa RACK1 Datacenter: DC-MILANO = Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 192.168.71.210 5.95 GiB 256 100.0% 210f0cdd-abee-4fc0-abd3-ecdab618606e RACK1 UN 192.168.71.211 5.83 GiB 256 100.0% 96c30edd-4e6c-4952-82d4-dfdf67f6a06f RACK1 and this is describecluster command output /conf/CASSANDRA_SHARE_PROD_conf/bin/cassandra-3.11.3/bin/nodetool -h 192.168.71.210 -p 17052 describecluster Cluster Information: Name: CASSANDRA_3 Snitch: org.apache.cassandra.locator.GossipingPropertyFileSnitch DynamicEndPointSnitch: enabled Partitioner: org.apache.cassandra.dht.Murmur3Partitioner Schema versions: 6bdd4617-658e-375e-8503-7158df833495: [192.168.71.210, 192.168.71.211] UNREACHABLE: [192.168.204.175, 192.168.204.176] In attach the cassandra.yaml file Regards Francesco Messere On 09/11/2018 17:48, Romain Hardouin wrote: Hi Francesco, it can't work! Milano and Firenze, oh boy, Calcio vs Calcio Storico X-D Ok more seriously, "Updating topology ..." is not a problem. But you have low resources and system misconfiguration: - Small heap size: 3.867GiB From the logs: "Unable to lock JVM memory (ENOMEM). This can result in part of the JVM being swapped out, especially with mmapped I/O enabled. Increase RLIMIT_MEMLOCK or run Cassandra as root." - System settings: Swap shoud be disabled, bad system limits, etc. From the logs: "Cassandra server running in degraded mode. Is swap disabled? : false, Address space adequate? : true, nofile limit adequate? : true, nproc limit adequate? : false"For system tuning see https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html You said "Cassandra node did not startup". What is the problem exactly? The process is stuck or does it dies? What do you see with "nodetool status" on nodes that are up and running? Btw cassandra-topology.properties is not required with GossipingPropertyFileSnitch (unless your are migratig from PropertyFileSnitch). Best, Romain Le vendredi 9 novembre 2018 à 11:34:16 UTC+1, Francesco Messere a écrit : Hi to all, I have a problem with distribuited cluster configuration. This is a test environment Cassandra version is 3.11.3 2 site Milan and Florence 2 servers on each site 1 common "cluster-name" and 2 DC First installation and startup goes ok all the nodes are present in the cluster. The issue startup after a server reboot in FLORENCE DC Cassandra node did not startup and in system.log last line written is INFO [ScheduledTasks:1] 2018-11-09 10:36:54,306 TokenMetadata.java:498 - Updating topology for all endpoints that have changed The only way to correct the thing I found is to cleanup the node, remove from cluster and re-join it. How can I solve it? here are configuration files less cassandra-topology.properties # Unless required by applicable law or agreed to in writing, software # distributed under the Licens
Re: Cluster configuration issue
Hi Francesco, it can't work! Milano and Firenze, oh boy, Calcio vs Calcio Storico X-D Ok more seriously, "Updating topology ..." is not a problem. But you have low resources and system misconfiguration: - Small heap size: 3.867GiB From the logs: "Unable to lock JVM memory (ENOMEM). This can result in part of the JVM being swapped out, especially with mmapped I/O enabled. Increase RLIMIT_MEMLOCK or run Cassandra as root." - System settings: Swap shoud be disabled, bad system limits, etc. From the logs: "Cassandra server running in degraded mode. Is swap disabled? : false, Address space adequate? : true, nofile limit adequate? : true, nproc limit adequate? : false" For system tuning see https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html You said "Cassandra node did not startup". What is the problem exactly? The process is stuck or does it dies?What do you see with "nodetool status" on nodes that are up and running? Btw cassandra-topology.properties is not required with GossipingPropertyFileSnitch (unless your are migratig from PropertyFileSnitch). Best, Romain Le vendredi 9 novembre 2018 à 11:34:16 UTC+1, Francesco Messere a écrit : Hi to all, I have a problem with distribuited cluster configuration. This is a test environment Cassandra version is 3.11.3 2 site Milan and Florence 2 servers on each site 1 common "cluster-name" and 2 DC First installation and startup goes ok all the nodes are present in the cluster. The issue startup after a server reboot in FLORENCE DC Cassandra node did not startup and in system.log last line written is INFO [ScheduledTasks:1] 2018-11-09 10:36:54,306 TokenMetadata.java:498 - Updating topology for all endpoints that have changed The only way to correct the thing I found is to cleanup the node, remove from cluster and re-join it. How can I solve it? here are configuration files less cassandra-topology.properties # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # Cassandra Node IP=Data Center:Rack 192.168.204.175=DC-FIRENZE:RACK1 192.168.204.176=DC-FIRENZE:RACK1 192.168.71.210=DC-MILANO:RACK1 192.168.71.211=DC-MILANO:RACK1 # default for unknown nodes default=DC-FIRENZE:r1 # Native IPv6 is supported, however you must escape the colon in the IPv6 Address # Also be sure to comment out JVM_OPTS="$JVM_OPTS -Djava.net.preferIPv4Stack=true" # in cassandra-env.sh #fe80\:0\:0\:0\:202\:b3ff\:fe1e\:8329=DC1:RAC3 cassandra-rackdc.properties # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # These properties are used with GossipingPropertyFileSnitch and will # indicate the rack and dc for this node dc=DC-FIRENZE rack=RACK1 # Add a suffix to a datacenter name. Used by the Ec2Snitch and Ec2MultiRegionSnitch # to append a string to the EC2 region name. #dc_suffix= # Uncomment the following line to make this snitch prefer the internal ip when possible, as the Ec2MultiRegionSnitch does. # prefer_local=true In attach the system.log file Regards Francesco Messere - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org
Re: Connections info
Note that one "user"/application can open multiple connections. You have also the number of Thrift connections available in JMX if you run a legacy application. Max is right, regarding where they're come from you can use lsof. For instance on AWS - but you can adapt it for your needs: IP=...REGION=... ssh $IP "sudo lsof -i -n | grep 9042 | grep -Po '(?<=->)[^:]+' | sort -u" | xargs -P 20 -I '{}' aws --output json --region $REGION ec2 describe-instances --filter Name=private-ip-address,Values={} --query 'Reservations[].Instances[*].Tags[*]' | jq '.[0][0] | map(select(.Key == "Name")) | .[0].Value' | sort | uniq -c You'll have number of instances grouped by AWS name : 3 "name_ABC" 15 "name_example" 37 "name_test" Best,Romain Le vendredi 5 octobre 2018 à 06:28:51 UTC+2, Max C. a écrit : Looks like the number of connections is available in JMX as: org.apache.cassandra.metrics:type=Client,name=connectedNativeClients http://cassandra.apache.org/doc/4.0/operating/metrics.html "Number of clients connected to this nodes native protocol server” As for where they’re coming from — I’m not sure how to get that from JMX. Maybe you’ll have to use “lsof” or something to get that. - Max On Oct 4, 2018, at 8:57 pm, Abdul Patel wrote: Hi All, Can we get number of users connected to each node in cassandra?Also can we get from whixh app node they are connecting from?
Re: Simple upgrade for outdated cluster
Also, you didn't mention which C*2.0 version you're using but prior to upgrade to 2.1.20, make sure to use the latest 2.0 - or at least >= 2.0.7 Le vendredi 3 août 2018 à 13:03:39 UTC+2, Romain Hardouin a écrit : Hi Joel, No it's not supported. C*2.0 can't stream data to C*3.11. Make the upgrade 2.0 -> 2.1.20 then you'll be able to upgrade to 3.11.3 i.e. 2.1.20 -> 3.11.3. You can upgrade to 3.0.17 as an intermediary step (I would do), but don't upgrade to 2.2. Also make sure to read carefully https://github.com/apache/cassandra/blob/cassandra-3.11/NEWS.txt It's a long read but it's important. There are lots of changes between all these versions. Best, RomainLe vendredi 3 août 2018 à 11:40:26 UTC+2, Joel Samuelsson a écrit : Hi, We have a pretty outdated Cassandra cluster running version 2.0.x. Instead of doing step by step upgrades (2.0 -> 2.1, 2.1 -> 2.2, 2.2 -> 3.0, 3.0 -> 3.11.x), would it be possible to add new nodes with a recent version (say 3.11.x) and start decommissioning the old ones until we have a cluster with only 3.11.x? Best regards,Joel
Re: Simple upgrade for outdated cluster
Hi Joel, No it's not supported. C*2.0 can't stream data to C*3.11. Make the upgrade 2.0 -> 2.1.20 then you'll be able to upgrade to 3.11.3 i.e. 2.1.20 -> 3.11.3. You can upgrade to 3.0.17 as an intermediary step (I would do), but don't upgrade to 2.2. Also make sure to read carefully https://github.com/apache/cassandra/blob/cassandra-3.11/NEWS.txt It's a long read but it's important. There are lots of changes between all these versions. Best, RomainLe vendredi 3 août 2018 à 11:40:26 UTC+2, Joel Samuelsson a écrit : Hi, We have a pretty outdated Cassandra cluster running version 2.0.x. Instead of doing step by step upgrades (2.0 -> 2.1, 2.1 -> 2.2, 2.2 -> 3.0, 3.0 -> 3.11.x), would it be possible to add new nodes with a recent version (say 3.11.x) and start decommissioning the old ones until we have a cluster with only 3.11.x? Best regards,Joel
Re: Rocksandra blog post
Rocksandra is very interesting for key/value data model. Let's hope it will land in C* upstream in the near future thanks to pluggable storage.Thanks Dikang! Le mardi 6 mars 2018 à 10:06:16 UTC+1, Kyrylo Lebediev a écrit : #yiv7016643451 #yiv7016643451 -- P {margin-top:0;margin-bottom:0;}#yiv7016643451 Thanks for sharing, Dikang! Impressive results. As you plugged in different storage engine, it's interesting how you're dealing with compactions in Rocksandra? Is there still the concept of immutable SSTables + compaction strategies or it was changed somehow? Best, Kyrill From: Dikang Gu Sent: Monday, March 5, 2018 8:26 PM To: d...@cassandra.apache.org; cassandra Subject: Rocksandra blog post As some of you already know, Instagram Cassandra team is working on the project to use RocksDB as Cassandra's storage engine. Today, we just published a blog post about the work we have done, and more excitingly, we published the benchmark metrics in AWS environment. Check it out here: https://engineering.instagram.com/open-sourcing-a-10x-reduction-in-apache-cassandra-tail-latency-d64f86b43589 ThanksDikang
Re: What kind of Automation you have for Cassandra related operations on AWS ?
At Teads we use Terraform, Chef, Packer and Rundeck for our AWS infrastructure. I'll publish a blog post on Medium which talk about that, it's in the pipeline. Terraform is awesome. Best, RomainLe vendredi 9 février 2018 à 00:57:01 UTC+1, Ben Wood a écrit : Shameless plug of our (DC/OS) Apache Cassandra service: https://docs.mesosphere.com/services/cassandra/2.0.3-3.0.14. You must run DC/OS, but it will handle:RestartsReplacement of nodesModification of configurationBackups and Restores (to S3) On Thu, Feb 8, 2018 at 3:46 PM, Krish Donald wrote: Hi All, What kind of Automation you have for Cassandra related operations on AWS like restacking, restart of the cluster , changing cassandra.yaml parameters etc ? Thanks -- Ben WoodSoftware Engineer - Data AgilityMesosphere
Re: Heavy one-off writes best practices
We use Spark2Cassandra (this fork works with C*3.0 https://github.com/leoromanovsky/Spark2Cassandra ) SSTables are streamed to Cassandra by Spark2Cassandra (so you need to open port 7000 accordingly).During benchmark we used 25 EMR nodes but in production we use less nodes to be more gentle with Cassandra. Best, Romain Le mardi 6 février 2018 à 16:05:16 UTC+1, Julien Moumne a écrit : This does look like a very viable solution. Thanks. Could you give us some pointers/documentation on : - how can we build such SSTables using spark jobs, maybe https://github.com/Netflix/sstable-adaptor ? - how do we send these tables to cassandra? does a simple SCP work? - what is the recommended size for sstables for when it does not fit a single executor On 5 February 2018 at 18:40, Romain Hardouin wrote: Hi Julien, We have such a use case on some clusters. If you want to insert big batches at fast pace the only viable solution is to generate SSTables on Spark side and stream them to C*. Last time we benchmarked such a job we achieved 1.3 million partitions inserted per seconde on a 3 C* nodes test cluster - which is impossible with regular inserts. Best, Romain Le lundi 5 février 2018 à 03:54:09 UTC+1, kurt greaves a écrit : Would you know if there is evidence that inserting skinny rows in sorted order (no batching) helps C*? This won't have any effect as each insert will be handled separately by the coordinator (or a different coordinator, even). Sorting is also very unlikely to help even if you did batch. Also, in the case of wide rows, is there evidence that sorting clustering keys within partition batches helps ease C*'s job? No evidence, seems very unlikely. -- Julien MOUMNÉ Software Engineering - Data Science Mail: jmoumne@deezer.com12 rue d'Athènes 75009 Paris - France
Re: Heavy one-off writes best practices
Hi Julien, We have such a use case on some clusters. If you want to insert big batches at fast pace the only viable solution is to generate SSTables on Spark side and stream them to C*. Last time we benchmarked such a job we achieved 1.3 million partitions inserted per seconde on a 3 C* nodes test cluster - which is impossible with regular inserts. Best, Romain Le lundi 5 février 2018 à 03:54:09 UTC+1, kurt greaves a écrit : Would you know if there is evidence that inserting skinny rows in sorted order (no batching) helps C*? This won't have any effect as each insert will be handled separately by the coordinator (or a different coordinator, even). Sorting is also very unlikely to help even if you did batch. Also, in the case of wide rows, is there evidence that sorting clustering keys within partition batches helps ease C*'s job? No evidence, seems very unlikely.
Re: Meltdown/Spectre Linux patch - Performance impact on Cassandra?
Hi, We also noticed an increase of CPU - both system and user - on our c3.4xlarge fleet. So far it's really visible with max(%user) and especially max(%system), it has doubled!I graphed a ratio "write/s / %system", it's interesting to see how the value dropped yesterday, you can see it here: https://ibb.co/dnVcHG For reference: https://aws.amazon.com/fr/security/security-bulletins/AWS-2018-013/ Best, Romain Le vendredi 5 janvier 2018 à 13:09:35 UTC+1, Steinmaurer, Thomas a écrit : Hello, has anybody already some experience/results if a patched Linux kernel regarding Meltdown/Spectre is affecting performance of Cassandra negatively? In production, all nodes running in AWS with m4.xlarge, we see up to a 50% relative (e.g. AVG CPU from 40% => 60%) CPU increase since Jan 4, 2018, most likely correlating with Amazon finished patching the underlying Hypervisor infrastructure … Anybody else seeing a similar CPU increase? Thanks, Thomas The contents of this e-mail are intended for the named addressee only. It contains information that may be confidential. Unless you are the named addressee or an authorized designee, you may not copy or use it, or disclose it to anyone else. If you received it in error please notify us immediately and then destroy it. Dynatrace Austria GmbH (registration number FN 91482h) is a company registered in Linz whose registered office is at 4040 Linz, Austria, Freistädterstraße 313
Re: unrecognized column family in logs
Does "nodetool describecluster" shows an actual schema disagreement?You can try "nodetool resetlocalschema" to fix the issue on the node experiencing disagreement. Romain Le jeudi 9 novembre 2017 à 02:55:22 UTC+1, Erick Ramirez a écrit : It looks like you have a schema disagreement in your cluster which you need to look into. And you're right since that column family ID is equivalent to Friday, June 24, 2016 10:14:49 AM PDT. Have a look at the table IDs in system.schema_columnfamilies for clues. Cheers! On Thu, Nov 9, 2017 at 4:50 AM, Anubhav Kale wrote: Hello, We run Cassandra 2.1.13 and since last few days we’re seeing below in logs occasionally. The node then becomes unresponsive to cqlsh. ERROR [SharedPool-Worker-2] 2017-11-08 17:02:32,362 CommitLogSegment.java:441 - Attempted to write commit log entry for unrecognized column family: 2854d160-3a2f-11e6-925c- b143135bdc80 https://github.com/mariusae/ cassandra/blob/master/src/ java/org/apache/cassandra/db/ commitlog/CommitLogSegment. java#L95 The column family has heavy writes, but it hasn’t changed schema / load wise recently. How can this be troubleshooted / fixed ? Thanks !
Re: How to do cassandra routine maintenance
Hi, You should read about repair maintenance: http://cassandra.apache.org/doc/latest/operating/repair.htmlConsider installing and running C* reaper to do so: http://cassandra-reaper.io/STCS doesn't work well with TTL. I saw you have done some tuning, hard to say if it's OK without knowing the workload.LCS is better for TTL (but requires fast disks) and if you're working with time series consider TWCS.If CPU are not overloaded you can also consider Snappy compression (btw check compression ratio).Again depending on your data model and your queries, chunk_length_in_kb might be increased to have a more effective compression (generally speaking we tend to lower it to improve read latency). Best, Romain Le samedi 2 septembre 2017 à 04:17:22 UTC+2, qf zhou a écrit : I am using the cluster with 3 cassandra nodes, the cluster version is 3.0.9. Each day about 200~300 million records are inserted into the cluster. As time goes by, more and more data occupied more and more disk space. Currently, the data distribution on each node is as the following: UN 172.20.5.4 2.5 TiB 256 66.3% c5271e74-19a1-4cee-98d7-dc169cf87e95 rack1 UN 172.20.5.2 1.73 TiB 256 67.0% c623bbc0-9839-4d2d-8ff3-db7115719d59 rack1 UN 172.20.5.3 1.86 TiB 256 66.7% c555e44c-9590-4f45-aea4-f5eca68180b2 rack1 There is only one datacenter. The compaciton strategy is here: compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '12', 'tombstone_threshold': '0.1', 'unchecked_tombstone_compaction': 'true'} AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'} AND crc_check_chance = 1.0 AND dclocal_read_repair_chance = 0.1 AND default_time_to_live = 864 AND gc_grace_seconds = 432000 I really want to know about how to do cassandra routine maintenance ? I found the data seems to grow faster and the disk is in heavy load. Sometimes I found the data inconsistency: two different results appear with the same query. So what I shoud I do to keep the cluster healthy, how to maintain the cluster? I hope some help very much! Thanks a lot ! - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org
Re: Cassandra 3.11 is compacting forever
Hi, It might be useful to enable compaction logging with log_all subproperties. Best, Romain Le vendredi 8 septembre 2017 à 00:15:19 UTC+2, kurt greaves a écrit : Might be worth turning on debug logging for that node and when the compaction kicks off and CPU skyrockets send through the logs.
Re: Splitting Cassandra Cluster between AWS availability zones
Hi, Before: 1 cluster with 2 DC. 3 nodes in each DCNow: 1 cluster with 1 DC. 6 nodes in this DC Is it right? If yes, depending on the RF - and assuming NetworkTopologyStrategy - I would do: - RF = 2 => 2 C* rack, one rack in each AZ - RF = 3 => 3 C* rack, one rack in each AZ In other words, I would align C* rack and AZ.Note that AWS charges for inter AZ traffic a.k.a Regional Data Transfer. Best, Romain Le Mardi 7 mars 2017 18h36, tommaso barbugli a écrit : Hi Richard, It depends on the snitch and the replication strategy in use. Here's a link to a blogpost about how we deployed C* on 3AZ http://highscalability.com/blog/2016/8/1/how-to-setup-a-highly-available-multi-az-cassandra-cluster-o.html Best,Tommaso On Mar 7, 2017 18:05, "Ney, Richard" wrote: We’ve collapsed our 2 DC – 3 node Cassandra clusters into a single 6 node Cassandra cluster split between two AWS availability zones. Are there any behaviors we need to take into account to ensure the Cassandra cluster stability with this configuration? RICHARD NEYTECHNICAL DIRECTOR, RESEARCH & DEVELOPMENTUNITED statesrichard@aspect.comaspect.com This email (including any attachments) is proprietary to Aspect Software, Inc. and may contain information that is confidential. If you have received this message in error, please do not read, copy or forward this message. Please notify the sender immediately, delete it from your system and destroy any copies. You may not further disclose or distribute this email or its attachments.
Re: Attached profiled data but need help understanding it
Hi Kant, You'll find more information about ixgbevf here http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/sriov-networking.htmlI repeat myself but don't underestimate VMs placement: same AZ? same placement group? etc.Note that LWT are not discouraged but as the doc says: "[...] reserve lightweight transactions for those situations where they are absolutely necessary;"I hope you'll be able to achieve what you want with more powerful VMs. Let us know! Best,Romain Le Lundi 6 mars 2017 10h49, Kant Kodali a écrit : Hi Romain, We may be able to achieve what we need without LWT but that would require bunch of changes from the application side and possibly introducing caching layers and designing solution around that. But for now, we are constrained to use LWT's for another month or so. All said, I still would like to see the discouraged features such as LWT's, secondary indexes, triggers get better over time so it would really benefit users. Agreed High park/unpark is a sign of excessive context switching but any ideas why this is happening? yes today we will be experimenting with c3.2Xlarge and see what the numbers look like and slowly scale up from there. How do I make sure I install ixgbevf driver? Do M4.xlarge or C3.2Xlarge don't already have it? when I googled " ixgbevf driver" it tells me it is ethernet driver...I thought all instances by default run on ethernet on AWS. can you please give more context on this? Thanks,kant On Fri, Mar 3, 2017 at 4:42 AM, Romain Hardouin wrote: Also, I should have mentioned that it would be a good idea to spawn your three benchmark instances in the same AZ, then try with one instance on each AZ to see how network latency affects your LWT rate. The lower latency is achievable with three instances on the same placement group of course but it's kinda dangerous for production.
Re: question of keyspace that just disappeared
I suspect a lack of 3.x reliability. Cassandra could had gave up with dropped messages but not with a "drop keyspace". I mean I already saw some spark jobs with too much executors that produce a high load average on a DC. I saw a C* node with a 1 min. load avg of 140 that can still have a P99 read latency at 40ms. But I never saw a disappearing keyspace. There are old tickets regarding C* 1.x but as far as I remember it was due to a create/drop/create keyspace. Le Vendredi 3 mars 2017 13h44, George Webster a écrit : Thank you for your reply and good to know about the debug statement. I haven't We never dropped or re-created the keyspace before. We haven't even performed writes to that keyspace in months. I also checked the permissions of Apache, that user had read only access. Unfortunately, I reverted from a backend recently. I cannot say for sure anymore if I saw something in system before the revert. Anyway, hopefully it was just a fluke. We have some crazy ML libraries running on it maybe Cassandra just gave up? Ohh well, Cassandra is a a champ and we haven't really had issues with it before. On Thu, Mar 2, 2017 at 6:51 PM, Romain Hardouin wrote: Did you inspect system tables to see if there is some traces of your keyspace? Did you ever drop and re-create this keyspace before that? Lines in debug appear because fd interval is > 2 seconds (logs are in nanoseconds). You can override intervals via -Dcassandra.fd_initial_value_ ms and -Dcassandra.fd_max_interval_ms properties. Are you sure you didn't have these lines in debug logs before? I used to see them a lot prior to increase intervals to 4 seconds. Best, Romain Le Mardi 28 février 2017 18h25, George Webster a écrit : Hey Cassandra Users, We recently encountered an issue with a keyspace just disappeared. I was curious if anyone has had this occur before and can provide some insight. We are using cassandra 3.10. 2 DCs 3 nodes each. The data was still located in the storage folder but is not located inside Cassandra I searched the logs for any hints of error or commands being executed that could have caused a loss of a keyspace. Unfortunately I found nothing. In the logs the only unusual issue i saw was a series of read timeouts that occurred right around when the keyspace went away. Since then I see numerous entries in debug log as the following: DEBUG [GossipStage:1] 2017-02-28 18:14:12,580 FailureDetector.java:457 - Ignoring interval time of 2155674599 for /x.x.x..12DEBUG [GossipStage:1] 2017-02-28 18:14:16,580 FailureDetector.java:457 - Ignoring interval time of 2945213745 for /x.x.x.81DEBUG [GossipStage:1] 2017-02-28 18:14:19,590 FailureDetector.java:457 - Ignoring interval time of 2006530862 for /x.x.x..69DEBUG [GossipStage:1] 2017-02-28 18:14:27,434 FailureDetector.java:457 - Ignoring interval time of 3441841231 for /x.x.x.82DEBUG [GossipStage:1] 2017-02-28 18:14:29,588 FailureDetector.java:457 - Ignoring interval time of 2153964846 for /x.x.x.82DEBUG [GossipStage:1] 2017-02-28 18:14:33,582 FailureDetector.java:457 - Ignoring interval time of 2588593281 for /x.x.x.82DEBUG [GossipStage:1] 2017-02-28 18:14:37,588 FailureDetector.java:457 - Ignoring interval time of 2005305693 for /x.x.x.69DEBUG [GossipStage:1] 2017-02-28 18:14:38,592 FailureDetector.java:457 - Ignoring interval time of 2009244850 for /x.x.x.82DEBUG [GossipStage:1] 2017-02-28 18:14:43,584 FailureDetector.java:457 - Ignoring interval time of 2149192677 for /x.x.x.69DEBUG [GossipStage:1] 2017-02-28 18:14:45,605 FailureDetector.java:457 - Ignoring interval time of 2021180918 for /x.x.x.85DEBUG [GossipStage:1] 2017-02-28 18:14:46,432 FailureDetector.java:457 - Ignoring interval time of 2436026101 for /x.x.x.81DEBUG [GossipStage:1] 2017-02-28 18:14:46,432 FailureDetector.java:457 - Ignoring interval time of 2436187894 for /x.x.x.82 During the time of the disappearing keyspace we had two concurrent activities:1) Running a Spark job (via HDP 2.5.3 in Yarn) that was performing a countbykey. It was using they Keyspace that disappeared. The operation crashed.2) We created a new keyspace to test out scheme. Only "fancy" thing in that keyspace are a few material view tables. Data was being loaded into that keyspace during the crash. The load process was extracting information and then just writing to Cassandra. Any ideas? Anyone seen this before? Thanks,George
Re: Attached profiled data but need help understanding it
Also, I should have mentioned that it would be a good idea to spawn your three benchmark instances in the same AZ, then try with one instance on each AZ to see how network latency affects your LWT rate. The lower latency is achievable with three instances on the same placement group of course but it's kinda dangerous for production.
Re: Attached profiled data but need help understanding it
Hi Kant, > By backporting you mean I should cherry pick CASSANDRA-11966 commit and > compile from source? Yes Regarding the network utilization: you checked throughput but latency is more important for LWT. That's why you should make sure your m4 instances (both C* and client) are using ixgbevf driver. I agree 1500 writes/s is not impressive but 4 vCPU is low. It depends on the workload but my experience is that an AWS instance become to be powerful with 16 vCPUs (e.g. c3.4xlarge). And beware of EBS (again, that's my experience YMMV). High park/unpark is a sign of excessive context switching. If I were you I would make a LWT benchmark with 3 x c3.4xlarge or c3.8xlarge (32 vCPUs, SSD instance store). Spawn spot instances to save money and be sure to tune cassandra.yaml accordingly e.g. concurrent_writes. Finally, a naive question but I must ask you... are you really sure you need LWT? Can't you achieve your goal without it? Best, Romain Le Jeudi 2 mars 2017 10h31, Kant Kodali a écrit : Hi Romain, Any ideas on this? I am not sure why there is so much time being spent in Park and Unpark methods as produced by thread dump? Also, could you please look into my responses from other email? It would greatly help. Thanks,kant On Tue, Feb 28, 2017 at 10:20 PM, Kant Kodali wrote: Hi Romain, I am using Cassandra version 3.0.9 and here is the generated report (Graphical view) of my thread dump as well!. Just send this over in case if it helps. Thanks,kant On Tue, Feb 28, 2017 at 7:51 PM, Kant Kodali wrote: Hi Romain, Thanks again. My response are inline. kant On Tue, Feb 28, 2017 at 10:04 AM, Romain Hardouin wrote: > we are currently using 3.0.9. should we use 3.8 or 3.10 No, don't use 3.X in production unless you really need a major feature.I would advise to stick to 3.0.X (i.e. 3.0.11 now).You can backport CASSANDRA-11966 easily but of course you have to deploy from source as a prerequisite. By backporting you mean I should cherry pick CASSANDRA-11966 commit and compile from source? > I haven't done any tuning yet. So it's a good news because maybe there is room for improvement > Can I change this on a running instance? If so, how? or does it require a > downtime? You can throttle compaction at runtime with "nodetool setcompactionthroughput". Be sure to read all nodetool commmands, some of them are really useful for a day to day tuning/management. If GC is fine, then check other things -> "[...] different pool sizes for NTR, concurrent reads and writes, compaction executors, etc. Also check if you can improve network latency (e.g. VF or ENA on AWS)." Regarding thread pools, some of them can be resized at runtime via JMX. > 5000 is the target. Right now you reached 1500. Is it per node or for the cluster?We don't know your setup so it's hard to say it's doable. Can you provide more details? VM, physical nodes, #nodes, etc.Generally speaking LWT should be seldom used. AFAIK you won't achieve 10,000 writes/s per node. Maybe someone on the list already made some tuning for heavy LWT workload? 1500 total cluster. I have a 8 node cassandra cluster. Each node is AWS m4.xlarge instance (so 4 vCPU, 16GB, 1Gbit network=125MB/s) I have 1 node (m4.xlarge) for my application which just inserts a bunch of data and each insert is an LWT I tested the network throughput of the node. I can get up 98 MB/s. Now, when I start my application. I see that Cassandra nodes Receive rate/ throughput is about 4MB/s (yes it is in Mega Bytes. I checked this by running sudo iftop -B). The Disk I/O is also same and the Cassandra process CPU usage is about 360% (the max is 400% since it is a 4 core machine). The application node transmission throughput is about 6MB/s. so even with 4MB/s receive throughput at Cassandra node the CPU is almost maxed out. I am not sure what this says about Cassandra? But, what I can tell is that Network is way underutilized and that 8 nodes are unnecessary so we plan to bring it down to 4 nodes except each node this time will have 8 cores. All said, I am still not sure how to scale up from 1500 writes/sec? Best, Romain
Re: question of keyspace that just disappeared
Did you inspect system tables to see if there is some traces of your keyspace? Did you ever drop and re-create this keyspace before that? Lines in debug appear because fd interval is > 2 seconds (logs are in nanoseconds). You can override intervals via -Dcassandra.fd_initial_value_ms and -Dcassandra.fd_max_interval_ms properties. Are you sure you didn't have these lines in debug logs before? I used to see them a lot prior to increase intervals to 4 seconds. Best, Romain Le Mardi 28 février 2017 18h25, George Webster a écrit : Hey Cassandra Users, We recently encountered an issue with a keyspace just disappeared. I was curious if anyone has had this occur before and can provide some insight. We are using cassandra 3.10. 2 DCs 3 nodes each. The data was still located in the storage folder but is not located inside Cassandra I searched the logs for any hints of error or commands being executed that could have caused a loss of a keyspace. Unfortunately I found nothing. In the logs the only unusual issue i saw was a series of read timeouts that occurred right around when the keyspace went away. Since then I see numerous entries in debug log as the following: DEBUG [GossipStage:1] 2017-02-28 18:14:12,580 FailureDetector.java:457 - Ignoring interval time of 2155674599 for /x.x.x..12DEBUG [GossipStage:1] 2017-02-28 18:14:16,580 FailureDetector.java:457 - Ignoring interval time of 2945213745 for /x.x.x.81DEBUG [GossipStage:1] 2017-02-28 18:14:19,590 FailureDetector.java:457 - Ignoring interval time of 2006530862 for /x.x.x..69DEBUG [GossipStage:1] 2017-02-28 18:14:27,434 FailureDetector.java:457 - Ignoring interval time of 3441841231 for /x.x.x.82DEBUG [GossipStage:1] 2017-02-28 18:14:29,588 FailureDetector.java:457 - Ignoring interval time of 2153964846 for /x.x.x.82DEBUG [GossipStage:1] 2017-02-28 18:14:33,582 FailureDetector.java:457 - Ignoring interval time of 2588593281 for /x.x.x.82DEBUG [GossipStage:1] 2017-02-28 18:14:37,588 FailureDetector.java:457 - Ignoring interval time of 2005305693 for /x.x.x.69DEBUG [GossipStage:1] 2017-02-28 18:14:38,592 FailureDetector.java:457 - Ignoring interval time of 2009244850 for /x.x.x.82DEBUG [GossipStage:1] 2017-02-28 18:14:43,584 FailureDetector.java:457 - Ignoring interval time of 2149192677 for /x.x.x.69DEBUG [GossipStage:1] 2017-02-28 18:14:45,605 FailureDetector.java:457 - Ignoring interval time of 2021180918 for /x.x.x.85DEBUG [GossipStage:1] 2017-02-28 18:14:46,432 FailureDetector.java:457 - Ignoring interval time of 2436026101 for /x.x.x.81DEBUG [GossipStage:1] 2017-02-28 18:14:46,432 FailureDetector.java:457 - Ignoring interval time of 2436187894 for /x.x.x.82 During the time of the disappearing keyspace we had two concurrent activities:1) Running a Spark job (via HDP 2.5.3 in Yarn) that was performing a countbykey. It was using they Keyspace that disappeared. The operation crashed.2) We created a new keyspace to test out scheme. Only "fancy" thing in that keyspace are a few material view tables. Data was being loaded into that keyspace during the crash. The load process was extracting information and then just writing to Cassandra. Any ideas? Anyone seen this before? Thanks,George
Re: AWS NVMe i3 instances performances
Thanks for your feedback Daemeon!I'm a disappointed and I hope that some system settings will allow to leverage NVMe :-/What i3 instances did you benchmarked?Did you have a "preview access" to i3? Or was it available in a specific region before the announcement? Best,Romain Le Mercredi 1 mars 2017 17h44, daemeon reiydelle a écrit : We did. Found that, even with (CentOS, Ubuntu both for application compatibility reasons) that there is somewhat less IO and better CPU throughput at the price point. At the time my optimization work for that client ended, Amazon was looking at the IO issue, as perhaps the frame configurations needed further optimization. this was 2 months ago. A very superficial (no kernel tuning) done last month seems to indicate the same tradeoffs. Testing was performed in both cases with C* stress tool and with CI test suites. Does this help? ... Daemeon C.M. Reiydelle USA (+1) 415.501.0198 London (+44) (0) 20 8144 9872 On Wed, Mar 1, 2017 at 3:30 AM, Romain Hardouin wrote: Hi all, AWS launched i3 instances few days ago*. NVMe SSDs seem very promising! Did someone already benchmark an i3 with Cassandra? e.g. i2 vs i3If yes, with which OS and kernel version?Did you make any system tuning for NVMe? e.g. PCIe IRQ? etc. We plan to make some benchmarks but Debian is not listed as a supported OS so we have to upgrade our kernel and see if it works :PHere is what we have in mind for the time being:* OS: Debian* Kernel: v4.9* IRQ: try several configurationsAlso I would like to compare performances between our Debian AMI and a standard AWS Linux AMI. Thanks! [*] https://aws.amazon.com/fr/ blogs/aws/now-available-i3- instances-for-demanding-io- intensive-applications/
AWS NVMe i3 instances performances
Hi all, AWS launched i3 instances few days ago*. NVMe SSDs seem very promising! Did someone already benchmark an i3 with Cassandra? e.g. i2 vs i3If yes, with which OS and kernel version?Did you make any system tuning for NVMe? e.g. PCIe IRQ? etc. We plan to make some benchmarks but Debian is not listed as a supported OS so we have to upgrade our kernel and see if it works :PHere is what we have in mind for the time being:* OS: Debian* Kernel: v4.9* IRQ: try several configurationsAlso I would like to compare performances between our Debian AMI and a standard AWS Linux AMI. Thanks! [*] https://aws.amazon.com/fr/blogs/aws/now-available-i3-instances-for-demanding-io-intensive-applications/
Re: Attached profiled data but need help understanding it
> we are currently using 3.0.9. should we use 3.8 or 3.10 No, don't use 3.X in production unless you really need a major feature.I would advise to stick to 3.0.X (i.e. 3.0.11 now).You can backport CASSANDRA-11966 easily but of course you have to deploy from source as a prerequisite. > I haven't done any tuning yet. So it's a good news because maybe there is room for improvement > Can I change this on a running instance? If so, how? or does it require a > downtime? You can throttle compaction at runtime with "nodetool setcompactionthroughput". Be sure to read all nodetool commmands, some of them are really useful for a day to day tuning/management. If GC is fine, then check other things -> "[...] different pool sizes for NTR, concurrent reads and writes, compaction executors, etc. Also check if you can improve network latency (e.g. VF or ENA on AWS)." Regarding thread pools, some of them can be resized at runtime via JMX. > 5000 is the target. Right now you reached 1500. Is it per node or for the cluster?We don't know your setup so it's hard to say it's doable. Can you provide more details? VM, physical nodes, #nodes, etc.Generally speaking LWT should be seldom used. AFAIK you won't achieve 10,000 writes/s per node. Maybe someone on the list already made some tuning for heavy LWT workload? Best, Romain
Re: Attached profiled data but need help understanding it
Hi, Regarding shared pool workers see CASSANDRA-11966. You may have to backport it depending on your Cassandra version. Did you try to lower compaction throughput to see if it helps? Be sure to keep an eye on pending compactions, SSTables count and SSTable per read of course. "alloc" is the memory allocation rate. You can see that compactions are GC intensive. You won't be able to achieve impressive writes/s with LWT. But maybe there is room for improvement. Try GC tuning, different pool sizes for NTR, concurrent reads and writes, compaction executors, etc. Also check if you can improve network latency (e.g. VF or ENA on AWS). What LWT rate would you want to achieve? Best, Romain Le Lundi 27 février 2017 12h48, Kant Kodali a écrit : Also Attached is a flamed graph generated from a thread dump. On Mon, Feb 27, 2017 at 2:32 AM, Kant Kodali wrote: Hi, Attached are the stats of my Cassandra node running on a 4-core CPU. I am using sjk-plus tool for the first time so what are the things I should watched out for in my attached screenshot? I can see the CPU is almost maxed out but should I say that is because of compaction or shared-worker-pool threads (which btw, I dont know what they are doing perhaps I need to take threadump)? Also what is alloc for each thread? I have a insert heavy workload (almost like an ingest running against cassandra cluster) and in my case all writes are LWT. The current throughput is 1500 writes/sec where each write is about 1KB. How can I tune something for a higher throughput? Any pointers or suggestions would help. Thanks much,kant
Re: secondary index on static column
Hi, Sorry for the delay, I created a ticket with steps to reproduce the issue: https://issues.apache.org/jira/browse/CASSANDRA-13277 Best, Romain Le Jeudi 2 février 2017 16h53, Micha a écrit : Hi, it's a 3.9, installed on a jessie system. For me it's like this: I have a three node cluster. When creating the keyspace with replication factor 3 it works. When creating the keyspace with replication factor 2 it doesn't work and shows the weird behavior. This is a fresh install, I also have tried it multiple times and the result is the same. As SASI indices work, I use those. But I would like to solve this. Cheers, Michael On 02.02.2017 15:06, Romain Hardouin wrote: > Hi, > > What's your C* 3.X version? > I've just tested it on 3.9 and it works: > > cqlsh> SELECT * FROM test.idx_static where id2=22; > > id | added | id2 | source | dest > -+-+-++-- > id1 | 2017-01-27 23:00:00.00+ | 22 | src1 | dst1 > > (1 rows) > > Maybe your dataset is incorrect (try on a new table) or you hit a bug. > > Best, > > Romain > > > > Le Vendredi 27 janvier 2017 9h44, Micha a écrit : > > > Hi, > > I'm quite new to cassandra and allthough there is much info on the net, > sometimes I cannot find the solution to a problem. > > In this case, I have a second index on a static column and I don't > understand the answer I get from my select. > > A cut down version of the table is: > > create table demo (id text, id2 bigint static, added timestamp, source > text static, dest text, primary key (id, added)); > > create index on demo (id2); > > id and id2 match one to one. > > I make one insert: > insert into demo (id, id2, added, source, dest) values ('id1', 22, > '2017-01-28', 'src1', 'dst1'); > > > The "select from demo;" gives the expected answer of the one inserted row. > > But "select from demo where id2=22" gives 70 rows as result (all the same). > > Why? I have read > https://www.datastax.com/dev/blog/cassandra-native-secondary-index-deep-dive > > but I don't get it... > > thanks for answering, > Michael > > >
Re: secondary index on static column
Hi, What's your C* 3.X version?I've just tested it on 3.9 and it works: cqlsh> SELECT * FROM test.idx_static where id2=22; id | added | id2 | source | dest-+-+-++-- id1 | 2017-01-27 23:00:00.00+ | 22 | src1 | dst1 (1 rows) Maybe your dataset is incorrect (try on a new table) or you hit a bug. Best, Romain Le Vendredi 27 janvier 2017 9h44, Micha a écrit : Hi, I'm quite new to cassandra and allthough there is much info on the net, sometimes I cannot find the solution to a problem. In this case, I have a second index on a static column and I don't understand the answer I get from my select. A cut down version of the table is: create table demo (id text, id2 bigint static, added timestamp, source text static, dest text, primary key (id, added)); create index on demo (id2); id and id2 match one to one. I make one insert: insert into demo (id, id2, added, source, dest) values ('id1', 22, '2017-01-28', 'src1', 'dst1'); The "select from demo;" gives the expected answer of the one inserted row. But "select from demo where id2=22" gives 70 rows as result (all the same). Why? I have read https://www.datastax.com/dev/blog/cassandra-native-secondary-index-deep-dive but I don't get it... thanks for answering, Michael
Re: Global TTL vs Insert TTL
Default TTL is nice to provide information on tables for ops guys. I mean we know that data in such tables are ephemeral at a glance. Le Mercredi 1 février 2017 21h47, Carlos Rolo a écrit : Awsome to know this! Thanks Jon and DuyHai! Regards, Carlos Juzarte RoloCassandra Consultant / Datastax Certified Architect / Cassandra MVP Pythian - Love your data rolo@pythian | Twitter: @cjrolo | Skype: cjr2k3 | Linkedin: linkedin.com/in/carlosjuzarterolo Mobile: +351 918 918 100 www.pythian.com On Wed, Feb 1, 2017 at 6:57 PM, Jonathan Haddad wrote: The optimization is there. The entire sstable can be dropped but it's not because of the default TTL. The default TTL only applies if a TTL isn't specified explicitly. The default TTL can't be used to drop a table automatically since it can be overridden at insert time. Check out this example. The first insert uses the default TTL. The second insert overrides the default. Using the default TTL to drop the sstable would be pretty terrible in this case: CREATE TABLE test.b ( k int PRIMARY KEY, v int) WITH default_time_to_live = 1; insert into b (k, v) values (1, 1); cqlsh:test> select k, v, TTL(v) from b where k = 1; k | v | ttl(v)---+---+ 1 | 1 | 9943 (1 rows) cqlsh:test> insert into b (k, v) values (2, 1) USING TTL ;cqlsh:test> select k, v, TTL(v) from b where k = 2; k | v | ttl(v)---+---+-- 2 | 1 | 9995 (1 rows) TL;DR: The default TTL is there as a convenience so you don't have to keep the TTL in your code. From a performance perspective it does not matter. Jon On Wed, Feb 1, 2017 at 10:39 AM DuyHai Doan wrote: I was referring to this JIRA https://issues.apache. org/jira/browse/CASSANDRA-3974 when talking about dropping entire SSTable at compaction time But the JIRA is pretty old and it is very possible that the optimization is no longer there On Wed, Feb 1, 2017 at 6:53 PM, Jonathan Haddad wrote: This is incorrect, there's no optimization used that references the table level TTL setting. The max local deletion time is stored in table metadata. See org.apache.cassandra.io. sstable.metadata. StatsMetadata# maxLocalDeletionTime in the Cassandra 3.0 branch. The default ttl is stored here: org.apache.cassandra. schema.TableParams# defaultTimeToLive and is never referenced during compaction. Here's an example from a table I created without a default TTL, you can use the sstablemetadata tool to see: jhaddad@rustyrazorblade ~/dev/cassandra/data/data/ test$ ../../../tools/bin/ sstablemetadata a- 7bca6b50e8a511e6869a5596edf4dd 35/mc-1-big-Data.db.SSTable max local deletion time: 1485980862 On Wed, Feb 1, 2017 at 6:59 AM DuyHai Doan wrote: Global TTL is better than dynamic runtime TTL Why ? Because Global TTL is a table property and Cassandra can perform optimization when compacting. For example if it can see than the maxTimestamp of an SSTable is older than the table Global TTL, the SSTable can be entirely dropped during compaction Using dynamic TTL at runtime, since Cassandra doesn't how and cannot track each individual TTL value, the previous optimization is not possible (even if you always use the SAME TTL for all query, Cassandra is not supposed to know that) On Wed, Feb 1, 2017 at 3:01 PM, Cogumelos Maravilha wrote: Thank you all, for your answers. On 02/01/2017 01:06 PM, Carlos Rolo wrote: To reinforce Alain statement: "I would say that the unsafe part is more about using C* 3.9" this is key. You would be better on 3.0.x unless you need features on the 3.x series. Regards, Carlos Juzarte Rolo Cassandra Consultant / Datastax Certified Architect / Cassandra MVP Pythian - Love your data rolo@pythian | Twitter: @cjrolo | Skype: cjr2k3 | Linkedin: linkedin.com/in/ carlosjuzarterolo Mobile: +351 918 918 100 www.pythian.com On Wed, Feb 1, 2017 at 8:32 AM, Alain RODRIGUEZ wrote: Is it safe to use TWCS in C* 3.9? I would say that the unsafe part is more about using C* 3.9 than using TWCS in C*3.9 :-). I see no reason to say 3.9 would be specifically unsafe in C*3.9, but I might be missing something. Going from STCS to TWCS is often smooth, from LCS you might expect an extra load compacting a lot (all?) of the SSTable from what we saw from the field. In this case, be sure that your compaction options are safe enough to handle this. TWCS is even easier to use on C*3.0.8+ and C*3.8+ as it became the new default replacing TWCS, so no extra jar is needed, you can enable TWCS as any other default compaction strategy. C*heers, --- Alain Rodriguez - @arodream - al...@thelastpickle.com France The Last Pickle - Apache Cassandra Consulting http://www.thelastpickle.com 2017-01-31 23:29 GMT+01:00 Cogumelos Maravilha : Hi Alain, Thanks for your response and the links. I've also checked "Time series data model and tombstones". Is it safe to use TWCS in
Re: Is this normal!?
Just a side note: increase system_auth keyspace replication factor if you're using authentication. Le Jeudi 12 janvier 2017 14h52, Alain RODRIGUEZ a écrit : Hi, Nodetool repair always list lots of data and never stays repaired. I think. This might be the reason: "incremental: true" Incremental repairs is the default in your version. It marks data as being repaired in order to only repair each data only once. It is a clever feature, but with some caveats. I would read about it as it is not trivial to understand impacts and in some cases it can create issues and not be such a good idea to use incremental repairs. Make sure to run a full repair instead when a node goes down for example. C*heers,---Alain Rodriguez - @arodream - alain@thelastpickle.comFrance The Last Pickle - Apache Cassandra Consultinghttp://www.thelastpickle.com 2017-01-11 15:21 GMT+01:00 Cogumelos Maravilha : Nodetool repair always list lots of data and never stays repaired. I think. Cheers On 01/11/2017 02:15 PM, Hannu Kröger wrote: > Just to understand: > > What exactly is the problem? > > Cheers, > Hannu > >> On 11 Jan 2017, at 16.07, Cogumelos Maravilha >> wrote: >> >> Cassandra 3.9. >> >> nodetool status >> Datacenter: dc1 >> === >> Status=Up/Down >> |/ State=Normal/Leaving/Joining/ Moving >> -- Address Load Tokens Owns (effective) Host >> ID Rack >> UN 10.0.120.145 1.21 MiB 256 49.5% >> da6683cd-c3cf-4c14-b3cc- e7af4080c24f rack1 >> UN 10.0.120.179 1020.51 KiB 256 48.1% >> fb695bea-d5e8-4bde-99db- 9f756456a035 rack1 >> UN 10.0.120.55 1.02 MiB 256 53.3% >> eb911989-3555-4aef-b11c- 4a684a89a8c4 rack1 >> UN 10.0.120.46 1.01 MiB 256 49.1% >> 8034c30a-c1bc-44d4-bf84- 36742e0ec21c rack1 >> >> nodetool repair >> [2017-01-11 13:58:27,274] Replication factor is 1. No repair is needed >> for keyspace 'system_auth' >> [2017-01-11 13:58:27,284] Starting repair command #4, repairing keyspace >> system_traces with repair options (parallelism: parallel, primary range: >> false, incremental: true, job threads: 1, ColumnFamilies: [], >> dataCenters: [], hosts: [], # of ranges: 515) >> [2017-01-11 14:01:55,628] Repair session >> 82a25960-d806-11e6-8ac4- 73b93fe4986d for range >> [(-1278992819359672027,- 1209509957304098060], >> (-2593749995021251600,- 2592266543457887959], >> (-6451044457481580778,- 6438233936014720969], >> (-1917989291840804877,- 1912580903456869648], >> (-3693090304802198257,- 3681923561719364766], >> (-380426998894740867,- 350094836653869552], >> (1890591246410309420, 1899294587910578387], >> (6561031217224224632, 6580230317350171440], >> ... 4 pages of data >> , (6033828815719998292, 6079920177089043443]] finished (progress: 1%) >> [2017-01-11 13:58:27,986] Repair completed successfully >> [2017-01-11 13:58:27,988] Repair command #4 finished in 0 seconds >> >> nodetool gcstats >> Interval (ms) Max GC Elapsed (ms)Total GC Elapsed (ms)Stdev GC Elapsed >> (ms) GC Reclaimed (MB) Collections Direct Memory Bytes >> 360134 23 >> 23 0 333975216 >> 1 -1 >> >> (wait) >> nodetool gcstats >> Interval (ms) Max GC Elapsed (ms)Total GC Elapsed (ms)Stdev GC Elapsed >> (ms) GC Reclaimed (MB) Collections Direct Memory Bytes >> 60016 0 0 >> NaN 0 0 -1 >> >> nodetool repair >> [2017-01-11 14:00:45,888] Replication factor is 1. No repair is needed >> for keyspace 'system_auth' >> [2017-01-11 14:00:45,896] Starting repair command #5, repairing keyspace >> system_traces with repair options (parallelism: parallel, primary range: >> false, incremental: true, job threads: 1, ColumnFamilies: [], >> dataCenters: [], hosts: [], # of ranges: 515) >> ... 4 pages of data >> , (94613607632078948, 219237792837906432], >> (6033828815719998292, 6079920177089043443]] finished (progress: 1%) >> [2017-01-11 14:00:46,567] Repair completed successfully >> [2017-01-11 14:00:46,576] Repair command #5 finished in 0 seconds >> >> nodetool gcstats >> Interval (ms) Max GC Elapsed (ms)Total GC Elapsed (ms)Stdev GC Elapsed >> (ms) GC Reclaimed (MB) Collections Direct Memory Bytes >> 9169 25 25 >> 0 330518688 1 -1 >> >> >> Always in loop, I think! >> >> Thanks in advance. >>
Re: Openstack and Cassandra
Kilo is a bit old but the good news is that CPU pinning is available which IMHO is a must to run C* on Production.Of course your bottleneck will be shared HDDs. Best, Romain Le Mardi 27 décembre 2016 10h21, Shalom Sagges a écrit : Hi Romain, Thanks for the input! We currently use the Kilo release of Openstack. Are you aware of any known bugs/issues with this release?We definitely defined anti-affinity rules regarding spreading C* on different hosts. (I surely don't want to be woken up at night due to a failed host ;-) ) Regarding Trove, I doubt we'll use it in Production any time soon. Thanks again! | | | Shalom Sagges | | DBA | | T: +972-74-700-4035 | | | | | | | | We Create Meaningful Connections | | | | On Mon, Dec 26, 2016 at 7:37 PM, Romain Hardouin wrote: Hi Shalom, I assume you'll use KVM virtualization so pay attention to your stack at every level:- Nova e.g. CPU pinning, NUMA awareness if relevant, etc. Have a look to extra specs.- libvirt - KVM- QEMU You can also be interested by resources quota on other OpenStack VMs that will be colocated with C* VMs.Don't forget to define anti-affinity rules in order to spread out your C* VMs on different hosts.Finally, watch out versions of libvirt/KVM/QEMU. Some optimizations/bugs are good to know. Out of curiosity, which OpenStack release are you using?You can be interested by Trove but C* support is for testing only. Best, Romain This message may contain confidential and/or privileged information. If you are not the addressee or authorized to receive this on behalf of the addressee you must not use, copy, disclose or take action based on this message or any information herein. If you have received this message in error, please advise the sender immediately by reply email and delete this message. Thank you.
Re: Openstack and Cassandra
Hi Shalom, I assume you'll use KVM virtualization so pay attention to your stack at every level:- Nova e.g. CPU pinning, NUMA awareness if relevant, etc. Have a look to extra specs.- libvirt - KVM- QEMU You can also be interested by resources quota on other OpenStack VMs that will be colocated with C* VMs.Don't forget to define anti-affinity rules in order to spread out your C* VMs on different hosts.Finally, watch out versions of libvirt/KVM/QEMU. Some optimizations/bugs are good to know. Out of curiosity, which OpenStack release are you using?You can be interested by Trove but C* support is for testing only. Best, Romain
Repair: huge boost on C* 2.1 with CASSANDRA-12580
Hi all, Many people here have troubles with repair so I would like to share my experience regarding the backport of CASSANDRA-12580 "Fix merkle tree size calculation" (thanks Paulo!) in our C* 2.1.16. I was expecting some minor improvements but the results are impressive on some tables. Because of a slow VPN between our EU and US AWS DCs, the massive drop of overstreaming is a big win for us. On top of that, before the backport I used to see many RepairException that increased during each repair. With this fix the graph shows only one exception on one node, so we can say it's negligible. Such exceptions are not critical because Cassandra-reaper makes a retry but it's a waste of time. I run a repair on tables set by set (some sets of tables being more critical, etc.). The most impressive result so far for a set is: * Before: 23 days (days, not hours) * With CASSANDRA-12580: 16 hours (yes, hours!) The improvement is not always dramatic (e.g. 8 hours instead of 39 hours on another set) but still significant and valuable. Moreover, considering that: * repair is a mandatory operation in many use cases * Paulo already made the patch for 2.1 * C* 2.1 is widely used (the most used?) I think this bugfix is critical - from an Ops point of view - and should land in 2.1.17 to be available to people that don't deploy from sources. Best, Romain
Re: cassandra dump file path
Hi Jean, I had the same problem, I removed the lines in /etc/init.d/cassandra template (we use Chef to deploy) and now the HeapDumpPath is not overridden anymore.The same goes for -XX:ErrorFile. Best, Romain Le Mardi 4 octobre 2016 9h25, Jean Carlo a écrit : Yes, we did it. So if the parameter in cassandra-env.sh is used only if we have a OOM, what is for the definition of -XX:HeapDumpPath=/var/lib/ cassandra/java_1475461286. hprof in /etc/init.d/cassandra for? Saludos Jean Carlo "The best way to predict the future is to invent it" Alan Kay On Tue, Oct 4, 2016 at 2:58 AM, Yabin Meng wrote: Have you restarted Cassandra after making changes in cassandra-env.sh? Yabin On Mon, Oct 3, 2016 at 7:44 AM, Jean Carlo wrote: OK I got the response to one of my questions. In the script /etc/init.d/cassandra we set the path for the heap dump by default in the cassandra_home. Now the thing I don't understand is, why do the dumps are located by the file set by /etc/init.d/cassandra and not by the conf file cassandra-env.sh? Anyone any idea? Saludos Jean Carlo "The best way to predict the future is to invent it" Alan Kay On Mon, Oct 3, 2016 at 12:00 PM, Jean Carlo wrote: Hi I see in the log of my node cassandra that the parameter -XX:HeapDumpPath is charged two times. INFO [main] 2016-10-03 04:21:29,941 CassandraDaemon.java:205 - JVM Arguments: [-ea, -javaagent:/usr/share/cassandr a/lib/jamm-0.3.0.jar, -XX:+CMSClassUnloadingEnabled, -XX:+UseThreadPriorities, -XX:ThreadPriorityPolicy=42, -Xms6G, -Xmx6G, -Xmn600M, -XX:+HeapDumpOnOutOfMemoryErro r, -XX:HeapDumpPath=/cassandra/du mps/cassandra-1475461287-pid34 435.hprof, -Xss256k, -XX:StringTableSize=103, -XX:+UseParNewGC, -XX:+UseConcMarkSweepGC, -XX:+CMSParallelRemarkEnabled, -XX:SurvivorRatio=8, -XX:MaxTenuringThreshold=1, -XX:CMSInitiatingOccupancyFrac tion=30, -XX:+UseCMSInitiatingOccupancy Only, -XX:+UseTLAB, -XX:CompileCommandFile=/etc/ca ssandra/hotspot_compiler, -XX:CMSWaitDuration=1, -XX:+CMSParallelInitialMarkEna bled, -XX:+CMSEdenChunksRecordAlways , -XX:CMSWaitDuration=1, -XX:+UseCondCardMark, -XX:+PrintGCDetails, -XX:+PrintGCDateStamps, -XX:+PrintGCApplicationStopped Time, -Xloggc:/var/opt/hosting/log/c assandra/gc.log, -XX:+UseGCLogFileRotation, -XX:NumberOfGCLogFiles=20, -XX:GCLogFileSize=20M, -Djava.net.preferIPv4Stack=tru e, -Dcom.sun.management.jmxremote .port=7199, -Dcom.sun.management.jmxremote .rmi.port=7199, -Dcom.sun.management.jmxremote .ssl=false, -Dcom.sun.management.jmxremote .authenticate=false, -Dcom.sun.management.jmxremote .password.file=/etc/cassandra/ jmxremote.password, -Djava.io.tmpdir=/var/opt/host ing/db/cassandra/tmp, -javaagent:/usr/share/cassandr a/lib/jolokia-jvm-1.0.6-agent. jar=port=8778,host=0.0.0.0, -Dcassandra.auth_bcrypt_gensal t_log2_rounds=4, -Dlogback.configurationFile=lo gback.xml, -Dcassandra.logdir=/var/log/ca ssandra, -Dcassandra.storagedir=, -Dcassandra-pidfile=/var/run/c assandra/cassandra.pid, -XX:HeapDumpPath=/var/lib/cass andra/java_1475461286.hprof, -XX:ErrorFile=/var/lib/cassand ra/hs_err_1475461286.log] This option is defined in cassandra-env.sh if [ "x$CASSANDRA_HEAPDUMP_DIR" != "x" ]; then JVM_OPTS="$JVM_OPTS -XX:HeapDumpPath=$CASSANDRA_HE APDUMP_DIR/cassandra-`date +%s`-pid$$.hprof" fi and we defined before the value of CASSANDRA_HEAPDUMP_DIR before to /cassandra/dumps/ It is seems that cassandra does not care about the conf in cassandra-env.sh and he only takes in account the last set for HeapDumpPath /var/lib/cassandra/java_147546 1286.hprof This causes problems when we have to dump the heap because cassandra uses the disk not suitable to do it. Is XX:HeapDumpPath set in another place/file that I dont know? Thxs Jean Carlo "The best way to predict the future is to invent it" Alan Kay
Re: TRUNCATE throws OperationTimedOut randomly
Hi, @Edward > In older versions you can not control when this call will timeout,truncate_request_timeout_in_ms is available for many years, starting from 1.2. Maybe you have another setting parameter in mind? @GeorgeTry to put cassandra logs in debug Best, Romain Le Mercredi 28 septembre 2016 20h31, George Sigletos a écrit : Even when I set a lower request-timeout in order to trigger a timeout, still no WARN or ERROR in the logs On Wed, Sep 28, 2016 at 8:22 PM, George Sigletos wrote: Hi Joaquin, Unfortunately neither WARN nor ERROR found in the system logs across the cluster when executing truncate. Sometimes it executes immediately, other times it takes 25 seconds, given that I have connected with --request-timeout=30 seconds. The nodes are a bit busy compacting. On a freshly restarted cluster, truncate seems to work without problems. Some warnings that I see around that time but not exactly when executing truncate are: WARN [CompactionExecutor:2] 2016-09-28 20:03:29,646 SSTableWriter.java:241 - Compacting large partition system/hints:6f2c3b31-4975- 470b-8f91-e706be89a83a (133819308 bytes Kind regards, George On Wed, Sep 28, 2016 at 7:54 PM, Joaquin Casares wrote: Hi George, Try grepping for WARN and ERROR on the system.logs across all nodes when you run the command. Could you post any of the recent stacktraces that you see? Cheers, Joaquin Casares ConsultantAustin, TX Apache Cassandra Consultinghttp://www.thelastpickle.com On Wed, Sep 28, 2016 at 12:43 PM, George Sigletos wrote: Thanks a lot for your reply. I understand that truncate is an expensive operation. But throwing a timeout while truncating a table that is already empty? A workaround is to set a high --request-timeout when connecting. Even 20 seconds is not always enough Kind regards, George On Wed, Sep 28, 2016 at 6:59 PM, Edward Capriolo wrote: Truncate does a few things (based on version) truncate takes snapshots truncate causes a flush in very old versions truncate causes a schema migration. In newer versions like cassandra 3.4 you have this knob. # How long the coordinator should wait for truncates to complete# (This can be much longer, because unless auto_snapshot is disabled# we need to flush first so we can snapshot before removing the data.)truncate_request_timeout_in_ms : 6 In older versions you can not control when this call will timeout, it is fairly normal that it does! On Wed, Sep 28, 2016 at 12:50 PM, George Sigletos wrote: Hello, I keep executing a TRUNCATE command on an empty table and it throws OperationTimedOut randomly: cassandra@cqlsh> truncate test.mytable; OperationTimedOut: errors={}, last_host=cassiebeta-01 cassandra@cqlsh> truncate test.mytable; OperationTimedOut: errors={}, last_host=cassiebeta-01 Having a 3 node cluster running 2.1.14. No connectivity problems. Has anybody come across the same error? Thanks, George
Re: Optimising the data model for reads
Hi Julian, The problem with any deletes here is that you can *read* potentially many tombstones. I mean you have two concerns: 1. Avoid to read tombstones during a query 2. How to evict tombstones as quickly as possible to reclaim disk space The first point is a data model consideration. Generally speaking, to avoid to read tombstones we have to think about order. Let's take an example not related to your data model: say you have a "updated_at" column, maybe you always want to read the newest data (e.g. < 7 days) while oldest ones will be TTL'ed (tombstones). If you order your data by "updated_at DESC" (and TTL>7 days and there are no manual deletes) you won't read tombstones. The second point depends on many factors: gc_grace, compaction strategy, compaction throughput, number of compactors, IO performances, #CPUs, ... Also, with such a data model, you will have unbalance data distribution. What if a user has 1,000,000 files or more?You can use a composite partition key to avoid that: PRIMARY KEY ((userid, fileid), ...).The data distribution will be much better and on top of that you won't read tombstones when a file is deleted (because you won't query the partition key at all). *However if you always read many files per user, each query will hit many nodes.*You have to decide depending on the query pattern, the average/max number of files per user, the average/max file size, etc. Regarding the compaction strategy, LCS is good for read heavy workload but you need good disk IO and enough CPUs/vCPUs (watch out if your write workload is quite heavy).The LCS will compact frequently so, *if tombstones are evictable*, they will be evicted faster that with STCS.As you mentioned, you have 10 days of gc_grace so you might consider to lower this value if maintenance repair are running in few hours/days. LCS is doing a good job with updates and that gives me an idea: what about soft deletes? A clustering column "status int" could do the trick. Let's say 1=>"live file", 2=>"to delete".When a user deletes a file, you set the "status" to 2 and write the userid and fileid in a table "files_to_delete" (the partition key can be the date of the day if there are not millions of deletion per day). Then a batch job can run during off-peak hours to delete i.e. add a tombstone on files to delete.In read queries you would have to add "WHERE status = 1 AND ...". Again it's just an idea that crosses my mind, I never tested this model, but maybe you can think about it. The bonus is that you can "undeleted" a file as long as the batch job has not been triggered. Best, Romain Le Jeudi 29 septembre 2016 11h31, Thomas Julian a écrit : Hello, I have created a column family for User File Management. CREATE TABLE "UserFile" ("USERID" bigint,"FILEID" text,"FILETYPE" int,"FOLDER_UID" text,"FILEPATHINFO" text,"JSONCOLUMN" text,PRIMARY KEY ("USERID","FILEID")); Sample Entry (4*003, 3f9**6a1, null, 2 , [{"FOLDER_TYPE":"-1","UID":"1","FOLDER":"\"HOME\""}] ,{"filename":"untitled","size":1,"kind":-1,"where":""}) Queries : Select "USERID","FILEID","FILETYPE","FOLDER_UID","JSONCOLUMN" from "UserFile" where "USERID"= and "FILEID" in (,,...) Select "USERID","FILEID","FILEPATHINFO" from "UserFile" where "USERID"= and "FILEID" in (,,...) This column family was perfectly working in our lab. I was able to fetch the results for the queries stated at less than 10ms. I deployed this in production(Cassandra 2.1.13), It was working perfectly for a month or two. But now at times the queries are taking 5s to 10s. On analysing further, I found that few users are deleting the files too frequently. This generates too many tombstones. I have set the gc_grace_seconds to the default 10 days and I have chosen SizeTieredCompactionStrategy. I want to optimise this Data Model for read efficiency. Any help is much appreciated. Best Regards, Julian.
Re: Nodetool repair
OK. If you still have issues after setting streaming_socket_timeout_in_ms != 0, consider increasing request_timeout_in_ms to a high value, say 1 or 2 minutes. See comments in https://issues.apache.org/jira/browse/CASSANDRA-7904Regarding 2.1, be sure to test incremental repair on your data before to run it in production ;-) Romain
Re: Nodetool repair
Alain, you replied faster, I didn't see your answer :-D
Re: Nodetool repair
Hi, @Matija: George wrote that he uses C* 2.0.9, so the Spotify master is OK for him :-) But you're right about C* >= 2.1, we also use a fork to run it against our 2.1 clusters. @George: your repair might be slow and not necessarily stuck. As Alain said, check the progression of nodetool netstats.Did you set streaming_socket_timeout_in_ms to a value different than 0?What is the value of request_timeout_in_ms?Also I suggest you to upgrade to the last 2.0.x (i.e. 2.0.17). No need to upgrade SSTables but be sure to read https://github.com/apache/cassandra/blob/cassandra-2.0/NEWS.txtAgain, you should have a look at cassandra-reaper and the GUI, you will have a progress bar to follow the repair. Finally if you want to kill a repair you can invoke forceTerminateAllRepairSessions with jmxterm on each node:1. nodetool stop VALIDATION2. echo run -b org.apache.cassandra.db:type=StorageService forceTerminateAllRepairSessions | java -jar /tmp/jmxterm/jmxterm-1.0-alpha-4-uber.jar -l 127.0.0.1:7199 jmxterm download: http://downloads.sourceforge.net/cyclops-group/jmxterm-1.0-alpha-4-uber.jar Best, Romain Le Jeudi 22 septembre 2016 16h45, "Li, Guangxing" a écrit : Romain, I had another repair that seems to just hang last night. When I did 'nodetool tpstats' on nodes, I see the following in the node where I initiated the repair: AntiEntropySessions 1 1 On all other nodes, I see: AntiEntropySessions 0 0 When I check the log for pattern "session completed successfully" in system.log, I see the last finished range occurred in 14 hours ago. So I think it is safe to say that the repair has hanged somehow. In order to start another repair, do we need to 'kill' this repair. If so, how do we do that? Thanks. George. On Thu, Sep 22, 2016 at 6:23 AM, Romain Hardouin wrote: I meant that pending (and active) AntiEntropySessions are a simple way to check if a repair is still running on a cluster. Also have a look at Cassandra reaper: >- https://github.com/spotify/ cassandra-reaper > >- https://github.com/ spodkowinski/cassandra-reaper- ui > >Best, >Romain > > > > >Le Mercredi 21 septembre 2016 22h32, "Li, Guangxing" > a écrit : > >Romain, > >I started running a new repair. If I see such behavior again, I will try what >you mentioned. > >Thanks. >
Re: Nodetool repair
I meant that pending (and active) AntiEntropySessions are a simple way to check if a repair is still running on a cluster. Also have a look at Cassandra reaper: - https://github.com/spotify/cassandra-reaper - https://github.com/spodkowinski/cassandra-reaper-ui Best, Romain Le Mercredi 21 septembre 2016 22h32, "Li, Guangxing" a écrit : Romain, I started running a new repair. If I see such behavior again, I will try what you mentioned. Thanks.
Re: Nodetool repair
Do you see any pending AntiEntropySessions (not AntiEntropyStage) with nodetool tpstats on nodes? Romain Le Mercredi 21 septembre 2016 16h45, "Li, Guangxing" a écrit : Alain, my script actually grep through all the log files, including those system.log.*. So it was probably due to a failed session. So now my script assumes the repair has finished (possibly due to failure) if it does not see any more repair related logs after 2 hours. Thanks. George. On Wed, Sep 21, 2016 at 3:03 AM, Alain RODRIGUEZ wrote: Hi George, That's the best way to monitor repairs "out of the box" I could think of. When you're not seeing 2048 (in your case), it might be due to log rotation or to a session failure. Have you had a look at repair failures? I am wondering why the implementor did not put something in the log (e.g. ... Repair command #41 has ended...) to clearly state that the repair has completed. +1, and some informations about ranges successfully repaired and the ranges that failed could be a very good thing as well. It would be easy to then read the repair result and to know what to do next (re-run repair on some ranges, move to the next node, etc). 2016-09-20 17:00 GMT+02:00 Li, Guangxing : Hi, I am using version 2.0.9. I have been looking into the logs to see if a repair is finished. Each time a repair is started on a node, I am seeing log line like "INFO [Thread-112920] 2016-09-16 19:00:43,805 StorageService.java (line 2646) Starting repair command #41, repairing 2048 ranges for keyspace groupmanager" in system.log. So I know that I am expecting to see 2048 log lines like "INFO [AntiEntropySessions:109] 2016-09-16 19:27:20,662 RepairSession.java (line 282) [repair #8b910950-7c43-11e6-88f3-f147e a74230b] session completed successfully". Once I see 2048 such log lines, I know this repair has completed. But this is not dependable since sometimes I am seeing less than 2048 but I know there is no repair going on since I do not see any trace of repair in system.log for a long time. So it seems to me that there is a clear way to tell that a repair has started but there is no clear way to tell a repair has ended. The only thing you can do is to watch the log and if you do not see repair activity for a long time, the repair is done somehow. I am wondering why the implementor did not put something in the log (e.g. ... Repair command #41 has ended...) to clearly state that the repair has completed. Thanks. George. On Tue, Sep 20, 2016 at 2:54 AM, Jens Rantil wrote: On Mon, Sep 19, 2016 at 3:07 PM Alain RODRIGUEZ wrote: ... - The size of your data- The number of vnodes- The compaction throughput- The streaming throughput- The hardware available- The load of the cluster- ... I've also heard that the number of clustering keys per partition key could have an impact. Might be worth investigating. Cheers,Jens -- Jens Rantil Backend Developer @ TinkTink AB, Wallingatan 5, 111 60 Stockholm, Sweden For urgent matters you can reach me at +46-708-84 18 32.
Re: High load on few nodes in a DC.
Hi, Do you shuffle the replicas with TokenAwarePolicy?TokenAwarePolicy(LoadBalancingPolicy childPolicy, boolean shuffleReplicas) Best, RomainLe Mardi 20 septembre 2016 15h47, Pranay akula a écrit : I was a able to find the hotspots causing the load,but the size of these partitions are in KB and no tombstones and no.of sstables is only 2 what else i need to debug to find the reason for high load for some nodes. we are also using unlogged batches is that can be the reason ?? how to find which node is serving as a coordinator for un logged batches?? we are using token-aware policy. thanks On Mon, Sep 19, 2016 at 12:29 PM, Pranay akula wrote: I was able to see most used partitions but the nodes with less load are serving more read and write requests for that particular partitions when compared to nodes with high load, how can i find if these nodes are serving as co-coordinators for those read and write requests ?? how can i find the token range for these particular partitions and which node is the primary for these partition ?? Thanks On Mon, Sep 19, 2016 at 11:04 AM, Pranay akula wrote: Hai Jeff, Thank, we are using RF 3 and cassandra version 2.1.8. ThanksPranay. On Mon, Sep 19, 2016 at 10:55 AM, Jeff Jirsa wrote: Is your replication_factor 2? Or is it 3? What version are you using? The most likely answer is some individual partition that’s either being written/read more than others, or is somehow impacting the cluster (wide rows are a natural candidate). You don’t mention your version, but most modern versions of Cassandra ship with ‘nodetool toppartitions’, which will help you identify frequently written/read partitions – perhaps you can use that to identify a hotspot due to some external behavior (some partition being read thousands of times, over and over could certainly drive up load). - Jeff From: Pranay akula Reply-To: "user@cassandra.apache.org" Date: Monday, September 19, 2016 at 7:53 AM To: "user@cassandra.apache.org" Subject: High load on few nodes in a DC. when our cluster was under load i am seeing 1 or 2 nodes are on more load consistently when compared to others in dc i am not seeing any GC pauses or wide partitions is this can be those nodes are continuously serving as coordinators ?? how can i find what is the reason for high load on those two nodes ?? We are using Vnode. ThanksPranay.
Re: Export/Importing keyspace from a different sized cluster
Also for testing purposes, you can send only one replica set to the Test DC. For instance with a RF=3 and 3 C* racks, you can just rsync/sstableload one rack. It will be faster and OK for tests. Best, Romain Le Mardi 20 septembre 2016 3h28, Michael Laws a écrit : I put together a shell wrapper around nodetool/sstableloader that I’ve been running for the past few years – https://github.com/AppliedInfrastructure/cassandra-snapshot-toolsAlways seemed to work well for these kinds of scenarios… Never really had to think about where SSTables were on the filesystem, etc. Mike From: Justin Sanciangco [mailto:jsancian...@blizzard.com] Sent: Monday, September 19, 2016 6:20 PM To: user@cassandra.apache.org Subject: RE: Export/Importing keyspace from a different sized cluster I am running cqlsh 5.0.1 | Cassandra 2.1.11.969 | DSE 4.8.3 | CQL spec 3.2.1 | Doing the below command seemed to worksstableloader -d Thanks for the help! From: Jeff Jirsa [mailto:jeff.ji...@crowdstrike.com] Sent: Monday, September 19, 2016 5:49 PM To: user@cassandra.apache.org Subject: Re: Export/Importing keyspace from a different sized cluster Something like that, depending on your version (which you didn’t specify). Note, though, that sstableloader is notoriously picky about the path to sstables. In particular, it really really really wants a directory structure that matches the directory structure on disk, and wants you to be at the equivalent of the parent/data_files_directory (so if you dump your sstables at /path/to/data/keyspace/table/, you’d want to run sstableloader from /path/to/data/ and provide keyspace/table/ as the location). From: Justin Sanciangco Reply-To: "user@cassandra.apache.org" Date: Monday, September 19, 2016 at 5:44 PM To: "user@cassandra.apache.org" Subject: RE: Export/Importing keyspace from a different sized cluster So if I rsync the the sstables say from source node 1 and source node 2 to target node 1. Would I just run the command like this? From target hostsstableloader -d From: Jeff Jirsa [mailto:jeff.ji...@crowdstrike.com] Sent: Monday, September 19, 2016 4:45 PM To: user@cassandra.apache.org Subject: Re: Export/Importing keyspace from a different sized cluster You can ship the sstables to the destination (or any other server with Cassandra binary tools installed) via ssh/rsync and run sstableloader on the destination cluster as well. From: Justin Sanciangco Reply-To: "user@cassandra.apache.org" Date: Monday, September 19, 2016 at 2:49 PM To: "user@cassandra.apache.org" Subject: Export/Importing keyspace from a different sized cluster Hello, Assuming I can’t get ports opened from source to target cluster to run sstableloader, what methods can I use to load a single keyspace from one cluster to another cluster of different size? Appreciate the help… Thanks,Justin
Re: Optimal value for concurrent_reads for a single NVMe Disk
Hi, You should make a benchmark with cassandra-stress to find the sweet spot. With NVMe I guess you can start with a high value, 128? Please let us know the results of your findings, it's interesting to know if we can go crazy with such pieces of hardware :-) Best, Romain Le Mardi 20 septembre 2016 12h11, Thomas Julian a écrit : Hello, We are using Cassandra 2.1.13 with each node having a NVMe disk with the configuration of Total Capacity - 1.2TB, Alloted Capacity - 880GB. We would like to increase the default value of 32 for the param concurrent_reads. But the document says "(Default: 32)note For workloads with more data than can fit in memory, the bottleneck is reads fetching data from disk. Setting to (16 × number_of_drives) allows operations to queue low enough in the stack so that the OS and drives can reorder them. The default setting applies to both logical volume managed (LVM) and RAID drives." https://docs.datastax.com/en/cassandra/2.1/cassandra/configuration/configCassandra_yaml_r.html#reference_ds_qfg_n1r_1k__concurrent_reads According to this hardware specification, what could be the optimal value that can be set for concurrent_reads? Best Regards, Julian.
Re: How to alter the default value for concurrent_compactors
Hi, You can read and write the value of the following MBean via JMX:org.apache.cassandra.db:type=CompactionManager - CoreCompactorThreads - MaximumCompactorThreads If you modify CoreCompactorThreads it will be effective immediatly, I mean assuming you have some pending compactions, you will see N lines with nodetool compactionstats where N=. Best, Romain Le Mardi 20 septembre 2016 13h50, Thomas Julian a écrit : Hello, We have commented out "concurrent_compactors" in our Cassandra 2.1.13 installation. We would like to review this setting, as some issues indicate that the default configuration may affect read/write performance. https://issues.apache.org/jira/browse/CASSANDRA-8787 https://issues.apache.org/jira/browse/CASSANDRA-7139 Where can we see the value set for concurrent_compactors in our setup? Is it possible to update this configuration without a restart? Best Regards, Julian.
Re: large system hint partition
Hi, > More recent (I think 2.2) don't have this problem since they write hints to >the file system as per the commit log Flat files hints were implemented starting from 3.0 https://issues.apache.org/jira/browse/CASSANDRA-6230 Best, Romain
Re: Is a blob storage cost of cassandra is the same as bigint storage cost for long variables?
Note that LZ4 compression is used by default. If you want to disable compression you can do this:CREATE/ALTER TABLE ... WITH compression = { 'sstable_compression' : '' }; Best, Romain Le Vendredi 9 septembre 2016 8h12, Alexandr Porunov a écrit : Hello Romain, Thank you very much for the explanation! I have just run a simple test to compare both situations.I have run two VM equivalent machines.Machine 1:CREATE KEYSPACE "test" WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }; CREATE TABLE test.simple ( id bigint PRIMARY KEY); Machine 2:CREATE KEYSPACE "test" WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }; CREATE TABLE test.simple ( id blob PRIMARY KEY); And have put 13421772 primary keys from 1 to 13421772 in both machines. Results:Machine 1: size of the data folder: 495864 bytesMachine 2: size of the data folder: 495004 bytes So here is almost no any difference between them (even happened with blob storage cost 1 MB less). I am happy about it because I need to store special encoded primary keys with 80 bits each. So I can use blob as a primary key without hesitation. Best regards,Alexandr On Fri, Sep 9, 2016 at 1:20 AM, Romain Hardouin wrote: Hi, Disk-wise it's the same because a bigint is serialized as a 8 bytes ByteBuffer and if you want to store a Long as bytes into a blob type it will take 8 bytes too, right?The difference is the validation. The blob ByteBuffer will be stored as is whereas the bigint will be validated. So technically the Long is slower, but I guess that's not noticeable. Yes you can use a blob as a partition key. I would use the bigint both for validation and clarity. Best, Romain Le Mercredi 7 septembre 2016 22h54, Alexandr Porunov a écrit : Hello, I need to store a "Long" Java variable.The question is: whether the storage cost is the same both for store hex representation of "Long" variable to the blob and for store "Long" variable to the bigint?Are there any performance pros or cons?Is it OK to use blob as primary key? Sincerely,Alexandr
Re: Is a blob storage cost of cassandra is the same as bigint storage cost for long variables?
Hi, Disk-wise it's the same because a bigint is serialized as a 8 bytes ByteBuffer and if you want to store a Long as bytes into a blob type it will take 8 bytes too, right?The difference is the validation. The blob ByteBuffer will be stored as is whereas the bigint will be validated. So technically the Long is slower, but I guess that's not noticeable. Yes you can use a blob as a partition key. I would use the bigint both for validation and clarity. Best, Romain Le Mercredi 7 septembre 2016 22h54, Alexandr Porunov a écrit : Hello, I need to store a "Long" Java variable.The question is: whether the storage cost is the same both for store hex representation of "Long" variable to the blob and for store "Long" variable to the bigint?Are there any performance pros or cons?Is it OK to use blob as primary key? Sincerely,Alexandr
Re: Read timeouts on primary key queries
Is it still fast if you specify CONSISTENCY LOCAL_QUORUM in cqlsh? RomainLe Mercredi 7 septembre 2016 13h56, Joseph Tech a écrit : Thanks, Romain for the detailed explanation. We use log4j 2 and i have added the driver logging for slow/error queries, will see if it helps to provide any pattern once in Prod. I tried getendpoints and getsstables for some of the timed out keys and most of them listed only 1 SSTable .There were a few which showed 2 SSTables. There is no specific trend on the keys, it's completely based on the user access, and the same keys return results instantly from cqlsh On Tue, Sep 6, 2016 at 1:57 PM, Romain Hardouin wrote: There is nothing special in the two sstablemetadata outuputs but if the timeouts are due to a network split or overwhelmed node or something like that you won't see anything here. That said, if you have the keys which produced the timeouts then, yes, you can look for a regular pattern (i.e. always the same keys?). You can find sstables for a given key with nodetool: nodetool getendpoints Then you can run the following command on one/each node of the enpoints: nodetool getsstables If many sstables are shown in the previous command it means that your data is fragmented but thanks to LCS this number should be low. I think the most usefull actions now would be: 1) Enable DEBUG for o.a.c.db.ConsistencyLevel, it won't spam your log file, you will see the following when errors will occur: - Local replicas [, ...] are insufficient to satisfy LOCAL_QUORUM requirement of X live nodes in '' You are using C* 2.1 but you can have a look at the C* 2.2 logback.xml: https://github. com/apache/cassandra/blob/ cassandra-2.2/conf/logback.xml I'm using it on production, it's better because it creates a separate debug.log file with a asynchronous appender. Watch out when enabling: Because the default logback configuration set all o.a.c in DEBUG: Instead you can set: Also, if you want to restrict debug.log to DEBUG level only (instead of DEBUG+INFO+...) you can add a LevelFilter to ASYNCDEBUGLOG in logback.xml: DEBUG ACCEPT DENY Thus, the debug.log file will be empty unless some Consistency issues happen. 2) Enable slow queries log at the driver level with a QueryLogger: Cluster cluster = ... // log queries longer than 1 second, see also withDynamicThreshold QueryLogger queryLogger = QueryLogger.builder(cluster). withConstantThreshold(1000). build(); cluster.register(queryLogger) ; Then in your driver logback file: 3) And/or: you mentioned that you use DSE so you can enable slow queries logging in dse.yaml (cql_slow_log_options) Best, Romain Le Lundi 5 septembre 2016 20h05, Joseph Tech a écrit : Attached are the sstablemeta outputs from 2 SSTables of size 28 MB and 52 MB (out2). The records are inserted with different TTLs based on their nature ; test records with 1 day, typeA records with 6 months, typeB records with 1 year etc. There are also explicit DELETEs from this table, though it's much lower than the rate of inserts. I am not sure how to interpret this output, or if it's the right SSTables that were picked. Please advise. Is there a way to get the sstables corresponding to the keys that timed out, though they are accessible later. On Mon, Sep 5, 2016 at 10:58 PM, Anshu Vajpayee wrote: We have seen read time out issue in cassandra due to high droppable tombstone ratio for repository. Please check for high droppable tombstone ratio for your repo. On Mon, Sep 5, 2016 at 8:11 PM, Romain Hardouin wrote: Yes dclocal_read_repair_chance will reduce the cross-DC traffic and latency, so you can swap the values ( https://issues.apache.org/ji ra/browse/CASSANDRA-7320 ). I guess the sstable_size_in_mb was set to 50 because back in the day (C* 1.0) the default size was way too small: 5 MB. So maybe someone in your company tried "10 * the default" i.e. 50 MB. Now the default is 160 MB. I don't say to change the value but just keep in mind that you're using a small value here, it could help you someday. Regarding the cells, the histograms shows an *estimation* of the min, p50, ..., p99, max of cells based on SSTables metadata. On your screenshot, the Max is 4768. So you have a partition key with ~ 4768 cells. The p99 is 1109, so 99% of your partition keys have less than (or equal to) 1109 cells. You can see these data of a given sstable with the tool sstablemetadata. Best, Romain Le Lundi 5 septembre 2016 15h17, Joseph Tech a écrit : Thanks, Romain . We will try to enable the DEBUG logging (assuming it won't clog the logs much) . Regarding the table configs, read_repair_chance must be carried over from older versions - mostly defaults. I think sstable_size_in_mb was set to limit the max SSTab
Re: WriteTimeoutException with LOCAL_QUORUM
1) Is it a typo or did you really make a giant leap from C* 1.x to 3.4 with all the C*2.0 and C*2.1 upgrades? (btw if I were you, I would use the last 3.0.X) 2) Regarding NTR all time blocked (e.g. 26070160 from the logs), have a look to the patch "max_queued_ntr_property.txt": https://issues.apache.org/jira/browse/CASSANDRA-11363) Then set -Dcassandra.max_queued_native_transport_requests=XXX to a value that works for you. 3) Regarding write timeouts: - Are your writes idempotent? You can retry when a WriteTimeoutException is catched, see IdempotenceAwareRetryPolicy. - We can see Hints in the logs => Do you monitor the frequency/number of hints? Do you see some UnavailableException at the driver level? It means that some nodes are unreachable and even if it should trigger an UnavailableException, it may also raise WriteTimeoutException if the coordinator of a request doesn't know yet that the node is unreachable (see failure detector) - 4 GB of heap is very small and you have 19 tables. Add 40 system tables to this and you have 59 tables that share 4 GB. - You are using batches for one/some table(s), right? Is it really required? Is is the most used table? - What are the values of * memtable_cleanup_threshold * batch_size_warn_threshold_in_kb - What the IO wait status on the nodes? Try to correlate timeout exceptions with IO wait load. - Commitlog and data are on separate devices? - What are the value of the following Mbean attributes on each nodes? * org.apache.cassandra.metrics:type=CommitLog,name=WaitingOnCommit - Count * org.apache.cassandra.metrics:type=CommitLog,name=WaitingOnSegmentAllocation - Mean - 99thPercentile - Max - Do you see MemtableFlushWriter blocked tasks on nodes? I see 0 on the logs but the node may have been restarted (e.g. 18 hours of uptime on the nodetool info). 4) Did you notice that you have tombstones warning? e.g.: WARN [SharedPool-Worker-48] 2016-09-01 06:53:19,453 ReadCommand.java:481 - Read 5000 live rows and 1 tombstone cells for query SELECT * FROM pc_object_data_beta.vsc_data WHERE rundate, vscid = 1472653906000, 111034565 LIMIT 5000 (see tombstone_warn_threshold) The chances are high that your data model is not optimal. You should *really* fix this. Best, Romain Le Mardi 6 septembre 2016 6h47, "adeline@thomsonreuters.com" a écrit : From: Pan, Adeline (TR Technology & Ops) Sent: Tuesday, September 06, 2016 12:34 PM To: 'user@cassandra.apache.org' Cc: Yang, Ling (TR Technology & Ops) Subject: FW: WriteTimeoutException with LOCAL_QUORUM Hi All, I hope you are doing well today, and I need your help. We were using Cassandra 1 before, now we are upgrading to Cassandra 3.4 . During the integration test, we encountered “WriteTimeoutException” very frequently (about every other minute), the exception message is as below. The exception trace is in the attach file. | Caused by: com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency LOCAL_QUORUM (2 replica were required but only 1 acknowledged the write) | There is some information: 1. It is a six nodes cluster, two data centers, and three nodes for each datacenter. The consistency level we are using is LOCAL_QUORUM 2. The node info | [BETA:@:/local/java/cassandra3/current]$ bin/nodetool -hlocalhost info ID : ad077318-6531-498e-bf5a-14ac339d1a45 Gossip active : true Thrift active : false Native Transport active: true Load : 23.47 GB Generation No : 1473065408 Uptime (seconds) : 67180 Heap Memory (MB) : 1679.57 / 4016.00 Off Heap Memory (MB) : 10.34 Data Center : dc1 Rack : rack1 Exceptions : 0 Key Cache : entries 32940, size 3.8 MB, capacity 100 MB, 2124114 hits, 2252348 requests, 0.943 recent hit rate, 14400 save period in seconds Row Cache : entries 0, size 0 bytes, capacity 0 bytes, 0 hits, 0 requests, NaN recent hit rate, 0 save period in seconds Counter Cache : entries 0, size 0 bytes, capacity 50 MB, 0 hits, 0 requests, NaN recent hit rate, 7200 save period in seconds Token : (invoke with -T/--tokens to see all 256 tokens) | 3. We have increased the write_request_timeout_in_ms to 4, which didn’t work. 4. The memtable size is 4GB. 5. memtable_allocation_type: heap_buffers 6. In the Cassandra server log, we found there are Native-Transport-Requests pending from time to time. (The server log piece is in attach file) | INFO [ScheduledTasks:1] 2016-09-01 10:08:47,036 StatusLogger.java:52 - Pool Name Active Pending Completed Blocked All Time Blocked INFO [ScheduledTasks:1] 2016-09-01 10:08:47,043 St
Re: Read timeouts on primary key queries
There is nothing special in the two sstablemetadata outuputs but if the timeouts are due to a network split or overwhelmed node or something like that you won't see anything here. That said, if you have the keys which produced the timeouts then, yes, you can look for a regular pattern (i.e. always the same keys?). You can find sstables for a given key with nodetool: nodetool getendpoints Then you can run the following command on one/each node of the enpoints: nodetool getsstables If many sstables are shown in the previous command it means that your data is fragmented but thanks to LCS this number should be low. I think the most usefull actions now would be: 1) Enable DEBUG for o.a.c.db.ConsistencyLevel, it won't spam your log file, you will see the following when errors will occur: - Local replicas [, ...] are insufficient to satisfy LOCAL_QUORUM requirement of X live nodes in '' You are using C* 2.1 but you can have a look at the C* 2.2 logback.xml: https://github.com/apache/cassandra/blob/cassandra-2.2/conf/logback.xml I'm using it on production, it's better because it creates a separate debug.log file with a asynchronous appender. Watch out when enabling: Because the default logback configuration set all o.a.c in DEBUG: Instead you can set: Also, if you want to restrict debug.log to DEBUG level only (instead of DEBUG+INFO+...) you can add a LevelFilter to ASYNCDEBUGLOG in logback.xml: DEBUG ACCEPT DENY Thus, the debug.log file will be empty unless some Consistency issues happen. 2) Enable slow queries log at the driver level with a QueryLogger: Cluster cluster = ... // log queries longer than 1 second, see also withDynamicThreshold QueryLogger queryLogger = QueryLogger.builder(cluster).withConstantThreshold(1000).build(); cluster.register(queryLogger); Then in your driver logback file: 3) And/or: you mentioned that you use DSE so you can enable slow queries logging in dse.yaml (cql_slow_log_options) Best, Romain Le Lundi 5 septembre 2016 20h05, Joseph Tech a écrit : Attached are the sstablemeta outputs from 2 SSTables of size 28 MB and 52 MB (out2). The records are inserted with different TTLs based on their nature ; test records with 1 day, typeA records with 6 months, typeB records with 1 year etc. There are also explicit DELETEs from this table, though it's much lower than the rate of inserts. I am not sure how to interpret this output, or if it's the right SSTables that were picked. Please advise. Is there a way to get the sstables corresponding to the keys that timed out, though they are accessible later. On Mon, Sep 5, 2016 at 10:58 PM, Anshu Vajpayee wrote: We have seen read time out issue in cassandra due to high droppable tombstone ratio for repository. Please check for high droppable tombstone ratio for your repo. On Mon, Sep 5, 2016 at 8:11 PM, Romain Hardouin wrote: Yes dclocal_read_repair_chance will reduce the cross-DC traffic and latency, so you can swap the values ( https://issues.apache.org/ji ra/browse/CASSANDRA-7320 ). I guess the sstable_size_in_mb was set to 50 because back in the day (C* 1.0) the default size was way too small: 5 MB. So maybe someone in your company tried "10 * the default" i.e. 50 MB. Now the default is 160 MB. I don't say to change the value but just keep in mind that you're using a small value here, it could help you someday. Regarding the cells, the histograms shows an *estimation* of the min, p50, ..., p99, max of cells based on SSTables metadata. On your screenshot, the Max is 4768. So you have a partition key with ~ 4768 cells. The p99 is 1109, so 99% of your partition keys have less than (or equal to) 1109 cells. You can see these data of a given sstable with the tool sstablemetadata. Best, Romain Le Lundi 5 septembre 2016 15h17, Joseph Tech a écrit : Thanks, Romain . We will try to enable the DEBUG logging (assuming it won't clog the logs much) . Regarding the table configs, read_repair_chance must be carried over from older versions - mostly defaults. I think sstable_size_in_mb was set to limit the max SSTable size, though i am not sure on the reason for the 50 MB value. Does setting dclocal_read_repair_chance help in reducing cross-DC traffic (haven't looked into this parameter, just going by the name). By the cell count definition : is it incremented based on the number of writes for a given name(key?) and value. This table is heavy on reads and writes. If so, the value should be much higher? On Mon, Sep 5, 2016 at 7:35 AM, Romain Hardouin wrote: Hi, Try to put org.apache.cassandra.db. ConsistencyLevel at DEBUG level, it could help to find a regular pattern. By the way, I see that you have set a global read repair chance: read_repair_chance = 0.1And not th
Re: Is it possible to replay hints after running nodetool drain?
Hi, You don't have to worry about that unless you write with CL = ANY. The sole method to force hints that I know is to invoke scheduleHintDelivery on "org.apache.cassandra.db:type=HintedHandoffManager" via JMX but it takes an endpoint as argument. If you have lots of nodes and several DCs, make sure to properly set hinted_handoff_throttle_in_kb and max_hints_delivery_threads. Best, Romain Le Samedi 3 septembre 2016 2h59, jerome a écrit : #yiv4261622774 #yiv4261622774 -- P {margin-top:0;margin-bottom:0;}#yiv4261622774 Hi Matija, Thanks for your help! The downtime is minimal, usually less than five minutes. Since it is so short we're not so concerned about the node that's down missing data, we just want to make sure that before it goes down it replays all the hints it has so that there won't be any gaps in data on any other nodes for the hints it has while it's down. Thanks,Jerome From: Matija Gobec Sent: Friday, September 2, 2016 6:05:01 PM To: user@cassandra.apache.org Subject: Re: Is it possible to replay hints after running nodetool drain? Hi Jerome, The node being drained stops listening to requests but the other nodes being coordinators for given requests will store hints for that downed node for a configured period of time (max_hint_window_in_ms is 3 hours by default). If the downed node is back online in this time window it will receive hints from other nodes in the cluster and eventually catch up.What is your typical maintenance downtime? Regards,Matija On Fri, Sep 2, 2016 at 10:53 PM, jerome wrote: Hello, As part of routine maintenance for our cluster, my colleagues and I will run a nodetool drain before stopping a Cassandra node, performing maintenance, and bringing it back up. We run maintenance as a cron-job with a lock stored in a different cluster to ensure only node is ever down at a time. We would like to make sure the node has replayed all its hints before bringing it down to minimize the potential window in which users might read out-of-date data (we read at a consistency level of ONE). Is it possible to replay hints after performing a nodetool drain? The documentation leads me to believe its not since Cassandra will stop listening for connections from other nodes, but I was unable to find anything definitive either way. If a node won't replay hints after a nodetool drain, is there perhaps another way to tell Cassandra to stop listening for client connections but continue to replay hints to other nodes. Thanks,Jerome
Re: Read timeouts on primary key queries
Yes dclocal_read_repair_chance will reduce the cross-DC traffic and latency, so you can swap the values ( https://issues.apache.org/jira/browse/CASSANDRA-7320 ). I guess the sstable_size_in_mb was set to 50 because back in the day (C* 1.0) the default size was way too small: 5 MB. So maybe someone in your company tried "10 * the default" i.e. 50 MB. Now the default is 160 MB. I don't say to change the value but just keep in mind that you're using a small value here, it could help you someday. Regarding the cells, the histograms shows an *estimation* of the min, p50, ..., p99, max of cells based on SSTables metadata. On your screenshot, the Max is 4768. So you have a partition key with ~ 4768 cells. The p99 is 1109, so 99% of your partition keys have less than (or equal to) 1109 cells. You can see these data of a given sstable with the tool sstablemetadata. Best, Romain Le Lundi 5 septembre 2016 15h17, Joseph Tech a écrit : Thanks, Romain . We will try to enable the DEBUG logging (assuming it won't clog the logs much) . Regarding the table configs, read_repair_chance must be carried over from older versions - mostly defaults. I think sstable_size_in_mb was set to limit the max SSTable size, though i am not sure on the reason for the 50 MB value. Does setting dclocal_read_repair_chance help in reducing cross-DC traffic (haven't looked into this parameter, just going by the name). By the cell count definition : is it incremented based on the number of writes for a given name(key?) and value. This table is heavy on reads and writes. If so, the value should be much higher? On Mon, Sep 5, 2016 at 7:35 AM, Romain Hardouin wrote: Hi, Try to put org.apache.cassandra.db. ConsistencyLevel at DEBUG level, it could help to find a regular pattern. By the way, I see that you have set a global read repair chance: read_repair_chance = 0.1And not the local read repair: dclocal_read_repair_chance = 0.0 Is there any reason to do that or is it just the old (pre 2.0.9) default configuration? The cell count is the number of triplets: (name, value, timestamp) Also, I see that you have set sstable_size_in_mb at 50 MB. What is the rational behind this? (Yes I'm curious :-) ). Anyway your "SSTables per read" are good. Best, Romain Le Lundi 5 septembre 2016 13h32, Joseph Tech a écrit : Hi Ryan, Attached are the cfhistograms run within few mins of each other. On the surface, don't see anything which indicates too much skewing (assuming skewing ==keys spread across many SSTables) . Please confirm. Related to this, what does the "cell count" metric indicate ; didn't find a clear explanation in the documents. Thanks,Joseph On Thu, Sep 1, 2016 at 6:30 PM, Ryan Svihla wrote: Have you looked at cfhistograms/tablehistograms your data maybe just skewed (most likely explanation is probably the correct one here) Regard, Ryan Svihla _ From: Joseph Tech Sent: Wednesday, August 31, 2016 11:16 PM Subject: Re: Read timeouts on primary key queries To: Patrick, The desc table is below (only col names changed) : CREATE TABLE db.tbl ( id1 text, id2 text, id3 text, id4 text, f1 text, f2 map, f3 map, created timestamp, updated timestamp, PRIMARY KEY (id1, id2, id3, id4)) WITH CLUSTERING ORDER BY (id2 ASC, id3 ASC, id4 ASC) AND bloom_filter_fp_chance = 0.01 AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}' AND comment = '' AND compaction = {'sstable_size_in_mb': '50', 'class': 'org.apache.cassandra.db. compaction. LeveledCompactionStrategy'} AND compression = {'sstable_compression': 'org.apache.cassandra.io. compress.LZ4Compressor'} AND dclocal_read_repair_chance = 0.0 AND default_time_to_live = 0 AND gc_grace_seconds = 864000 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND read_repair_chance = 0.1 AND speculative_retry = '99.0PERCENTILE'; and the query is select * from tbl where id1=? and id2=? and id3=? and id4=? The timeouts happen within ~2s to ~5s, while the successful calls have avg of 8ms and p99 of 15s. These times are seen from app side, the actual query times would be slightly lower. Is there a way to capture traces only when queries take longer than a specified duration? . We can't enable tracing in production given the volume of traffic. We see that the same query which timed out works fine later, so not sure if the trace of a successful run would help. Thanks,Joseph On Wed, Aug 31, 2016 at 8:05 PM, Patrick McFadin wrote: If you are getting a timeout on one table, then a mismatch of RF and node count doesn't seem as likely. Time to look at your query. You said it was a 'select * from ta
Re: Read timeouts on primary key queries
Hi, Try to put org.apache.cassandra.db.ConsistencyLevel at DEBUG level, it could help to find a regular pattern. By the way, I see that you have set a global read repair chance: read_repair_chance = 0.1And not the local read repair: dclocal_read_repair_chance = 0.0 Is there any reason to do that or is it just the old (pre 2.0.9) default configuration? The cell count is the number of triplets: (name, value, timestamp) Also, I see that you have set sstable_size_in_mb at 50 MB. What is the rational behind this? (Yes I'm curious :-) ). Anyway your "SSTables per read" are good. Best, Romain Le Lundi 5 septembre 2016 13h32, Joseph Tech a écrit : Hi Ryan, Attached are the cfhistograms run within few mins of each other. On the surface, don't see anything which indicates too much skewing (assuming skewing ==keys spread across many SSTables) . Please confirm. Related to this, what does the "cell count" metric indicate ; didn't find a clear explanation in the documents. Thanks,Joseph On Thu, Sep 1, 2016 at 6:30 PM, Ryan Svihla wrote: Have you looked at cfhistograms/tablehistograms your data maybe just skewed (most likely explanation is probably the correct one here) Regard, Ryan Svihla _ From: Joseph Tech Sent: Wednesday, August 31, 2016 11:16 PM Subject: Re: Read timeouts on primary key queries To: Patrick, The desc table is below (only col names changed) : CREATE TABLE db.tbl ( id1 text, id2 text, id3 text, id4 text, f1 text, f2 map, f3 map, created timestamp, updated timestamp, PRIMARY KEY (id1, id2, id3, id4)) WITH CLUSTERING ORDER BY (id2 ASC, id3 ASC, id4 ASC) AND bloom_filter_fp_chance = 0.01 AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}' AND comment = '' AND compaction = {'sstable_size_in_mb': '50', 'class': 'org.apache.cassandra.db. compaction. LeveledCompactionStrategy'} AND compression = {'sstable_compression': 'org.apache.cassandra.io. compress.LZ4Compressor'} AND dclocal_read_repair_chance = 0.0 AND default_time_to_live = 0 AND gc_grace_seconds = 864000 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND read_repair_chance = 0.1 AND speculative_retry = '99.0PERCENTILE'; and the query is select * from tbl where id1=? and id2=? and id3=? and id4=? The timeouts happen within ~2s to ~5s, while the successful calls have avg of 8ms and p99 of 15s. These times are seen from app side, the actual query times would be slightly lower. Is there a way to capture traces only when queries take longer than a specified duration? . We can't enable tracing in production given the volume of traffic. We see that the same query which timed out works fine later, so not sure if the trace of a successful run would help. Thanks,Joseph On Wed, Aug 31, 2016 at 8:05 PM, Patrick McFadin wrote: If you are getting a timeout on one table, then a mismatch of RF and node count doesn't seem as likely. Time to look at your query. You said it was a 'select * from table where key=?' type query. I would next use the trace facility in cqlsh to investigate further. That's a good way to find hard to find issues. You should be looking for clear ledge where you go from single digit ms to 4 or 5 digit ms times. The other place to look is your data model for that table if you want to post the output from a desc table. Patrick On Tue, Aug 30, 2016 at 11:07 AM, Joseph Tech wrote: On further analysis, this issue happens only on 1 table in the KS which has the max reads. @Atul, I will look at system health, but didnt see anything standing out from GC logs. (using JDK 1.8_92 with G1GC). @Patrick , could you please elaborate the "mismatch on node count + RF" part. On Tue, Aug 30, 2016 at 5:35 PM, Atul Saroha wrote: There could be many reasons for this if it is intermittent. CPU usage + I/O wait status. As read are I/O intensive, your IOPS requirement should be met that time load. Heap issue if CPU is busy for GC only. Network health could be the reason. So better to look system health during that time when it comes. -- -- -- --- Atul Saroha Lead Software Engineer M: +91 8447784271 T: +91 124-415-6069 EXT: 12369 Plot # 362, ASF Centre - Tower A, Udyog Vihar, Phase -4, Sector 18, Gurgaon, Haryana 122016, INDIA On Tue, Aug 30, 2016 at 5:10 PM, Joseph Tech wrote: Hi Patrick, The nodetool status shows all nodes up and normal now. From OpsCenter "Event Log" , there are some nodes reported as being down/up etc. during the timeframe of timeout, but these are Search workload nodes from the remote (non-local) DC. The RF is 3 and there are 9 nodes per DC. Thanks,Joseph On Mon, Aug 29, 2016 at 11:07 PM, Patrick McFadin wrote: You aren't achieving quorum on your reads as the error is explains. That means you either have
Re: nodetool repair with -pr and -dc
Hi Jérôme, The code in 2.2.6 allows -local and -pr:https://github.com/apache/cassandra/blob/cassandra-2.2.6/src/java/org/apache/cassandra/service/StorageService.java#L2899 But... the options validation introduced in CASSANDRA-6455 seems to break this feature!https://github.com/apache/cassandra/blob/cassandra-2.2.6/src/java/org/apache/cassandra/repair/messages/RepairOption.java#L211 I suggest to open a ticket https://issues.apache.org/jira/browse/cassandra/ Best, Romain Le Vendredi 19 août 2016 11h47, Jérôme Mainaud a écrit : Hello, I've got a repair command with both -pr and -local rejected on an 2.2.6 cluster. The exact command was : nodetool repair --full -par -pr -local -j 4 The message is “You need to run primary range repair on all nodes in the cluster”. Reading the code and previously cited CASSANDRA-7450, it should have been accepted. Did anyone meet this error before ? Thanks -- Jérôme Mainaud jer...@mainaud.com 2016-08-12 1:14 GMT+02:00 kurt Greaves : -D does not do what you think it does. I've quoted the relevant documentation from the README: Multiple Datacenters If you have multiple datacenters in your ring, then you MUST specify the name of the datacenter containing the node you are repairing as part of the command-line options (--datacenter=DCNAME). Failure to do so will result in only a subset of your data being repaired (approximately data/number-of-datacenters). This is because nodetool has no way to determine the relevant DC on its own, which in turn means it will use the tokens from every ring member in every datacenter. On 11 August 2016 at 12:24, Paulo Motta wrote: > if we want to use -pr option ( which i suppose we should to prevent duplicate > checks) in 2.0 then if we run the repair on all nodes in a single DC then it > should be sufficient and we should not need to run it on all nodes across > DC's? No, because the primary ranges of the nodes in other DCs will be missing repair, so you should either run with -pr in all nodes in all DCs, or restrict repair to a specific DC with -local (and have duplicate checks). Combined -pr and -local are only supported on 2.1 2016-08-11 1:29 GMT-03:00 Anishek Agarwal : ok thanks, so if we want to use -pr option ( which i suppose we should to prevent duplicate checks) in 2.0 then if we run the repair on all nodes in a single DC then it should be sufficient and we should not need to run it on all nodes across DC's ? On Wed, Aug 10, 2016 at 5:01 PM, Paulo Motta wrote: On 2.0 repair -pr option is not supported together with -local, -hosts or -dc, since it assumes you need to repair all nodes in all DCs and it will throw and error if you try to run with nodetool, so perhaps there's something wrong with range_repair options parsing. On 2.1 it was added support to simultaneous -pr and -local options on CASSANDRA-7450, so if you need that you can either upgade to 2.1 or backport that to 2.0. 2016-08-10 5:20 GMT-03:00 Anishek Agarwal : Hello, We have 2.0.17 cassandra cluster(DC1) with a cross dc setup with a smaller cluster(DC2). After reading various blogs about scheduling/running repairs looks like its good to run it with the following -pr for primary range only -st -et for sub ranges -par for parallel -dc to make sure we can schedule repairs independently on each Data centre we have. i have configured the above using the repair utility @ https://github.com/BrianGallew /cassandra_range_repair.git which leads to the following command : ./src/range_repair.py -k [keyspace] -c [columnfamily name] -v -H localhost -p -D DC1 but looks like the merkle tree is being calculated on nodes which are part of other DC2. why does this happen? i thought it should only look at the nodes in local cluster. however on nodetool the -pr option cannot be used with -local according to docs @https://docs.datastax.com/en/ cassandra/2.0/cassandra/tools/ toolsRepair.html so i am may be missing something, can someone help explain this please. thanksanishek -- Kurt greavesk...@instaclustr.comwww.instaclustr.com
Re: A question to updatesstables
Ok... you said 2.0.10 in the original post ;-)You can't upgrade from 1.2 to 2.1.2.0.7 is the minimum. So upgrade to 2.0.17 (the latest 2.0.X) first, see https://github.com/apache/cassandra/blob/cassandra-2.1/NEWS.txt#L244 Best, Romain Le Vendredi 19 août 2016 11h41, "Lu, Boying" a écrit : #yiv4120164789 #yiv4120164789 -- _filtered #yiv4120164789 {font-family:Helvetica;panose-1:2 11 6 4 2 2 2 2 2 4;} _filtered #yiv4120164789 {font-family:宋体;panose-1:2 1 6 0 3 1 1 1 1 1;} _filtered #yiv4120164789 {font-family:宋体;panose-1:2 1 6 0 3 1 1 1 1 1;} _filtered #yiv4120164789 {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;} _filtered #yiv4120164789 {font-family:Tahoma;panose-1:2 11 6 4 3 5 4 4 2 4;} _filtered #yiv4120164789 {panose-1:2 1 6 0 3 1 1 1 1 1;}#yiv4120164789 #yiv4120164789 p.yiv4120164789MsoNormal, #yiv4120164789 li.yiv4120164789MsoNormal, #yiv4120164789 div.yiv4120164789MsoNormal {margin:0cm;margin-bottom:.0001pt;font-size:12.0pt;}#yiv4120164789 a:link, #yiv4120164789 span.yiv4120164789MsoHyperlink {color:blue;text-decoration:underline;}#yiv4120164789 a:visited, #yiv4120164789 span.yiv4120164789MsoHyperlinkFollowed {color:purple;text-decoration:underline;}#yiv4120164789 p {margin-right:0cm;margin-left:0cm;font-size:12.0pt;}#yiv4120164789 p.yiv4120164789MsoAcetate, #yiv4120164789 li.yiv4120164789MsoAcetate, #yiv4120164789 div.yiv4120164789MsoAcetate {margin:0cm;margin-bottom:.0001pt;font-size:8.0pt;}#yiv4120164789 p.yiv4120164789msoacetate, #yiv4120164789 li.yiv4120164789msoacetate, #yiv4120164789 div.yiv4120164789msoacetate {margin-right:0cm;margin-left:0cm;font-size:12.0pt;}#yiv4120164789 p.yiv4120164789msonormal, #yiv4120164789 li.yiv4120164789msonormal, #yiv4120164789 div.yiv4120164789msonormal {margin-right:0cm;margin-left:0cm;font-size:12.0pt;}#yiv4120164789 p.yiv4120164789msochpdefault, #yiv4120164789 li.yiv4120164789msochpdefault, #yiv4120164789 div.yiv4120164789msochpdefault {margin-right:0cm;margin-left:0cm;font-size:12.0pt;}#yiv4120164789 span.yiv4120164789msohyperlink {}#yiv4120164789 span.yiv4120164789msohyperlinkfollowed {}#yiv4120164789 span.yiv4120164789emailstyle19 {}#yiv4120164789 span.yiv4120164789emailstyle20 {}#yiv4120164789 p.yiv4120164789msonormal1, #yiv4120164789 li.yiv4120164789msonormal1, #yiv4120164789 div.yiv4120164789msonormal1 {margin:0cm;margin-bottom:.0001pt;font-size:12.0pt;}#yiv4120164789 span.yiv4120164789msohyperlink1 {color:blue;text-decoration:underline;}#yiv4120164789 span.yiv4120164789msohyperlinkfollowed1 {color:purple;text-decoration:underline;}#yiv4120164789 p.yiv4120164789msoacetate1, #yiv4120164789 li.yiv4120164789msoacetate1, #yiv4120164789 div.yiv4120164789msoacetate1 {margin:0cm;margin-bottom:.0001pt;font-size:8.0pt;}#yiv4120164789 span.yiv4120164789emailstyle191 {color:#1F497D;}#yiv4120164789 span.yiv4120164789emailstyle201 {color:#1F497D;}#yiv4120164789 p.yiv4120164789msochpdefault1, #yiv4120164789 li.yiv4120164789msochpdefault1, #yiv4120164789 div.yiv4120164789msochpdefault1 {margin-right:0cm;margin-left:0cm;font-size:10.0pt;}#yiv4120164789 span.yiv4120164789EmailStyle32 {color:#1F497D;}#yiv4120164789 span.yiv4120164789BalloonTextChar {}#yiv4120164789 .yiv4120164789MsoChpDefault {font-size:10.0pt;} _filtered #yiv4120164789 {margin:72.0pt 90.0pt 72.0pt 90.0pt;}#yiv4120164789 div.yiv4120164789WordSection1 {}#yiv4120164789 yes, we use Cassandra 2.1.11 in our latest release. From: Romain Hardouin [mailto:romainh...@yahoo.fr] Sent: 2016年8月19日 17:36 To: user@cassandra.apache.org Subject: Re: A question to updatesstables ka is the 2.1 format... I don't understand. Did you install C* 2.1? Romain Le Vendredi 19 août 2016 11h32, "Lu, Boying" a écrit : Here is the error message in our log file: java.lang.RuntimeException: Incompatible SSTable found. Current version ka is unable to read file: /data/db/1/data/StorageOS/RemoteDirectorGroup/StorageOS-RemoteDirectorGroup-ic-37. Please run upgradesstables. at org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:517) at org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:494) at org.apache.cassandra.db.Keyspace.initCf(Keyspace.java:335) at org.apache.cassandra.db.Keyspace.(Keyspace.java:275) at org.apache.cassandra.db.Keyspace.open(Keyspace.java:121) at org.apache.cassandra.db.Keyspace.open(Keyspace.java:98) at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:328) at org.apache.cassandra.service.CassandraDaemon.init(CassandraDaemon.java:479) From: Ryan Svihla [mailto:r...@foundev.pro] Sent: 2016年8月19日 17:26 To: user@cassandra.apache.org Subject: Re: A question to updatesstables The actual error message could be very useful to diagnose the reason. There are warnings about incompatible formats which are safe to ignore (usually in the cache) and I have one time
Re: A question to updatesstables
ka is the 2.1 format... I don't understand. Did you install C* 2.1? Romain Le Vendredi 19 août 2016 11h32, "Lu, Boying" a écrit : #yiv1355196952 #yiv1355196952 -- _filtered #yiv1355196952 {font-family:宋体;panose-1:2 1 6 0 3 1 1 1 1 1;} _filtered #yiv1355196952 {font-family:宋体;panose-1:2 1 6 0 3 1 1 1 1 1;} _filtered #yiv1355196952 {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;} _filtered #yiv1355196952 {font-family:Tahoma;panose-1:2 11 6 4 3 5 4 4 2 4;} _filtered #yiv1355196952 {panose-1:2 1 6 0 3 1 1 1 1 1;}#yiv1355196952 #yiv1355196952 p.yiv1355196952MsoNormal, #yiv1355196952 li.yiv1355196952MsoNormal, #yiv1355196952 div.yiv1355196952MsoNormal {margin:0cm;margin-bottom:.0001pt;font-size:12.0pt;}#yiv1355196952 a:link, #yiv1355196952 span.yiv1355196952MsoHyperlink {color:blue;text-decoration:underline;}#yiv1355196952 a:visited, #yiv1355196952 span.yiv1355196952MsoHyperlinkFollowed {color:purple;text-decoration:underline;}#yiv1355196952 p {margin-right:0cm;margin-left:0cm;font-size:12.0pt;}#yiv1355196952 p.yiv1355196952MsoAcetate, #yiv1355196952 li.yiv1355196952MsoAcetate, #yiv1355196952 div.yiv1355196952MsoAcetate {margin:0cm;margin-bottom:.0001pt;font-size:8.0pt;}#yiv1355196952 span.yiv1355196952hoenzb {}#yiv1355196952 span.yiv1355196952EmailStyle19 {color:#1F497D;}#yiv1355196952 span.yiv1355196952EmailStyle20 {color:#1F497D;}#yiv1355196952 span.yiv1355196952BalloonTextChar {}#yiv1355196952 .yiv1355196952MsoChpDefault {font-size:10.0pt;} _filtered #yiv1355196952 {margin:72.0pt 90.0pt 72.0pt 90.0pt;}#yiv1355196952 div.yiv1355196952WordSection1 {}#yiv1355196952 Here is the error message in our log file: java.lang.RuntimeException: Incompatible SSTable found. Current version ka is unable to read file: /data/db/1/data/StorageOS/RemoteDirectorGroup/StorageOS-RemoteDirectorGroup-ic-37. Please run upgradesstables. at org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:517) at org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:494) at org.apache.cassandra.db.Keyspace.initCf(Keyspace.java:335) at org.apache.cassandra.db.Keyspace.(Keyspace.java:275) at org.apache.cassandra.db.Keyspace.open(Keyspace.java:121) at org.apache.cassandra.db.Keyspace.open(Keyspace.java:98) at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:328) at org.apache.cassandra.service.CassandraDaemon.init(CassandraDaemon.java:479) From: Ryan Svihla [mailto:r...@foundev.pro] Sent: 2016年8月19日 17:26 To: user@cassandra.apache.org Subject: Re: A question to updatesstables The actual error message could be very useful to diagnose the reason. There are warnings about incompatible formats which are safe to ignore (usually in the cache) and I have one time seen an issue with commit log archiving preventing a startup during upgrade. Usually there is something else broken and the version mismatch is a false signal. Regards, Ryan Svihla On Aug 18, 2016, at 10:18 PM, Lu, Boying wrote: Thanks a lot. I’m a little bit of confusing. If the ‘nodetool updatesstable’ doesn’t work without Cassandra server running, and Cassandra server failed to start due to the incompatible SSTable format, how to resolve this dilemma? From: Carlos Alonso [mailto:i...@mrcalonso.com] Sent: 2016年8月18日 18:44 To: user@cassandra.apache.org Subject: Re: A question to updatesstables Replies inline Carlos Alonso | Software Engineer | @calonso On 18 August 2016 at 11:56, Lu, Boying wrote: Hi, All, We use Cassandra in our product. I our early release we use Cassandra 1.2.10 whose SSTable is ‘ic’ format. We upgrade Cassandra to 2.0.10 in our product release. But the Cassandra server failed to start due to the incompatible SSTable format and the log message told us to use ‘nodetool updatesstables’ to upgrade SSTable files. To make sure that no negative impact on our data, I want to confirm following things about this command before trying it: 1. Does it work without Cassandra server running? No, it won't. 2. Will it cause data lost with this command? It shouldn't if you followed the upgrade instructions properly 3. What’s the best practice to void this error occurs again (e.g. upgrading Cassandra next time)? Upgrading SSTables is required or not depending on the upgrade you're running, basically if the SSTables layout changes you'll need to run it and not otherwise so there's nothing you can do to avoid it Thanks Boying
Re: A question to updatesstables
Hi, There are two ways to upgrade SSTables: - online (C* must be UP): nodetool upgradesstables - offline (when C* is stopped): using the tool called "sstableupgrade". It's located in the bin directory of Cassandra so depending on how you installed Cassandra, it may be on the path. See https://docs.datastax.com/en/cassandra/2.0/cassandra/tools/ToolsSSTableupgrade_t.html Few questions: - Did you check you are not hitting https://github.com/apache/cassandra/blob/cassandra-2.0/NEWS.txt#L162 ? i.e. are you sure that all your data are in "ic" format? - Why did you choose 2.0.10? (The latest 2.0 release being 2.0.17.) Best, Romain Le Vendredi 19 août 2016 5h18, "Lu, Boying" a écrit : #yiv8524026874 #yiv8524026874 -- _filtered #yiv8524026874 {font-family:宋体;panose-1:2 1 6 0 3 1 1 1 1 1;} _filtered #yiv8524026874 {font-family:宋体;panose-1:2 1 6 0 3 1 1 1 1 1;} _filtered #yiv8524026874 {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;} _filtered #yiv8524026874 {font-family:Tahoma;panose-1:2 11 6 4 3 5 4 4 2 4;} _filtered #yiv8524026874 {panose-1:2 1 6 0 3 1 1 1 1 1;}#yiv8524026874 #yiv8524026874 p.yiv8524026874MsoNormal, #yiv8524026874 li.yiv8524026874MsoNormal, #yiv8524026874 div.yiv8524026874MsoNormal {margin:0cm;margin-bottom:.0001pt;font-size:12.0pt;}#yiv8524026874 a:link, #yiv8524026874 span.yiv8524026874MsoHyperlink {color:blue;text-decoration:underline;}#yiv8524026874 a:visited, #yiv8524026874 span.yiv8524026874MsoHyperlinkFollowed {color:purple;text-decoration:underline;}#yiv8524026874 p {margin-right:0cm;margin-left:0cm;font-size:12.0pt;}#yiv8524026874 span.yiv8524026874hoenzb {}#yiv8524026874 span.yiv8524026874EmailStyle19 {color:#1F497D;}#yiv8524026874 .yiv8524026874MsoChpDefault {} _filtered #yiv8524026874 {margin:72.0pt 90.0pt 72.0pt 90.0pt;}#yiv8524026874 div.yiv8524026874WordSection1 {}#yiv8524026874 Thanks a lot. I’m a little bit of confusing. If the ‘nodetool updatesstable’ doesn’t work without Cassandra server running, and Cassandra server failed to start due to the incompatible SSTable format, how to resolve this dilemma? From: Carlos Alonso [mailto:i...@mrcalonso.com] Sent: 2016年8月18日 18:44 To: user@cassandra.apache.org Subject: Re: A question to updatesstables Replies inline Carlos Alonso | Software Engineer | @calonso On 18 August 2016 at 11:56, Lu, Boying wrote: Hi, All, We use Cassandra in our product. I our early release we use Cassandra 1.2.10 whose SSTable is ‘ic’ format. We upgrade Cassandra to 2.0.10 in our product release. But the Cassandra server failed to start due to the incompatible SSTable format and the log message told us to use ‘nodetool updatesstables’ to upgrade SSTable files. To make sure that no negative impact on our data, I want to confirm following things about this command before trying it: 1. Does it work without Cassandra server running? No, it won't. 2. Will it cause data lost with this command? It shouldn't if you followed the upgrade instructions properly 3. What’s the best practice to void this error occurs again (e.g. upgrading Cassandra next time)? Upgrading SSTables is required or not depending on the upgrade you're running, basically if the SSTables layout changes you'll need to run it and not otherwise so there's nothing you can do to avoid it Thanks Boying
Re: upgradesstables throws error when migrating from 2.0.14 to 2.1.13
Hi, Try this and check the yaml file path: strace -f -e open nodetool upgradesstables 2>&1 | grep cassandra.yaml How C* is installed (package, tarball)? Other nodetool commands run fine?Also, did you try offline SSTable upgrade with the sstableupgrade tool? Best, Romain Le Vendredi 12 août 2016 15h31, Amit Singh F a écrit : Hi All, We are in process of migrating from 2.0.14 to 2.1.13 and we are able to successfully install binaries and make Cassandra 2.1.13 running up and fine. But issue comes up when we try to runnodetool upgradesstables , it gets finished in few seconds only which means it does not find any old sstables which needs to be upgraded but when I locate sstables on disk , I can see them in old state. Also when I try running sstableupgrade command below error is thrown: org.apache.cassandra.exceptions.ConfigurationException: Expecting URI in variable: [cassandra.config]. Please prefix the file with file:/// for local files or file:/// for remote files. Aborting. If you are executing this from an external tool, it needs to set Config.setClientMode(true) to avoid loading configuration. at org.apache.cassandra.config.YamlConfigurationLoader.getStorageConfigURL(YamlConfigurationLoader.java:73) ~[apache-cassandra-2.1.13.jar:2.1.13] at org.apache.cassandra.config.YamlConfigurationLoader.loadConfig(YamlConfigurationLoader.java:84) ~[apache-cassandra-2.1.13.jar:2.1.13] at org.apache.cassandra.config.DatabaseDescriptor.loadConfig(DatabaseDescriptor.java:161) ~[apache-cassandra-2.1.13.jar:2.1.13] at org.apache.cassandra.config.DatabaseDescriptor.(DatabaseDescriptor.java:136) ~[apache-cassandra-2.1.13.jar:2.1.13] at org.apache.cassandra.tools.StandaloneUpgrader.main(StandaloneUpgrader.java:52) [apache-cassandra-2.1.13.jar:2.1.13] Expecting URI in variable: [cassandra.config]. Please prefix the file with file:/// for local files or file:/// for remote files. Aborting. If you are executing this from an external tool, it needs to set Config.setClientMode(true) to avoid loading configuration. Fatal configuration error; unable to start. See log for stacktrace. Also I debug in code little bit and this error is due to invalid path of Cassandra.yaml, but I can skip this as my Cassandra node in UN state. So can anybody provide me some pointers to look into this. Regards Amit Chowdhery
Re: Question nodetool status
It's a bit more involved than that. C* uses a "Phi accrual failure detector":https://docs.datastax.com/en/cassandra/3.x/cassandra/architecture/archDataDistributeFailDetect.html https://github.com/apache/cassandra/blob/trunk/conf/cassandra.yaml#L878See also https://dspace.jaist.ac.jp/dspace/bitstream/10119/4784/1/IS-RR-2004-010.pdf Best, Romain Le Jeudi 11 août 2016 17h02, jean paul a écrit : Hi, thanks a lot for answer :) Gossip is a peer-to-peer communication protocol in which nodes periodically exchange state information about themselves and about other nodes they know about. unreachableNodes = probe.getUnreachableNodes(); ---> i.e if nodedon't publish heartbeats on x seconds (using gossip protocol), it's therefore marked 'DN: down' ? That's it? 2016-08-11 13:51 GMT+01:00 Romain Hardouin : Hi Jean Paul, Yes, the gossiper is used. Example with down nodes:1. The status command retrieve unreachable nodes from a NodeProbe instance: https://github.com/ apache/cassandra/blob/trunk/ src/java/org/apache/cassandra/ tools/nodetool/Status.java#L64 2. The NodeProbe list comes from a StorageService proxy: https://github.com/apache/ cassandra/blob/trunk/src/java/ org/apache/cassandra/tools/ NodeProbe.java#L4383. The proxy calls the Gossiper singleton: https://github.com/ apache/cassandra/blob/trunk/ src/java/org/apache/cassandra/ service/StorageService.java# L2681 Best, Romain Le Jeudi 11 août 2016 14h16, jean paul a écrit : Hi all, $nodetool status Datacenter: datacenter1 === Status=Up/Down |/ State=Normal/Leaving/Joining/ Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 127.0.0.1 83.05 KB 256 100.0% 460ddcd9-1ee8-48b8-a618- c076056aad07 rack1 The nodetool command shows the status of the node (UN=up,DN=down): Please i'd like to know how this command works and is it based on gossip protocol or not ? Thank you so much for explanations.Best regards.
Re: Question nodetool status
Hi Jean Paul, Yes, the gossiper is used. Example with down nodes:1. The status command retrieve unreachable nodes from a NodeProbe instance: https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/tools/nodetool/Status.java#L64 2. The NodeProbe list comes from a StorageService proxy: https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/tools/NodeProbe.java#L4383. The proxy calls the Gossiper singleton: https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageService.java#L2681 Best, Romain Le Jeudi 11 août 2016 14h16, jean paul a écrit : Hi all, $nodetool status Datacenter: datacenter1 === Status=Up/Down |/ State=Normal/Leaving/Joining/ Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 127.0.0.1 83.05 KB 256 100.0% 460ddcd9-1ee8-48b8-a618- c076056aad07 rack1 The nodetool command shows the status of the node (UN=up,DN=down): Please i'd like to know how this command works and is it based on gossip protocol or not ? Thank you so much for explanations.Best regards.
Re: Re : Default values in Cassandra YAML file
Yes. You can even see that some caution is taken in the code https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/config/Config.java#L131 (But if I were you I would not rely on this. It's always better to be explicit.) Best, Romain Le Mercredi 10 août 2016 17h50, sai krishnam raju potturi a écrit : hi; if there are any missed attributes in the YAML file, will Cassandra pick up default values for those attributes. thanks
Re: Memory leak and lockup on our 2.2.7 Cassandra cluster.
> Curious why the 2.2 to 3.x upgrade path is risky at best. I guess that >upgrade from 2.2 is less tested by DataStax QA because DSE4 used C* 2.1, not >2.2.I would say the safest upgrade is 2.1 to 3.0.x Best, Romain
Re: unreachable nodes mystery in describecluster output
That's a good news if describecluster shows the same version on each node. Try with a high timeout like 120 seconds to see if it works. Is there a VPN between DCs? Is there room for improvement at the network level? TCP tuning, etc. I'm not saying you won't have unreachable nodes but it's worth it if you can. Romain Le Mercredi 3 août 2016 15h02, Aleksandr Ivanov a écrit : The latency is high... It is but is it really causing the problem? Latency is high but constant and not higher than ~200ms. Regarding the ALTER, did you try to increase the timeout with "cqlsh --request-timeout=REQUEST_TIMEOUT"? Because the default is 10 seconds. I use 25sec timeout (--request-timeout 25) Apart the unreachable nodes, do you know if all nodes have the same schema version? "nodetool gossipinfo" shows same scheme version on all nodes. Best, Romain
Re: unreachable nodes mystery in describecluster output
Hi, The latency is high... Regarding the ALTER, did you try to increase the timeout with "cqlsh --request-timeout=REQUEST_TIMEOUT"? Because the default is 10 seconds. Apart the unreachable nodes, do you know if all nodes have the same schema version? Best, Romain
Re: duplicate values for primary key
Just to know, did you get some errors during the nodetool upgradesstables? Romain Le Mardi 2 août 2016 8h40, Julien Anguenot a écrit : Hey Oskar, I would comment and add all possible information to that Jira issue… J. --Julien Anguenot (@anguenot) On Aug 2, 2016, at 8:36 AM, Oskar Kjellin wrote: Hi, Ran into the same issue when going to 3.5. Completely killed our cluster. Only way was to restore a backup. /Oskar On 2 aug. 2016, at 07:54, Julien Anguenot wrote: Hey Jesse, You might wanna check and comment against that issue: https://issues.apache.org/jira/browse/CASSANDRA-11887 J. --Julien Anguenot (@anguenot) On Aug 2, 2016, at 3:16 AM, Jesse Hodges wrote: Hi, I've got a bit of a conundrum. Recently I upgraded from 2.2.3 to 3.7.0 (ddc distribution). Following the upgrade (though this condition may have existed prior to upgrade).. I have a table with a simple partition key and multiple clustering keys, and I have duplicates of many of the primary keys! I've tried various repair options and lots of searching, but nothing's really helping. I'm unsure of how to troubleshoot further or potentially consolidate these keys. This seems like a bug, but hopefully it's something simple I missed. I'm also willing to troubleshoot further as needed, but I could use a few getting started pointers. Example output: primary key is (partition_id,alarm_id,tenant_id,account_id,source,metric) cqlsh:alarms> select * from alarms.last_seen_state where partition_id=10 and alarm_id='59893'; partition_id | alarm_id | tenant_id | account_id | source | metric | last_seen | value--+--+--++++-+--- 10 | 59893 | f50f8413-57bb-4eb5-a37c-7482a63ea9a5 | 10303 | PORTAL | CPU | 2016-08-01 15:27:37.00+ | 1 10 | 59893 | f50f8413-57bb-4eb5-a37c-7482a63ea9a5 | 10303 | PORTAL | CPU | 2016-08-01 15:07:15.00+ | 1 Thanks, Jesse
Re: (C)* stable version after 3.5
DSE 4.8 uses C* 2.1 and DSE 5.0 uses C* 3.0. So I would say that 2.1->3.0 is more tested by DataStax than 2.2->3.0. Le Jeudi 14 juillet 2016 11h37, Stefano Ortolani a écrit : FWIW, I've recently upgraded from 2.1 to 3.0 without issues of any sort, but admittedly I haven't been using anything too fancy. Cheers,Stefano On Wed, Jul 13, 2016 at 10:28 PM, Alain RODRIGUEZ wrote: Hi Anuj >From >https://docs.datastax.com/en/latest-upgrade/upgrade/cassandra/upgrdBestPractCassandra.html: - Employ a continual upgrade strategy for each year. Upgrades are impacted by the version you are upgrading from and the version you are upgrading to. The greater the gap between the current version and the target version, the more complex the upgrade. And I could not find it but historically I am quite sure it was explicitly recommended not to skip a major update (for a rolling upgrade), even if I could not find it. Anyway it is clear that the bigger the gap is, the more careful we need to be. On the other hand, I see 2.2 as a 2.1 + some feature but no real breaking changes (as 3.0 was already on the pipe) and doing a 2.2 was decided because 3.0 was taking a long time to be released and some feature were ready for a while. I might be wrong on some stuff above, but one can only speak with his knowledge and from his point of view. So I ended up saying: Also I am not sure if the 2.2 major version is something you can skip while upgrading through a rolling restart. I believe you can, but it is not what is recommended. Note that "I am not sure", "I believe you can"... So it was more a thought, something to explore for Varun :-). And I actually encouraged him to move forward. Now that Tyler Hobbs confirmed it works, you can put a lot more trust on the fact that this upgrade will work :-). I would still encourage people to test it (for client compatibility, corner cases due to models, ...). I hope I am more clear now, C*heers,---Alain Rodriguez - alain@thelastpickle.comFrance The Last Pickle - Apache Cassandra Consultinghttp://www.thelastpickle.com 2016-07-13 18:39 GMT+02:00 Tyler Hobbs : On Wed, Jul 13, 2016 at 11:32 AM, Anuj Wadehra wrote: Why do you think that skipping 2.2 is not recommended when NEWS.txt suggests otherwise? Can you elaborate? We test upgrading from 2.1 -> 3.x and upgrading from 2.2 -> 3.x equivalently. There should not be a difference in terms of how well the upgrade is supported. -- Tyler Hobbs DataStax
Re: Open source equivalents of OpsCenter
Hi Juho, Out of curiosity, which stack did you use to make your dashboard? Romain Le Jeudi 14 juillet 2016 10h43, Juho Mäkinen a écrit : I'm doing some work on replacing OpsCenter in out setup. I ended creating a Docker container which contains the following features: - Cassandra 2.2.7 - MX4J (a JMX to REST bridge) as a java-agent - metrics-graphite-3.1.0.jar (export some but not all JMX to graphite) - a custom ruby which uses MX4J to export some JMX metrics to graphite which we don't otherwise get. With this I will get all our cassandra instances and their JMX exposed data to graphite, which allows us to use Grafana and Graphite to draw pretty dashboards. In addition I started writing some code which currently provides the following features: - A dashboard which provides a similar ring view what OpsCenter does, with onMouseOver features to display more info on each node. - Simple HTTP GET/POST based api to do - Setup a new non-vnode based cluster - Get a JSON blob on cluster information, all its tokens, machines and so on - Api for new cluster instances so that they can get a token slot from the ring when they boot. - Option to kill a dead node and mark its slot for replace, so the new booting node can use cassandra.replace_address option. The node is not yet packaged in any way for distribution and some parts depend on our Chef installation, but if there's interest I can publish at least some parts from it. - Garo On Thu, Jul 14, 2016 at 10:54 AM, Romain Hardouin wrote: Do you run C* on physical machine or in the cloud? If the topology doesn't change too often you can have a look a Zabbix. The downside is that you have to set up all the JMX metrics yourself... but that's also a good point because you can have custom metrics. If you want nice graphs/dashboards you can use Grafana to plot Zabbix data. (We're also using SaaS but that's not open source).For the rolling restart and other admin stuff we're using Rundeck. It's a great tool when working in a team. (I think it's time to implement an open source alternative to OpsCenter. If some guys are interested I'm in.) Best, Romain Le Jeudi 14 juillet 2016 0h01, Ranjib Dey a écrit : we use datadog (metrics emitted as raw statsd) for the dashboard. All repair & compaction is done via blender & serf[1].[1]https://github.com/pagerduty/blender On Wed, Jul 13, 2016 at 2:42 PM, Kevin O'Connor wrote: Now that OpsCenter doesn't work with open source installs, are there any runs at an open source equivalent? I'd be more interested in looking at metrics of a running cluster and doing other tasks like managing repairs/rolling restarts more so than historical data.
Re: Open source equivalents of OpsCenter
Do you run C* on physical machine or in the cloud? If the topology doesn't change too often you can have a look a Zabbix. The downside is that you have to set up all the JMX metrics yourself... but that's also a good point because you can have custom metrics. If you want nice graphs/dashboards you can use Grafana to plot Zabbix data. (We're also using SaaS but that's not open source).For the rolling restart and other admin stuff we're using Rundeck. It's a great tool when working in a team. (I think it's time to implement an open source alternative to OpsCenter. If some guys are interested I'm in.) Best, Romain Le Jeudi 14 juillet 2016 0h01, Ranjib Dey a écrit : we use datadog (metrics emitted as raw statsd) for the dashboard. All repair & compaction is done via blender & serf[1].[1]https://github.com/pagerduty/blender On Wed, Jul 13, 2016 at 2:42 PM, Kevin O'Connor wrote: Now that OpsCenter doesn't work with open source installs, are there any runs at an open source equivalent? I'd be more interested in looking at metrics of a running cluster and doing other tasks like managing repairs/rolling restarts more so than historical data.
Re: CPU high load
Did you upgrade from a previous version? DId you make some schema changes like compaction strategy, compression, bloom filter, etc.?What about the R/W requests? SharedPool Workers are... shared ;-) Put logs in debug to see some examples of what services are using this pool (many actually). Best, Romain Le Mercredi 13 juillet 2016 18h15, Patrick McFadin a écrit : Might be more clear looking at nodetool tpstats >From there you can see all the thread pools and if there are any blocks. Could >be something subtle like network. On Tue, Jul 12, 2016 at 3:23 PM, Aoi Kadoya wrote: Hi, I am running 6 nodes vnode cluster with DSE 4.8.1, and since few weeks ago, all of the cluster nodes are hitting avg. 15-20 cpu load. These nodes are running on VMs(VMware vSphere) that have 8vcpu (1core/socket)-16 vRAM.(JVM options : -Xms8G -Xmx8G -Xmn800M) At first I thought this is because of CPU iowait, however, iowait is constantly low(in fact it's 0 almost all time time), CPU steal time is also 0%. When I took a thread dump, I found some of "SharedPool-Worker" threads are consuming CPU and those threads seem to be waiting for something so I assume this is the cause of cpu load. "SharedPool-Worker-1" #240 daemon prio=5 os_prio=0 tid=0x7fabf459e000 nid=0x39b3 waiting on condition [0x7faad7f02000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:304) at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:85) at java.lang.Thread.run(Thread.java:745) Thread dump looks like this, but I am not sure what is this sharedpool-worker waiting for. Would you please help me with the further trouble shooting? I am also reading the thread posted by Yuan as the situation is very similar to mine but I didn't get any blocked, dropped or pending count in my tpstat result. Thanks, Aoi
Re: Is my cluster normal?
Same behavior here with a very different setup.After an upgrade to 2.1.14 (from 2.0.17) I see a high load and many NTR "all time blocked". Offheap memtable lowered the blocked NTR for me, I put a comment on CASSANDRA-11363 Best, Romain Le Mercredi 13 juillet 2016 20h18, Yuan Fang a écrit : Sometimes, the Pending can change from 128 to 129, 125 etc. On Wed, Jul 13, 2016 at 10:32 AM, Yuan Fang wrote: $nodetool tpstats ...Pool Name Active Pending Completed Blocked All time blocked Native-Transport-Requests 128 128 1420623949 1 142821509 ... What is this? Is it normal? On Tue, Jul 12, 2016 at 3:03 PM, Yuan Fang wrote: Hi Jonathan, Here is the result: ubuntu@ip-172-31-44-250:~$ iostat -dmx 2 10Linux 3.13.0-74-generic (ip-172-31-44-250) 07/12/2016 _x86_64_ (4 CPU) Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %utilxvda 0.01 2.13 0.74 1.55 0.01 0.02 27.77 0.00 0.74 0.89 0.66 0.43 0.10xvdf 0.01 0.58 237.41 52.50 12.90 6.21 135.02 2.32 8.01 3.65 27.72 0.57 16.63 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %utilxvda 0.00 7.50 0.00 2.50 0.00 0.04 32.00 0.00 1.60 0.00 1.60 1.60 0.40xvdf 0.00 0.00 353.50 0.00 24.12 0.00 139.75 0.49 1.37 1.37 0.00 0.58 20.60 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %utilxvda 0.00 0.00 0.00 1.00 0.00 0.00 8.00 0.00 0.00 0.00 0.00 0.00 0.00xvdf 0.00 2.00 463.50 35.00 30.69 2.86 137.84 0.88 1.77 1.29 8.17 0.60 30.00 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %utilxvda 0.00 0.00 0.00 1.00 0.00 0.00 8.00 0.00 0.00 0.00 0.00 0.00 0.00xvdf 0.00 0.00 99.50 36.00 8.54 4.40 195.62 1.55 3.88 1.45 10.61 1.06 14.40 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %utilxvda 0.00 5.00 0.00 1.50 0.00 0.03 34.67 0.00 1.33 0.00 1.33 1.33 0.20xvdf 0.00 1.50 703.00 195.00 48.83 23.76 165.57 6.49 8.36 1.66 32.51 0.55 49.80 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %utilxvda 0.00 0.00 0.00 1.00 0.00 0.04 72.00 0.00 0.00 0.00 0.00 0.00 0.00xvdf 0.00 2.50 149.50 69.50 10.12 6.68 157.14 0.74 3.42 1.18 8.23 0.51 11.20 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %utilxvda 0.00 5.00 0.00 2.50 0.00 0.03 24.00 0.00 0.00 0.00 0.00 0.00 0.00xvdf 0.00 0.00 61.50 22.50 5.36 2.75 197.64 0.33 3.93 1.50 10.58 0.88 7.40 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %utilxvda 0.00 0.00 0.00 0.50 0.00 0.00 8.00 0.00 0.00 0.00 0.00 0.00 0.00xvdf 0.00 0.00 375.00 0.00 24.84 0.00 135.64 0.45 1.20 1.20 0.00 0.57 21.20 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %utilxvda 0.00 1.00 0.00 6.00 0.00 0.03 9.33 0.00 0.00 0.00 0.00 0.00 0.00xvdf 0.00 0.00 542.50 23.50 35.08 2.83 137.16 0.80 1.41 1.15 7.23 0.49 28.00 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %utilxvda 0.00 3.50 0.50 1.50 0.00 0.02 24.00 0.00 0.00 0.00 0.00 0.00 0.00xvdf 0.00 1.50 272.00 153.50 16.18 18.67 167.73 14.32 33.66 1.39 90.84 0.81 34.60 On Tue, Jul 12, 2016 at 12:34 PM, Jonathan Haddad wrote: When you have high system load it means your CPU is waiting for *something*, and in my experience it's usually slow disk. A disk connected over network has been a culprit for me many times. On Tue, Jul 12, 2016 at 12:33 PM Jonathan Haddad wrote: Can do you do: iostat -dmx 2 10 On Tue, Jul 12, 2016 at 11:20 AM Yuan Fang wrote: Hi Jeff, The read being low is because we do not
Re: NoHostAvailableException coming up on our server
Put the driver logs in debug mode to see what's happen.Btw I am surprised by the few requests by connections in your setup: .setConnectionsPerHost(HostDistance.LOCAL, 20, 20) .setMaxRequestsPerConnection(HostDistance.LOCAL, 128) It looks like a protocol v2 settings (Cassandra 2.0) because it was limited to 128 requests per connection. You're using C* 3.3 so the protocol v4.You can go up to 32K since protocol v3. As a first step I would try to open only 2 connections with 16K in MaxRequestsPerConnection. Then try to fine tune. Best, Romain Le Mardi 12 juillet 2016 23h57, Abhinav Solan a écrit : I am using 3.0.0 version over apache-cassandra-3.3 On Tue, Jul 12, 2016 at 2:37 PM Riccardo Ferrari wrote: What driver version are you using? You can look at the LoggingRetryPolicy to have more meaningful messages in your logs. best, On Tue, Jul 12, 2016 at 9:02 PM, Abhinav Solan wrote: Thanks, JohnnyActually, they were running .. it went through a series of read and writes .. and recovered after the error.Is there any settings I can specify in preparing the Session at java client driver level, here are my current settings - PoolingOptions poolingOptions = new PoolingOptions() .setConnectionsPerHost(HostDistance.LOCAL, 20, 20) .setMaxRequestsPerConnection(HostDistance.LOCAL, 128) .setNewConnectionThreshold(HostDistance.LOCAL, 100); Cluster.Builder builder = Cluster.builder() .addContactPoints(cp) .withPoolingOptions(poolingOptions) .withProtocolVersion(ProtocolVersion.NEWEST_SUPPORTED) .withPort(port); On Tue, Jul 12, 2016 at 11:47 AM Johnny Miller wrote: Abhinav - your getting that as the driver isn’t finding any hosts up for your query. You probably need to check if all the nodes in your cluster are running. See: http://docs.datastax.com/en/drivers/java/3.0/com/datastax/driver/core/exceptions/NoHostAvailableException.html Johnny On 12 Jul 2016, at 18:46, Abhinav Solan wrote: Hi Everyone, I am getting this error on our server, it comes and goes seems the connection drops a comes back after a while -Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: :9042 (com.datastax.driver.core.exceptions.ConnectionException: [] Pool is CLOSING)) at com.datastax.driver.core.RequestHandler.reportNoMoreHosts(RequestHandler.java:218) at com.datastax.driver.core.RequestHandler.access$1000(RequestHandler.java:43) at com.datastax.driver.core.RequestHandler$SpeculativeExecution.sendRequest(RequestHandler.java:284) at com.datastax.driver.core.RequestHandler.startNewExecution(RequestHandler.java:115) at com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:91) at com.datastax.driver.core.SessionManager.executeAsync(SessionManager.java:129)Can anyone suggest me what can be done to handle this error ? Thanks,Abhinav
Re: Changing a cluster name
Indeed when you want to flush the system keyspace you need to specify it. The flush without argument filters out the system keyspace. This behavior is still the same in the trunk. If you dig into the sources, look at "nodeProbe.getNonSystemKeyspaces()" when "cmdArgs" is empty:- https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/tools/NodeTool.java#L329- https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/tools/NodeTool.java#L337 Best, Romain Le Mercredi 29 juin 2016 19h03, Paul Fife a écrit : Thanks Dominik - I was doing a nodetool flush like the instructions said, but it wasn't actually flushing the system keyspace. Using nodetool flush system made it work as expected! Thanks,Paul Fife On Wed, Jun 29, 2016 at 7:37 AM, Dominik Keil wrote: Also you might want to explicitly do "nodetool flush system". I've recently done this in C* 2.2.6 and just "nodetool flush" would not have flushed the system keyspace, leading to the change in cluster name not being persisted across restarts. Cheers Am 29.06.2016 um 03:36 schrieb Surbhi Gupta: system.local uses local strategy . You need to update on all nodes . On 28 June 2016 at 14:51, Tyler Hobbs wrote: First, make sure that you call nodetool flush after modifying the system table. That's probably why it's not surviving the restart. Second, I believe you will have to do this across all nodes and restart them at the same time. Otherwise, cluster name mismatches will prevent the nodes from communicating with each other. On Fri, Jun 24, 2016 at 3:51 PM, Paul Fife wrote: I am following the instructions here to attempt to change the name of a cluster: https://wiki.apache.org/cassandra/FAQ#clustername_mismatch or at least the more up to date advice: http://stackoverflow.com/questions/22006887/cassandra-saved-cluster-name-test-cluster-configured-name I am able to query the system.local to verify the clusterName is modified, but when I restart Cassandra it fails, and the value is back at the original cluster name. Is this still possible, or are there changes preventing this from working anymore? I have attempted this several times and it did actually work the first time, but when I moved around to the other nodes it no longer worked. Thanks, Paul Fife -- Tyler Hobbs DataStax -- Dominik Keil Phone: + 49 (0) 621 150 207 31 Mobile: + 49 (0) 151 626 602 14 Movilizer GmbH Konrad-Zuse-Ring 30 68163 Mannheim Germany movilizer.com Reinvent Your Mobile Enterprise - Movilizer is moving After June 27th 2016 Movilizer's new headquarter will be EASTSITE VIII Konrad-Zuse-Ring 30 68163 Mannheim Be the first to know:Twitter | LinkedIn | Facebook | stack overflow Company's registered office: Mannheim HRB: 700323 / Country Court: Mannheim Managing Directors: Alberto Zamora, Jörg Bernauer, Oliver Lesche Please inform us immediately if this e-mail and/or any attachment was transmitted incompletely or was not intelligible. This e-mail and any attachment is for authorized use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender.
Re: Lightweight Transactions during datacenter outage
> Would you know why the driver doesn't automatically change to LOCAL_SERIAL > during a DC outage ? I would say because *you* decide, not the driver ;-) This kind of fallback could be achieved with a custom downgrading policy (DowngradingConsistencyRetryPolicy [*] doesn't handle ConsistencyLevel.SERIAL / LOCAL_SERIAL ) * https://github.com/datastax/python-driver/blob/2.7.2-cassandra-2.1/cassandra/policies.py#L747 Best, Romain Le Mercredi 8 juin 2016 15h41, Jeronimo de A. Barros a écrit : Tyler, Thank you, it's working now: self.query['online'] = SimpleStatement("UPDATE table USING ttl %s SET l = True WHERE k2 = %s IF l = False;", consistency_level=ConsistencyLevel.LOCAL_QUORUM, serial_consistency_level=ConsistencyLevel.LOCAL_SERIAL) Would you know why the driver doesn't automatically change to LOCAL_SERIAL during a DC outage ? Or the driver already has an option to make this change from SERIAL to LOCAL_SERIAL ? Again, thank you very much, the bill for the beers is on me in September during the Cassandra Summit. ;-) Best regards, Jero On Tue, Jun 7, 2016 at 6:39 PM, Tyler Hobbs wrote: You can set the serial_consistency_level to LOCAL_SERIAL to tolerate a DC failure: http://datastax.github.io/python-driver/api/cassandra/query.html#cassandra.query.Statement.serial_consistency_level. It defaults to SERIAL, which ignores DCs. On Tue, Jun 7, 2016 at 12:26 PM, Jeronimo de A. Barros wrote: Hi, I have a cluster spreaded among 2 datacenters (DC1 and DC2), two server on each DC and I have a keyspace with NetworkTopologyStrategy (DC1:2 and DC2:2) with the following table: CREATE TABLE test ( k1 int, k2 timeuuid, PRIMARY KEY ((k1), k2)) WITH CLUSTERING ORDER BY (k2 DESC) During a datacenter outage, as soon as a datacenter goes offline, I get this error during a lightweight transaction: cqlsh:devtest> insert into test (k1,k2) values(1,now()) if not exists;Request did not complete within rpc_timeout. And a short time after the on-line DC verify the second DC is off-line: cqlsh:devtest> insert into test (k1,k2) values(1,now()) if not exists;Unable to complete request: one or more nodes were unavailable. So, my question is: Is there any way to keep lightweight transactions working during a datacenter outage using the C* Python driver 2.7.2 ? I was thinking about catch the exception and do a simple insert (without "IF") when the error occur, but having the lightweight transactions working even during a DC outage/split would be nice. Thanks in advance for any help/hints. Best regards, Jero -- Tyler Hobbs DataStax
Re: Nodetool repair inconsistencies
Hi Jason, It's difficult for the community to help you if you don't share the error ;-)What the logs said when you ran a major compaction? (i.e. the first error you encountered) Best, Romain Le Mercredi 8 juin 2016 3h34, Jason Kania a écrit : I am running a 3 node cluster of 3.0.6 instances and encountered an error when running nodetool compact. I then ran nodetool repair. No errors were returned. I then attempted to run nodetool compact again, but received the same error so the repair made no correction and reported no errors. After that, I moved the problematic files out of the directory, restarted cassandra and attempted the repair again. The repair again completed without errors, however, no files were added to the directory that had contained the corrupt files. So nodetool repair does not seem to be making actual repairs. I started looking around and numerous directories have vastly different amounts of content across the 3 nodes. There are 3 replicas so I would expect to find similar amounts of content in the same data directory on the different nodes. Is there any way to dig deeper into this? I don't want to be caught because replication/repair is silently failing. I noticed that there is always an "some repair failed" amongst the repair output but that is so completely unhelpful and has always been present. Thanks, Jason
Re: How to remove 'compact storage' attribute?
Hi, You can't yet, see https://issues.apache.org/jira/browse/CASSANDRA-10857Note that secondary indexes don't scale. Be aware of their limitations.If you want to change the data model of a CF, a Spark job can do the trick. Best, Romain Le Mardi 7 juin 2016 10h51, "Lu, Boying" a écrit : #yiv3185006454 #yiv3185006454 -- filtered {font-family:SimSun;panose-1:2 1 6 0 3 1 1 1 1 1;}#yiv3185006454 filtered {font-family:SimSun;panose-1:2 1 6 0 3 1 1 1 1 1;}#yiv3185006454 filtered {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;}#yiv3185006454 filtered {font-family:SimSun;panose-1:2 1 6 0 3 1 1 1 1 1;}#yiv3185006454 p.yiv3185006454MsoNormal, #yiv3185006454 li.yiv3185006454MsoNormal, #yiv3185006454 div.yiv3185006454MsoNormal {margin:0cm;margin-bottom:.0001pt;font-size:11.0pt;}#yiv3185006454 a:link, #yiv3185006454 span.yiv3185006454MsoHyperlink {color:blue;text-decoration:underline;}#yiv3185006454 a:visited, #yiv3185006454 span.yiv3185006454MsoHyperlinkFollowed {color:purple;text-decoration:underline;}#yiv3185006454 span.yiv3185006454EmailStyle17 {color:windowtext;}#yiv3185006454 .yiv3185006454MsoChpDefault {}#yiv3185006454 filtered {margin:72.0pt 90.0pt 72.0pt 90.0pt;}#yiv3185006454 div.yiv3185006454WordSection1 {}#yiv3185006454 Hi, All, Since the Astyanax client has been EOL, we are considering to migrate to Datastax java client in our product. One thing I notice is that the CFs created by Astyanax have ‘compact storage’ attribute which prevent us from using some new features provided by CQL such as secondary index. Does anyone know how to remove this attribute? “ALTER TABLE” seems doesn’t work according to the CQL document. Thanks Boying
Re: Inconsistent Reads after Restoring Snapshot
Yes the "Node restart method" with -Dcassandra.join_ring=false. Note that they advise to make a repair anyway. But thanks to join_ring=false the node will hibernate and not serve stale data.Tell me if I'm wrong you assume that server A is still OK, therefore system keyspace still exist? If not (disk KO) it's not the same procedure (hence the tokens in cassandra.yaml that I mentioned). Actually I'm not sure of what you assume by "node A crashes". You should try on a test cluster or with CCM (https://github.com/pcmanus/ccm) in order to familiarize yourself with the procedure. Romain Le Mardi 26 avril 2016 11h02, Anuj Wadehra a écrit : Thanks Romain !! So just to clarify, you are suggesting following steps: 10 AM Daily Snapshot taken of node A and moved to backup location 11 AM A record is inserted such that node A and B insert the record but there is a mutation drop on node C. 1 PM Node A crashes 1 PM Follow following steps to restore the 10 AM snapshot on node A: 1. Restore the data as mentioned in https://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_backup_snapshot_restore_t.html with ONE EXCEPTION >> start node A with -Dcassandra.join_ring=false . 2. Run repair 3. Retstart the node A with -Dcassandra.join_ring=true Please confirm. I was not aware that join_ring can also be used using a normal reboot. I thought it was only an option during autobootstrap :) Thanks Anuj ---- On Tue, 26/4/16, Romain Hardouin wrote: Subject: Re: Inconsistent Reads after Restoring Snapshot To: "user@cassandra.apache.org" Date: Tuesday, 26 April, 2016, 12:47 PM You can make a restore on the new node A (don't forget to set the token(s) in cassandra.yaml), start the node with -Dcassandra.join_ring=false and then run a repair on it. Have a look at https://issues.apache.org/jira/browse/CASSANDRA-6961 Best, Romain Le Mardi 26 avril 2016 4h26, Anuj Wadehra a écrit : Hi, We have 2.0.14. We use RF=3 and read/write at Quorum. Moreover, we dont use incremental backups. As per the documentation at , if i need to restore a Snapshot on SINGLE node in a cluster, I would run repair at the end. But while the repair is going on, reads may get inconsistent. Consider following scenario:10 AM Daily Snapshot taken of node A and moved to backup location11 AM A record is inserted such that node A and B insert the record but there is a mutation drop on node C.1 PM Node A crashes and data is restored from latest 10 AM snapshot. Now, only Node B has the record. Now, my question is: Till the repair is completed on node A,a read at Quorum may return inconsistent result based on the nodes from which data is read.If data is read from node A and node C, nothing is returned and if data is read from node A and node B, record is returned. This is a vital point which is not highlighted anywhere. Please confirm my understanding.If my understanding is right, how to make sure that my reads are not inconsistent while a node is being repair after restoring a snapshot. I think, autobootstrapping the node without joining the ring till the repair is completed, is an alternative option. But snapshots save lot of streaming as compared to bootstrap. Will incremental backups guarantee that ThanksAnuj Sent from Yahoo Mail on Android
Re: Inconsistent Reads after Restoring Snapshot
You can make a restore on the new node A (don't forget to set the token(s) in cassandra.yaml), start the node with -Dcassandra.join_ring=false and then run a repair on it. Have a look at https://issues.apache.org/jira/browse/CASSANDRA-6961 Best, Romain Le Mardi 26 avril 2016 4h26, Anuj Wadehra a écrit : Hi, We have 2.0.14. We use RF=3 and read/write at Quorum. Moreover, we dont use incremental backups. As per the documentation at https://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_backup_snapshot_restore_t.html , if i need to restore a Snapshot on SINGLE node in a cluster, I would run repair at the end. But while the repair is going on, reads may get inconsistent. Consider following scenario:10 AM Daily Snapshot taken of node A and moved to backup location11 AM A record is inserted such that node A and B insert the record but there is a mutation drop on node C.1 PM Node A crashes and data is restored from latest 10 AM snapshot. Now, only Node B has the record. Now, my question is: Till the repair is completed on node A,a read at Quorum may return inconsistent result based on the nodes from which data is read.If data is read from node A and node C, nothing is returned and if data is read from node A and node B, record is returned. This is a vital point which is not highlighted anywhere. Please confirm my understanding.If my understanding is right, how to make sure that my reads are not inconsistent while a node is being repair after restoring a snapshot. I think, autobootstrapping the node without joining the ring till the repair is completed, is an alternative option. But snapshots save lot of streaming as compared to bootstrap. Will incremental backups guarantee that ThanksAnuj Sent from Yahoo Mail on Android
Re: Ops Centre Read Requests / TBL: Local Read Requests
Yes you are right Anishek. If you write with LOCAL_ONE, values will be the same.
Re: Keyspaces not found in cqlsh
Would you mind pasting the ouput for both nodes in gist/paste/whatever? https://gist.github.com http://paste.debian.net Le Jeudi 11 février 2016 11h57, kedar a écrit : Thanks for the reply. ls -l cassandra/data/* lists various *.db files This problem is on both nodes. Thanks, Kedar Parikh Ext : 2224 Dir : +91 22 61782224 Mob : +91 9819634734 Email : kedar.par...@netcore.co.in Web : www.netcore.co.in
Re: Keyspaces not found in cqlsh
What is the output on both nodes of the following command? ls -l /var/lib/cassandra/data/system/* If one node seems odd you can try "nodetool resetlocalschema" but the other node must be in clean state. Best, Romain Le Jeudi 11 février 2016 11h10, kedar a écrit : I am using cqlsh 5.0.1 | Cassandra 2.1.2, recently we are unable to see / desc keyspaces and query tables through cqlsh on either of the two nodes cqlsh> desc keyspaces cqlsh> use user_index; cqlsh:user_index> desc table list_1_10; Keyspace 'user_index' not found. cqlsh:user_index> cqlsh> select * from system.schema_keyspaces; Keyspace 'system' not found. cqlsh> We are running a 2 node cluster. The Python - Django app that inserts data is running without any failure and system logs show nothing abnormal. ./nodetool repair on one node hasn't helped ./nodetool cfstats shows all the tables too -- Thanks, Kedar Parikh Ext : 2224 Dir : +91 22 61782224 Mob : +91 9819634734 Email : kedar.par...@netcore.co.in Web : www.netcore.co.in
Re: reducing disk space consumption
As Mohammed said "nodetool clearsnaphost" will do the trick. Cassandra takes a snapshot by default before keyspace/table dropping or truncation. You can disable this feature if it's a dev node (see auto_snapshot in cassandra.yaml) but if it's a production node is a good thing to keep auto snapshots. Best, Romain
Re: missing rows while importing data using sstable loader
> What is the best practise to create sstables? When you run a "nodetool flush" Cassandra persists all the memtables on disk, i.e. it produces sstables. (You can create sstables by yourself thanks to CQLSSTableWriter, but I don't think it was the point of your question.)
Re: missing rows while importing data using sstable loader
Did you run "nodetool flush" on the source node? If not, the missing rows could be in memtables.
Re: missing rows while importing data using sstable loader
Hi, I assume a RF > 1. Right?What is the consistency level you used? cqlsh use ONE by default. Try: cqlsh> CONSISTENCY ALLAnd run your query again. Best,Romain Le Vendredi 29 janvier 2016 13h45, Arindam Choudhury a écrit : Hi Kai, The table schema is: CREATE TABLE mordor.things_values_meta ( thing_id text, key text, bucket_timestamp timestamp, total_rows counter, PRIMARY KEY ((thing_id, key), bucket_timestamp) ) WITH CLUSTERING ORDER BY (bucket_timestamp ASC) AND bloom_filter_fp_chance = 0.01 AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}' AND comment = '' AND compaction = {'min_threshold': '4', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32'} AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'} AND dclocal_read_repair_chance = 0.1 AND default_time_to_live = 0 AND gc_grace_seconds = 864000 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND read_repair_chance = 0.0 AND speculative_retry = '99.0PERCENTILE'; I am just running "select count(*) from things_values_meta ;" to get the count. Regards, Arindam On 29 January 2016 at 13:39, Kai Wang wrote: Arindam, what's the table schema and what does your query to retrieve the rows look like? On Fri, Jan 29, 2016 at 7:33 AM, Arindam Choudhury wrote: Hi, I am importing data to a new cassandra cluster using sstableloader. The sstableloader runs without any warning or error. But I am missing around 1000 rows. Any feedback will be highly appreciated. Kind Regards, Arindam Choudhury
Re: About cassandra's reblance when adding one or more nodes into the existed cluster?
Hi Dillon, CMIIW I suspect that you use vnodes and you want to "move one of the 256 tokens to another node". If yes, that's not possible."nodetool move" is not allowed with vnodes: https://github.com/apache/cassandra/blob/cassandra-2.1.11/src/java/org/apache/cassandra/service/StorageService.java#L3488 *But* if you try "nodetool move" with a token that is already owned by a node, the check is done *before* the vnodes check: https://github.com/apache/cassandra/blob/cassandra-2.1.11/src/java/org/apache/cassandra/service/StorageService.java#L3479 If you use single token, it seems you try to replace a node by another one...Maybe you could explain what is the problem that leads you to do a nodetool move? (along with the nodetool ring output as Alain suggested) Best,Romain
Re: Strategy tools for taking snapshots to load in another cluster instance
My previous answer (sstableloader) allows you moving from larger to smaller cluster Sent from Yahoo Mail on Android On Tue, Nov 24, 2015 at 11:30, Anishek Agarwal wrote: Peer, that talks about having a similar sized cluster, I was wondering if there is a way for moving from larger to smaller cluster. I will try a few things as soon as i get time and update here. On Thu, Nov 19, 2015 at 5:48 PM, Peer, Oded wrote: Have you read the DataStax documentation? http://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_snapshot_restore_new_cluster.html From: Romain Hardouin [mailto:romainh...@yahoo.fr] Sent: Wednesday, November 18, 2015 3:59 PM To: user@cassandra.apache.org Subject: Re: Strategy tools for taking snapshots to load in another cluster instance | You can take a snapshot via nodetool then load sstables on your test cluster with sstableloader: docs.datastax.com/en/cassandra/2.1/cassandra/tools/toolsBulkloader_t.html Sent from Yahoo Mail on Android | From:"Anishek Agarwal" Date:Wed, Nov 18, 2015 at 11:24 Subject:Strategy tools for taking snapshots to load in another cluster instance Hello We have 5 node prod cluster and 3 node test cluster. Is there a way i can take snapshot of a table in prod and load it test cluster. The cassandra versions are same. Even if there is a tool that can help with this it will be great. If not, how do people handle scenarios where data in prod is required in staging/test clusters for testing to make sure things are correct ? Does the cluster size have to be same to allow copying of relevant snapshot data etc? thanks anishek | |
Re: Strategy tools for taking snapshots to load in another cluster instance
You can take a snapshot via nodetool then load sstables on your test cluster with sstableloader: docs.datastax.com/en/cassandra/2.1/cassandra/tools/toolsBulkloader_t.html Sent from Yahoo Mail on Android From:"Anishek Agarwal" Date:Wed, Nov 18, 2015 at 11:24 Subject:Strategy tools for taking snapshots to load in another cluster instance Hello We have 5 node prod cluster and 3 node test cluster. Is there a way i can take snapshot of a table in prod and load it test cluster. The cassandra versions are same. Even if there is a tool that can help with this it will be great. If not, how do people handle scenarios where data in prod is required in staging/test clusters for testing to make sure things are correct ? Does the cluster size have to be same to allow copying of relevant snapshot data etc? thanks anishek
Re: keyspace with hundreds of columnfamilies
Cassandra can handle many more columns (e.g. time series). So 100 columns is OK. Best, Romain tommaso barbugli a écrit sur 03/07/2014 21:55:18 : > De : tommaso barbugli > A : user@cassandra.apache.org, > Date : 03/07/2014 21:55 > Objet : Re: keyspace with hundreds of columnfamilies > > thank you for the replies; I am rethinking the schema design, one > possible solution is to "implode" one dimension and get N times less CFs. > With this approach I would come up with (cql) tables with up to 100 > columns; would that be a problem? > > Thank You, > Tommaso >
Re: keyspace with hundreds of columnfamilies
Arena allocation is an improvement feature, not a limitation. It was introduced in Cassandra 1.0 in order to lower memory fragmentation (and therefore promotion failure). AFAIK It's not intended to be tweaked so it might not be a good idea to change it. Best, Romain tommaso barbugli a écrit sur 02/07/2014 17:40:18 : > De : tommaso barbugli > A : user@cassandra.apache.org, > Date : 02/07/2014 17:40 > Objet : Re: keyspace with hundreds of columnfamilies > > 1MB per column family sounds pretty bad to me; is this something I > can tweak/workaround somehow? > > Thanks > Tommaso > > 2014-07-02 17:21 GMT+02:00 Romain HARDOUIN : > The trap is that each CF will consume 1 MB of memory due to arena allocation. > This might seem harmless but if you plan thousands of CF it means > thousands of mega bytes... > Up to 1,000 CF I think it could be doable, but not 10,000. > > Best, > > Romain > > > tommaso barbugli a écrit sur 02/07/2014 10:13:41 : > > > De : tommaso barbugli > > A : user@cassandra.apache.org, > > Date : 02/07/2014 10:14 > > Objet : keyspace with hundreds of columnfamilies > > > > Hi, > > Are there any known issues, shortcomings about organising data in > > hundreds of column families? > > At this present I am running with 300 column families but I expect > > that to get to a couple of thousands. > > Is this something discouraged / unsupported (I am using Cassandra 2.0). > > > > Thanks > > Tommaso
RE: keyspace with hundreds of columnfamilies
The trap is that each CF will consume 1 MB of memory due to arena allocation. This might seem harmless but if you plan thousands of CF it means thousands of mega bytes... Up to 1,000 CF I think it could be doable, but not 10,000. Best, Romain tommaso barbugli a écrit sur 02/07/2014 10:13:41 : > De : tommaso barbugli > A : user@cassandra.apache.org, > Date : 02/07/2014 10:14 > Objet : keyspace with hundreds of columnfamilies > > Hi, > Are there any known issues, shortcomings about organising data in > hundreds of column families? > At this present I am running with 300 column families but I expect > that to get to a couple of thousands. > Is this something discouraged / unsupported (I am using Cassandra 2.0). > > Thanks > Tommaso
RE: Backup Cassandra to
So you have to install a backup client on each Cassandra node. If the NetBackup client behaves like EMC Networker, beware the resources utilization (data deduplication, compression). You could have to boost CPUs and RAM (+2GB) of each nodes. Try with one node: make a snapshot with nodetool and configure NetBackup so the backup client sends data to your tape library or virtual tape library. And of course, try to restore ;-) The tricky part is not Cassandra itself, it's to follow NetBackup (or whatever) best practices. "Camacho, Maria (NSN - FI/Espoo)" a écrit sur 12/06/2014 13:12:18 : > De : "Camacho, Maria (NSN - FI/Espoo)" > A : "user@cassandra.apache.org" , > Date : 12/06/2014 13:12 > Objet : RE: Backup Cassandra to > > Hi, > Thanks for the quick response Romain. > > We would like to avoid using extra disk space, so no DAS/SAN. > We are more interested in achieving something like what is now being > done with Oracle – Symantec’s NetBackup is used to backup directly > to tape, no intermediate storage is needed. > > It could be NetBackup or whatever product supported by Cassandra > that writes the backup on tape without storing it on disk first. > > Regards, > Maria
RE: Backup Cassandra to
Hi Maria, It depends which backup software and hardware you plan to use. Do you store your data on DAS or SAN? Some hints regarding Cassandra is either to drain the node to backup or take a Cassandra snapshot and then to backup this snapshot. We backup our data on tape but we also store our data on SAN, so it's pretty vendor specific. Best, Romain "Camacho, Maria (NSN - FI/Espoo)" a écrit sur 12/06/2014 10:57:06 : > De : "Camacho, Maria (NSN - FI/Espoo)" > A : "user@cassandra.apache.org" , > Date : 12/06/2014 10:57 > Objet : Backup Cassandra to > > Hi there, > > I'm trying to find information/instructions about backing up and > restoring a Cassandra DB to and from a tape unit. > > I was hopping someone in this forum could help me with this since I > could not find anything useful in Google :( > > Thanks in advance, > Maria >
RE: Memory issue
Well... you have already changed the limits ;-) Keep in mind that changes in the limits.conf file will not affect processes that are already running. opensaf dev a écrit sur 21/05/2014 06:59:05 : > De : opensaf dev > A : user@cassandra.apache.org, > Date : 21/05/2014 07:00 > Objet : Memory issue > > Hi guys, > > I am trying to run Cassandra on CentOS as an user X other then root > or cassandra. When I run as user cassandra, it starts and runs fine. > But, when I run under user X, I am getting the below error once > cassandra started and system freezes totally. > > Insufficient memlock settings: > WARN [main] 2011-06-15 09:58:56,861 CLibrary.java (line 118) Unable > to lock JVM memory (ENOMEM). > This can result in part of the JVM being swapped out, especially > with mmapped I/O enabled. > Increase RLIMIT_MEMLOCK or run Cassandra as root. > > > I have tried the tips available online to change the memlock and > other limits both for users cassadra and X, but did not solve the problem. > > What else I should consider when I run cassandra other then user > cassandra/root. > > > Any help is much appreciated. > > > Thanks > Dev >
RE: Memory issue
Hi, You have to define limits for the user. Here is an example for the user cassandra: # cat /etc/security/limits.d/cassandra.conf cassandra - memlock unlimited cassandra - nofile 10 best, Romain opensaf dev a écrit sur 21/05/2014 06:59:05 : > De : opensaf dev > A : user@cassandra.apache.org, > Date : 21/05/2014 07:00 > Objet : Memory issue > > Hi guys, > > I am trying to run Cassandra on CentOS as an user X other then root > or cassandra. When I run as user cassandra, it starts and runs fine. > But, when I run under user X, I am getting the below error once > cassandra started and system freezes totally. > > Insufficient memlock settings: > WARN [main] 2011-06-15 09:58:56,861 CLibrary.java (line 118) Unable > to lock JVM memory (ENOMEM). > This can result in part of the JVM being swapped out, especially > with mmapped I/O enabled. > Increase RLIMIT_MEMLOCK or run Cassandra as root. > > > I have tried the tips available online to change the memlock and > other limits both for users cassadra and X, but did not solve the problem. > > What else I should consider when I run cassandra other then user > cassandra/root. > > > Any help is much appreciated. > > > Thanks > Dev >
RE: Datacenter understanding question
RF=1 means no replication You have to set RF=2 in order to set up a mirroring -Romain ng a écrit sur 13/05/2014 19:37:08 : > De : ng > A : "user@cassandra.apache.org" , > Date : 14/05/2014 04:37 > Objet : Datacenter understanding question > > If I have configuration of two data center with one node each. > Replication factor is also 1. > Will these 2 nodes going to be mirrored/replicated?
RE: Cassandra Disk storage capacity
Hi, See data_file_directories and commitlog_directory in the settings file cassandra.yaml. Cheers, Romain Hari Rajendhran a écrit sur 07/04/2014 12:56:37 : > De : Hari Rajendhran > A : user@cassandra.apache.org, > Date : 07/04/2014 12:58 > Objet : Cassandra Disk storage capacity > > Hi Team, > > We have a 3 node Apache cassandra 2.0.4 setup installed in our lab > setup.We have set data directory to /var/lib/cassandra/data.What > would be the maximum > disk storage that will be used for cassandra data storage. > > Note : /var partition has a storage capacity of 40GB. > > My question is whether cassandra will the entire / directory for > data storage ? > If no, how to specify multiple directories for data storage ?? > > > > > > Best Regards > Hari Krishnan Rajendhran > Hadoop Admin > DESS-ABIM ,Chennai BIGDATA Galaxy > Tata Consultancy Services > Cell:- 9677985515 > Mailto: hari.rajendh...@tcs.com > Website: http://www.tcs.com > > Experience certainty. IT Services > Business Solutions > Consulting > > =-=-= > Notice: The information contained in this e-mail > message and/or attachments to it may contain > confidential or privileged information. If you are > not the intended recipient, any dissemination, use, > review, distribution, printing or copying of the > information contained in this e-mail message > and/or attachments to it are strictly prohibited. If > you have received this communication in error, > please notify us by reply e-mail or telephone and > immediately and permanently delete the message > and any attachments. Thank you
RE: Question about rpms from datastax
cassandra*.noarch.rpm -> Install Cassandra Only dsc*.noarch.rpm -> DSC stands for DataStax Community. Install Cassandra + OpsCenter Donald Smith a écrit sur 27/03/2014 20:36:57 : > De : Donald Smith > A : "'user@cassandra.apache.org'" , > Date : 27/03/2014 20:37 > Objet : Question about rpms from datastax > > On http://rpm.riptano.com/community/noarch/ what’s the difference between > > cassandra20-2.0.6-1.noarch.rpm and dsc20-2.0.6-1.noarch.rpm ? > > Thanks, Don > > Donald A. Smith | Senior Software Engineer > P: 425.201.3900 x 3866 > C: (206) 819-5965 > F: (646) 443-2333 > dona...@audiencescience.com
RE: [ANN] pithos is cassandra-backed S3 compatible object store
It looks like MagnetoDB for CloudStack. Nice Clojure project. Pierre-Yves Ritschard a écrit sur 27/03/2014 08:12:15 : > De : Pierre-Yves Ritschard > A : user , > Date : 27/03/2014 08:12 > Objet : [ANN] pithos is cassandra-backed S3 compatible object store > > Hi, > > If you're already using cassandra for storing your data, you might > be interested in http://pithos.io which provides s3 compatibility. > The underlying schema splits files in several blocks, themselves > being split in chunks. > > I'm looking forward to all your comments on the schema, code and of > course pull-requests :-) > > - pyr (https://twitter.com/pyr)
Re: Kernel keeps killing cassandra process - OOM
4 GB is OK for a test cluster. In the past we encountered a similar issue due to VMWare ESX's memory overcommit (memory ballooning). When you talk about overcommit, you talk about Linux (vm.overcommit_*) or hypervisor (like ESX)? prem yadav a écrit sur 24/03/2014 12:11:31 : > De : prem yadav > A : user@cassandra.apache.org, > Date : 24/03/2014 12:12 > Objet : Re: Kernel keeps killing cassandra process - OOM > > the nodes die without being under any load. Completely idle. > And 4 GB system memory is not low. or is it? > I have tried tweaking the overcommit memory. Tried disabling it, > under-committing and over-committing. > I also reduced rpc threads min and max. Will try other setting from > that link Michael has given.
Re: Kernel keeps killing cassandra process - OOM
You have to tune Cassandra in order to run it under a low memory environment. Many settings must be tuned. The link that Michael mentions provides a quick start. There is a point that I haven't understood. *When* did your nodes die? Under load? Or can they be killed via OOM killer even if they are not loaded? If the nodes are VM you have to pay attention to hypervisor memory overcommit. "Laing, Michael" a écrit sur 22/03/2014 22:25:30 : > De : "Laing, Michael" > A : user@cassandra.apache.org, > Date : 22/03/2014 22:26 > Objet : Re: Kernel keeps killing cassandra process - OOM > > You might want to look at: > > http://www.opensourceconnections.com/2013/08/31/building-the- > perfect-cassandra-test-environment/