LOCAL_QUORUM vs EACH_QUORUM
the following comment in the code describes them very clearly: * LOCAL_QUORUM Returns the record with the most recent timestamp once a majority of replicas within the local datacenter have replied. * EACH_QUORUM Returns the record with the most recent timestamp once a majority of replicas within each datacenter have replied. but it seems that my intended use case is not solved by either policy: I have 2 colos, mostly I want to run my application in the primary-backup (or hot/warm ) mode, though everything is automated, and a human switch over is not needed in case of one colo failure. I want all writes/reads to get a quorum from local colo, then at least make sure that 1 write has propagated to the other colo. So I do not necessarily need a Quorum from remote colos, but I need at least one write to arrive there. does that sound like a common use case? within the current code, is there a way to achieve that? if not, creating a new policy does not seem too difficult either. Thanks Yang
Re: Data migration between clusters
Hi Rob, Thank you for your reply. Our scenario is like this, we have 3 clusters, each has 1 or 2 keyspaces in it, and each cluster has 3 nodes. Now we're considering integrating these 3 clusters of 9 nodes to a single cluster of 9 nodes. This new cluster will contain all keyspaces and their data the former 3 clusters have. The replication factor, which is 3 now, will not be changed during this migration. We tried using sstableloader which didn't work well. Maybe we did it in a wrong way. It looks like the way of migrating data you suggested would solve our problem, we'll try it out by refering the link you gave in your mail. Thanks a lot again for your precious information, Ray (12/11/01 2:43), Rob Coli wrote: On Tue, Oct 30, 2012 at 4:18 AM, 張 睿 chou...@cyberagent.co.jp wrote: Does anyone here know if there is an efficient way to migrate multiple cassandra clusters' data to a single cassandra cluster without any dataloss. Yes. 1) create schema which is superset of all columnfamilies and all keyspaces 2) if all source clusters were the same fixed number of nodes, create a new cluster with the same fixed number of nodes 3) nodetool drain and shut down all nodes on all participating clusters 4) copy sstables from old clusters, maintaining that data from source node [x] ends up on target node [x] 5) start cassandra However without more details as to your old clusters, new clusters, and availability requirements, I can't give you a more useful answer. Here's some background on bulk loading, including copy-the-sstables. http://palominodb.com/blog/2012/09/25/bulk-loading-options-cassandra =Rob -- Ray Zhang Cyberagent.co
Re: idea drive layout - 4 drives + RAID question
Thanks. Yep, I think OS + CL (2 drive RAID1) will provide the best balance of reduced headaches / performance. I'll also be pondering 1 drive OS, 1 drive CL as well. On Wed, Oct 31, 2012 at 9:27 PM, aaron morton aa...@thelastpickle.comwrote: Good question. The is a comment on the DS blog or docs somewhere that says on EC2 running the commit log on the raid-0 ephemeral is preferred. I think the recommendation was specifically about how the disks are setup on EC2. While the commit log will be competing with logs and everything else on the OS volume, it would be competing with C* reads, Memtable flushing, compacting and repairing on the data volume. The only way to be sure is to test both setups. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 31/10/2012, at 1:11 PM, Ran User ranuse...@gmail.com wrote: Is there a concern of a large falloff in commit log write performance (sequential) when sharing 2 drives (RAID 1) with the OS (os and services writing their own logs, etc)? Do you expect the hit to be marginal? On Tue, Oct 30, 2012 at 7:58 PM, aaron morton aa...@thelastpickle.comwrote: We also have 4-disk nodes, and we use the following layout: 2 x OS + Commit in RAID 1 2 x Data disk in RAID 0 +1 You are replicating data at the application level and want the fastest possible IO performance per node. You can already distribute the individual Cassandra column families on different drives by just setting up symlinks to the individual folders. There are some features coming in 1.2 that make using a JBOD setup easier. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 30/10/2012, at 9:23 PM, Pieter Callewaert pieter.callewa...@be-mobile.be wrote: We also have 4-disk nodes, and we use the following layout: 2 x OS + Commit in RAID 1 2 x Data disk in RAID 0 This gives us the advantage we never have to reinstall the node when a drive crashes. Kind regards, Pieter *From:* Ran User [mailto:ranuse...@gmail.com] *Sent:* dinsdag 30 oktober 2012 4:33 *To:* user@cassandra.apache.org *Subject:* Re: idea drive layout - 4 drives + RAID question Have you considered running RAID 10 for the data drives to improve MTBF? On one hand Cassandra is handling redundancy issues, on the other hand, reducing the frequency of dealing with failed nodes is attractive if cheap (switching RAID levels to 10). We have no experience with software RAID (have always used hardware raid with BBU). I'm assuming software RAID 1 or 10 (the mirroring part) is inherently reliable (perhaps minus some edge case). On Tue, Oct 30, 2012 at 1:07 AM, Tupshin Harper tups...@tupshin.com wrote: I would generally recommend 1 drive for OS and commit log and 3 drive raid 0 for data. The raid does give you good performance benefit, and it can be convenient to have the OS on a side drive for configuration ease and better MTBF. -Tupshin On Oct 29, 2012 8:56 PM, Ran User ranuse...@gmail.com wrote: I was hoping to achieve approx. 2x IO (write and read) performance via RAID 0 (by accepting a higher MTBF). Do believe the performance gains of RAID0 are much lower and/or are not worth it vs the increased server failure rate? From my understanding, RAID 10 would achieve the read performance benefits of RAID 0, but not the write benefits. I'm also considering RAID 10 to maximize server IO performance. Currently, we're working with 1 CF. Thank you On Mon, Oct 29, 2012 at 11:51 PM, Timmy Turner timm.t...@gmail.com wrote: I'm not sure whether the raid 0 gets you anything other than headaches should one of the drives fail. You can already distribute the individual Cassandra column families on different drives by just setting up symlinks to the individual folders. 2012/10/30 Ran User ranuse...@gmail.com: For a server with 4 drive slots only, I'm thinking: either: - OS (1 drive) - Commit Log (1 drive) - Data (2 drives, software raid 0) vs - OS + Data (3 drives, software raid 0) - Commit Log (1 drive) or something else? also, if I can spare the wasted storage, would RAID 10 for cassandra data improve read performance and have no effect on write performance? Thank you! ** **
Re: Cassandra upgrade issues...
The first thing I would check is if nodetool is using the right jar. I sounds a lot like if the server has been correctly updated but nodetool haven't and still use the old classes. Check the nodetool executable, it's a shell script, and try echoing the CLASSPATH in there and check it correctly point to what it should. -- Sylvain On Thu, Nov 1, 2012 at 9:10 AM, Brian Fleming bigbrianflem...@gmail.com wrote: Hi, I was testing upgrading from Cassandra v.1.0.7 to v.1.1.5 yesterday on a single node dev cluster with ~6.5GB of data it went smoothly in that no errors were thrown, the data was migrated to the new directory structure, I can still read/write data as expected, etc. However nodetool commands are behaving strangely – full details below. I couldn’t find anything relevant online relating to these exceptions – any help/pointers would be greatly appreciated. Thanks Regards, Brian ‘nodetool cleanup’ runs successfully ‘nodetool info’ produces : Token: 82358484304664259547357526550084691083 Gossip active: true Load : 7.69 GB Generation No: 1351697611 Uptime (seconds) : 58387 Heap Memory (MB) : 936.91 / 1928.00 Exception in thread main java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.cassandra.dht.Token at org.apache.cassandra.tools.NodeProbe.getEndpoint(NodeProbe.java:546) at org.apache.cassandra.tools.NodeProbe.getDataCenter(NodeProbe.java:559) at org.apache.cassandra.tools.NodeCmd.printInfo(NodeCmd.java:313) at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:651) ‘nodetool repair’ produces : Exception in thread main java.lang.reflect.UndeclaredThrowableException at $Proxy0.forceTableRepair(Unknown Source) at org.apache.cassandra.tools.NodeProbe.forceTableRepair(NodeProbe.java:203) at org.apache.cassandra.tools.NodeCmd.optionalKSandCFs(NodeCmd.java:880) at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:719) Caused by: javax.management.ReflectionException: Signature mismatch for operation forceTableRepair: (java.lang.String, [Ljava.lang.String;) should be (java.lang.String, boolean, [Ljava.lang.String;) at com.sun.jmx.mbeanserver.PerInterface.noSuchMethod(PerInterface.java:152) at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:117) at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:262) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:836) at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:761) at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1427) at javax.management.remote.rmi.RMIConnectionImpl.access$200(RMIConnectionImpl.java:72) at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1265) at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1360) at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:788) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:303) at sun.rmi.transport.Transport$1.run(Transport.java:159) at java.security.AccessController.doPrivileged(Native Method) at sun.rmi.transport.Transport.serviceCall(Transport.java:155) at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:535) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:790) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:649) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) at sun.rmi.transport.StreamRemoteCall.exceptionReceivedFromServer(StreamRemoteCall.java:255) at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:233) at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:142) at com.sun.jmx.remote.internal.PRef.invoke(Unknown Source) at javax.management.remote.rmi.RMIConnectionImpl_Stub.invoke(Unknown Source) at javax.management.remote.rmi.RMIConnector$RemoteMBeanServerConnection.invoke(RMIConnector.java:993) at
Re: Cassandra upgrade issues...
Hi Sylvain, Simple as that!!! Using the 1.1.5 nodetool version works as expected. My mistake. Many thanks, Brian On Thu, Nov 1, 2012 at 8:24 AM, Sylvain Lebresne sylv...@datastax.comwrote: The first thing I would check is if nodetool is using the right jar. I sounds a lot like if the server has been correctly updated but nodetool haven't and still use the old classes. Check the nodetool executable, it's a shell script, and try echoing the CLASSPATH in there and check it correctly point to what it should. -- Sylvain On Thu, Nov 1, 2012 at 9:10 AM, Brian Fleming bigbrianflem...@gmail.com wrote: Hi, I was testing upgrading from Cassandra v.1.0.7 to v.1.1.5 yesterday on a single node dev cluster with ~6.5GB of data it went smoothly in that no errors were thrown, the data was migrated to the new directory structure, I can still read/write data as expected, etc. However nodetool commands are behaving strangely – full details below. I couldn’t find anything relevant online relating to these exceptions – any help/pointers would be greatly appreciated. Thanks Regards, Brian ‘nodetool cleanup’ runs successfully ‘nodetool info’ produces : Token: 82358484304664259547357526550084691083 Gossip active: true Load : 7.69 GB Generation No: 1351697611 Uptime (seconds) : 58387 Heap Memory (MB) : 936.91 / 1928.00 Exception in thread main java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.cassandra.dht.Token at org.apache.cassandra.tools.NodeProbe.getEndpoint(NodeProbe.java:546) at org.apache.cassandra.tools.NodeProbe.getDataCenter(NodeProbe.java:559) at org.apache.cassandra.tools.NodeCmd.printInfo(NodeCmd.java:313) at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:651) ‘nodetool repair’ produces : Exception in thread main java.lang.reflect.UndeclaredThrowableException at $Proxy0.forceTableRepair(Unknown Source) at org.apache.cassandra.tools.NodeProbe.forceTableRepair(NodeProbe.java:203) at org.apache.cassandra.tools.NodeCmd.optionalKSandCFs(NodeCmd.java:880) at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:719) Caused by: javax.management.ReflectionException: Signature mismatch for operation forceTableRepair: (java.lang.String, [Ljava.lang.String;) should be (java.lang.String, boolean, [Ljava.lang.String;) at com.sun.jmx.mbeanserver.PerInterface.noSuchMethod(PerInterface.java:152) at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:117) at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:262) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:836) at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:761) at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1427) at javax.management.remote.rmi.RMIConnectionImpl.access$200(RMIConnectionImpl.java:72) at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1265) at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1360) at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:788) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:303) at sun.rmi.transport.Transport$1.run(Transport.java:159) at java.security.AccessController.doPrivileged(Native Method) at sun.rmi.transport.Transport.serviceCall(Transport.java:155) at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:535) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:790) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:649) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) at sun.rmi.transport.StreamRemoteCall.exceptionReceivedFromServer(StreamRemoteCall.java:255) at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:233) at
Re: Benifits by adding nodes to the cluster
I've not run it myself, but upgrading is part of the design. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 1/11/2012, at 10:43 AM, Wei Zhu wz1...@yahoo.com wrote: I heard about virtual nodes. But it doesn't come out until 1.2. Is it easy to convert the existing installation to use virtual nodes? Thanks. -Wei From: aaron morton aa...@thelastpickle.com To: user@cassandra.apache.org Sent: Wednesday, October 31, 2012 2:23 PM Subject: Re: Benifits by adding nodes to the cluster I have been told that it's much easier to scale the cluster by doubling the number of nodes, since no token changed needed on the existing nodes. Yup. But if the number of nodes is substantial, it's not realistic to double it every time. See the keynote from Jonathan Ellis or the talk on Virtual Nodes from Sam here http://www.datastax.com/events/cassandrasummit2012/presentations virtual nodes make this sort of thing faster and easier How easy is to add let's say 3 additional nodes to the existing 10 nodes? In that scenario would would need to move every node. But if you have 10 nodes you probably don't want to scale up by 3, I would guess 5 or 10. Scaling is not something you want to do every day. How easy the process is depends on the level of automation in your environment. For example Ops Centre can automate rebalancing nodes. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 31/10/2012, at 7:14 AM, weiz wz1...@yahoo.com wrote: One follow up questions. I have been told that it's much easier to scale the cluster by doubling the number of nodes, since no token changed needed on the existing nodes. But if the number of nodes is substantial, it's not realistic to double it every time. How easy is to add let's say 3 additional nodes to the existing 10 nodes? I understand the process of moving around data and delete unused data. Just want to understand from the operational point of view, how difficult is that? We are in the processing of evaluating the nosql solution, one important consideration is the operation cost. Any real world experience is very much appreciated. Thanks. -Wei -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Benifits-by-adding-nodes-to-the-cluster-tp7583437p7583466.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: Multiple counters value after restart
What CL are you using ? I think this can be what causes the issue. I'm writing and reading at CL ONE. I didn't drain before stopping Cassandra and this may have produce a fail in the current counters (those which were being written when I stopped a server). My first thought is to use QUOURM. But with only two nodes it's hard to get strong consistency using QUOURM. Can you try it thought, or run a repair ? But isn't Cassandra suppose to handle a server crash ? When a server crashes I guess it don't drain before... I was asking to understand how you did the upgrade. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 1/11/2012, at 11:39 AM, Alain RODRIGUEZ arodr...@gmail.com wrote: What version of cassandra are you using ? 1.1.2 Can you explain this further? I had an unexplained amount of reads (up to 1800 r/s and 90 Mo/s) on one server the other was doing about 200 r/s and 5 Mo/s max. I fixed it by rebooting the server. This server is dedicated to cassandra. I can't tell you more about it 'cause I don't get it... But a simple Cassandra restart wasn't enough. Was something writing to the cluster ? Yes we are having some activity and perform about 600 w/s. Did you drain for the upgrade ? We upgrade a long time ago and to 1.1.2. This warning is about the version 1.1.6. What changes did you make ? In the cassandra.yaml I just change the compaction_throughput_mb_per_sec property to slow down my compaction a bit. I don't think the problem come from here. Are you saying that a particular counter column is giving different values for different reads ? Yes, this is exactly what I was saying. Sorry if something is wrong with my English, it's not my mother tongue. What CL are you using ? I think this can be what causes the issue. I'm writing and reading at CL ONE. I didn't drain before stopping Cassandra and this may have produce a fail in the current counters (those which were being written when I stopped a server). But isn't Cassandra suppose to handle a server crash ? When a server crashes I guess it don't drain before... Thank you for your time Aaron, once again. Alain 2012/10/31 aaron morton aa...@thelastpickle.com What version of cassandra are you using ? I finally restart Cassandra. It didn't solve the problem so I stopped Cassandra again on that node and restart my ec2 server. This solved the issue (1800 r/s to 100 r/s). Can you explain this further? Was something writing to the cluster ? Did you drain for the upgrade ? https://github.com/apache/cassandra/blob/cassandra-1.1/NEWS.txt#L17 Today I changed my cassandra.yml and restart this same server to apply my conf. What changes did you make ? I just noticed that my homepage (which uses a Cassandra counter and refreshes every sec) shows me 4 different values. 2 of them repeatedly (5000 and 4000) and the 2 other some rare times (5500 and 3800) Are you saying that a particular counter column is giving different values for different reads ? What CL are you using ? Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 31/10/2012, at 3:39 AM, Jason Wee peich...@gmail.com wrote: maybe enable the debug in log4j-server.properties and going through the log to see what actually happen? On Tue, Oct 30, 2012 at 7:31 PM, Alain RODRIGUEZ arodr...@gmail.com wrote: Hi, I have an issue with counters, yesterday I had a lot of ununderstandable reads/sec on one server. I finally restart Cassandra. It didn't solve the problem so I stopped Cassandra again on that node and restart my ec2 server. This solved the issue (1800 r/s to 100 r/s). Today I changed my cassandra.yml and restart this same server to apply my conf. I just noticed that my homepage (which uses a Cassandra counter and refreshes every sec) shows me 4 different values. 2 of them repeatedly (5000 and 4000) and the 2 other some rare times (5500 and 3800) Only the counters made today and yesterday are concerned. I performed a repair without success. These data are the heart of our business so if someone had any clue on it, I would be really grateful... The sooner the better, I am in production with these random counters. Alain INFO: My environnement is 2 nodes (EC2 large), RF 2, CL.ONE (R W), Random Partitioner. xxx.xxx.xxx.241eu-west 1b Up Normal 151.95 GB 50.00% 0 xxx.xxx.xxx.109eu-west 1b Up Normal 117.71 GB 50.00% 85070591730234615865843651857942052864 Here is my conf: http://pastebin.com/5cMuBKDt
Re: repair, compaction, and tombstone rows
Is this a feature or a bug? Neither really. Repair doesn't do any gcable tombstone collection and it would be really hard to change that (besides, it's not his job). So if you when you run repair there is sstable with tombstone that could be collected but are not yet, then yes, they will be streamed. Now the theory is that compaction will run often enough that gcable tombstone will be collected in a reasonably timely fashion and so you will never have lots of such tombstones in general (making the fact that repair stream them largely irrelevant). That being said, in practice, I don't doubt that there is a few scenario like your own where this still can lead to doing too much useless work. I believe the main problem is that size tiered compaction has a tendency to not compact the largest sstables very often. Meaning that you could have large sstable with mostly gcable tombstone sitting around. In the upcoming Cassandra 1.2, https://issues.apache.org/jira/browse/CASSANDRA-3442 will fix that. Until then, if you are no afraid of a little bit of scripting, one option could be before running a repair to run a small script that would check the creation time of your sstable. If an sstable is old enough (for some value of that that depends on what is the TTL you use on all your columns), you may want to force a compaction (using the JMX call forceUserDefinedCompaction()) of that sstable. The goal being to get read of a maximum of outdated tombstones before running the repair (you could also alternatively run a major compaction prior to the repair, but major compactions have a lot of nasty effect so I wouldn't recommend that a priori). -- Sylvain
Re: Multiple counters value after restart
Can you try it thought, or run a repair ? Repairing didn't help My first thought is to use QUOURM This fix the problem. However, my data is probably still inconsistent, even if I read now always the same value. The point is that I can't handle a crash with CL.QUORUM, I can't even restart a node... I will add a third server. But isn't Cassandra suppose to handle a server crash ? When a server crashes I guess it don't drain before... I was asking to understand how you did the upgrade. Ok. On my side I am just concern about the possibility of using counters with CL.ONE and correctly handle a crash or restart without a drain. Alain 2012/11/1 aaron morton aa...@thelastpickle.com What CL are you using ? I think this can be what causes the issue. I'm writing and reading at CL ONE. I didn't drain before stopping Cassandra and this may have produce a fail in the current counters (those which were being written when I stopped a server). My first thought is to use QUOURM. But with only two nodes it's hard to get strong consistency using QUOURM. Can you try it thought, or run a repair ? But isn't Cassandra suppose to handle a server crash ? When a server crashes I guess it don't drain before... I was asking to understand how you did the upgrade. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 1/11/2012, at 11:39 AM, Alain RODRIGUEZ arodr...@gmail.com wrote: What version of cassandra are you using ? 1.1.2 Can you explain this further? I had an unexplained amount of reads (up to 1800 r/s and 90 Mo/s) on one server the other was doing about 200 r/s and 5 Mo/s max. I fixed it by rebooting the server. This server is dedicated to cassandra. I can't tell you more about it 'cause I don't get it... But a simple Cassandra restart wasn't enough. Was something writing to the cluster ? Yes we are having some activity and perform about 600 w/s. Did you drain for the upgrade ? We upgrade a long time ago and to 1.1.2. This warning is about the version 1.1.6. What changes did you make ? In the cassandra.yaml I just change the compaction_throughput_mb_per_sec property to slow down my compaction a bit. I don't think the problem come from here. Are you saying that a particular counter column is giving different values for different reads ? Yes, this is exactly what I was saying. Sorry if something is wrong with my English, it's not my mother tongue. What CL are you using ? I think this can be what causes the issue. I'm writing and reading at CL ONE. I didn't drain before stopping Cassandra and this may have produce a fail in the current counters (those which were being written when I stopped a server). But isn't Cassandra suppose to handle a server crash ? When a server crashes I guess it don't drain before... Thank you for your time Aaron, once again. Alain 2012/10/31 aaron morton aa...@thelastpickle.com What version of cassandra are you using ? I finally restart Cassandra. It didn't solve the problem so I stopped Cassandra again on that node and restart my ec2 server. This solved the issue (1800 r/s to 100 r/s). Can you explain this further? Was something writing to the cluster ? Did you drain for the upgrade ? https://github.com/apache/cassandra/blob/cassandra-1.1/NEWS.txt#L17 Today I changed my cassandra.yml and restart this same server to apply my conf. What changes did you make ? I just noticed that my homepage (which uses a Cassandra counter and refreshes every sec) shows me 4 different values. 2 of them repeatedly (5000 and 4000) and the 2 other some rare times (5500 and 3800) Are you saying that a particular counter column is giving different values for different reads ? What CL are you using ? Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 31/10/2012, at 3:39 AM, Jason Wee peich...@gmail.com wrote: maybe enable the debug in log4j-server.properties and going through the log to see what actually happen? On Tue, Oct 30, 2012 at 7:31 PM, Alain RODRIGUEZ arodr...@gmail.comwrote: Hi, I have an issue with counters, yesterday I had a lot of ununderstandable reads/sec on one server. I finally restart Cassandra. It didn't solve the problem so I stopped Cassandra again on that node and restart my ec2 server. This solved the issue (1800 r/s to 100 r/s). Today I changed my cassandra.yml and restart this same server to apply my conf. I just noticed that my homepage (which uses a Cassandra counter and refreshes every sec) shows me 4 different values. 2 of them repeatedly (5000 and 4000) and the 2 other some rare times (5500 and 3800) Only the counters made today and yesterday are concerned. I performed a repair without success. These data are the heart of our business so if someone had any clue on it, I would be really grateful... The sooner the better, I am in production with
logging servers? any interesting in one for cassandra?
2 questions 1. What are people using for logging servers for their web tier logging? 2. Would anyone be interested in a new logging server(any programming language) for web tier to log to your existing cassandra(it uses up disk space in proportion to number of web servers and just has a rolling window of logs along with a window of threshold dumps)? Context for second question: I like less systems since it is less maintenance/operations cost and so yesterday I quickly wrote up some log back appenders which support (SLF4J/log4j/jdk/commons libraries) and send the logs from our client tier into cassandra. It is simply a rolling window of logs so the space used in cassandra is proportional to the amount of web servers I have(currently, I have 4 web servers). I am also thinking about adding warning type logging such that on warning, the last N logs info and above are flushed along with the warning so basically two rolling windows. Then in the GUI, it simply shows the logs and if you click on a session, it switches to a view with all the logs for that session(no matter which server since in our cluster the session switches servers on every request since we are stateless….our session id is in the cookie). Well, let me know if anyone is interested and would actually use such a thing and if so, we might create a server around it. Thanks, Dean
Re: High bandwidth usage between datacenters for cluster
bryce, did you resolve this? i'm interested in the outcome. when you write does it help to use CL = LOCAL_QUORUM? On Mon, Oct 29, 2012 at 12:52 AM, aaron morton aa...@thelastpickle.com wrote: Outbound messages for other DC's are grouped and a single instance is sent to a single node in the remote DC. The remote node then forwards the message on to the other recipients in it's DC. All remote DC nodes will however reply directly to the coordinator. Normally this isn’t an issue for us, but at times we are writing approximately 1MB a sec of data, and seeing a corresponding 3MB of traffic across the WAN to all the Cassandra DR servers. Can you break the traffic down by port and direction ? Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 28/10/2012, at 12:18 PM, Bryce Godfrey bryce.godf...@azaleos.com wrote: Network topology with the topology file filled out is already the configuration we are using. From: sankalp kohli [mailto:kohlisank...@gmail.com] Sent: Thursday, October 25, 2012 11:55 AM To: user@cassandra.apache.org Subject: Re: High bandwidth usage between datacenters for cluster Use placement_strategy = 'org.apache.cassandra.locator.NetworkTopologyStrategy' and also fill the topology.properties file. This will tell cassandra that you have two DCs. You can verify that by looking at output of the ring command. If you DCs are setup properly, only one request will go over WAN. Though the responses from all nodes in other DC will go over WAN. On Thu, Oct 25, 2012 at 10:44 AM, Bryce Godfrey bryce.godf...@azaleos.com wrote: We have a 5 node cluster, with a matching 5 nodes for DR in another data center. With a replication factor of 3, does the node I send a write too attempt to send it to the 3 servers in the DR also? Or does it send it to 1 and let it replicate locally in the DR environment to save bandwidth across the WAN? Normally this isn’t an issue for us, but at times we are writing approximately 1MB a sec of data, and seeing a corresponding 3MB of traffic across the WAN to all the Cassandra DR servers. If my assumptions are right, is this configurable somehow for writing to one node and letting it do local replication? We are on 1.1.5 Thanks
Diagnosing a row caching problem
Having a problem diagnosing an issue with row caching. It seems like row caching is not working (very few items stored), despite it being enabled, using JNA, and the key cache being super hot. I assume I'm missing something obvious, but I would expect to have more items stored in the row cache, even after updates. Is there something in the logs I should look for? Are writes invalidating the cache? Here are the stats: keys: 100 bytes, values: 200 - 400 bytes, with read and write pattern Single node, Cassandra 1.1.5, test node CFSTATS Keyspace: omitted Read Count: 15967248 Read Latency: 0.4956699828924809 ms. Write Count: 15967240 Write Latency: 0.027375880803445052 ms. Pending Tasks: 0 Column Family: omitted SSTable count: 75 Space used (live): 705364536 Space used (total): 705364536 Number of Keys (estimate): 2591104 Memtable Columns Count: 267840 Memtable Data Size: 129949276 Memtable Switch Count: 192 Read Count: 15967248 Read Latency: NaN ms. Write Count: 15967240 Write Latency: NaN ms. Pending Tasks: 0 Bloom Filter False Postives: 281 Bloom Filter False Ratio: 0.0 Bloom Filter Space Used: 3986944 Compacted row minimum size: 311 Compacted row maximum size: 642 Compacted row mean size: 419 INFO Load : 676.5 MB Heap Memory (MB) : 1476.10 / 2925.00 Key Cache: size 104857584 (bytes), capacity 104857584 (bytes), 6146639 hits, 14680951 requests, 1.000 recent hit rate, 14400 save period in seconds Row Cache : size 0 (bytes), capacity 209715200 (bytes), 47 hits, 14400100 requests, NaN recent hit rate, 87000 save period in seconds
Hello
Hello, My name is Davor Vuković. I am a Student on a Specialist Professional Graduate Study of Information Science and Technology in Business Systems in Croatia. I was wondering if you could help me a bit regarding Database Management in Cassandra? I would be very happy if you could explain me these terms regarding the Cassandra DBMS (how is it done in Cassandra): - renewal procedures of database - optimization of database in the DBMS - optimization of the DBMS - using Codd's rules in the DBMS - views, triggers, stored procedures - relation constraints within the scheme and the scheme between the database Thank you so much and have a nice day! Davor Vuković
Re: cassandra 1.0.10 : Bootstrapping 7 node cluster to 14 nodes
The other nodes all have copies of the same data. To optimize performance, all of them stream different parts of the data, even though 102 has all the data that 108 needs. (I think. I'm not an expert.) -Brennan On Thu, Nov 1, 2012 at 9:31 AM, Ramesh Natarajan rames...@gmail.com wrote: I am trying to bootstrap cassandra 1.0.10 cluster of 7 nodes to 14 nodes. My seed nodes are 101, 102, 103 and 104. Here is my initial ring Address DC RackStatus State Load OwnsToken 145835300108973627198589117470757804908 192.168.1.101 datacenter1 rack1 Up Normal 8.16 GB 14.29% 0 192.168.1.102 datacenter1 rack1 Up Normal 8.68 GB 14.29% 24305883351495604533098186245126300818 192.168.1.103 datacenter1 rack1 Up Normal 8.45 GB 14.29% 48611766702991209066196372490252601636 192.168.1.104 datacenter1 rack1 Up Normal 8.16 GB 14.29% 72917650054486813599294558735378902454 192.168.1.105 datacenter1 rack1 Up Normal 8.33 GB 14.29% 97223533405982418132392744980505203272 192.168.1.106 datacenter1 rack1 Up Normal 8.71 GB 14.29% 121529416757478022665490931225631504090 192.168.1.107 datacenter1 rack1 Up Normal 8.41 GB 14.29% 145835300108973627198589117470757804908 I add a new node 108 with the initial_token between 101 and 102. After I start bootstrapping, I see the node is placed in the ring in correct place Address DC RackStatus State Load OwnsToken 145835300108973627198589117470757804908 192.168.1.101 datacenter1 rack1 Up Normal 8.16 GB 14.29% 0 192.168.1.108 datacenter1 rack1 Up Joining 114.61 KB 7.14% 12152941675747802266549093122563150409 192.168.1.102 datacenter1 rack1 Up Normal 8.68 GB 7.14% 24305883351495604533098186245126300818 192.168.1.103 datacenter1 rack1 Up Normal 8.4 GB 14.29% 48611766702991209066196372490252601636 192.168.1.104 datacenter1 rack1 Up Normal 8.15 GB 14.29% 72917650054486813599294558735378902454 192.168.1.105 datacenter1 rack1 Up Normal 8.33 GB 14.29% 97223533405982418132392744980505203272 192.168.1.106 datacenter1 rack1 Up Normal 8.71 GB 14.29% 121529416757478022665490931225631504090 192.168.1.107 datacenter1 rack1 Up Normal 8.41 GB 14.29% 145835300108973627198589117470757804908 What puzzles me is when I look at the netstats I see nodes 107,104 and 103 are streaming data to 108. Can someone explain why this happens? I was under the impression that only node 102 needs to split the tokens and send to 108. Am I missing something? Streaming from: /192.168.1.107 Streaming from: /192.168.1.104 Streaming from: /192.168.1.103 Thanks Ramesh
Re: repair, compaction, and tombstone rows
On Thu, Nov 1, 2012 at 1:43 AM, Sylvain Lebresne sylv...@datastax.com wrote: on all your columns), you may want to force a compaction (using the JMX call forceUserDefinedCompaction()) of that sstable. The goal being to get read of a maximum of outdated tombstones before running the repair (you could also alternatively run a major compaction prior to the repair, but major compactions have a lot of nasty effect so I wouldn't recommend that a priori). If sstablesplit (reverse compaction) existed, major compaction would be a simple solution to this case. You'd major compact and then split your One Giant SSTable With No Tombstones into a number of smaller ones. :) https://issues.apache.org/jira/browse/CASSANDRA-4766 =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: repair, compaction, and tombstone rows
It seems like CASSANDRA-3442 might be an effective fix for this issue assuming that I'm reading it correctly. It sounds like the intent is to automatically compact SSTables when a certain percent of the columns are gcable by being deleted or with expired tombstones. Is my understanding correct? Would such tables be compacted individually (1-1) or are several eligible tables selected and compacted using the STCS compaction threshold bounds? -Bryan On Thu, Nov 1, 2012 at 9:43 AM, Rob Coli rc...@palominodb.com wrote: On Thu, Nov 1, 2012 at 1:43 AM, Sylvain Lebresne sylv...@datastax.com wrote: on all your columns), you may want to force a compaction (using the JMX call forceUserDefinedCompaction()) of that sstable. The goal being to get read of a maximum of outdated tombstones before running the repair (you could also alternatively run a major compaction prior to the repair, but major compactions have a lot of nasty effect so I wouldn't recommend that a priori). If sstablesplit (reverse compaction) existed, major compaction would be a simple solution to this case. You'd major compact and then split your One Giant SSTable With No Tombstones into a number of smaller ones. :) https://issues.apache.org/jira/browse/CASSANDRA-4766 =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Cassandra upgrade issues...
Note that 1.0.7 came out before 1.1 and I know there were some compatibility issues that were fixed in later 1.0.x releases which could affect your upgrade. I think it would be best to first upgrade to the latest 1.0.x release, and then upgrade to 1.1.x from there. -Bryan On Thu, Nov 1, 2012 at 1:27 AM, Brian Fleming bigbrianflem...@gmail.comwrote: Hi Sylvain, Simple as that!!! Using the 1.1.5 nodetool version works as expected. My mistake. Many thanks, Brian On Thu, Nov 1, 2012 at 8:24 AM, Sylvain Lebresne sylv...@datastax.comwrote: The first thing I would check is if nodetool is using the right jar. I sounds a lot like if the server has been correctly updated but nodetool haven't and still use the old classes. Check the nodetool executable, it's a shell script, and try echoing the CLASSPATH in there and check it correctly point to what it should. -- Sylvain On Thu, Nov 1, 2012 at 9:10 AM, Brian Fleming bigbrianflem...@gmail.com wrote: Hi, I was testing upgrading from Cassandra v.1.0.7 to v.1.1.5 yesterday on a single node dev cluster with ~6.5GB of data it went smoothly in that no errors were thrown, the data was migrated to the new directory structure, I can still read/write data as expected, etc. However nodetool commands are behaving strangely – full details below. I couldn’t find anything relevant online relating to these exceptions – any help/pointers would be greatly appreciated. Thanks Regards, Brian ‘nodetool cleanup’ runs successfully ‘nodetool info’ produces : Token: 82358484304664259547357526550084691083 Gossip active: true Load : 7.69 GB Generation No: 1351697611 Uptime (seconds) : 58387 Heap Memory (MB) : 936.91 / 1928.00 Exception in thread main java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.cassandra.dht.Token at org.apache.cassandra.tools.NodeProbe.getEndpoint(NodeProbe.java:546) at org.apache.cassandra.tools.NodeProbe.getDataCenter(NodeProbe.java:559) at org.apache.cassandra.tools.NodeCmd.printInfo(NodeCmd.java:313) at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:651) ‘nodetool repair’ produces : Exception in thread main java.lang.reflect.UndeclaredThrowableException at $Proxy0.forceTableRepair(Unknown Source) at org.apache.cassandra.tools.NodeProbe.forceTableRepair(NodeProbe.java:203) at org.apache.cassandra.tools.NodeCmd.optionalKSandCFs(NodeCmd.java:880) at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:719) Caused by: javax.management.ReflectionException: Signature mismatch for operation forceTableRepair: (java.lang.String, [Ljava.lang.String;) should be (java.lang.String, boolean, [Ljava.lang.String;) at com.sun.jmx.mbeanserver.PerInterface.noSuchMethod(PerInterface.java:152) at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:117) at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:262) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:836) at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:761) at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1427) at javax.management.remote.rmi.RMIConnectionImpl.access$200(RMIConnectionImpl.java:72) at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1265) at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1360) at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:788) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:303) at sun.rmi.transport.Transport$1.run(Transport.java:159) at java.security.AccessController.doPrivileged(Native Method) at sun.rmi.transport.Transport.serviceCall(Transport.java:155) at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:535) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:790) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:649) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at
Re: Is it bad putting columns with composite or integer name in CF with ByteType comparator validator ?
Thoughts, please ? On Thu, Nov 1, 2012 at 7:12 PM, Ertio Lew ertio...@gmail.com wrote: Would that do any harm or are there any downsides, if I store columns with composite names or Integer type names in a column family with bytesType comparator validator. I have observed that bytesType comparator would also sort the integer named columns in similar fashion as done by IntegerType comparator, so why should I just lock my CF to just store Integer or composite named columns, would be good if I could just mix different datatypes in same column family, No !?
Re: distribution of token ranges with virtual nodes
it will migrate you to virtual nodes by splitting the existing partition 256 ways. Out of curiosity, is it for the purpose of avoiding streaming? the former would require you to perform a shuffle to achieve that. Is there a nodetool option or are there other ways shuffle could be done automatically? On Thu, Nov 1, 2012 at 2:17 AM, Eric Evans eev...@acunu.com wrote: On Wed, Oct 31, 2012 at 11:38 AM, John Sanda john.sa...@gmail.com wrote: Can/should i assume that i will get even range distribution or close to it with random token selection? The short answer is: If you're using virtual nodes, random token selection will give you even range distribution. The somewhat longer answer is that this is really a function of the total number of tokens. The more randomly generated tokens a cluster has, the more distribution will even out. The reason this can work for virtual nodes where it has not for the older 1-token-per-node model is because (assuming a reasonable num_tokens value), virtual nodes gives you a much higher token count for a given number of nodes. That wiki page you cite wasn't really intended to be documentation (expect some of that soon though), but what that section was trying to convey was that while random distribution is quite good, it may not be 100% perfect, especially when the number of nodes is low (remember, the number of tokens scales with the number of nodes). I think this is (or may be) a problem for some. If you're forced to manually calculate tokens then you are quite naturally going to calculate a perfect distribution, and if you've grown accustomed to this, seeing the ownership values off by a few percent could really bring out your inner OCD. :) For the sake of discussion, what is a reasonable default to start with for num_tokens assuming nodes are homogenous? That wiki page mentions a default of 256 which I see commented out in cassandra.yaml; however, Config.num_tokens is set to 1. The (unconfigured )default is 1. That is to say that virtual nodes is not enabled. The current recommendation when setting this, (documented in the config) is 256. Maybe I missed where the default of 256 is used. From some initial testing though, it looks like 1 token per node is being used. Using defaults in cassandra.yaml, I see this in my logs, Right. And it's worth noting that if you uncomment num_tokens *after* starting a node with it commented (i.e. num_tokens: 1), then it will migrate you to virtual nodes by splitting the existing partition 256 ways. This is *not* the equivalent of starting a node with num_tokens = 256 for the first time. The latter would leave you with randomized placement, the former would require you to perform a shuffle to achieve that. -- Eric Evans Acunu | http://www.acunu.com | @acunu
Re: distribution of token ranges with virtual nodes
On Thu, Nov 1, 2012 at 10:05 PM, Manu Zhang owenzhang1...@gmail.com wrote: it will migrate you to virtual nodes by splitting the existing partition 256 ways. Out of curiosity, is it for the purpose of avoiding streaming? It splits into a contiguous range, because truly upgrading to vnode functionality is another step. the former would require you to perform a shuffle to achieve that. Is there a nodetool option or are there other ways shuffle could be done automatically? There a shuffle command in bin/ that was recently committed, we'll document this in process in NEWS.txt shortly. -Brandon
Re: distribution of token ranges with virtual nodes
It splits into a contiguous range, because truly upgrading to vnode functionality is another step. That confuses me. As I understand it, there is no point in having 256 tokens on same node if I don't commit the shuffle On Fri, Nov 2, 2012 at 11:10 AM, Brandon Williams dri...@gmail.com wrote: On Thu, Nov 1, 2012 at 10:05 PM, Manu Zhang owenzhang1...@gmail.com wrote: it will migrate you to virtual nodes by splitting the existing partition 256 ways. Out of curiosity, is it for the purpose of avoiding streaming? It splits into a contiguous range, because truly upgrading to vnode functionality is another step. the former would require you to perform a shuffle to achieve that. Is there a nodetool option or are there other ways shuffle could be done automatically? There a shuffle command in bin/ that was recently committed, we'll document this in process in NEWS.txt shortly. -Brandon