Re: avro + cassandra + ruby
Cassandra.new(keyspace, server, {:protocol = Thrift::BinaryProtocolAccelerated}) On Tue, Nov 16, 2010 at 5:13 PM, Ryan King r...@twitter.com wrote: On Tue, Nov 16, 2010 at 10:25 AM, Jonathan Ellis jbel...@gmail.com wrote: On Tue, Sep 28, 2010 at 6:35 PM, Ryan King r...@twitter.com wrote: One thing you should try is to make thrift use BinaryProtocolAccelerated, rather than the pure-ruby implementation (we should change the default). Dumb question time: how do you do this? $ find . -name *.rb |xargs grep -i binaryprotocol in the fauna cassandra gem repo turns up no hits. I believe we're relying on the default from thrift_client (which defaults to BinaryProtocol): https://github.com/fauna/thrift_client/ -ryan
Re: How to Retrieve all the rows from a ColumnFamily
http://wiki.apache.org/cassandra/FAQ#iter_world On Sun, Sep 26, 2010 at 11:51 PM, sekhar kosuru kosurusek...@gmail.com wrote: Hi I am new to Cassandra Database. I want to know how to Retrieve all the records from a column family, is this is different in the clustered servers vs single servers. Please suggest me with piece of code if possible. /Regards Sekhar.
Re: 0.7 memory usage problem
On Mon, Sep 27, 2010 at 12:59 PM, Alaa Zubaidi alaa.zuba...@pdf.com wrote: Thanks for the help. we have 2 drives using basic configurations, commitlog on one drive and data on another. and Yes the CL for writes is 3, however, the CL for reads is 1. It is simply not possible that you are inserting at CL.ALL (which is what I assume you mean by CL for writes is 3) given how frequently you are flushing memtables. Flushing every 1.7 seconds with 300,000 ops and your 60 columns per row indicates you are inserting 3000 rows/sec, not 600. The behavior shown in that logs is almost certainly from inserting with CL.ZERO. The code you provided does not include the definition of _writeConsistencyLevel. Where is that set and what is it set to? b
Re: 0.7 memory usage problem
On Mon, Sep 27, 2010 at 2:51 PM, Benjamin Black b...@b3k.us wrote: On Mon, Sep 27, 2010 at 12:59 PM, Alaa Zubaidi alaa.zuba...@pdf.com wrote: Thanks for the help. we have 2 drives using basic configurations, commitlog on one drive and data on another. and Yes the CL for writes is 3, however, the CL for reads is 1. It is simply not possible that you are inserting at CL.ALL (which is what I assume you mean by CL for writes is 3) given how frequently you are flushing memtables. Flushing every 1.7 seconds with 300,000 ops and your 60 columns per row indicates you are inserting 3000 rows/sec, Sorry, that should be _5000_ rows/sec, not 3000. b
Re: UnavailableException when data grows
Your ring is wildly unbalanced and you are almost certainly out of I/O on one or more nodes. You should be monitoring via JMX and common systems tools to know when you are starting to have issues. It is going to take you some effort to get out of this situation now. b On Mon, Sep 27, 2010 at 2:55 PM, Rana Aich aichr...@gmail.com wrote: Hi Peter, Thanks for your detailed query... I have 8 m/c cluster. KVSHIGH1,2,3,4 and KVSLOW1,2,3,4. As the name suggests KVSLOWs have low diskspace ~ 350GB Whereas KVSHIGHs have 1.5 terabytes. Yet my nodetool shows the following: 192.168.202.202Down 319.94 GB 7200044730783885730400843868815072654 |--| 192.168.202.4 Up 382.39 GB 23719654286404067863958492664769598669 | ^ 192.168.202.2 Up 106.81 GB 36701505058375526444137310055285336988 v | 192.168.202.3 Up 149.81 GB 65098486053779167479528707238121707074 | ^ 192.168.202.201Up 154.72 GB 79420606800360567885560534277526521273 v | 192.168.202.204Up 72.91 GB 85219217446418416293334453572116009608 | ^ 192.168.202.1 Up 29.78 GB 87632302962564279114105239858760976120 v | 192.168.202.203Up 9.35 GB 87790520647700936489181912967436646309 |--| As you can see one of our KVSLOW box is already down. Its 100% full. Whereas boxes having 1.5 terabytes have only 29.78 GB (192.168.202.1 )! I'm using RandomPartitioner. When I run the client program the Cassandra Daemon takes around 85-130% CPU. Regards, Rana On Mon, Sep 27, 2010 at 2:31 PM, Peter Schuller peter.schul...@infidyne.com wrote: How can I handle this kind of situation? In terms of surviving the problem, a re-try on the client side might help assuming the problem is temporary. However, certainly the fact that you're seeing an issue to begin with is interesting, and the way to avoid it would depend on what the problem is. My understanding is that the UnavailableException indicates that the node you are talking to was unable to read form/write to a sufficient number of nodes to satisfy your consistency level. Presumably either because individual requests failed to return in time, or because the node considers other nodes to be flat out down. Can you correlate these issues with server-side activity on the nodes, such as background compaction, commitlog rotation or memtable flushing? Do you see your nodes saying that other nodes in the cluster are DOWN and UP (flapping)? How large is the data set in total (in terms of sstable size on disk), and how much memory do you have in your machines (going to page cache)? Have you observed the behavior of your nodes during compaction; in particular whether compaction is CPU bound or I/O bound? (That would tend to depend on data; generally the larger the individual values the more disk bound you'd tend to be.) Just trying to zero in on what the likely root cause is in this case. -- / Peter Schuller
Re: 0.7 memory usage problem
What is your RF? On Mon, Sep 27, 2010 at 3:13 PM, Alaa Zubaidi alaa.zuba...@pdf.com wrote: Sorry 3 means QUORUM. On 9/27/2010 2:55 PM, Benjamin Black wrote: On Mon, Sep 27, 2010 at 2:51 PM, Benjamin Blackb...@b3k.us wrote: On Mon, Sep 27, 2010 at 12:59 PM, Alaa Zubaidialaa.zuba...@pdf.com wrote: Thanks for the help. we have 2 drives using basic configurations, commitlog on one drive and data on another. and Yes the CL for writes is 3, however, the CL for reads is 1. It is simply not possible that you are inserting at CL.ALL (which is what I assume you mean by CL for writes is 3) given how frequently you are flushing memtables. Flushing every 1.7 seconds with 300,000 ops and your 60 columns per row indicates you are inserting 3000 rows/sec, Sorry, that should be _5000_ rows/sec, not 3000. b -- Alaa Zubaidi PDF Solutions, Inc. 333 West San Carlos Street, Suite 700 San Jose, CA 95110 USA Tel: 408-283-5639 (or 408-280-7900 x5639) fax: 408-938-6479 email: alaa.zuba...@pdf.com
Re: 0.7 memory usage problem
Does that mean you are doing 600 rows/sec per process or 600/sec total across all processes? On Mon, Sep 27, 2010 at 3:14 PM, Alaa Zubaidi alaa.zuba...@pdf.com wrote: Its actually split to 8 different processes that are doing the insertion. Thanks On 9/27/2010 2:03 PM, Peter Schuller wrote: [note: i put user@ back on CC but I'm not quoting the source code] Here is the code I am using (this is only for testing Cassandra it is not going the be used in production) I am new to Java, but I tested this and it seems to work fine when running for short amount of time: If you mean to ask about how to distributed writes - the general recommendation is to use a high-level Cassandra client (such as Hector at http://github.com/rantav/hector or Pelops at http://github.com/s7/scale7-pelops) rather than using the Thrift API directly. This is probably especially a good idea if you're new to Java as you say. But in any case, if you're having performance issues w.r.t. the write speed - are you in fact doing writes concurrently or is it a single sequential client doing the insertions? If you are maxing out without being disk bound, make sure that in addition to spreading writes across all nodes in the cluster, you are submitting writes with sufficient concurrency to allow Cassandra to scale to use available CPU across all cores. -- Alaa Zubaidi PDF Solutions, Inc. 333 West San Carlos Street, Suite 700 San Jose, CA 95110 USA Tel: 408-283-5639 (or 408-280-7900 x5639) fax: 408-938-6479 email: alaa.zuba...@pdf.com
Re: 0.7 memory usage problem
On Mon, Sep 27, 2010 at 3:48 PM, Alaa Zubaidi alaa.zuba...@pdf.com wrote: RF=2 With RF=2, QUORUM and ALL are the same. Again, your logs show you are attempting to insert about 180,000 columns/sec. The only way that is possible with your hardware is if you are using CL.ZERO. The available information does not add up. b
Re: Curious as to how Cassandra handles the following
On Sun, Sep 26, 2010 at 11:04 AM, Lucas Nodine lucasnod...@gmail.com wrote: I'm looking at a design where multiple clients will connect to Cassandra and get/mutate resources, possibly concurrently. After planning a bit, I ran into the following scenero for which I have not been able to research to find an answer sufficient for my needs. I have found where others have recommended Zookeeper for such tasks, but I want to determine if there is a simple solution before including another product in my design. Make the following assumption for all following situations: Assuming multiple clients where a client is someone accessing Cassandra using thrift. All reads and writes are performed using the QUORUM consistency level. Situation 1: Client A (A) connects to Cassandra and requests a QUORUM consistency level get of an entire row. At or very shortly thereafter (before A's request completes), Client B (B) connects to Cassandra and inserts (or mutates) a column (or multiple columns) within the row. Does A receive the new data saved by B or does A receive the data prior to B's save? Depends on the exact order of operations across several nodes. Since you can't know what that ordering will be (or what it was), you can't predict whether you see the pre- or post-update version. Situaton 2: B connects and mutates multiple columns within a row. A requests some data therein while B is processing. Result? Which call was used to make the changes? Situation 3: B mutates multiple columns within multiple rows. A requests some data therein while B is processing. Result? Undefined, as in situation 1. Justification: At certain points I want to essentially lock a resource (row) in cassandra for exclusive write access (think checkout a resource) by setting a flag value of a column within that row. I'm just considering race conditions. If you really can't fix your design to avoid locks, then you need a system to permit locking. That usually means Zookeeper. b
Re: Curious as to how Cassandra handles the following
On Sun, Sep 26, 2010 at 4:01 PM, Lucas Nodine lucasnod...@gmail.com wrote: Ok, so based on everyone's input it seems that I need to put some sort of server in front of Cassandra to handle locking and exclusive access. I am planning on building a system (DMS) that will store resources (document, images, media, etc) using Cassandra for data. As my target user is going to be someone without any understanding of a 'diff' I have elected for locking instead of conflict resolution in versions. Good thing, as versioned conflict resolution is not available in Cassandra. b
Re: 0.7 memory usage problem
Looking further, I would expect your 36000 writes/sec to trigger a memtable flush every 8-9 seconds (which is already crazy), but you are actually flushing them every ~1.7 seconds, leading me to believe you are writing a _lot_ faster than you think you are. INFO [ROW-MUTATION-STAGE:21] 2010-09-24 13:13:23,203 ColumnFamilyStore.java (line 422) switching in a fresh Memtable for HiFreq at CommitLogContext(file='C:\Cassandra\Cass07\commitlog\CommitLog-1285358848765.log', position=13796967) INFO [ROW-MUTATION-STAGE:4] 2010-09-24 13:13:25,171 ColumnFamilyStore.java (line 422) switching in a fresh Memtable for HiFreq at CommitLogContext(file='C:\Cassandra\Cass07\commitlog\CommitLog-1285358848765.log', position=29372124) INFO [ROW-MUTATION-STAGE:8] 2010-09-24 13:13:26,937 ColumnFamilyStore.java (line 422) switching in a fresh Memtable for HiFreq at CommitLogContext(file='C:\Cassandra\Cass07\commitlog\CommitLog-1285358848765.log', position=44950820) b On Sat, Sep 25, 2010 at 7:53 PM, Benjamin Black b...@b3k.us wrote: The log posted shows _10_ pending in MPF stage, and the errors show repeated failures trying to flush memtables at all: INFO [GC inspection] 2010-09-24 13:16:11,281 GCInspector.java (line 156) MEMTABLE-POST-FLUSHER 1 10 You are also flushing _really_ small memtables to disk (looks to be triggered by the default ops threshold): INFO [FLUSH-WRITER-POOL:1] 2010-09-24 12:55:27,296 Memtable.java (line 150) Writing memtable-hif...@741540175(15105576 bytes, 314640 operations) Based on what you said initially: 600 row (60 columns per row) per second, ~3K size rows If that is so, you are writing 36000 columns per second to a single machine (why are you not distributing the client load across the cluster, as is best practice?). If your RF is 3 on your 3 node cluster, every node is taking every write, so you are trying to maintain 36000 writes per second per node. Even with a dedicated (spinning media) commitlog drive, you can't possibly keep up with that. What is your disk setup? What CL are you using for these writes? Can you post your client code for doing the writes? It is odd that you are able to do 36000/sec _at all_ unless you are using CL.ZERO, which would quickly lead to OOM. b
Re: Backporting Data Center Shard Strategy
You might be confusing the RackAware strategy (which puts 1 replica in a remote DC) and the DatacenterShard strategy (which puts M of N replicas in remote DCs). Both are in 0.6.5. https://svn.apache.org/repos/asf/cassandra/tags/cassandra-0.6.5/src/java/org/apache/cassandra/locator/DatacenterShardStategy.java On Tue, Sep 21, 2010 at 10:23 PM, rbukshin rbukshin rbuks...@gmail.com wrote: The one in 0.6 doesn't allow controlling number of replicas to place in other DC. Atmost 1 copy of data can be placed in other DC. What are other differences between the implementation in 0.6 vs 0.7? On Tue, Sep 21, 2010 at 10:03 PM, Benjamin Black b...@b3k.us wrote: DCShard is in 0.6. It has been rewritten in 0.7. On Tue, Sep 21, 2010 at 10:02 PM, rbukshin rbukshin rbuks...@gmail.com wrote: Is there any plan to backport DataCenterShardStrategy to 0.6.x from 0.7? It will be very useful for those who don't want to make drastic changes in their code and get the benefits of this replica placement strategy. -- Thanks, -rbukshin -- Thanks, -rbukshin
Re: Backporting Data Center Shard Strategy
DCShard is in 0.6. It has been rewritten in 0.7. On Tue, Sep 21, 2010 at 10:02 PM, rbukshin rbukshin rbuks...@gmail.com wrote: Is there any plan to backport DataCenterShardStrategy to 0.6.x from 0.7? It will be very useful for those who don't want to make drastic changes in their code and get the benefits of this replica placement strategy. -- Thanks, -rbukshin
Re: timestamp parameter for Thrift insert API ??
On Mon, Sep 20, 2010 at 7:25 PM, Kuan(謝冠生) lakersg...@mail2000.com.tw wrote: By using cassandra-cli tool, we don't have to input timestamp while insertion. Does it mean that Cassandra have time synchronization build-in already? No, it means the cassandra-cli program is inserting a timestamp, which it then provides to the cluster via thrift, just like any other client. Since cassandra depending on time-stamp parameter very much (both read/write). The most ideal way to deal with timestamp is by cassandra itself, considering data safty and consistensy This doesn't fix anything, unfortunately. Time synchronization/event ordering in distributed systems is a notoriously hard problem. Having Cassandra nodes (remember, there are many in a cluster) assign timestamps just means their clocks need to be tightly synchronized, exactly as is the case for having clients insert timestamps. They will never be in sync enough to deal with badly designed apps attempting to simultaneously write to the same cell. Further, as jbellis mentioned, there are other reasons to not want the _current_ time used as the timestamp. The end result is that it is neither advantageous nor desirable to have cluster nodes assign timestamps. If that is the requirement, you need to a) fix your application, b) use a locking service like Zookeeper, or c) use an ACID database. b
Re: Cassandra performance
It appears you are doing several things that assure terrible performance, so I am not surprised you are getting it. On Tue, Sep 14, 2010 at 3:40 PM, Kamil Gorlo kgs4...@gmail.com wrote: My main tool was stress.py for benchmarks (or equivalent written in C++ to deal with python2.5 lack of multiprocessing). I will focus only on reads (random with normal distribution, what is default in stress.py) because writes were /quite/ good. I have 8 machines (xen quests with dedicated pair of 2TB SATA disks combined in RAID-O for every guest). Every machine has 4 individual cores of 2.4 Ghz and 4GB RAM. First problem: I/O in Xen is very poor and Cassandra is generally very sensitive to I/O performance. Cassandra commitlog and data dirs were on the same disk, This is not recommended if you want best performance. You should have a dedicated commitlog drive. I gave 2.5GB for Heap for Cassandra, key and row cached were disabled (standard Keyspace1 schema, all tests use Standard1 CF). All other options were defaults. I've disabled cache because I was testing random (or semi random - normal distribution) reads so it wouldnt help so much (and also because 4GB of RAM is not a lot). Disabling row cache in this case makes sense, but disabling key cache is probably hurting your performance quite a bit. If you wrote 20GB of data per node, with narrow rows as you describe, and had default memtable settings, you now have a huge number of sstables on disk. You did not indicate you use nodetool compact to trigger a major compaction, so I'm assuming you did not. For first test I installed Cassandra on only one machine to test it and remember results for further comparisons with large cluster and other DBs. 1) RF was set to 1. I've inserted ~20GB of data (this is number reported in load column form nodetool ring output) using stress.py (100 colums per row). Then I've tested reads and got 200 rows/second (reading 100 columns per row, CL=ONE, disks were bottleneck, util was 100%). There was no other operation pending during reads (compaction, insertion, etc..). This is normal behavior under random reads for _any_ data base. If the dataset can't fit in RAM, you are I/O bound. I don't know why you would expect anything else. You did not indicate your disk access mode, but if it is mmap and you are not using code that calls mlockall, then with that size dataset you are almost certainly swapping, as well. You can check that with vmstat. Given the combination of very little RAM in comparison to the data set, very little disk I/O, key caching disabled, a large number of sstables, and likely mmap I/O without mlockall, you have created about the worst possible setup. If you are _actually_ dealing with that much data AND random reads, then you either need enough RAM to hold it all, or you need SSDs. And that is not specific to Cassandra. If you are saying you have similarly misconfigured MySQL and still gotten better performance, then kudos. You are very lucky. b
Re: questions on cassandra (repair and multi-datacenter)
On Thu, Sep 16, 2010 at 3:19 PM, Gurpreet Singh gurpreet.si...@gmail.com wrote: 1. I was looking to increase the RF to 3. This process entails changing the config and calling repair on the keyspace one at a time, right? So, I started with one node at a time, changed the config file on the first node for the keyspace, restarted the node. And then called a nodetool repair on the node. You need to change the RF on _all_ nodes in the cluster _before_ running repair on _any_ of them. If nodes disagree on which nodes should have replicas for keys, repair will not work correctly. Different RF for the same keyspace creates that disagreement. b
Re: Connect to localhost is ok,but the ip fails.
Do you mean you are changing the yaml file? Does 'netstat -an | grep 9160' indicate cassandra is bound to ipv4 or ipv6 (tcp vs tcp6 in the netstat output)? b On Thu, Sep 9, 2010 at 1:06 AM, Ying Tang ivytang0...@gmail.com wrote: I'm using cassandra 0.7 . And in storage-conf . # The address to bind the Thrift RPC service to rpc_address: localhost # port for Thrift to listen on rpc_port: 9160 In my client , the code below works successfully. TSocket socket = new TSocket(localhost, 9160); TTransport trans = Boolean.valueOf(System.getProperty(cassandra.framed, true)) ? new TFramedTransport( socket) : socket; trans.open(); But if i changed localhost to the localhost's ip , throws out the java.net.ConnectException: Connection refused. And the connecting to other ip also fails. -- Best regards, Ivy Tang
Re: Connect to localhost is ok,but the ip fails.
when you say localhost's ip do you mean 127.0.0.1 or do you mean an ip on its local interface? On Thu, Sep 9, 2010 at 1:29 AM, Ying Tang ivytang0...@gmail.com wrote: oh.solve it. Change the rpc_address to my localhost's ip ,then in the client code ,the TSocket can connect to the ip. On Thu, Sep 9, 2010 at 4:14 AM, Ying Tang ivytang0...@gmail.com wrote: no , i didn't change the yaml file. On Thu, Sep 9, 2010 at 4:10 AM, Benjamin Black b...@b3k.us wrote: Do you mean you are changing the yaml file? Does 'netstat -an | grep 9160' indicate cassandra is bound to ipv4 or ipv6 (tcp vs tcp6 in the netstat output)? b On Thu, Sep 9, 2010 at 1:06 AM, Ying Tang ivytang0...@gmail.com wrote: I'm using cassandra 0.7 . And in storage-conf . # The address to bind the Thrift RPC service to rpc_address: localhost # port for Thrift to listen on rpc_port: 9160 In my client , the code below works successfully. TSocket socket = new TSocket(localhost, 9160); TTransport trans = Boolean.valueOf(System.getProperty(cassandra.framed, true)) ? new TFramedTransport( socket) : socket; trans.open(); But if i changed localhost to the localhost's ip , throws out the java.net.ConnectException: Connection refused. And the connecting to other ip also fails. -- Best regards, Ivy Tang -- Best regards, Ivy Tang -- Best regards, Ivy Tang
Re: ganglia plugin
Nice! On Wed, Sep 8, 2010 at 6:45 PM, Scott Dworkis s...@mylife.com wrote: in case the community is interested, my gmetric collector: http://github.com/scottnotrobot/gmetric/tree/master/database/cassandra/ note i have only tested with a special csv mode of gmetric... you can bypass this mode and use vanilla gmetric with --nocsv, but beware it will generate over 100 forks on a trivial cassandra schema. the patch for the csv enabled gmetric is here: http://bugzilla.ganglia.info/cgi-bin/bugzilla/show_bug.cgi?id=273 -scott
Re: Connect to localhost is ok,but the ip fails.
correct, 0.0.0.0 is a wildcard. On Thu, Sep 9, 2010 at 1:19 PM, Aaron Morton aa...@thelastpickle.com wrote: I set this to 0.0.0.0 I think the original storage_config.xml had a comment that it would make thrift respond on all interfaces. Aaron On 09 Sep, 2010,at 08:37 PM, Benjamin Black b...@b3k.us wrote: when you say localhost's ip do you mean 127.0.0.1 or do you mean an ip on its local interface? On Thu, Sep 9, 2010 at 1:29 AM, Ying Tang ivytang0...@gmail.com wrote: oh.solve it. Change the rpc_address to my localhost's ip ,then in the client code ,the TSocket can connect to the ip. On Thu, Sep 9, 2010 at 4:14 AM, Ying Tang ivytang0...@gmail.com wrote: no , i didn't change the yaml file. On Thu, Sep 9, 2010 at 4:10 AM, Benjamin Black b...@b3k.us wrote: Do you mean you are changing the yaml file? Does 'netstat -an | grep 9160' indicate cassandra is bound to ipv4 or ipv6 (tcp vs tcp6 in the netstat output)? b On Thu, Sep 9, 2010 at 1:06 AM, Ying Tang ivytang0...@gmail.com wrote: I'm using cassandra 0.7 . And in storage-conf . # The address to bind the Thrift RPC service to rpc_address: localhost # port for Thrift to listen on rpc_port: 9160 In my client , the code below works successfully. TSocket socket = new TSocket(localhost, 9160); TTransport trans = Boolean.valueOf(System.getProperty(cassandra.framed, true)) ? new TFramedTransport( socket) : socket; trans.open(); But if i changed localhost to the localhost's ip , throws out the java.net.ConnectException: Connection refused. And the connecting to other ip also fails. -- Best regards, Ivy Tang -- Best regards, Ivy Tang -- Best regards, Ivy Tang
Re: Azure Cloud Storage - Tables
They are not copying Cassandra with that, as it was in development for some time before Cassandra was released (possibly even before Cassandra development started). The BigTable-esque aspects, if they are 'copied' from anywhere, are copied from BigTable, just as they are in Cassandra. The underlying storage originated with their Cosmos project, and the layers with various semantics came about at least 3 years ago (possibly more) when various things were consolidated under the Azure umbrella. Definitely a lot of interesting technology in there. b public reference for some of that - http://www.zdnet.com/blog/microsoft/a-microsoft-code-name-a-day-cosmos/632?tag=mantle_skin;content On Wed, Sep 8, 2010 at 2:20 PM, Peter Harrison cheetah...@gmail.com wrote: Microsoft has essentially copied the Cassandra approach for it's Table Storage. See here: http://www.codeproject.com/KB/azure/AzureStorage.aspx It is I believe a compliment of sorts, in the sense that it is a validation of the Cassandra approach. The reason I know about this is that I attended a presentation about Azure last week, and one of the Azure team told us all about it. The Table Storage is essentially Cassandra; although I guess reimplemented. That said it would have been better for them to actually use Cassandra and commit development effort to help the project than simply reimplement. Given their desire to promote Azure as a platform for open source applications I would have thought this would be a no brainer.
Re: Azure Cloud Storage - Tables
And having said all that: Azure Table storage model doesn't look like Cassandra. There is a schema, there are partition keys. It more resembles something like VoltDB than the map of maps (of maps) of Cassandra (and BigTable, and HBase). b On Wed, Sep 8, 2010 at 2:20 PM, Peter Harrison cheetah...@gmail.com wrote: Microsoft has essentially copied the Cassandra approach for it's Table Storage. See here: http://www.codeproject.com/KB/azure/AzureStorage.aspx It is I believe a compliment of sorts, in the sense that it is a validation of the Cassandra approach. The reason I know about this is that I attended a presentation about Azure last week, and one of the Azure team told us all about it. The Table Storage is essentially Cassandra; although I guess reimplemented. That said it would have been better for them to actually use Cassandra and commit development effort to help the project than simply reimplement. Given their desire to promote Azure as a platform for open source applications I would have thought this would be a no brainer.
Re: Few questions regarding cassandra deployment on windows
This does not sound like a good application for Cassandra at all. Why are you using it? On Tue, Sep 7, 2010 at 3:42 PM, kannan chandrasekaran ckanna...@yahoo.com wrote: Hi All, We are currently considering Cassandra for our application. Platform: * a single-node cluster. * windows '08 * 64-bit jvm For the sake of brevity let, Cassandra service = a single node cassandra server running as an embedded service inside a JVM My use cases: 1) Start with a schema ( keyspace and set of column families under it) in a cassandra service 2) Need to be able to replicate the same schema structure (add new keyspace/columnfamilies with different names ofcourse). 3) Because of some existing limitations in my application, I need to be able to write to the keyspace/column-families from a cassandra service and read the written changes from a different cassandra service. Both the write and the read cassandra-services are sharing the same Data directory. I understand that the application has to take care of any naming collisions. Couple Questions related to the above mentioned usecases: 1) I want to spawn a new JVM and launch Cassandra as an embedded service programatically instead of using the startup.bat. I would like to know if that is possible and any pointers in that direction would be really helpful. ( use-case1) 2) I understand that there are provisions for live schema changes in 0.7 ( thank you guys !!!), but since I cant use a beta version in production, I am restricted to 0.6 for now. Is it possible to to support use-case 2 in 0.6.5 ? More specifically, I am planning to make runtime changes to the storage.conf xml file followed by a cassandra service restart 3) Can I switch the data directory at run-time ? (use-case 3). In order to not disrupt read while the writes are in progress, I am thinking something like, copy the existing data-dir into a new location; write to a new data directory; once the write is complete; switch pointers and restart the cassandra service to read from the new directory to pick up the updated changes Any help is greatly appreciated. Thanks Kannan
Re: 4k keyspaces... Maybe we're doing it wrong?
On Mon, Sep 6, 2010 at 12:41 AM, Janne Jalkanen janne.jalka...@ecyrd.com wrote: So if I read this right, using lots of CF's is also a Bad Idea(tm)? Yes, lots of CFs is bad means lots of CFs is also bad.
Re: question about Cassandra error
You seem to be typing 0.7 commands on a 0.6 cli. Please follow the README in the version you are using, e.g.: set Keyspace1.Standard2['jsmith']['first'] = 'John' On Thu, Sep 2, 2010 at 5:35 PM, Simon Chu simonchu@gmail.com wrote: I downloaded cassendra 0.6.5 and ran it, got this error: bin/cassandra -f INFO 16:46:06,198 JNA not found. Native methods will be disabled. INFO 16:46:06,875 DiskAccessMode 'auto' determined to be mmap, indexAccessMode is mmap is this an issue? When I tried to run cassandra cli from the example, I got the following errors: cassandra use Keyspace1 sc 'blah$' line 1:0 no viable alternative at input 'use' Invalid Statement (Type: 0) cassandra set Standard2['jsmith']['first'] = 'John'; line 1:13 mismatched input '[' expecting DOT is this a setup issue? Simon
Re: the process of reading and writing
On Thu, Sep 2, 2010 at 8:19 PM, Ying Tang ivytang0...@gmail.com wrote: Recently , i read the paper about Cassandra again . And now i have some concepts about the reading and writing . We all know Cassandra uses NWR , When read : the request --- a random node in Cassandra .This node acts as a proxy ,and it routes the request. Here , 1. the proxy node route this request to this key's coordinator , the coordinator then routes request to other N-1 nodes OR the proxy routes the read request to N nodes ? The coordinator node is the proxy node. 2. If it is the former situation , the read repair occurs on the key's coordinator ? If it is the latter , the read repair occurs on the proxy node ? Depends on the CL requested. QUORUM and ALL cause the RR to be performed by the coordinator. ANY and ONE cause RR to be delegated to one of the replicas for the key. When write : the request --- a random node in Cassandra .This node acts as a proxy ,and it routes the request. Here , 3. the proxy node route this request to this key's coordinator , the coordinator then routes request to other N-1 nodes OR the proxy routes the request to N nodes ? For writes, the coordinator sends the writes directly to the replicas regardless of CL (rather than delegating for weakly consistent CLs). 4. The N isn't the data's copy numbers , it's just a range . In this N range , there must be W copies .So W is the copy numbers. So in this N range , R+WN can guarantee the data's validity. Right? Sorry, I can't even parse this. b
Re: Cassandra on AWS across Regions
On Thu, Sep 2, 2010 at 5:52 AM, Phil Stanhope stanh...@gmail.com wrote: Ben, can you elaborate on some infrastructure topology issues that would break this approach? As noted, the naive approach results in nodes behind the same NAT having to communicate with each other through that NAT rather than directly. You can different property files for property snitch on different nodes, as that is directly encoding topology. You could do the same with /etc/hosts. You could do the same with DNS. The problem is that in all these cases you have a different view of the world depending on where you are. Does this node have the right information for connecting to local nodes and remote nodes? Is it failing to connect to some other node because of a hostname resolution failure, or because it has the wrong topology information, or ...? And this only assumes 1:1 NAT. What is the solution for PAT (which is quite common)? It's a deep dark hole of edge cases. I would rather have a dead simple 80% solution than a 100% solution with dynamics I can't understand. b
Re: Data Center Move
You will likely need to rename some of the files to avoid collisions (they are only unique per node). Otherwise, yes, this can work. On Thu, Sep 2, 2010 at 11:09 AM, Anthony Molinaro antho...@alumni.caltech.edu wrote: Hi, We're running cassandra 0.6.4, and need to do a data center move of a cluster (from EC2 to our own data center). Because of the way the networks are set up we can't actually connect these boxes directly, so the original plan of add some nodes in the new colo, let them bootstrap then decommission nodes in the old colo until the data is all transfered will not work. So I'm wondering if the following will work 1. take a snapshot on the source cluster 2. rsync all the files from the old machines to the new machines (we'd most likely be reducing the total number of machines, so would do things like take 4-5 machines worth of data and put it onto 1 machine) 3. bring up the new machines in the new colo 4. run cleanup on all new nodes? 5. run repair on all new nodes? So will this work? If so, are steps 4 and 5 correct? I realize we will miss any new data that happens between the snapshot and turning on writes on the new cluster, but I think we might be able to just tune compaction such that it doesn't happen, then just sync the files that change while the data transfers happen? Thanks, -Anthony -- Anthony Molinaro antho...@alumni.caltech.edu
Re: Cassandra on AWS across Regions
It's not gossiping hostnames, it's gossiping IP addresses. The purpose of Peter's patch is to have the system gossip its external address (so other nodes can connect), but bind its internal address. As Edward notes, it helps with NAT in general, not just EC2. Not perfect, but a great start. b On Wed, Sep 1, 2010 at 2:57 PM, Andres March ama...@qualcomm.com wrote: Is it not possible to put the external host name in cassandra.yaml and add a host entry in /etc/hosts for that name to resolve to the local interface? On 09/01/2010 01:24 PM, Benjamin Black wrote: The issue is this: The IP address by which an EC2 instance is known _externally_ is not actually on the instance itself (the address being translated), and the _internal_ address is not accessible across regions. Since you can't bind a specific address that is not on one of your local interfaces, and Cassandra nodes don't have a notion of internal vs external you need a mechanism by which a node is told to bind one IP (the internal one), while it gossips another (the external one). I like what this patch does conceptually, but would prefer configuration options to cause it to happen (obviously a much larger patch). Very cool, Peter! b On Wed, Sep 1, 2010 at 1:10 PM, Andres March ama...@qualcomm.com wrote: Could you explain this point further? Was there an exception? On 09/01/2010 09:26 AM, Peter Fales wrote: that doesn't quite work with the stock Cassandra, as it will try to bind and listen on those addresses and give up because they don't appear to be valid network addresses. -- Andres March ama...@qualcomm.com Qualcomm Internet Services -- Andres March ama...@qualcomm.com Qualcomm Internet Services
Re: Cassandra on AWS across Regions
On Wed, Sep 1, 2010 at 3:18 PM, Andres March ama...@qualcomm.com wrote: I thought you might say that. Is there some reason to gossip IP addresses vs hostnames? I thought that layer of indirection could be useful in more than just this use case. The trade-off for that flexibility is that nodes are now dependent on name resolution during normal operation, rather than only at startup. The opportunities for horribly confusing failure scenarios are numerous and frightening. Other than NAT (which can clearly be dealt with without gossiping hostnames), what do you think this would enable? b
Re: Cassandra on AWS across Regions
On Wed, Sep 1, 2010 at 4:16 PM, Andres March ama...@qualcomm.com wrote: I didn't have anything specific in mind. I understand all the issues around DNS and not advocating only supporting hostnames (just thought it would be a nice option). I also wouldn't expect name resolution to be done all the time, only when the node is first being started or during initial discovery. All nodes would have to resolve whenever topology changed. One use case might be when nodes are spread out over multiple networks as the poster describes, nodes on the same network on a private interface could incur less network overhead than if they go out through the public interface. I'm not sure that this is even possible given that cassandra binds to only one interface. This case is not actually solved more simply by gossiping hostnames. It requires much more in-depth understanding of infrastructure topology. b
Re: column family names
Exactly. On Mon, Aug 30, 2010 at 11:39 PM, Janne Jalkanen janne.jalka...@ecyrd.com wrote: I've been doing it for years with no technical problems. However, using % as the escape char tends to, in some cases, confuse a certain operating system whose name may or may not begin with W, so using something else makes sense. However, it does require an extra cognitive step for the maintainer, since the mapping between filenames and logical names is no longer immediately obvious. Especially with multiple files this can be a pain (e.g. Chinese logical names which map to pretty incomprehensible sequences that are laborious to look up). So my experience suggests to avoid it for ops reasons, and just go with simplicity. /Janne On Aug 31, 2010, at 08:39 , Terje Marthinussen wrote: Beyond aesthetics, specific reasons? Terje On Tue, Aug 31, 2010 at 11:54 AM, Benjamin Black b...@b3k.us wrote: URL encoding.
Re: column family names
This is not the Unix way for good reason: it creates all manner of operational challenges for no benefit. This is how Windows does everything and automation and operations for large-scale online services is _hellish_ because of it. This horse is sufficiently beaten, though. b On Mon, Aug 30, 2010 at 11:55 PM, Terje Marthinussen tmarthinus...@gmail.com wrote: Another option would of course be to store a mapping between dir/filenames and Keyspace/columns familes together with other info related to keyspaces and column families. Just add API/command line tools to look up the filenames and maybe store the values in the files as well for recovery purposes. Terje On Tue, Aug 31, 2010 at 3:39 PM, Janne Jalkanen janne.jalka...@ecyrd.com wrote: I've been doing it for years with no technical problems. However, using % as the escape char tends to, in some cases, confuse a certain operating system whose name may or may not begin with W, so using something else makes sense. However, it does require an extra cognitive step for the maintainer, since the mapping between filenames and logical names is no longer immediately obvious. Especially with multiple files this can be a pain (e.g. Chinese logical names which map to pretty incomprehensible sequences that are laborious to look up). So my experience suggests to avoid it for ops reasons, and just go with simplicity. /Janne On Aug 31, 2010, at 08:39 , Terje Marthinussen wrote: Beyond aesthetics, specific reasons? Terje On Tue, Aug 31, 2010 at 11:54 AM, Benjamin Black b...@b3k.us wrote: URL encoding.
Re: column family names
Then make a CF in which you store the mappings from UTF8 (or byte[]!) names to CFs. Now all clients can read the same mappings. Problem solved. Still not solved because you have arbitrary, uncontrolled clients doing arbitrary, uncontrolled things in the same Cassandra cluster? You're doing it wrong. On Tue, Aug 31, 2010 at 7:26 AM, Terje Marthinussen tmarthinus...@gmail.com wrote: Sure, but as I am likely to have multiple clients (which I may not control) accessing a single store, I would prefer to keep such custom mappings out of the client for consistency reasons (much bigger problem than any of the operational issues highlighted so far). Terje On 31 Aug 2010, at 23:03, David Boxenhorn da...@lookin2.com wrote: It's not so hard to implement your mapping suggestion in your application, rather than in Cassandra, if you really want it. On Tue, Aug 31, 2010 at 1:05 PM, Terje Marthinussen tmarthinus...@gmail.com wrote: No benefit? Making it easier to use column families as part of your data model is a fairly good benefit, at least given the somewhat special data model cassandra offers. Much more of a benefit than the disadvantages I can imagine. fileprefix=`sometool -fileprefix tablename` is something I would say is a lot more unixy than windows like. Sorry, I don't share your concern for large scale operations here, but sure, '_' does the trick for me now so thanks to Aaron for reminding me about that. Some day I am sure there will be realized that unicode strings/byte arrays are useful here like most other places in Cassandra (\w is a bit limited for some of us living in the non-ascii part of the world...), but what is the XXX way are not the type of topics I find interesting, so another time. Terje On Tue, Aug 31, 2010 at 5:30 PM, Benjamin Black b...@b3k.us wrote: This is not the Unix way for good reason: it creates all manner of operational challenges for no benefit. This is how Windows does everything and automation and operations for large-scale online services is _hellish_ because of it. This horse is sufficiently beaten, though. b On Mon, Aug 30, 2010 at 11:55 PM, Terje Marthinussen tmarthinus...@gmail.com wrote: Another option would of course be to store a mapping between dir/filenames and Keyspace/columns familes together with other info related to keyspaces and column families. Just add API/command line tools to look up the filenames and maybe store the values in the files as well for recovery purposes. Terje On Tue, Aug 31, 2010 at 3:39 PM, Janne Jalkanen janne.jalka...@ecyrd.com wrote: I've been doing it for years with no technical problems. However, using % as the escape char tends to, in some cases, confuse a certain operating system whose name may or may not begin with W, so using something else makes sense. However, it does require an extra cognitive step for the maintainer, since the mapping between filenames and logical names is no longer immediately obvious. Especially with multiple files this can be a pain (e.g. Chinese logical names which map to pretty incomprehensible sequences that are laborious to look up). So my experience suggests to avoid it for ops reasons, and just go with simplicity. /Janne On Aug 31, 2010, at 08:39 , Terje Marthinussen wrote: Beyond aesthetics, specific reasons? Terje On Tue, Aug 31, 2010 at 11:54 AM, Benjamin Black b...@b3k.us wrote: URL encoding.
Re: get_slice sometimes returns previous result on php
On Mon, Aug 30, 2010 at 6:05 AM, Juho Mäkinen juho.maki...@gmail.com wrote: The application is using the same cassandra thrift connection (it doesn't close it in between) and everything is happening inside same php process. This is why you are seeing this problem (and is specific to connection reuse in certain languages, not a general problem with connection reuse). b
Re: column family names
URL encoding. On Mon, Aug 30, 2010 at 5:55 PM, Aaron Morton aa...@thelastpickle.com wrote: under scores or URL encoding ? Aaron On 31 Aug, 2010,at 12:27 PM, Benjamin Black b...@b3k.us wrote: Please don't do this. On Mon, Aug 30, 2010 at 5:22 AM, Terje Marthinussen tmarthinus...@gmail.com wrote: Ah, sorry, I forgot that underscore was part of \w. That will do the trick for now. I do not see the big issue with file names though. Why not expand the allowed characters a bit and escape the file names? Maybe some sort of URL like escaping. Terje On Mon, Aug 30, 2010 at 6:29 PM, Aaron Morton aa...@thelastpickle.com wrote: Moving to the user list. The new restrictions were added as part of CASSANDRA-1377 for 0.6.5 and 0.7, AFAIK it's to ensure the file names created for the CFs can be correctly parsed. So it's probably not going to change. The names have to match the \w reg ex class, which includes the underscore character. Aaron On 30 Aug 2010, at 21:01, Terje Marthinussen tmarthinus...@gmail.com wrote: Hi, Now that we can make columns families on the fly, it gets interesting to use column families more as part of the data model (can reduce diskspace quite a bit vs. super columns in some cases). However, currently, the column family name validator is pretty strict allowing only word characters and in some cases it is pretty darned nice to be able to put something like a - inbetweenallthewords. Any reason to be this strict or could it be loosened up a little bit? Terje
Re: Cassandra HAProxy
On Sun, Aug 29, 2010 at 11:04 AM, Anthony Molinaro antho...@alumni.caltech.edu wrote: I don't know it seems to tax our setup of 39 extra large ec2 nodes, its also closer to 24000 reqs/sec at peak since there are different tables (2 tables for each read and 2 for each write) Could you clarify what you mean here? On the face of it, this performance seems really poor given the number and size of nodes. b
Re: RowMutationVerbHandler.java (line 78) Error in row mutation
Have you tried with beta1 and is there a repro you can put in a bug report in jira? On Sat, Aug 28, 2010 at 11:28 AM, Todd Burruss bburr...@real.com wrote: Trunk -Original Message- From: Benjamin Black [...@b3k.us] Received: 8/28/10 10:05 AM To: user@cassandra.apache.org [u...@cassandra.apache.org] Subject: Re: RowMutationVerbHandler.java (line 78) Error in row mutation Todd, Are you using beta1 or trunk code? b On Fri, Aug 27, 2010 at 3:58 PM, B. Todd Burruss bburr...@real.com wrote: i got the latest code this morning. i'm testing with 0.7 ERROR [ROW-MUTATION-STAGE:388] 2010-08-27 15:54:58,053 RowMutationVerbHandler.java (line 78) Error in row mutation org.apache.cassandra.db.UnserializableColumnFamilyException: Couldn't find cfId=1002 at org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:113) at org.apache.cassandra.db.RowMutationSerializer.defreezeTheMaps(RowMutation.java:372) at org.apache.cassandra.db.RowMutationSerializer.deserialize(RowMutation.java:382) at org.apache.cassandra.db.RowMutationSerializer.deserialize(RowMutation.java:340) at org.apache.cassandra.db.RowMutationVerbHandler.doVerb(RowMutationVerbHandler.java:46) at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:50) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619)
Re: Cassandra HAProxy
Because you create a bottleneck at the HAProxy and because the presence of the proxy precludes clients properly backing off from nodes returning errors. The proper approach is to have clients maintain connection pools with connections to multiple nodes in the cluster, and then to spread requests across those connections. Should a node begin returning errors (for example, because it is overloaded), clients can remove it from rotation. On Sat, Aug 28, 2010 at 11:27 AM, Mark static.void@gmail.com wrote: On 8/28/10 11:20 AM, Benjamin Black wrote: no and no. On Sat, Aug 28, 2010 at 10:28 AM, Markstatic.void@gmail.com wrote: I will be loadbalancing between nodes using HAProxy. Is this recommended? Also is there a some sort of ping/health check uri available? Thanks any reason on why loadbalancing client connections using HAProxy isnt recommended?
Re: Cassandra HAProxy
munin is the simplest thing. There are numerous JMX stats of interest. As a symmetric distributed system, you should not expect to monitor Cassandra like you would a web server. Intelligent clients use connection pools and react to current node behavior in making choices of where to send requests, including using describe_ring to discover nodes and open new connections as needed. On Sat, Aug 28, 2010 at 11:29 AM, Mark static.void@gmail.com wrote: On 8/28/10 11:20 AM, Benjamin Black wrote: no and no. On Sat, Aug 28, 2010 at 10:28 AM, Markstatic.void@gmail.com wrote: I will be loadbalancing between nodes using HAProxy. Is this recommended? Also is there a some sort of ping/health check uri available? Thanks Also, what would be a good way of monitoring the health of the cluster?
Re: Cassandra HAProxy
On Sat, Aug 28, 2010 at 2:34 PM, Anthony Molinaro antho...@alumni.caltech.edu wrote: I think maybe he thought you meant put a layer between cassandra internal communication. No, I took the question to be about client connections. There's no problem balancing client connections with haproxy, we've been pushing several billion requests per month through haproxy to cassandra. Can it be done: yes. Is it best practice: no. Even 10 billion requests/month is an average of less than 4000 reqs/sec. Just not that many for a distributed database like Cassandra. we use mode tcp balance leastconn server local 127.0.0.1:12350 check so basically just a connect based check, and it works fine Cassandra can, and does, fail in ways that do not stop it from answering TCP connection requests. Are you saying it works fine because you have seen numerous types of node failures and this was sufficient? I would be quite surprised if that were so. Using an LB for service discovery is a fine thing (connect to a VIP, call describe_ring, open direct connections to cluster nodes). Relying on an LB to do the right thing when it is totally ignorant of what is going across those client connections (as is implied by simply checking for connectivity) is asking for trouble. Doubly so when you use a leastconn policy (a failing node can spit out an error and close a connection with impressive speed, sucking all the traffic to itself; common problem with HTTP servers giving back errors). b
Re: Benchmarking Cassandra 0.6.5 with YCSB client ... drags to a halt
cassandra.in.sh? storage-conf.xml? output of iostat -x while this is going on? turn GC log level to debug? On Sat, Aug 28, 2010 at 2:02 PM, Fernando Racca fra...@gmail.com wrote: Hi, I'm currently executing some benchmarks against 0.6.5, which i plan to compare against 0.7-beta1, using the YCSB client I'm experiencing some strange behaviour when running a small 2 nodes cluster using OrderPreservingPartitioner. Does anybody have any experience on using the client to generate load? It's the first benchmark that i try so i'm probably doing something dumb. A detailed post with screenshots of the VM and CPU history can be seen in this post.http://quantleap.blogspot.com/2010/08/cassandra-065-benchmarking-first-run.html I would very much appreciate your help since i'm doing this benchmarks as part of my master's dissertation A previous official benchmark is documented here http://research.yahoo.com/files/ycsb-v4.pdf Thanks! Fernando Racca
Re: Benchmarking Cassandra 0.6.5 with YCSB client ... drags to a halt
MESSAGE-STREAMING-POOL 0 0 INFO 23:56:20,618 LOAD-BALANCER-STAGE 0 0 INFO 23:56:20,625 FLUSH-SORTER-POOL 0 0 the problem seems to be with the second node... any ideas? On 28 August 2010 22:49, Benjamin Black b...@b3k.us wrote: cassandra.in.sh? storage-conf.xml? output of iostat -x while this is going on? turn GC log level to debug? On Sat, Aug 28, 2010 at 2:02 PM, Fernando Racca fra...@gmail.com wrote: Hi, I'm currently executing some benchmarks against 0.6.5, which i plan to compare against 0.7-beta1, using the YCSB client I'm experiencing some strange behaviour when running a small 2 nodes cluster using OrderPreservingPartitioner. Does anybody have any experience on using the client to generate load? It's the first benchmark that i try so i'm probably doing something dumb. A detailed post with screenshots of the VM and CPU history can be seen in this post.http://quantleap.blogspot.com/2010/08/cassandra-065-benchmarking-first-run.html I would very much appreciate your help since i'm doing this benchmarks as part of my master's dissertation A previous official benchmark is documented here http://research.yahoo.com/files/ycsb-v4.pdf Thanks! Fernando Racca
Re: Follow-up post on cassandra configuration with some experiments on GC tuning
ecapriolo's testing seemed to indicate it _did_ change the behavior. wonder what the difference is? On Fri, Aug 27, 2010 at 6:23 AM, Mikio Braun mi...@cs.tu-berlin.de wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Dear all, thanks for your comments, and I'm glad that you found my post helpful. Concerning the incremental CMS, I've recently updated my post and added the experiments repeated on one of our cluster nodes, and for some reason incremental CMS doesn't look that different anymore. So I guess it's ok to stick with the non-incremental CMS for now. - -M On 08/27/2010 09:12 AM, Peter Schuller wrote: Whether or not this is likely to happen with Cassandra I don't know. I don't know much about the incremental duty cycles are scheduled and it may be the case that Cassandra is not even remotely close to having a problem with incremental mode. I should further weaken my statement by pointing out that I never did any exhaustive tweaking to get around the problem (other than disabling incremental mode, since my primary goal has tended to be ensure low pause times and not so much even GC activity). It may be the case that even in stressful cases where it fails by default it is simply a matter of tweaking. So, I guess I should re-phrase: In terms of just turning on incremental mode without at least application specific tweaking (if not deployment specific testing), I would suggest caution. - -- Dr. Mikio Braun email: mi...@cs.tu-berlin.de TU Berlin web: ml.cs.tu-berlin.de/~mikio Franklinstr. 28/29 tel: +49 30 314 78627 10587 Berlin, Germany -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkx3vFUACgkQtnXKX8rQtgDUlgCfWb/euA2mgVJAWDY2tBSyAN+I 604AoKVua1+5bYK2yF9CWwFQmLHDt0Fn =CIal -END PGP SIGNATURE-
Re: is it my cassandra cluster ok?
No, it means manually assign tokens to evenly distribute ring range to the existing nodes. On Wed, Aug 25, 2010 at 7:29 PM, john xie shanfengg...@gmail.com wrote: load balancing? is it means add more nodes? 2010/8/26 Ryan King r...@twitter.com Looks like you need to do some load balancing. -ryan On Wed, Aug 25, 2010 at 12:33 AM, john xie shanfengg...@gmail.com wrote: /opt/apache-cassandra-0.6.4/bin/nodetool --host 192.168.123.100 ring Address Status Load Range Ring 162027259805094200094770502377853667196 192.168.123.101Up 183.43 GB 26404162423947656621914545677405489813 |--| 192.168.123.5 Up 196.18 GB 97646479029625162367516203572215570207 | | 192.168.123.100Up 302.86 GB 150826772797302282411816801037163789836| | 192.168.123.102Up 235.83 GB 162027259805094200094770502377853667196|--| /opt/apache-cassandra-0.6.4/bin/nodetool --host 192.168.123.5 tpstats Pool NameActive Pending Completed FILEUTILS-DELETE-POOL 0 0610 STREAM-STAGE 0 0 0 RESPONSE-STAGE0 0 18879316 ROW-READ-STAGE0 0 0 LB-OPERATIONS 0 0 0 MESSAGE-DESERIALIZER-POOL 0 0 86925654 GMFD 0 0 168769 LB-TARGET 0 0 0 CONSISTENCY-MANAGER 0 0 0 ROW-MUTATION-STAGE0 0 66657550 MESSAGE-STREAMING-POOL0 0 0 LOAD-BALANCER-STAGE 0 0 0 FLUSH-SORTER-POOL 0 0 0 MEMTABLE-POST-FLUSHER 0 0 1125 FLUSH-WRITER-POOL 0 0 1125 AE-SERVICE-STAGE 0 0 0 HINTED-HANDOFF-POOL 1 4 0 /opt/apache-cassandra-0.6.4/bin/nodetool --host 192.168.123.5 info 97646479029625162367516203572215570207 Load : 196.01 GB Generation No: 1282656715 Uptime (seconds) : 64437 Heap Memory (MB) : 2245.43 / 5111.69 /opt/apache-cassandra-0.6.4/bin/nodetool --host 192.168.123.100 tpstats Pool NameActive Pending Completed FILEUTILS-DELETE-POOL 0 0950 STREAM-STAGE 0 0 0 RESPONSE-STAGE0 0 88290400 ROW-READ-STAGE0 0 0 LB-OPERATIONS 0 0 0 MESSAGE-DESERIALIZER-POOL 0 0 149317269 GMFD 0 0 187571 LB-TARGET 0 0 0 CONSISTENCY-MANAGER 0 0 0 ROW-MUTATION-STAGE0 0 104055920 MESSAGE-STREAMING-POOL0 0 0 LOAD-BALANCER-STAGE 0 0 0 FLUSH-SORTER-POOL 0 0 0 MEMTABLE-POST-FLUSHER 0 0 1749 FLUSH-WRITER-POOL 0 0 1749 AE-SERVICE-STAGE 0 0 0 HINTED-HANDOFF-POOL 1 4 17 /opt/apache-cassandra-0.6.4/bin/nodetool --host 192.168.123.100 info 150826772797302282411816801037163789836 Load : 302.79 GB Generation No: 1282656138 Uptime (seconds) : 65439 Heap Memory (MB) : 1854.20 / 6135.69 /opt/apache-cassandra-0.6.4/bin/nodetool --host 192.168.123.101 tpstats Pool NameActive Pending Completed FILEUTILS-DELETE-POOL 0 0594 STREAM-STAGE 0 0 0 RESPONSE-STAGE0 0 120024993 ROW-READ-STAGE0 0 0 LB-OPERATIONS 0 0 0 MESSAGE-DESERIALIZER-POOL 0 0 154158111 GMFD 0 0 193471 LB-TARGET 0 0 0 CONSISTENCY-MANAGER 0 0 0 ROW-MUTATION-STAGE0 0 64174075 MESSAGE-STREAMING-POOL0 0 0 LOAD-BALANCER-STAGE 0 0 0 FLUSH-SORTER-POOL 0 0 0 MEMTABLE-POST-FLUSHER 0 0 1091 FLUSH-WRITER-POOL 0 0 1091 AE-SERVICE-STAGE 0 0 0 HINTED-HANDOFF-POOL 1
Re: Repair help
recommend testing the waters on release software (0.6.x), not beta. On Thu, Aug 26, 2010 at 2:53 PM, Mark static.void@gmail.com wrote: I have a 2 node cluster (testing the waters) w/ a replication factor of 2. One node got completed screwed up (see any of my previous messages from today) so I deleted the commit log and data directory. I restarted the node and rain nodetool repair as describe in http://wiki.apache.org/cassandra/Operations. I waited for over an hour and checked my ring only to find that nothing was repaired/replicated??? I only have a mere 7gigs of data so I would have thought this would have been fairly quick? Address Status State Load Token 129447565151094499156612104441060791022 x.x.x.x Up Normal 7.31 GB 12949228055906550350782255148181029323 x.x.x.x Up Normal 30.01 MB 129447565151094499156612104441060791022 I tried the alternative method of manually removing the token and then bootstrapping however when I tried to remove the token via nodetool removetoken an IllegalStateException was thrown... replication factor (2) exceeds number of endpoints (1) What should I do in this situation to get my node back up to where it should be? Is there anywhere I can check that the repair is actually running? Thanks for any suggestions ps I'm using 0.7.0 beta 1
Re: Follow-up post on cassandra configuration with some experiments on GC tuning
imo, these should be part of the defaults. On Tue, Aug 24, 2010 at 8:29 AM, Mikio Braun mi...@cs.tu-berlin.de wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Dear all, thanks again for all the comments I got on my last post. I've played a bit with different GC settings and got my Cassandra instance to run very nicely with 8GB of heap. I summarized my experiences with GC tuning in this follow-up post: http://blog.mikiobraun.de/2010/08/cassandra-gc-tuning.html - -M - -- Dr. Mikio Braun email: mi...@cs.tu-berlin.de TU Berlin web: ml.cs.tu-berlin.de/~mikio Franklinstr. 28/29 tel: +49 30 314 78627 10587 Berlin, Germany -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkxz5WcACgkQtnXKX8rQtgDiiwCeLknuTcr65eehwIcsivInjv4W LaQAn3RY9pH19r8SuUhVBvtE6LeyFUvB =MYsY -END PGP SIGNATURE-
Re: get_slice slow
Todd, This is a really bad idea. What you are likely doing is spreading that single row across a large number of sstables. The more columns you insert, the more sstables you are likely inspecting, the longer the get_slice operations will take. You can test whether this is so by running nodetool compact when things start slowing down. If it speeds up, that is likely the problem. If you are deleting that much, you should also tune GCGraceSeconds way down (from the default of 10 days) so the space is reclaimed on major compaction and, again, there are fewer things to inspect. Long rows written over long periods of time are almost certain to give worse read performance, even far worse, than rows written all at once. b On Tue, Aug 24, 2010 at 10:17 PM, B. Todd Burruss bburr...@real.com wrote: thx artie, i haven't used a super CF because i thought it has more trouble doing slices because the entire row must be deserialized to get to the subcolumn you want? iostat is nothing, 0.0. i have plenty of RAM and the OS is I/O caching nicely i haven't used the key cache, because i only have one key, the row of the queue ;) i haven't used row cache because i need the row to grow quite large, millions of columns. and the size of data could be arbitrary - right now i am testing with 32 byte values per column. i do need quorum consistency. i have read previous that some folks are using a single row with millions of columns. is anyone using get_slice to pick off the first or the last column in the row? On 08/24/2010 09:25 PM, Artie Copeland wrote: Have you tried using a super column, it seems that having a row with over 100K columns and growing would be alot for cassandra to deserialize? what is iostat and jmeter telling you? it would be interesting to see that data. also what are you using for you key or row caching? do you need to use a quorum consistency as that can slow down reads as well, can you use a lower consistency level? Artie On Tue, Aug 24, 2010 at 9:14 PM, B. Todd Burruss bburr...@real.com wrote: i am using get_slice to pull columns from a row to emulate a queue. column names are TimeUUID and the values are small, 32 bytes. simple ColumnFamily. i am using SlicePredicate like this to pull the first (oldest) column in the row: SlicePredicate predicate = new SlicePredicate(); predicate.setSlice_range(new SliceRange(new byte[] {}, new byte[] {}, false, 1)); get_slice(rowKey, colParent, predicate, QUORUM); once i get the column i remove it. so there are a lot of gets and mutates, leaving lots of deleted columns. get_slice starts off performing just fine, but then falls off dramatically as the number of columns grows. at its peak there are 100,000 columns and get_slice is taking over 100ms to return. i am running a single instance of cassandra 0.7 on localhost, default config. i've done some googling and can't find any tweaks or tuning suggestions specific to get_slice. i already know about separating commitlog and data, watching iostat, GC, etc. any low hanging tuning fruit anyone can think of? in 0.6 i recall an index for columns, maybe that is what i need? thx -- http://yeslinux.org http://yestech.org
Re: Does the scan speed with CL.ALL is faster than CL.QUORUM and CL.ONE?
Did you run the tests in this order without changing anything but CL? You may be seeing the effects of OS page caching. Run then in the reverse order and see if the difference persists. On Tue, Aug 24, 2010 at 11:52 PM, ring_ayumi_king ring_ayumi_k...@yahoo.com.tw wrote: Hi all, I ran my benchmark(OPP via get_range_slices) and found the following: Why does the scan speed with CL.ALL is faster than CL.QUORUM and CL.ONE? CL.ONE (1k per row, Count:5) scan :11095 ms scan per:0.2219 ms scan thput:4506.5347 ops/sec CL.QUORUM scan :11072 ms scan per:0.22144 ms scan thput:4515.896 ops/sec CL.ALL scan :7869 ms scan per:0.15738 ms scan thput:6354.0474 ops/sec Thanks. Shen
Re: Cassandra and Lucene
Please put your storage-conf.xml and cassandra.in.sh files on pastie/dpaste/gist and send the link. (moving it back to the user list again) On Sun, Jul 25, 2010 at 11:51 PM, Michelan Arendse miche...@hermanus.cc wrote: I have 2 seeds in my cluster, with a replication of 2. I am using cassandra 0.6.2. It keeps running out of memory so I don't know if there are some memory leaks. This is what is in the log: at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) ... 2 more ERROR [GC inspection] 2010-07-22 18:41:57,157 CassandraDaemon.java (line 78) Fatal exception in thread Thread[GC inspection,5,main] java.lang.OutOfMemoryError: Java heap space at java.util.AbstractList.iterator(AbstractList.java:273) at org.apache.cassandra.service.GCInspector.logIntervalGCStats(GCInspector.java:82) at org.apache.cassandra.service.GCInspector.access$000(GCInspector.java:38) at org.apache.cassandra.service.GCInspector$1.run(GCInspector.java:74) at java.util.TimerThread.mainLoop(Timer.java:512) at java.util.TimerThread.run(Timer.java:462) On Mon, Jul 26, 2010 at 2:14 AM, Aaron Morton aa...@thelastpickle.comwrote: You may need to provide a some more information. What's the cluster configuration, what version, what's in the logs etc. Aaron On 24 Jul, 2010,at 03:40 AM, Michelan Arendse miche...@hermanus.cc wrote: Hi I have recently started working on Cassandra as I need to make a distribute Lucene index and found that Lucandra was the best for this. Since then I have configured everything and it's working ok. Now the problem comes in when I need to write this Lucene index to Cassandra or convert it so that Cassandra can read it. The test index is 32 gigs and i find that Cassandra times out alot. What happens can't Cassandra take that load? Please any help will be great. Kind Regards,
Re: Node OOM Problems
How much storage do you need? 240G SSDs quite capable of saturating a 3Gbps SATA link are $600. Larger ones are also available with similar performance. Perhaps you could share a bit more about the storage and performance requirements. How SSDs to sustain 10k writes/sec PER NODE WITH LINEAR SCALING breaks down the commodity server concept eludes me. b On Sat, Aug 21, 2010 at 11:27 PM, Wayne wav...@gmail.com wrote: Thank you for the advice, I will try these settings. I am running defaults right now. The disk subsystem is one SATA disk for commitlog and 4 SATA disks in raid 0 for the data. From your email you are implying this hardware can not handle this level of sustained writes? That kind of breaks down the commodity server concept for me. I have never used anything but a 15k SAS disk (fastest disk money could buy until SSD) ALWAYS with a database. I have tried to throw out that mentality here but are you saying nothing has really changed/ Spindles spindles spindles as fast as you can afford is what I have always known...I guess that applies here? Do I need to spend $10k per node instead of $3.5k to get SUSTAINED 10k writes/sec per node? On Sat, Aug 21, 2010 at 11:03 PM, Benjamin Black b...@b3k.us wrote: My guess is that you have (at least) 2 problems right now: You are writing 10k ops/sec to each node, but have default memtable flush settings. This is resulting in memtable flushing every 30 seconds (default ops flush setting is 300k). You thus have a proliferation of tiny sstables and are seeing minor compactions triggered every couple of minutes. You have started a major compaction which is now competing with those near constant minor compactions for far too little I/O (3 SATA drives in RAID0, perhaps?). Normally, this would result in a massive ballooning of your heap use as all sorts of activities (like memtable flushes) backed up, as well. I suggest you increase the memtable flush ops to at least 10 (million) if you are going to sustain that many writes/sec, along with an increase in the flush MB to match, based on your typical bytes/write op. Long term, this level of write activity demands a lot faster storage (iops and bandwidth). b On Sat, Aug 21, 2010 at 2:18 AM, Wayne wav...@gmail.com wrote: I am already running with those options. I thought maybe that is why they never get completed as they keep pushed pushed down in priority? I am getting timeouts now and then but for the most part the cluster keeps running. Is it normal/ok for the repair and compaction to take so long? It has been over 12 hours since they were submitted. On Sat, Aug 21, 2010 at 10:56 AM, Jonathan Ellis jbel...@gmail.com wrote: yes, the AES is the repair. if you are running linux, try adding the options to reduce compaction priority from http://wiki.apache.org/cassandra/PerformanceTuning On Sat, Aug 21, 2010 at 3:17 AM, Wayne wav...@gmail.com wrote: I could tell from munin that the disk utilization was getting crazy high, but the strange thing is that it seemed to stall. The utilization went way down and everything seemed to flatten out. Requests piled up and the node was doing nothing. It did not crash but was left in a useless state. I do not have access to the tpstats when that occurred. Attached is the munin chart, and you can see the flat line after Friday at noon. I have reduced the writers from 10 per to 8 per node and they seem to be still running, but I am afraid they are barely hanging on. I ran nodetool repair after rebooting the failed node and I do not think the repair ever completed. I also later ran compact on each node and some it finished but some it did not. Below is the tpstats currently for the node I had to restart. Is the AE-SERVICE-STAGE the repair and compaction queued up? It seems several nodes are not getting enough free cycles to keep up. They are not timing out (30 sec timeout) for the most part but they are also not able to compact. Is this normal? Do I just give it time? I am migrating 2-3 TB of data from Mysql so the load is constant and will be for days and it seems even with only 8 writer processes per node I am maxed out. Thanks for the advice. Any more pointers would be greatly appreciated. Pool Name Active Pending Completed FILEUTILS-DELETE-POOL 0 0 1868 STREAM-STAGE 1 1 2 RESPONSE-STAGE 0 2 769158645 ROW-READ-STAGE 0 0 140942 LB-OPERATIONS 0 0 0 MESSAGE-DESERIALIZER-POOL 1 0 1470221842 GMFD 0 0 169712 LB-TARGET 0 0 0 CONSISTENCY-MANAGER
Re: Node OOM Problems
I see no reason to make that assumption. Cassandra currently has no mechanism to alternate in that manner. At the update rate you require, you just need more disk io (bandwidth and iops). Alternatively, you could use a bunch more, smaller nodes with the same SATA RAID setup so they each take many fewer writes/sec, and so can keep with compaction. On Sun, Aug 22, 2010 at 12:00 AM, Wayne wav...@gmail.com wrote: Due to compaction being so expensive in terms of disk resources, does it make more sense to have 2 data volumes instead of one? We have 4 data disks in raid 0, would this make more sense to be 2 x 2 disks in raid 0? That way the reader and writer I assume would always be a different set of spindles? On Sun, Aug 22, 2010 at 8:27 AM, Wayne wav...@gmail.com wrote: Thank you for the advice, I will try these settings. I am running defaults right now. The disk subsystem is one SATA disk for commitlog and 4 SATA disks in raid 0 for the data. From your email you are implying this hardware can not handle this level of sustained writes? That kind of breaks down the commodity server concept for me. I have never used anything but a 15k SAS disk (fastest disk money could buy until SSD) ALWAYS with a database. I have tried to throw out that mentality here but are you saying nothing has really changed/ Spindles spindles spindles as fast as you can afford is what I have always known...I guess that applies here? Do I need to spend $10k per node instead of $3.5k to get SUSTAINED 10k writes/sec per node? On Sat, Aug 21, 2010 at 11:03 PM, Benjamin Black b...@b3k.us wrote: My guess is that you have (at least) 2 problems right now: You are writing 10k ops/sec to each node, but have default memtable flush settings. This is resulting in memtable flushing every 30 seconds (default ops flush setting is 300k). You thus have a proliferation of tiny sstables and are seeing minor compactions triggered every couple of minutes. You have started a major compaction which is now competing with those near constant minor compactions for far too little I/O (3 SATA drives in RAID0, perhaps?). Normally, this would result in a massive ballooning of your heap use as all sorts of activities (like memtable flushes) backed up, as well. I suggest you increase the memtable flush ops to at least 10 (million) if you are going to sustain that many writes/sec, along with an increase in the flush MB to match, based on your typical bytes/write op. Long term, this level of write activity demands a lot faster storage (iops and bandwidth). b On Sat, Aug 21, 2010 at 2:18 AM, Wayne wav...@gmail.com wrote: I am already running with those options. I thought maybe that is why they never get completed as they keep pushed pushed down in priority? I am getting timeouts now and then but for the most part the cluster keeps running. Is it normal/ok for the repair and compaction to take so long? It has been over 12 hours since they were submitted. On Sat, Aug 21, 2010 at 10:56 AM, Jonathan Ellis jbel...@gmail.com wrote: yes, the AES is the repair. if you are running linux, try adding the options to reduce compaction priority from http://wiki.apache.org/cassandra/PerformanceTuning On Sat, Aug 21, 2010 at 3:17 AM, Wayne wav...@gmail.com wrote: I could tell from munin that the disk utilization was getting crazy high, but the strange thing is that it seemed to stall. The utilization went way down and everything seemed to flatten out. Requests piled up and the node was doing nothing. It did not crash but was left in a useless state. I do not have access to the tpstats when that occurred. Attached is the munin chart, and you can see the flat line after Friday at noon. I have reduced the writers from 10 per to 8 per node and they seem to be still running, but I am afraid they are barely hanging on. I ran nodetool repair after rebooting the failed node and I do not think the repair ever completed. I also later ran compact on each node and some it finished but some it did not. Below is the tpstats currently for the node I had to restart. Is the AE-SERVICE-STAGE the repair and compaction queued up? It seems several nodes are not getting enough free cycles to keep up. They are not timing out (30 sec timeout) for the most part but they are also not able to compact. Is this normal? Do I just give it time? I am migrating 2-3 TB of data from Mysql so the load is constant and will be for days and it seems even with only 8 writer processes per node I am maxed out. Thanks for the advice. Any more pointers would be greatly appreciated. Pool Name Active Pending Completed FILEUTILS-DELETE-POOL 0 0 1868 STREAM-STAGE 1 1 2 RESPONSE-STAGE 0
Re: Node OOM Problems
Is the need for 10k/sec/node just for bulk loading of data or is it how your app will operate normally? Those are very different things. On Sun, Aug 22, 2010 at 4:11 AM, Wayne wav...@gmail.com wrote: Currently each node has 4x1TB SATA disks. In MySQL we have 15tb currently with no replication. To move this to Cassandra replication factor 3 we need 45TB assuming the space usage is the same, but it is probably more. We had assumed a 30 node cluster with 4tb per node would suffice with head room for compaction and to growth (120 TB). SSD drives for 30 nodes in this size range are not cost feasible for us. We can try to use 15k SAS drives and have more spindles but then our per node cost goes up. I guess I naively thought cassandra would do its magic and a few commodity SATA hard drives would be fine. Our performance requirement does not need 10k writes/node/sec 24 hours a day, but if we can not get really good performance the switch from MySQL becomes harder to rationalize. We can currently restore from a MySQL dump a 2.5 terabyte backup (plain old insert statements) in 4-5 days. I expect as much or more from cassandra and I feel years away from simply loading 2+tb into cassandra without so many issues. What is really required in hardware for a 100+tb cluster with near 10k/sec write performance sustained? If the answer is SSD what can be expected from 15k SAS drives and what from SATA? Thank you for your advice, I am struggling with how to make this work. Any insight you can provide would be greatly appreciated. On Sun, Aug 22, 2010 at 8:58 AM, Benjamin Black b...@b3k.us wrote: How much storage do you need? 240G SSDs quite capable of saturating a 3Gbps SATA link are $600. Larger ones are also available with similar performance. Perhaps you could share a bit more about the storage and performance requirements. How SSDs to sustain 10k writes/sec PER NODE WITH LINEAR SCALING breaks down the commodity server concept eludes me. b On Sat, Aug 21, 2010 at 11:27 PM, Wayne wav...@gmail.com wrote: Thank you for the advice, I will try these settings. I am running defaults right now. The disk subsystem is one SATA disk for commitlog and 4 SATA disks in raid 0 for the data. From your email you are implying this hardware can not handle this level of sustained writes? That kind of breaks down the commodity server concept for me. I have never used anything but a 15k SAS disk (fastest disk money could buy until SSD) ALWAYS with a database. I have tried to throw out that mentality here but are you saying nothing has really changed/ Spindles spindles spindles as fast as you can afford is what I have always known...I guess that applies here? Do I need to spend $10k per node instead of $3.5k to get SUSTAINED 10k writes/sec per node? On Sat, Aug 21, 2010 at 11:03 PM, Benjamin Black b...@b3k.us wrote: My guess is that you have (at least) 2 problems right now: You are writing 10k ops/sec to each node, but have default memtable flush settings. This is resulting in memtable flushing every 30 seconds (default ops flush setting is 300k). You thus have a proliferation of tiny sstables and are seeing minor compactions triggered every couple of minutes. You have started a major compaction which is now competing with those near constant minor compactions for far too little I/O (3 SATA drives in RAID0, perhaps?). Normally, this would result in a massive ballooning of your heap use as all sorts of activities (like memtable flushes) backed up, as well. I suggest you increase the memtable flush ops to at least 10 (million) if you are going to sustain that many writes/sec, along with an increase in the flush MB to match, based on your typical bytes/write op. Long term, this level of write activity demands a lot faster storage (iops and bandwidth). b On Sat, Aug 21, 2010 at 2:18 AM, Wayne wav...@gmail.com wrote: I am already running with those options. I thought maybe that is why they never get completed as they keep pushed pushed down in priority? I am getting timeouts now and then but for the most part the cluster keeps running. Is it normal/ok for the repair and compaction to take so long? It has been over 12 hours since they were submitted. On Sat, Aug 21, 2010 at 10:56 AM, Jonathan Ellis jbel...@gmail.com wrote: yes, the AES is the repair. if you are running linux, try adding the options to reduce compaction priority from http://wiki.apache.org/cassandra/PerformanceTuning On Sat, Aug 21, 2010 at 3:17 AM, Wayne wav...@gmail.com wrote: I could tell from munin that the disk utilization was getting crazy high, but the strange thing is that it seemed to stall. The utilization went way down and everything seemed to flatten out. Requests piled up and the node was doing nothing. It did not crash
Re: Node OOM Problems
Wayne, Bulk loading this much data is a very different prospect from needing to sustain that rate of updates indefinitely. As was suggested earlier, you likely need to tune things differently, including disabling minor compactions during the bulk load, to make this work efficiently. b On Sun, Aug 22, 2010 at 12:40 PM, Wayne wav...@gmail.com wrote: Has anyone loaded 2+ terabytes of real data in one stretch into a cluster without bulk loading and without any problems? How long did it take? What kind of nodes were used? How many writes/sec/node can be sustained for 24+ hours? On Sun, Aug 22, 2010 at 8:22 PM, Peter Schuller peter.schul...@infidyne.com wrote: I only sifted recent history of this thread (for time reasons), but: You have started a major compaction which is now competing with those near constant minor compactions for far too little I/O (3 SATA drives in RAID0, perhaps?). Normally, this would result in a massive ballooning of your heap use as all sorts of activities (like memtable flushes) backed up, as well. AFAIK memtable flushing is unrelated to compaction in the sense that they occur concurrently and don't block each other (except to the extent that they truly do compete for e.g. disk or CPU resources). While small memtables do indeed mean more compaction activity in total, the expensiveness of any given compaction should not be severely affecting. As far as I can tell, the two primary effects of small memtable sizes are: * An increase in total amount of compaction work done in total for a given database size. * An increase in the number of sstables that may accumulate while larger compactions are running. ** That in turn is particularly relevant because it can generate a lot of seek-bound activity; consider for example range queries that end up spanning 10 000 files on disk. If memtable flushes are not able to complete fast enough to cope with write activity, even if that is the case only during concurrenct compaction (for whatever reason), that suggests to me that write activity is too high. Increasing memtable sizes may help on average due to decreased compaction work, but I don't see why it would significantly affect the performance one compactions *do* in fact run. With respect to timeouts on writes: I make no claims as to whether it is expected, because I have not yet investigated, but I definitely see sporadic slowness when benchmarking high-throughput writes on a cassandra trunk snapshot somewhere between 0.6 and 0.7. This occurs even when writing to a machine where the commit log and data directories are both on separate RAID volumes that are battery backed and should have no trouble eating write bursts (and the data is such that one is CPU bound rather than diskbound on average; so it only needs to eat bursts). I've had to add re-try to the benchmarking tool (or else up the timeout) because the default was not enough. I have not investigated exactly why this happens but it's an interesting effect that as far as I can tell should not be there. Haver other people done high-throughput writes (to the point of CPU saturation) over extended periods of time while consistently seeing low latencies (consistencty meaning never exceeding hundreds of ms over several days)? -- / Peter Schuller
Re: Node OOM Problems
On Sun, Aug 22, 2010 at 2:03 PM, Wayne wav...@gmail.com wrote: From a testing whether cassandra can take the load long term I do not see it as different. Yes bulk loading can be made faster using very different Then you need far more IO, whether it comes form faster drives or more nodes. If you can achieve 10k writes/sec/node and linear scaling without sharding in MySQL on cheap, commodity hardware then I am impressed. methods, but my purpose is to test cassandra with a large volume of writes (and not to bulk load as efficiently as possible). I have scaled back to 5 writer threads per node and still see 8k writes/sec/node. With the larger memory table settings we shall see how it goes. I have no idea how to change a JMX setting and prefer to use std options to be frank. For us this is If you want best performance, you must tune the system appropriately. If you want to use the base settings (which are intended for the 1G max heap which is way too small for anything interesting), expect suboptimal performance for your application. after all an evaluation of whether Cassandra can replace Mysql. I thank everyone for their help. On Sun, Aug 22, 2010 at 10:37 PM, Benjamin Black b...@b3k.us wrote: Wayne, Bulk loading this much data is a very different prospect from needing to sustain that rate of updates indefinitely. As was suggested earlier, you likely need to tune things differently, including disabling minor compactions during the bulk load, to make this work efficiently. b On Sun, Aug 22, 2010 at 12:40 PM, Wayne wav...@gmail.com wrote: Has anyone loaded 2+ terabytes of real data in one stretch into a cluster without bulk loading and without any problems? How long did it take? What kind of nodes were used? How many writes/sec/node can be sustained for 24+ hours? On Sun, Aug 22, 2010 at 8:22 PM, Peter Schuller peter.schul...@infidyne.com wrote: I only sifted recent history of this thread (for time reasons), but: You have started a major compaction which is now competing with those near constant minor compactions for far too little I/O (3 SATA drives in RAID0, perhaps?). Normally, this would result in a massive ballooning of your heap use as all sorts of activities (like memtable flushes) backed up, as well. AFAIK memtable flushing is unrelated to compaction in the sense that they occur concurrently and don't block each other (except to the extent that they truly do compete for e.g. disk or CPU resources). While small memtables do indeed mean more compaction activity in total, the expensiveness of any given compaction should not be severely affecting. As far as I can tell, the two primary effects of small memtable sizes are: * An increase in total amount of compaction work done in total for a given database size. * An increase in the number of sstables that may accumulate while larger compactions are running. ** That in turn is particularly relevant because it can generate a lot of seek-bound activity; consider for example range queries that end up spanning 10 000 files on disk. If memtable flushes are not able to complete fast enough to cope with write activity, even if that is the case only during concurrenct compaction (for whatever reason), that suggests to me that write activity is too high. Increasing memtable sizes may help on average due to decreased compaction work, but I don't see why it would significantly affect the performance one compactions *do* in fact run. With respect to timeouts on writes: I make no claims as to whether it is expected, because I have not yet investigated, but I definitely see sporadic slowness when benchmarking high-throughput writes on a cassandra trunk snapshot somewhere between 0.6 and 0.7. This occurs even when writing to a machine where the commit log and data directories are both on separate RAID volumes that are battery backed and should have no trouble eating write bursts (and the data is such that one is CPU bound rather than diskbound on average; so it only needs to eat bursts). I've had to add re-try to the benchmarking tool (or else up the timeout) because the default was not enough. I have not investigated exactly why this happens but it's an interesting effect that as far as I can tell should not be there. Haver other people done high-throughput writes (to the point of CPU saturation) over extended periods of time while consistently seeing low latencies (consistencty meaning never exceeding hundreds of ms over several days)? -- / Peter Schuller
Re: Cassandra Nodes Freeze/Down for ConcurrentMarkSweep GC?
http://riptano.blip.tv/file/4012133/ On Sun, Aug 22, 2010 at 12:11 PM, Moleza Moleza mole...@gmail.com wrote: Hi, I am setting up a cluster on a linux box. Everything seems to be working great and I am watching the ring with: watch -d -n 2 nodetool -h localhost ring Suddenly, I see that one of the nodes just went down (at 14:07): Status changed from Up to Down. 13 minutes later (without any intervention) the node comes back Up (by itself). I check the logs (see at end of text) on that node and see that there is nothing in the log from 14:07 until 14:20 (13 minutes later). I also notice the GC ConcurrentMarkSweep took 13 minutes. Here are my questions: [1] Is this behavior normal? [2] Has it been observed by someone else before? [3] The node being down means that nodetool, and any other client, wont be able to connect to it (clients should use other nodes in cluster to write data). Correct? [4] Is GC ConcurrentMarkSweep a Stop-The-World situation? Where the JVM cannot do anything else? Hence then node is technically Down? Correct? [5] Why is this GC taking such a long time? (see JMV ARGS posted bellow). [6] Any JMV Args (switches) I can use to prevent this? -- JVM_OPTS= \ -Dprog=Cassandra \ -ea \ -Xms12G \ -Xmx12G \ -XX:+UseParNewGC \ -XX:+UseConcMarkSweepGC \ -XX:+CMSParallelRemarkEnabled \ -XX:SurvivorRatio=8 \ -XX:MaxTenuringThreshold=1 \ -XX:+HeapDumpOnOutOfMemoryError \ -Dcom.sun.management.jmxremote.port=8080 \ -Dcom.sun.management.jmxremote.ssl=false \ -Dcom.sun.management.jmxremote.authenticate=false Log Extract ## INFO [GC inspection] 2010-08-22 14:06:48,622 GCInspector.java (line 116) GC for ParNew: 235 ms, 134504976 reclaimed leaving 12721498296 used; max is 13005881344 INFO [FLUSH-TIMER] 2010-08-22 14:19:45,429 ColumnFamilyStore.java (line 357)HintsColumnFamily has reached its threshold; switching in a fresh Memtable at CommitLogContext(file='/var/nes/data1/cassandra_commitlog/CommitLog-1282500306160.log', position=55517352) INFO [FLUSH-TIMER] 2010-08-22 14:19:45,429 ColumnFamilyStore.java (line 609) Enqueuing flush of memtable-hintscolumnfam...@1935604258(3147 bytes, 433 operations) INFO [FLUSH-WRITER-POOL:1] 2010-08-22 14:19:45,430 Memtable.java (line 148) Writing memtable-hintscolumnfam...@1935604258(3147 bytes, 433 operations) INFO [GC inspection] 2010-08-22 14:19:45,917 GCInspector.java (line 116) GC for ParNew: 215 ms, 130254256 reclaimed leaving 12742982208 used; max is 13005881344 INFO [GC inspection] 2010-08-22 14:19:45,973 GCInspector.java (line 116)GC for ConcurrentMarkSweep: 775679 ms, 12685881488 reclaimed leaving 196692400 used; max is 13005881344 --
Re: Node OOM Problems
Perhaps I missed it in one of the earlier emails, but what is your disk subsystem config? On Sat, Aug 21, 2010 at 2:18 AM, Wayne wav...@gmail.com wrote: I am already running with those options. I thought maybe that is why they never get completed as they keep pushed pushed down in priority? I am getting timeouts now and then but for the most part the cluster keeps running. Is it normal/ok for the repair and compaction to take so long? It has been over 12 hours since they were submitted. On Sat, Aug 21, 2010 at 10:56 AM, Jonathan Ellis jbel...@gmail.com wrote: yes, the AES is the repair. if you are running linux, try adding the options to reduce compaction priority from http://wiki.apache.org/cassandra/PerformanceTuning On Sat, Aug 21, 2010 at 3:17 AM, Wayne wav...@gmail.com wrote: I could tell from munin that the disk utilization was getting crazy high, but the strange thing is that it seemed to stall. The utilization went way down and everything seemed to flatten out. Requests piled up and the node was doing nothing. It did not crash but was left in a useless state. I do not have access to the tpstats when that occurred. Attached is the munin chart, and you can see the flat line after Friday at noon. I have reduced the writers from 10 per to 8 per node and they seem to be still running, but I am afraid they are barely hanging on. I ran nodetool repair after rebooting the failed node and I do not think the repair ever completed. I also later ran compact on each node and some it finished but some it did not. Below is the tpstats currently for the node I had to restart. Is the AE-SERVICE-STAGE the repair and compaction queued up? It seems several nodes are not getting enough free cycles to keep up. They are not timing out (30 sec timeout) for the most part but they are also not able to compact. Is this normal? Do I just give it time? I am migrating 2-3 TB of data from Mysql so the load is constant and will be for days and it seems even with only 8 writer processes per node I am maxed out. Thanks for the advice. Any more pointers would be greatly appreciated. Pool Name Active Pending Completed FILEUTILS-DELETE-POOL 0 0 1868 STREAM-STAGE 1 1 2 RESPONSE-STAGE 0 2 769158645 ROW-READ-STAGE 0 0 140942 LB-OPERATIONS 0 0 0 MESSAGE-DESERIALIZER-POOL 1 0 1470221842 GMFD 0 0 169712 LB-TARGET 0 0 0 CONSISTENCY-MANAGER 0 0 0 ROW-MUTATION-STAGE 0 1 865124937 MESSAGE-STREAMING-POOL 0 0 6 LOAD-BALANCER-STAGE 0 0 0 FLUSH-SORTER-POOL 0 0 0 MEMTABLE-POST-FLUSHER 0 0 8088 FLUSH-WRITER-POOL 0 0 8088 AE-SERVICE-STAGE 1 34 54 HINTED-HANDOFF-POOL 0 0 7 On Fri, Aug 20, 2010 at 11:56 PM, Bill de hÓra b...@dehora.net wrote: On Fri, 2010-08-20 at 19:17 +0200, Wayne wrote: WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-20 16:57:02,602 MessageDeserializationTask.java (line 47) dropping message (1,078,378ms past timeout) WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-20 16:57:02,602 MessageDeserializationTask.java (line 47) dropping message (1,078,378ms past timeout) MESSAGE-DESERIALIZER-POOL usually backs up when other stages are bogged downstream, (eg here's Ben Black describing the symptom when the underlying cause is running out of disk bandwidth, well worth a watch http://riptano.blip.tv/file/4012133/). Can you send all of nodetool tpstats? Bill -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
Re: Node OOM Problems
My guess is that you have (at least) 2 problems right now: You are writing 10k ops/sec to each node, but have default memtable flush settings. This is resulting in memtable flushing every 30 seconds (default ops flush setting is 300k). You thus have a proliferation of tiny sstables and are seeing minor compactions triggered every couple of minutes. You have started a major compaction which is now competing with those near constant minor compactions for far too little I/O (3 SATA drives in RAID0, perhaps?). Normally, this would result in a massive ballooning of your heap use as all sorts of activities (like memtable flushes) backed up, as well. I suggest you increase the memtable flush ops to at least 10 (million) if you are going to sustain that many writes/sec, along with an increase in the flush MB to match, based on your typical bytes/write op. Long term, this level of write activity demands a lot faster storage (iops and bandwidth). b On Sat, Aug 21, 2010 at 2:18 AM, Wayne wav...@gmail.com wrote: I am already running with those options. I thought maybe that is why they never get completed as they keep pushed pushed down in priority? I am getting timeouts now and then but for the most part the cluster keeps running. Is it normal/ok for the repair and compaction to take so long? It has been over 12 hours since they were submitted. On Sat, Aug 21, 2010 at 10:56 AM, Jonathan Ellis jbel...@gmail.com wrote: yes, the AES is the repair. if you are running linux, try adding the options to reduce compaction priority from http://wiki.apache.org/cassandra/PerformanceTuning On Sat, Aug 21, 2010 at 3:17 AM, Wayne wav...@gmail.com wrote: I could tell from munin that the disk utilization was getting crazy high, but the strange thing is that it seemed to stall. The utilization went way down and everything seemed to flatten out. Requests piled up and the node was doing nothing. It did not crash but was left in a useless state. I do not have access to the tpstats when that occurred. Attached is the munin chart, and you can see the flat line after Friday at noon. I have reduced the writers from 10 per to 8 per node and they seem to be still running, but I am afraid they are barely hanging on. I ran nodetool repair after rebooting the failed node and I do not think the repair ever completed. I also later ran compact on each node and some it finished but some it did not. Below is the tpstats currently for the node I had to restart. Is the AE-SERVICE-STAGE the repair and compaction queued up? It seems several nodes are not getting enough free cycles to keep up. They are not timing out (30 sec timeout) for the most part but they are also not able to compact. Is this normal? Do I just give it time? I am migrating 2-3 TB of data from Mysql so the load is constant and will be for days and it seems even with only 8 writer processes per node I am maxed out. Thanks for the advice. Any more pointers would be greatly appreciated. Pool Name Active Pending Completed FILEUTILS-DELETE-POOL 0 0 1868 STREAM-STAGE 1 1 2 RESPONSE-STAGE 0 2 769158645 ROW-READ-STAGE 0 0 140942 LB-OPERATIONS 0 0 0 MESSAGE-DESERIALIZER-POOL 1 0 1470221842 GMFD 0 0 169712 LB-TARGET 0 0 0 CONSISTENCY-MANAGER 0 0 0 ROW-MUTATION-STAGE 0 1 865124937 MESSAGE-STREAMING-POOL 0 0 6 LOAD-BALANCER-STAGE 0 0 0 FLUSH-SORTER-POOL 0 0 0 MEMTABLE-POST-FLUSHER 0 0 8088 FLUSH-WRITER-POOL 0 0 8088 AE-SERVICE-STAGE 1 34 54 HINTED-HANDOFF-POOL 0 0 7 On Fri, Aug 20, 2010 at 11:56 PM, Bill de hÓra b...@dehora.net wrote: On Fri, 2010-08-20 at 19:17 +0200, Wayne wrote: WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-20 16:57:02,602 MessageDeserializationTask.java (line 47) dropping message (1,078,378ms past timeout) WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-20 16:57:02,602 MessageDeserializationTask.java (line 47) dropping message (1,078,378ms past timeout) MESSAGE-DESERIALIZER-POOL usually backs up when other stages are bogged downstream, (eg here's Ben Black describing the symptom when the underlying cause is running out of disk bandwidth, well worth a watch http://riptano.blip.tv/file/4012133/). Can you send all of nodetool tpstats? Bill -- Jonathan Ellis Project Chair, Apache Cassandra
Re: Privileges
No. On Sat, Aug 21, 2010 at 4:19 PM, Mark static.void@gmail.com wrote: Is there anyway to remove drop column family/keyspace privileges?
Re: Privileges
My mistake, the access levels in 0.7 do now distinguish these operations (at access level FULL). On Sat, Aug 21, 2010 at 4:19 PM, Mark static.void@gmail.com wrote: Is there anyway to remove drop column family/keyspace privileges?
Re: Privileges
For reference, I learned this from reading the source: thrift/CassandraServer.java On Sat, Aug 21, 2010 at 4:19 PM, Mark static.void@gmail.com wrote: Is there anyway to remove drop column family/keyspace privileges?
Re: questions regarding read and write in cassandra
More recent. Newest timestamp always wins. And I am moving this to the user list (again) so it can be with all its friendly threads on the exact same topic. On Thu, Aug 19, 2010 at 10:22 AM, Maifi Khan maifi.k...@gmail.com wrote: Hi David Thanks for your reply. But what happens if I read and get 2 nodes has value 10 with older time stamp and the third node has 20 with more recent time stamp? Will cassandra return 10(majority) or 20(more recent)? thanks Maifi On Thu, Aug 19, 2010 at 1:20 PM, David Timothy Strauss da...@fourkitchens.com wrote: The quorum write would fail, but the data would not be rolled back. Assuming the offline nodes recover, the data would eventually replicate. This question belongs on the user list, though. -Original Message- From: Maifi Khan maifi.k...@gmail.com Date: Thu, 19 Aug 2010 13:00:47 To: d...@cassandra.apache.org Reply-To: d...@cassandra.apache.org Subject: questions regarding read and write in cassandra Hi I have a question in the following scenario. Say we have 10 nodes, Replication factor is 5. Now, say, for Row X, Column Y, data is replicated to node 1,2,3,4,5 and current value is 10 Say, I issue a write command with value “20” to Row X, column Y with quorum(n/2+1=3 nodes). Say it updated 1 and 2 and failed to update any other node. So it failed to write to 3 nodes. What happens in such scenario? Q: Will the user returned failed? Now, assuming that the write failed. What value will I see if I want to read the same cell with Quorum? Now, say I read the data with quorum. It read from 1, 4, 5 and see that node 1 has the most recent data (“20” which is still there as cassandra does not roll back). Will it will return the data “20” to the user or will it return the earlier value 10 as it is returned by the node 4 and 5? Also, does read repair tries to propagate 20 to all the replicas although cassandra returned failed to the user? thanks
Re: Cassandra gem
great, thanks! On Tue, Aug 17, 2010 at 11:30 PM, Mark static.void@gmail.com wrote: On 8/17/10 5:44 PM, Benjamin Black wrote: Updated code is now in my master branch, with the reversion to 10.0.0. Please let me know of further trouble. b On Tue, Aug 17, 2010 at 8:31 AM, Markstatic.void@gmail.com wrote: On 8/16/10 11:37 PM, Benjamin Black wrote: I'm testing with the default cassandra.yaml. I cannot reproduce the output in that gist, however: thrift_client = client.instance_variable_get(:@client) = nil Also, the Thrift version for 0.7 is 11.0.0, according to the code I have. Can someone comment on whether 0.7 beta1 is at Thrift interface version 10.0.0 or 11.0.0? b On Mon, Aug 16, 2010 at 9:03 PM, Markstatic.void@gmail.com wrote: On 8/16/10 8:51 PM, Mark wrote: On 8/16/10 6:19 PM, Benjamin Black wrote: client = Cassandra.new('system', '127.0.0.1:9160') Brand new download of beta-0.7.0-beta1 http://gist.github.com/528357 Which thrift/thrift_client versions are you using? FYI also tested similar setup on another machine and same results. Is there any configuration change I need in cassandra.yaml or something? thrift_client = client.instance_variable_get(:@client) The above client will only be instantiated after making (or attempting in my case) a request. Works like a charm. Thanks
Re: Cassandra gem
thrift (0.2.0.4) thrift_client (0.4.6, 0.4.3) On Mon, Aug 16, 2010 at 8:51 PM, Mark static.void@gmail.com wrote: On 8/16/10 6:19 PM, Benjamin Black wrote: client = Cassandra.new('system', '127.0.0.1:9160') Brand new download of beta-0.7.0-beta1 http://gist.github.com/528357 Which thrift/thrift_client versions are you using?
Re: Cassandra gem
I'm testing with the default cassandra.yaml. I cannot reproduce the output in that gist, however: thrift_client = client.instance_variable_get(:@client) = nil Also, the Thrift version for 0.7 is 11.0.0, according to the code I have. Can someone comment on whether 0.7 beta1 is at Thrift interface version 10.0.0 or 11.0.0? b On Mon, Aug 16, 2010 at 9:03 PM, Mark static.void@gmail.com wrote: On 8/16/10 8:51 PM, Mark wrote: On 8/16/10 6:19 PM, Benjamin Black wrote: client = Cassandra.new('system', '127.0.0.1:9160') Brand new download of beta-0.7.0-beta1 http://gist.github.com/528357 Which thrift/thrift_client versions are you using? FYI also tested similar setup on another machine and same results. Is there any configuration change I need in cassandra.yaml or something?
Re: Cassandra gem
Then this may be the issue. I'll see if I can regenerate something with 10.0.0 version tomorrow. On Mon, Aug 16, 2010 at 11:45 PM, Thorvaldsson Justus justus.thorvalds...@svenskaspel.se wrote: Using beta, made a describe_version(), got 10.0.0 as reply, aint using gem though, just thrift from java /Justus -Ursprungligt meddelande- Från: Benjamin Black [mailto:b...@b3k.us] Skickat: den 17 augusti 2010 08:37 Till: user@cassandra.apache.org Ämne: Re: Cassandra gem I'm testing with the default cassandra.yaml. I cannot reproduce the output in that gist, however: thrift_client = client.instance_variable_get(:@client) = nil Also, the Thrift version for 0.7 is 11.0.0, according to the code I have. Can someone comment on whether 0.7 beta1 is at Thrift interface version 10.0.0 or 11.0.0? b On Mon, Aug 16, 2010 at 9:03 PM, Mark static.void@gmail.com wrote: On 8/16/10 8:51 PM, Mark wrote: On 8/16/10 6:19 PM, Benjamin Black wrote: client = Cassandra.new('system', '127.0.0.1:9160') Brand new download of beta-0.7.0-beta1 http://gist.github.com/528357 Which thrift/thrift_client versions are you using? FYI also tested similar setup on another machine and same results. Is there any configuration change I need in cassandra.yaml or something?
Re: move data between clusters
without answering your whole question, just fyi: there is a matching json2sstable command for going the other direction. On Tue, Aug 17, 2010 at 10:48 AM, Artie Copeland yeslinux@gmail.com wrote: what is the best way to move data between clusters. we currently have a 4 node prod cluster with 80G of data and want to move it to a dev env with 3 nodes. we have plenty of disk were looking into nodetool snapshot, but it look like that wont work because of the system tables. sstabletojson does look like it would work as it would miss the index files. am i missing something? have others tried to do the same and been successful. thanx artie -- http://yeslinux.org http://yestech.org
Re: indexing rows ordered by int
http://code.google.com/p/redis/wiki/SortedSets On Tue, Aug 17, 2010 at 12:33 PM, S Ahmed sahmed1...@gmail.com wrote: So when using Redis, how do you go about updating the index? Do you serialize changes to the index i.e. when someone votes, you then update the index? Little confused as to how to go about updating a huge index. Say you have 1 million stores, and you want to order by the top votes, how would you maintain such an index since they are being constantly voted on. On Sun, Aug 15, 2010 at 10:48 PM, Chris Goffinet c...@chrisgoffinet.com wrote: Digg is using redis for such a feature as well. We use it on the MyNews - Top in 24 hours. Since we need timestamp ordering + sorting by how many friends touch a story. -Chris On Aug 15, 2010, at 7:34 PM, Benjamin Black wrote: http://code.google.com/p/redis/ On Sat, Aug 14, 2010 at 11:51 PM, S Ahmed sahmed1...@gmail.com wrote: For CF that I need to perform range scans on, I create separate CF that have custom ordering. Say a CF holds comments on a story (like comments on a reddit or digg story post) So if I need to order comments by votes, it seems I have to re-index every time someone votes on a comment (or batch it every x minutes). Right now I think I have to pull all the comments into memory, then sort by votes, then re-write the index. Are there any best-practises for this type of index?
Re: Cassandra gem
Updated code is now in my master branch, with the reversion to 10.0.0. Please let me know of further trouble. b On Tue, Aug 17, 2010 at 8:31 AM, Mark static.void@gmail.com wrote: On 8/16/10 11:37 PM, Benjamin Black wrote: I'm testing with the default cassandra.yaml. I cannot reproduce the output in that gist, however: thrift_client = client.instance_variable_get(:@client) = nil Also, the Thrift version for 0.7 is 11.0.0, according to the code I have. Can someone comment on whether 0.7 beta1 is at Thrift interface version 10.0.0 or 11.0.0? b On Mon, Aug 16, 2010 at 9:03 PM, Markstatic.void@gmail.com wrote: On 8/16/10 8:51 PM, Mark wrote: On 8/16/10 6:19 PM, Benjamin Black wrote: client = Cassandra.new('system', '127.0.0.1:9160') Brand new download of beta-0.7.0-beta1 http://gist.github.com/528357 Which thrift/thrift_client versions are you using? FYI also tested similar setup on another machine and same results. Is there any configuration change I need in cassandra.yaml or something? thrift_client = client.instance_variable_get(:@client) The above client will only be instantiated after making (or attempting in my case) a request.
Re: cassandra for a inbox search with high reading qps
On Tue, Aug 17, 2010 at 7:55 PM, Chen Xinli chen.d...@gmail.com wrote: Hi, We are going to use cassandra for searching purpose like inbox search. The reading qps is very high, we'd like to use ConsitencyLevel.One for reading and disable read-repair at the same time. In 0.7 you can set a probability for read repair, but disabling it is a spectacularly bad idea. Any write problems on a node will result in persistent inconsistency. For reading consistency in this condition, the writing should use ConsistencyLevel.ALL. But the writing will fail if one node fails. You are free to read and write with consistency levels where R+W N, it just means you have weaker consistency guarantees. We want such a ConsistencyLevel for writing/reading that : 1. writing will success if there is node alive for this key 2. reading will not forward to a node that's just recovered and doing hinted handoff So that, if some node fails, others nodes for replica will receive the data and surve reading successfully; when the failure node recovers, it will receive hinted handoff from other nodes and it'll not surve reading until hinted handoff is done. Does cassandra support the cases already? or should I modify the code to meet our requirements? You are phrasing these requirements in terms of a specific implementation. What are your actual consistency goals? If node failure is such a common occurrence in your system, you are going to have _numerous_ problems. b
Re: data deleted came back after 9 days.
On Tue, Aug 17, 2010 at 7:49 PM, Zhong Li z...@voxeo.com wrote: Those data were inserted one node, then deleted on a remote node in less than 2 seconds. So it is very possible some node lost tombstone when connection lost. My question, is a ConstencyLevel.ALL read can retrieve lost tombstone back instead of repair? No. Read repair does not replay operations. You must run nodetool repair. b
Re: File write errors but cassandra isn't crashing
Useful config option, perhaps? On Mon, Aug 16, 2010 at 8:51 AM, Jonathan Ellis jbel...@gmail.com wrote: That's a tough call -- you can also come up with scenarios where you'd rather have it read-only than completely dead. On Wed, Aug 11, 2010 at 12:38 PM, Ran Tavory ran...@gmail.com wrote: Due to administrative error one of the hosts in the cluster lost permission to write to it's data directory. So I started seeing errors in the log, however, the server continued serving traffic. It wasn't able to compact and do other write operations but it didn't crash. I was wondering wether that's by design and if so, is this a good one... I guess I want to know if really bad things happen to my cluster... logs look like that... INFO [FLUSH-TIMER] 2010-08-11 07:53:14,683 ColumnFamilyStore.java (line 357) KvAds has reached its threshold; switching in a fresh Memtable at CommitLogContext(file='/outbrain/cassandra/commitlog/Commi tLog-1281505164614.log', position=88521163) INFO [FLUSH-TIMER] 2010-08-11 07:53:14,683 ColumnFamilyStore.java (line 609) Enqueuing flush of Memtable(KvAds)@851225759 INFO [FLUSH-WRITER-POOL:1] 2010-08-11 07:53:14,684 Memtable.java (line 148) Writing Memtable(KvAds)@851225759 ERROR [FLUSH-WRITER-POOL:1] 2010-08-11 07:53:14,688 DebuggableThreadPoolExecutor.java (line 94) Error in executor futuretask java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.io.FileNotFoundException: /outbrain/cassandra/data/outbrain_kvdb/KvAds-tmp-249-Data.db (Permission denied) at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222) at java.util.concurrent.FutureTask.get(FutureTask.java:83) at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.afterExecute(DebuggableThreadPoolExecutor.java:86) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:888) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) Caused by: java.lang.RuntimeException: java.io.FileNotFoundException: /outbrain/cassandra/data/outbrain_kvdb/KvAds-tmp-249-Data.db (Permission denied) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) ... more -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
Re: Cassandra gem
If you pulled before a couple hours ago and did not use the 'trunk' branch, then you don't have current code. I merged the trunk branch to master earlier today and sent a pull request for the fauna repo to get the changes, as well. Also fixed a bug another user found when running with Ruby 1.9. Summary: pull again, use master, have fun. If it still doesn't work, please open an issue to me. b On Mon, Aug 16, 2010 at 2:13 PM, Mark static.void@gmail.com wrote: Just upgraded my cassandra gem today to b/cassandra fork and noticed that the transport changed. I re-enabled TFramedTransport in cassandra.yml but my client no longer works. I keep receiving the following error. Thrift::ApplicationException: describe_keyspace failed: unknown result from workspace/vendor/plugins/cassandra/lib/../vendor/0.7/gen-rb/cassandra.rb:346:in `recv_describe_keyspace' from workspace/vendor/plugins/cassandra/lib/../vendor/0.7/gen-rb/cassandra.rb:335:in `describe_keyspace' from workspace/vendor/plugins/thrift_client/lib/thrift_client/abstract_thrift_client.rb:67:in `send' from workspace/vendor/plugins/thrift_client/lib/thrift_client/abstract_thrift_client.rb:67:in `send_rpc' from workspace/vendor/plugins/thrift_client/lib/thrift_client/abstract_thrift_client.rb:164:in `send_rpc' from workspace/vendor/plugins/thrift_client/lib/thrift_client/abstract_thrift_client.rb:63:in `proxy' from workspace/vendor/plugins/thrift_client/lib/thrift_client/abstract_thrift_client.rb:154:in `proxy' from workspace/vendor/plugins/thrift_client/lib/thrift_client/abstract_thrift_client.rb:53:in `handled_proxy' from workspace/vendor/plugins/thrift_client/lib/thrift_client/abstract_thrift_client.rb:150:in `handled_proxy' from workspace/vendor/plugins/thrift_client/lib/thrift_client/abstract_thrift_client.rb:23:in `describe_keyspace' from (irb):14 Any clues?
Re: Cassandra gem
can you gist the code? On Mon, Aug 16, 2010 at 5:46 PM, Mark static.void@gmail.com wrote: On 8/16/10 3:58 PM, Benjamin Black wrote: If you pulled before a couple hours ago and did not use the 'trunk' branch, then you don't have current code. I merged the trunk branch to master earlier today and sent a pull request for the fauna repo to get the changes, as well. Also fixed a bug another user found when running with Ruby 1.9. Summary: pull again, use master, have fun. If it still doesn't work, please open an issue to me. b On Mon, Aug 16, 2010 at 2:13 PM, Markstatic.void@gmail.com wrote: Just upgraded my cassandra gem today to b/cassandra fork and noticed that the transport changed. I re-enabled TFramedTransport in cassandra.yml but my client no longer works. I keep receiving the following error. Thrift::ApplicationException: describe_keyspace failed: unknown result from workspace/vendor/plugins/cassandra/lib/../vendor/0.7/gen-rb/cassandra.rb:346:in `recv_describe_keyspace' from workspace/vendor/plugins/cassandra/lib/../vendor/0.7/gen-rb/cassandra.rb:335:in `describe_keyspace' from workspace/vendor/plugins/thrift_client/lib/thrift_client/abstract_thrift_client.rb:67:in `send' from workspace/vendor/plugins/thrift_client/lib/thrift_client/abstract_thrift_client.rb:67:in `send_rpc' from workspace/vendor/plugins/thrift_client/lib/thrift_client/abstract_thrift_client.rb:164:in `send_rpc' from workspace/vendor/plugins/thrift_client/lib/thrift_client/abstract_thrift_client.rb:63:in `proxy' from workspace/vendor/plugins/thrift_client/lib/thrift_client/abstract_thrift_client.rb:154:in `proxy' from workspace/vendor/plugins/thrift_client/lib/thrift_client/abstract_thrift_client.rb:53:in `handled_proxy' from workspace/vendor/plugins/thrift_client/lib/thrift_client/abstract_thrift_client.rb:150:in `handled_proxy' from workspace/vendor/plugins/thrift_client/lib/thrift_client/abstract_thrift_client.rb:23:in `describe_keyspace' from (irb):14 Any clues? Still getting the same error. describe_keyspaces doesnt work. Does it work for you? I am using: apache-cassandra-0.7.0-beta1-bin, thrift (0.2.0.4) thrift_client (0.4.6) Any clue? Thanks!
Re: Cassandra gem
$ irb require lib/cassandra/0.7 = true client = Cassandra.new('system', '127.0.0.1:9160') = #Cassandra:2160486220, @keyspace=system, @schema={}, @servers=[127.0.0.1:9160] client.keyspaces = [system] client.partitioner = org.apache.cassandra.dht.RandomPartitioner On Mon, Aug 16, 2010 at 5:46 PM, Mark static.void@gmail.com wrote: On 8/16/10 3:58 PM, Benjamin Black wrote: If you pulled before a couple hours ago and did not use the 'trunk' branch, then you don't have current code. I merged the trunk branch to master earlier today and sent a pull request for the fauna repo to get the changes, as well. Also fixed a bug another user found when running with Ruby 1.9. Summary: pull again, use master, have fun. If it still doesn't work, please open an issue to me. b On Mon, Aug 16, 2010 at 2:13 PM, Markstatic.void@gmail.com wrote: Just upgraded my cassandra gem today to b/cassandra fork and noticed that the transport changed. I re-enabled TFramedTransport in cassandra.yml but my client no longer works. I keep receiving the following error. Thrift::ApplicationException: describe_keyspace failed: unknown result from workspace/vendor/plugins/cassandra/lib/../vendor/0.7/gen-rb/cassandra.rb:346:in `recv_describe_keyspace' from workspace/vendor/plugins/cassandra/lib/../vendor/0.7/gen-rb/cassandra.rb:335:in `describe_keyspace' from workspace/vendor/plugins/thrift_client/lib/thrift_client/abstract_thrift_client.rb:67:in `send' from workspace/vendor/plugins/thrift_client/lib/thrift_client/abstract_thrift_client.rb:67:in `send_rpc' from workspace/vendor/plugins/thrift_client/lib/thrift_client/abstract_thrift_client.rb:164:in `send_rpc' from workspace/vendor/plugins/thrift_client/lib/thrift_client/abstract_thrift_client.rb:63:in `proxy' from workspace/vendor/plugins/thrift_client/lib/thrift_client/abstract_thrift_client.rb:154:in `proxy' from workspace/vendor/plugins/thrift_client/lib/thrift_client/abstract_thrift_client.rb:53:in `handled_proxy' from workspace/vendor/plugins/thrift_client/lib/thrift_client/abstract_thrift_client.rb:150:in `handled_proxy' from workspace/vendor/plugins/thrift_client/lib/thrift_client/abstract_thrift_client.rb:23:in `describe_keyspace' from (irb):14 Any clues? Still getting the same error. describe_keyspaces doesnt work. Does it work for you? I am using: apache-cassandra-0.7.0-beta1-bin, thrift (0.2.0.4) thrift_client (0.4.6) Any clue? Thanks!
Re: indexing rows ordered by int
http://code.google.com/p/redis/ On Sat, Aug 14, 2010 at 11:51 PM, S Ahmed sahmed1...@gmail.com wrote: For CF that I need to perform range scans on, I create separate CF that have custom ordering. Say a CF holds comments on a story (like comments on a reddit or digg story post) So if I need to order comments by votes, it seems I have to re-index every time someone votes on a comment (or batch it every x minutes). Right now I think I have to pull all the comments into memory, then sort by votes, then re-write the index. Are there any best-practises for this type of index?
Re: Data Distribution / Replication
On Fri, Aug 13, 2010 at 9:48 AM, Oleg Anastasjev olega...@gmail.com wrote: Benjamin Black b at b3k.us writes: 3. I waited for the data to replicate, which didn't happen. Correct, you need to run nodetool repair because the nodes were not present when the writes came in. You can also use a higher consistency level to force read repair before returning data, which will incrementally repair things. Alternatively you could configure new nodes to bootstrap in storage-conf.xml. If bootstrap is enalbed you'll get data replicated on them as soon as they join cluster. My recommendation is to leave Autobootstrap disabled, copy the datafiles over, and then run cleanup. It is faster and more reliable than streaming, in my experience. b
Re: Data Distribution / Replication
Number of bugs I've hit doing this with scp: 0 Number of bugs I've hit with streaming: 2 (and others found more) Also easier to monitor progress, manage bandwidth, etc. I just prefer using specialized tools that are really good at specific things. This is such a case. b On Fri, Aug 13, 2010 at 2:05 PM, Bill de hÓra b...@dehora.net wrote: On Fri, 2010-08-13 at 09:51 -0700, Benjamin Black wrote: My recommendation is to leave Autobootstrap disabled, copy the datafiles over, and then run cleanup. It is faster and more reliable than streaming, in my experience. What is less reliable about streaming? Bill
Re: Data Distribution / Replication
On Thu, Aug 12, 2010 at 8:30 AM, Stefan Kaufmann sta...@gmail.com wrote: Hello again, last day's I started several tests with Cassandra and learned quite some facts. However, of course, there are still enough things I need to understand. One thing is, how the data replication works. For my Testing: 1. I set the replication Factor to 3, started with 1 active node (the seed) and I inserted some test key's. This is not a correct concept of what a seed is. I suggest you not use the word 'seed' for it. 2. I started 2 more nodes, which joined the cluster. 3. I waited for the data to replicate, which didn't happen. Correct, you need to run nodetool repair because the nodes were not present when the writes came in. You can also use a higher consistency level to force read repair before returning data, which will incrementally repair things. 4. I inserted more key's, and it looked like they were distributed to all three nodes. Correct, they were up at the time and received the write operations directly. Seems like you might benefit from reading the operations wiki: http://wiki.apache.org/cassandra/Operations b b
Re: Growing commit log directory.
what does the io load look like on those nodes? On Mon, Aug 9, 2010 at 1:50 PM, Edward Capriolo edlinuxg...@gmail.com wrote: I have a 16 node 6.3 cluster and two nodes from my cluster are giving me major headaches. 10.71.71.56 Up 58.19 GB 10827166220211678382926910108067277 | ^ 10.71.71.61 Down 67.77 GB 123739042516704895804863493611552076888 v | 10.71.71.66 Up 43.51 GB 127605887595351923798765477786913079296 | ^ 10.71.71.59 Down 90.22 GB 139206422831293007780471430312996086499 v | 10.71.71.65 Up 22.97 GB 148873535527910577765226390751398592512 | ^ The symptoms I am seeing are nodes 61 and nodes 59 have huge 6 GB + commit log directories. They keep growing, along with memory usage, eventually the logs start showing GCInspection errors and then the nodes will go OOM INFO 14:20:01,296 Creating new commitlog segment /var/lib/cassandra/commitlog/CommitLog-1281378001296.log INFO 14:20:02,199 GC for ParNew: 327 ms, 57545496 reclaimed leaving 7955651792 used; max is 9773776896 INFO 14:20:03,201 GC for ParNew: 443 ms, 45124504 reclaimed leaving 8137412920 used; max is 9773776896 INFO 14:20:04,314 GC for ParNew: 438 ms, 54158832 reclaimed leaving 8310139720 used; max is 9773776896 INFO 14:20:05,547 GC for ParNew: 409 ms, 56888760 reclaimed leaving 8480136592 used; max is 9773776896 INFO 14:20:06,900 GC for ParNew: 441 ms, 58149704 reclaimed leaving 8648872520 used; max is 9773776896 INFO 14:20:08,904 GC for ParNew: 462 ms, 59185992 reclaimed leaving 8816581312 used; max is 9773776896 INFO 14:20:09,973 GC for ParNew: 460 ms, 57403840 reclaimed leaving 8986063136 used; max is 9773776896 INFO 14:20:11,976 GC for ParNew: 447 ms, 59814376 reclaimed leaving 9153134392 used; max is 9773776896 INFO 14:20:13,150 GC for ParNew: 441 ms, 61879728 reclaimed leaving 9318140296 used; max is 9773776896 java.lang.OutOfMemoryError: Java heap space Dumping heap to java_pid10913.hprof ... INFO 14:22:30,620 InetAddress /10.71.71.66 is now dead. INFO 14:22:30,621 InetAddress /10.71.71.65 is now dead. INFO 14:22:30,621 GC for ConcurrentMarkSweep: 44862 ms, 261200 reclaimed leaving 9334753480 used; max is 9773776896 INFO 14:22:30,621 InetAddress /10.71.71.64 is now dead. Heap dump file created [12730501093 bytes in 253.445 secs] ERROR 14:28:08,945 Uncaught exception in thread Thread[Thread-2288,5,main] java.lang.OutOfMemoryError: Java heap space at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:71) ERROR 14:28:08,948 Uncaught exception in thread Thread[Thread-2281,5,main] java.lang.OutOfMemoryError: Java heap space at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:71) INFO 14:28:09,017 GC for ConcurrentMarkSweep: 33737 ms, 85880 reclaimed leaving 9335215296 used; max is 9773776896 Does anyone have any ideas what is going on?
Re: TokenRange contains endpoints without any port information?
On Sun, Aug 8, 2010 at 5:21 AM, Carsten Krebs carsten.kr...@gmx.net wrote: I'm wondering why a TokenRange returned by describe_ring(keyspace) of the thrift API just returns endpoints consisting only of an address but omits any port information? My first thought was, this method could be used to expose some information about the ring structure to the client, i.e. to do some client side load balancing. But now, I'm not sure about this anymore. Additionally, when looking into the code, I guess the address returned as part of the TokenRange is the address of the storage service which could differ from the thrift address, which in turn would make the returned endpoint useless for the client. Not just _could_ differ, is _guaranteed_ to differ. The inter-node protocol is not Thrift. The returned endpoint is not useless for the client: you had to connect to the RPC port to even make the call. Use the same port when connecting to the other nodes. It is bad practice to have RPC ports differ between nodes in the same cluster. What is the purpose of this method or respectively why is the port information omitted? Discovering which nodes are in the ring and which node claims each range. b
Re: Columns limit
Right, this is an index row per time interval (your previous email was not). On Sat, Aug 7, 2010 at 11:43 AM, Mark static.void@gmail.com wrote: On 8/7/10 11:30 AM, Mark wrote: On 8/7/10 4:22 AM, Thomas Heller wrote: Ok, I think the part I was missing was the concatenation of the key and partition to do the look ups. Is this the preferred way of accomplishing needs such as this? Are there alternatives ways? Depending on your needs you can concat the row key or use super columns. How would one then query over multiple days? Same question for all days. Should I use range_slice or multiget_slice? And if its range_slice does that mean I need OrderPreservingPartitioner? The last 3 days is pretty simple: ['2010-08-07', '2010-08-06', '2010-08-05'], as is 7, 31, etc. Just generate the keys in your app and use multiget_slice. If you want to get all days where a specific ip address had some requests you'll just need another CF where the row key is the addr and column names are the days (values optional again). Pretty much the same all over again, just add another CF and insert the data you need. get_range_slice in my experience is better used for offline tasks where you really want to process every row there is. /thomas Ok... as an example using looking up logs by ip for a certain timeframe/range would this work? ColumnFamily Name=SearchLog/ ColumnFamily Name=IPSearchLog ColumnType=Super CompareWith=UTF8Type CompareSubcolumnsWith=TimeUUIDType/ Resulting in a structure like: { 127.0.0.1 : { 2010080711 : { uuid1 : uuid2: uuid3: } 2010080712 : { uuid1 : uuid2: uuid3: } } some.other.ip : { 2010080711 : { uuid1 : } } } Whereas each uuid is the key used for SearchLog. Is there anything wrong with this? I know there is a 2 billion column limit but in this case that would never be exceeded because each column represents an hour. However does the above schema imply that for any certain IP there can only be a maxium of 2GB of data stored? Or should I invert the ip with the time slices? The limitation of this seems like there can only be 2 billion unique ips per hour which is more than enough for our application :) { 2010080711 : { 127.0.0.1 : { uuid1 : uuid2: uuid3: } some.other.ip : { uuid1 : uuid2: uuid3: } } 2010080712 : { 127.0.0.1 : { uuid1 : } } }
Re: Columns limit
certainly it matters: your previous version is not bounded on time, so will grow without bound. ergo, it is not a good fit for cassandra. On Sat, Aug 7, 2010 at 2:51 PM, Mark static.void@gmail.com wrote: On 8/7/10 2:33 PM, Benjamin Black wrote: Right, this is an index row per time interval (your previous email was not). On Sat, Aug 7, 2010 at 11:43 AM, Markstatic.void@gmail.com wrote: On 8/7/10 11:30 AM, Mark wrote: On 8/7/10 4:22 AM, Thomas Heller wrote: Ok, I think the part I was missing was the concatenation of the key and partition to do the look ups. Is this the preferred way of accomplishing needs such as this? Are there alternatives ways? Depending on your needs you can concat the row key or use super columns. How would one then query over multiple days? Same question for all days. Should I use range_slice or multiget_slice? And if its range_slice does that mean I need OrderPreservingPartitioner? The last 3 days is pretty simple: ['2010-08-07', '2010-08-06', '2010-08-05'], as is 7, 31, etc. Just generate the keys in your app and use multiget_slice. If you want to get all days where a specific ip address had some requests you'll just need another CF where the row key is the addr and column names are the days (values optional again). Pretty much the same all over again, just add another CF and insert the data you need. get_range_slice in my experience is better used for offline tasks where you really want to process every row there is. /thomas Ok... as an example using looking up logs by ip for a certain timeframe/range would this work? ColumnFamily Name=SearchLog/ ColumnFamily Name=IPSearchLog ColumnType=Super CompareWith=UTF8Type CompareSubcolumnsWith=TimeUUIDType/ Resulting in a structure like: { 127.0.0.1 : { 2010080711 : { uuid1 : uuid2: uuid3: } 2010080712 : { uuid1 : uuid2: uuid3: } } some.other.ip : { 2010080711 : { uuid1 : } } } Whereas each uuid is the key used for SearchLog. Is there anything wrong with this? I know there is a 2 billion column limit but in this case that would never be exceeded because each column represents an hour. However does the above schema imply that for any certain IP there can only be a maxium of 2GB of data stored? Or should I invert the ip with the time slices? The limitation of this seems like there can only be 2 billion unique ips per hour which is more than enough for our application :) { 2010080711 : { 127.0.0.1 : { uuid1 : uuid2: uuid3: } some.other.ip : { uuid1 : uuid2: uuid3: } } 2010080712 : { 127.0.0.1 : { uuid1 : } } } In the end does it really matter which one to go with? I kind of like the previous version so I don't have to build up all the keys for the multi_get and instead I can just provide and start finish for the columns (time frames).
Re: Columns limit
Certainly. There is also a performance penalty to unbounded row sizes. That penalty is your nodes OOMing. I strongly recommend you abandon that direction. On Sat, Aug 7, 2010 at 9:06 PM, Mark static.void@gmail.com wrote: On 8/7/10 7:04 PM, Benjamin Black wrote: certainly it matters: your previous version is not bounded on time, so will grow without bound. ergo, it is not a good fit for cassandra. On Sat, Aug 7, 2010 at 2:51 PM, Markstatic.void@gmail.com wrote: On 8/7/10 2:33 PM, Benjamin Black wrote: Right, this is an index row per time interval (your previous email was not). On Sat, Aug 7, 2010 at 11:43 AM, Markstatic.void@gmail.com wrote: On 8/7/10 11:30 AM, Mark wrote: On 8/7/10 4:22 AM, Thomas Heller wrote: Ok, I think the part I was missing was the concatenation of the key and partition to do the look ups. Is this the preferred way of accomplishing needs such as this? Are there alternatives ways? Depending on your needs you can concat the row key or use super columns. How would one then query over multiple days? Same question for all days. Should I use range_slice or multiget_slice? And if its range_slice does that mean I need OrderPreservingPartitioner? The last 3 days is pretty simple: ['2010-08-07', '2010-08-06', '2010-08-05'], as is 7, 31, etc. Just generate the keys in your app and use multiget_slice. If you want to get all days where a specific ip address had some requests you'll just need another CF where the row key is the addr and column names are the days (values optional again). Pretty much the same all over again, just add another CF and insert the data you need. get_range_slice in my experience is better used for offline tasks where you really want to process every row there is. /thomas Ok... as an example using looking up logs by ip for a certain timeframe/range would this work? ColumnFamily Name=SearchLog/ ColumnFamily Name=IPSearchLog ColumnType=Super CompareWith=UTF8Type CompareSubcolumnsWith=TimeUUIDType/ Resulting in a structure like: { 127.0.0.1 : { 2010080711 : { uuid1 : uuid2: uuid3: } 2010080712 : { uuid1 : uuid2: uuid3: } } some.other.ip : { 2010080711 : { uuid1 : } } } Whereas each uuid is the key used for SearchLog. Is there anything wrong with this? I know there is a 2 billion column limit but in this case that would never be exceeded because each column represents an hour. However does the above schema imply that for any certain IP there can only be a maxium of 2GB of data stored? Or should I invert the ip with the time slices? The limitation of this seems like there can only be 2 billion unique ips per hour which is more than enough for our application :) { 2010080711 : { 127.0.0.1 : { uuid1 : uuid2: uuid3: } some.other.ip : { uuid1 : uuid2: uuid3: } } 2010080712 : { 127.0.0.1 : { uuid1 : } } } In the end does it really matter which one to go with? I kind of like the previous version so I don't have to build up all the keys for the multi_get and instead I can just provide and start finish for the columns (time frames). Is there any performance penalty for a multi_get that includes x keys versus a get on 1 key with a start/finish range of x? Using your gem, multi_get(SearchLog, [20090101...20100807], 127.0.0.1) vs get(SearchLog, 127.0.0.1, :start = 20090101, :finish = 127.0.0.1) Thanks
Re: Question on load balancing in a cluster
Yes, imo, it should be renamed. On Fri, Aug 6, 2010 at 10:10 AM, Bill Au bill.w...@gmail.com wrote: If nodetool loadbalance does not do what it's name implies, should it be renamed or maybe even remove altogether since the recommendation is to _never_ use it in production? Bill On Thu, Aug 5, 2010 at 6:41 AM, aaron morton aa...@thelastpickle.com wrote: This comment from Ben Black may help... I recommend you _never_ use nodetool loadbalance in production because it will _not_ result in balanced load. The correct process is manual calculation of tokens (the algorithm for RP is on the Operations wiki page) and nodetool move. http://www.mail-archive.com/user@cassandra.apache.org/msg04933.html So the recommendation is to manually set initial tokens and then manually move them. As for the need to decommission I'm guessing it's for reasons such as making it easier to avoid overlapping tokens and to avoid accepting writes that will soon be moved. Others may be able to add more. Aaron On 5 Aug 2010, at 14:49, anand_s wrote: Hi, Have some thoughts on load balancing on current / new nodes. I have come across some posts around this, but not sure of what is being finally proposed, so.. From what I have read, a nodebalance on a node does a decommission and bootstrap of that node. Is there a reason why it is that way (decommission and bootstrap) and not just a simple look at my next neighbor and just split the load with it? As in if the ring has nodes A, B, C and D with load (in GB) on these respectively is 100, 70, 100, 80. Then a nodetool balance on B should result in 100, 85, 85, 80 (some tokens move from C to B). It is still manual but data movement is only what is needed – 15 GB instead of the 100+GB (decommission and bootstrap) . The idea is not to get a perfect balance, but an acceptable balance with less data movement. Also when a new node is added, it takes 50% from the most loaded node. Don't we want to rebalance such that the load is more or less evenly distributed across the cluster? Would it not help if I could just specify the % load as a parameter to rebalance command, so that I can optimize the moment of data for rebalancing. E.g. A,B,C,E is a cluster with load being 80, 78, 83, 84. Now I add a new node D (position will be before E), so eventually after all the rebalance activity I want the load to be ~66 (245/5) . Now to minimize the movement of data and still get a good balance, we move only what is needed (so data sort of flows from more to less loaded nodes until balanced). This could be a manual process (I am basically suggesting a similar approach as in paragraph one). Another thought is that instead of using pure current usage on a node to determine load, shouldn't there be higher level concept like node weight to handle heterogeneous nodes or is the expectation that all nodes are more or less equal? Thanks Anand -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Question-on-load-balancing-in-a-cluster-tp5375140p5375140.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: one question about cassandra write
On Fri, Aug 6, 2010 at 12:51 PM, Maifi Khan maifi.k...@gmail.com wrote: Hi I have a question about the internal of cassandra write. Say, I already have the following in the database - (row_x,col_y,val1) Now if I try to insert (row_x,col_y,val100), what will happen? Will it overwrite the old data? I mean, will it overwrite the data physically or will it keep both the old version and the new version of the data? Assuming the old version is already on disk in an SSTable, the new version will not overwrite it, and both versions will be in the system. A compaction will remove the old version, however. If the later is the case, can I retrieve the old version of the data? No. And no, there is no plan to add that functionality. If it is needed it is simple to emulate in a variety of ways with the current feature set. This is recommended reading: http://maxgrinev.com/2010/07/12/update-idempotency-why-it-is-important-in-cassandra-applications-2/ b
Re: set ReplicationFactor and Token at Column Family/SuperColumn level.
Additional keyspaces have very little overhead (unlike CFs). On Fri, Aug 6, 2010 at 9:42 AM, Zhong Li z...@voxeo.com wrote: If I create 3-4 keyspaces, will this impact performance and resources (esp. memory and disk I/O) too much? Thanks, Zhong On Aug 5, 2010, at 4:52 PM, Benjamin Black wrote: On Thu, Aug 5, 2010 at 12:59 PM, Zhong Li z...@voxeo.com wrote: The big thing bother me is initial ring token. We have some Column Families. It is very hard to choose one token suitable for all CFs. Also some Column Families need higher Consistent Level and some don't. If we set Consistency Level is set by clients, per request. If you require different _Replication Factors_ for different CFs, then just put them in different keyspaces. Additional keyspaces have very little overhead (unlike CFs). ReplicationFactor too high, it is too costy for crossing datacenter, especially in otherside the world. I know we can setup multiple rings, but it costs more hardware. if Cassandra can implement Ring,Token and RF on the CF level, or even SuperColumn level, it will make design much easier and more efficiency. Is it possible? The approach I described above is what you can do. The rest of what you asked is not happening. b
Re: How to migrate any relational database to Cassandra
http://maxgrinev.com/2010/07/12/do-you-really-need-sql-to-do-it-all-in-cassandra/ http://www.slideshare.net/benjaminblack/cassandra-basics-indexing On Fri, Aug 6, 2010 at 11:42 AM, sonia gehlot sonia.geh...@gmail.com wrote: Thanks for reply, I am sorry It seems my question comes out wrong.. * My question is what are the considration should I keep in mind to Migrate to Cassandra? * Like we do in ETL to extract data from source data we write query and then load it in our database after applying desired transformation.. How can we do this if we want extract data from MySQL and load it into Cassandra. * I think I can write script for these kind of stuff but do anyone have any example script? * What kind of setup I need to do this? -Sonia On Fri, Aug 6, 2010 at 11:24 AM, Michael Dürgner mich...@duergner.de wrote: In my opinion it's the wrong approach when so ask how to migrate from MySQL to Cassandra from a database level view. The lack of joins in NoSQL should lead to think about what you wanna get out of your persistent storage and afterwards think about how to migrate and most of the time how to denormalize the data you have in order to insert it into a NoSQL storage like Cassandra. Simply just migrating the data and moving the joins up to the application level might work in the beginning but most of the times doesn't scale in the end. Am 06.08.2010 um 20:00 schrieb sonia gehlot: Hi All, Little background about myself. I am ETL engineer worked in only relational databases. I have been reading and trying Cassandra since 3-4 weeks. I kind of understood Cassandra data model, its structure, nodes etc. I also installed Cassandra and played around with it, like cassandra set Keyspace1.Standard2['jsmith']['first'] = 'John' Value inserted. cassandra get Keyspace1.Standard2['jsmith'] (column=first, value=John; timestamp=1249930053103) Returned 1 rows. But don't know what to do next? Like if someone says me this is MySQL database migrate it to cassandra, then I dont know what should be my next step? Can you please help me how to move forward? How should I do all the setup for this? Any help is appreciated. Thanks, Sonia
Re: Cassandra 0.7 Ruby/Thrift Bindings
Ryan, I believe my branch was merged into fauna some time ago by jmhodges. However, 0.7 support must be explicitly enabled by require 'cassandra/0.7' as it currently defaults to 0.6. b On Fri, Aug 6, 2010 at 10:02 AM, Ryan King r...@twitter.com wrote: On Fri, Aug 6, 2010 at 9:57 AM, Mark static.void@gmail.com wrote: Wow.. fast answer AND correct. In Cassandra.yml # Frame size for thrift (maximum field length). # 0 disables TFramedTransport in favor of TSocket. thrift_framed_transport_size_in_mb: 15 I just had to change that value to 0 and everything worked. Now for my follow up question :) What is the difference between these two and why does 0.7 default to true while earlier versions default to false? Thanks again! Ah, you're using 0.7. fauna-cassandra has not been updated for 0.7. There's an experimental branch for it here: http://github.com/b/cassandra/tree/0.7 -ryan
Re: Columns limit
Yes, it is common to create distinct CFs for indices. On Fri, Aug 6, 2010 at 4:40 PM, Software Dev static.void@gmail.com wrote: Thanks for the suggestion. I've somewhat understand all that, the point where my head begins to explode is when I want to figure out something like Continuing with your example: Over the last X amount of days give me all the logs for remote_addr:XXX. I'm guessing I would need to create a separate index ColumnFamily??? On Fri, Aug 6, 2010 at 4:32 PM, Thomas Heller i...@zilence.net wrote: Howdy, thought I jump in here. I did something similar, meaning I had lots of items coming in per day and wanted to somehow partition them to avoid running into the column limit (it was also logging related). Solution was pretty simple, log data is immutable, so no SuperColumn needed. ColumnFamily Standard: LogRecords, CompareWith=TimeUUIDType Row Key 20100806: Column Name: TimeUUID.new Value: JSON({'remote_addr':..., 'user_agent':, 'url':) ..., more Columns In my case I chose to partition by day, if you are getting too many columns per day, just get hours in there. If you want an extra seperation level (foo, bar) in your example you could either go for a SuperColumn or just adjust your row key accordingly (eg. foo:20100806) HTH, /thomas
Re: Columns limit
Same answer as on other thread right now about how to index: http://maxgrinev.com/2010/07/12/do-you-really-need-sql-to-do-it-all-in-cassandra/ http://www.slideshare.net/benjaminblack/cassandra-basics-indexing On Fri, Aug 6, 2010 at 6:18 PM, Mark static.void@gmail.com wrote: On 8/6/10 4:50 PM, Thomas Heller wrote: Thanks for the suggestion. I've somewhat understand all that, the point where my head begins to explode is when I want to figure out something like Continuing with your example: Over the last X amount of days give me all the logs for remote_addr:XXX. I'm guessing I would need to create a separate index ColumnFamily??? Depending on your needs you can either insert them directly or pull them out later in some map/reduce fashion. What you want is another column Family and a similar structure. ColumnFamily Standard LogByRemoteAddrAndDate CompareWith: TimeUUID Row: 127.0.0.1:20100806 Column TimeUUID/JSON as usual. If you want to link to the actual log record (to avoid writing if multiple times) just insert the same timeuuid you inserted into the other CF and leave the value empty. So you have your Index, aka list of column names, and you can look up the actual values using get_slice with column_names. Confusing at first, but really quite simple once you get used to the idea. Just alot more work then letting SQL do it for you. ;) HTH, /thomas Ok, I think the part I was missing was the concatenation of the key and partition to do the look ups. Is this the preferred way of accomplishing needs such as this? Are there alternatives ways? How would one then query over multiple days? Same question for all days. Should I use range_slice or multiget_slice? And if its range_slice does that mean I need OrderPreservingPartitioner?
Re: when should new nodes be added to a cluster
you have insufficient i/o bandwidth and are seeing reads suffer due to competition from memtable flushes and compaction. adding additional nodes will help some, but i recommend increasing the disk i/o bandwidth, regardless. b On Mon, Aug 2, 2010 at 11:47 AM, Artie Copeland yeslinux@gmail.com wrote: i have a question on what are the signs from cassandra that new nodes should be added to the cluster. We are currently seeing long read times from the one node that has about 70GB of data with 60GB in one column family. we are using a replication factor of 3. I have tracked down the slow to occur when either row-read-stage or message-deserializer-pool is high like atleast 4000. my systems are 16core, 3 TB, 48GB mem servers. we would like to be able to use more of the server than just 70GB. The system is a realtime system that needs to scale quite large. Our current heap size is 25GB and are getting atleast 50% row cache hit rates. Does it seem strange that cassandra is not able to handle the work load? We perform multislice gets when reading similar to twissandra does. this is to cut down on the network ops. Looking at iostat it doesnt appear to have alot of queued reads. What are others seeing when they have to add new nodes? What data sizes are they seeing? This is needed so we can plan our growth and server purchase strategy. thanx Artie -- http://yeslinux.org http://yestech.org
Re: Cassandra cookbook for Chef
Correct, it is on its own branch. On Mon, Aug 2, 2010 at 9:08 AM, Sal Fuentes fuente...@gmail.com wrote: I'm guessing its the cassandra branch from that repo. You can find it here: http://github.com/b/cookbooks/tree/cassandra On Mon, Aug 2, 2010 at 1:49 AM, Boris Shulman shulm...@gmail.com wrote: I can't find this cookbook anymore at the specified URL. Where can I find it? On Tue, Mar 16, 2010 at 6:40 AM, Benjamin Black b...@b3k.us wrote: I've just pushed a rough but useful chef cookbook for Cassandra: http://github.com/b/cookbooks/tree/master/cassandra It is lacking in documentation and assumes you have a Cassandra package handy to install. I'd really appreciate if folks could try it out and give feed back (or, even better, patches to improve it). b -- Salvador Fuentes Jr.
Re: Columns limit
The proper way to handle this is to have a row per time interval such that the number of columns per row is constrained. On Thu, Jul 29, 2010 at 2:39 PM, Mark static.void@gmail.com wrote: Is there any limitations on the number of columns a row can have? Does all the day for a single key need to reside on a single host? If so, wouldn't that mean there is an implicit limit on the number of columns one can have... ie the disk size of that machine. What is the proper way to handle timelines in this matter. For example lets say I wanted to store all user searches in a super column. ColumnFamily Name=SearchLogs ColumnType=Super CompareWith=TimeUUIDType CompareSubcolumnsWith=BytesType/ Which results in a structure as follows { SearchLogs : { foo : { timeuuid_1 : { metadata goes here} timeuuid_2: { metadata goes here} }, bar : { timeuuid_1 : { metadata goes here} timeuuid_2: { metadata goes here} } } } Couldn't this theoretically run out of columns for the same search term because for each unique term there can (and will) be many timeuuid columns? Thanks for clearing this up for me.
Re: Unreliable transport layer
Because it is extremely well-understood, handles a lot of the reliability needs itself, and nothing more is required for the application. On Thu, Jul 29, 2010 at 7:02 PM, ChingShen chingshenc...@gmail.com wrote: Why? What reasons did you choose TCP? Shen On Sat, Mar 6, 2010 at 9:15 AM, Jonathan Ellis jbel...@gmail.com wrote: In 0.6 gossip is over TCP. On Fri, Mar 5, 2010 at 6:54 PM, Ashwin Jayaprakash ashwin.jayaprak...@gmail.com wrote: Hey guys! I have a simple question. I'm a casual observer, not a real Cassandra user yet. So, excuse my ignorance. I see that the Gossip feature uses UDP. I was curious to know if you guys faced issues with unreliable transports in your production clusters? Like faulty switches, dropped packets etc during heavy network loads? If I'm not mistaken are all client reads/writes doing point-to-point over TCP? Thanks, Ashwin.
Re: Columns limit
Have the TimeUUID as the key, and then index rows named for the time intervals, each containing columns with TimeUUID names giving the data in those intervals. On Sat, Jul 31, 2010 at 5:13 PM, Mark static.void@gmail.com wrote: So have the TimeUUID as the key? SearchLogs : { TimeUUID_1 : { metadata goes here}, TimeUUID_2 : { metadata goes here}, TimeUUID_3 : { metadata goes here}, ... } On 7/31/10 3:42 PM, Benjamin Black wrote: The proper way to handle this is to have a row per time interval such that the number of columns per row is constrained. On Thu, Jul 29, 2010 at 2:39 PM, Markstatic.void@gmail.com wrote: Is there any limitations on the number of columns a row can have? Does all the day for a single key need to reside on a single host? If so, wouldn't that mean there is an implicit limit on the number of columns one can have... ie the disk size of that machine. What is the proper way to handle timelines in this matter. For example lets say I wanted to store all user searches in a super column. ColumnFamily Name=SearchLogs ColumnType=Super CompareWith=TimeUUIDType CompareSubcolumnsWith=BytesType/ Which results in a structure as follows { SearchLogs : { foo : { timeuuid_1 : { metadata goes here} timeuuid_2: { metadata goes here} }, bar : { timeuuid_1 : { metadata goes here} timeuuid_2: { metadata goes here} } } } Couldn't this theoretically run out of columns for the same search term because for each unique term there can (and will) be many timeuuid columns? Thanks for clearing this up for me.
Re: Consequences of Cassandra key NOT unique
You are both confusing columns with rows. Columns have timestamps, row keys do not. On Wed, Jul 28, 2010 at 11:37 PM, Thorvaldsson Justus justus.thorvalds...@svenskaspel.se wrote: You insert 500 rows with key “x” And 1000 rows with key “y” You make a query getting all rows. It will only show two rows, the ones with the latest timestamps. /Justus Från: Rana Aich [mailto:aichr...@gmail.com] Skickat: den 29 juli 2010 08:23 Till: user@cassandra.apache.org Ämne: Re: Consequences of Cassandra key NOT unique Thanks for your reply! I thought in that case a new row would be inserted with a new timestamp and cassandra will report the new row. But how this will affect my range query? It would not affect it. On Wed, Jul 28, 2010 at 7:03 PM, Benjamin Black b...@b3k.us wrote: If you write new data with a key that is already present, the existing columns are overwritten or new columns are added. There is no way to cause a duplicate key to be inserted. On Wed, Jul 28, 2010 at 6:16 PM, Rana Aich aichr...@gmail.com wrote: Hello, I was wondering what may the pitfalls in Cassandra when the Key value is not UNIQUE? Will it affect the range query performance? Thanks and regards, raich
Re: Cassandra vs MongoDB
They have approximately nothing in common. And, no, Cassandra is definitely not dying off. On Tue, Jul 27, 2010 at 8:14 AM, Mark static.void@gmail.com wrote: Can someone quickly explain the differences between the two? Other than the fact that MongoDB supports ad-hoc querying I don't know whats different. It also appears (using google trends) that MongoDB seems to be growing while Cassandra is dying off. Is this the case? Thanks for the help
Re: Consequences of Cassandra key NOT unique
If you write new data with a key that is already present, the existing columns are overwritten or new columns are added. There is no way to cause a duplicate key to be inserted. On Wed, Jul 28, 2010 at 6:16 PM, Rana Aich aichr...@gmail.com wrote: Hello, I was wondering what may the pitfalls in Cassandra when the Key value is not UNIQUE? Will it affect the range query performance? Thanks and regards, raich
Re: Quick Poll: Server names
[role][sequence].[airport code][sequence].[domain].[tld]