Re: Keyspace and table/cf limits
We had the similar problem - multi-tenancy and multiple DC support. But we did not really have strict requirement of one keyspace per tenant. Our row keys allow us to put any number of tenants per keyspace. So, on one side - we could put all data in a single keyspace for all tenants. And size the cluster for it, at the end the total amount of data would be the same :) However, we wanted different replication strategy for different customers. And the replication strategy is a keyspace setting. Thus, it wold be simpler to have one keyspace per customer. The cost, as it was mentioned, is per CF. The more keyspaces we have, the more CFs we have. So we did not want this to be too high. The decision we've made was to have something in between. We'd define a number of keyspaces with different replication strategies (possibly even duplicate ones) and map tenants to these keyspaces. Thus, there would be a couple of tenants in one keyspace all sharing the same properties (replication strategy in our case). We could even create a keyspace that will group some tenants that currently share the same replication requirements and that may be moved/replicated to a specific DC in the future. On Wed, Dec 3, 2014 at 4:54 PM, Raj N raj.cassan...@gmail.com wrote: The question is more from a multi-tenancy point of view. We wanted to see if we can have a keyspace per client. Each keyspace may have 50 column families, but if we have 200 clients, that would be 10,000 column families. Do you think that's reasonable to support? I know that key cache capacity is reserved in heap still. Any plans to move it off-heap? -Raj On Tue, Nov 25, 2014 at 3:10 PM, Robert Coli rc...@eventbrite.com wrote: On Tue, Nov 25, 2014 at 9:07 AM, Raj N raj.cassan...@gmail.com wrote: What's the latest on the maximum number of keyspaces and/or tables that one can have in Cassandra 2.1.x? Most relevant changes lately would be : https://issues.apache.org/jira/browse/CASSANDRA-6689 and https://issues.apache.org/jira/browse/CASSANDRA-6694 Which should meaningfully reduce the amount of heap memtables consume. That heap can then be used to support more heap-persistent structures associated with many CFs. I have no idea how to estimate the scale of the improvement. As a general/meta statement, Cassandra is very multi-threaded, and consumes file handles like crazy. How many different query cases do you really want to put on one cluster/node? ;D =Rob -- Nikolai Grigoriev (514) 772-5178
Re: opscenter: 0 of 0 agents connected, but /nodes/all gives 3 results
I have observed this kind of situation with 0 agents connected. Restarting the agents always helped so far. By the way, check the agent's logs and opscenterd logs, there may be some clues there. On Tue, Dec 2, 2014 at 4:59 PM, Ian Rose ianr...@fullstory.com wrote: Hi all - Just getting started setting up OpsCenter today. I have a 3 node cassandra cluster and (afaict) the agent installed and running happily on all 3 nodes. I also have OpsCenter up and running on a 4th node. I do not have SSL enabled between these nodes. In the OpsCenter interface, I see 0 of 0 agents connected and restarting OpsCenter does not change this. However, in my cluster settings I have the following entered for the least one host / IP in the cluster portion, and these hostnames are all resolvable/pingable from the OpsCenter machine. cassandra-db-1 cassandra-db-2 cassandra-db-3 Also, from this post http://www.datastax.com/support-forums/topic/agents-running-without-error-but-not-connecting-to-opscenter I found a suggestion to curl cluster/nodes/all and when I do that I (think I) get results showing 3 nodes. Any suggestions? Thanks! - Ian curl http://localhost:/FullStory/nodes/all [{load: 0.0, has_jna: false, vnodes: true, devices: {saved_caches: sdb, commitlog: sdb, other: [sda], data: [sdb]}, task_progress: {}, node_ip: 10.240.167.187, network_interfaces: [eth0, lo], ec2: {instance-type: null, placement: null, ami-id: null, instance-id: null}, node_version: {search: null, jobtracker: null, tasktracker: null, spark: {master: null, version: null, worker: null}, dse: null, cassandra: 2.0.9}, dc: us-central1, node_name: cassandra-db-3.c.fs-staging.internal, num_procs: 1, streaming: {}, token: -3220529141950148793, data_held: 54064133.0, mode: normal, rpc_ip: 10.240.167.187, partitions: {saved_caches: /dev/sdb, commitlog: /dev/sdb, other: [/dev/disk/by_uuid/121fd70f_e625_4394_89ce_5e5eff3d25e0, rootfs], data: [/dev/sdb]}, os: linux, rack: f, last_seen: 0}, {load: 0.0, has_jna: false, vnodes: true, devices: {saved_caches: sdb, commitlog: sdb, other: [sda], data: [sdb]}, task_progress: {}, node_ip: 10.240.111.79, network_interfaces: [eth0, lo], ec2: {instance-type: null, placement: null, ami-id: null, instance-id: null}, node_version: {search: null, jobtracker: null, tasktracker: null, spark: {master: null, version: null, worker: null}, dse: null, cassandra: 2.0.9}, dc: us-central1, node_name: cassandra-db-1.c.fs-staging.internal, num_procs: 1, streaming: {}, token: 6048217978730786970, data_held: 54319361.0, mode: normal, rpc_ip: 10.240.111.79, partitions: {saved_caches: /dev/sdb, commitlog: /dev/sdb, other: [/dev/disk/by_uuid/2795e34f_4e9b_46a9_9e39_e6cc0f1f1478, rootfs], data: [/dev/sdb]}, os: linux, rack: a, last_seen: 0}, {load: 0.04, has_jna: false, vnodes: true, devices: {saved_caches: sdb, commitlog: sdb, other: [sda], data: [sdb]}, task_progress: {}, node_ip: 10.240.129.185, network_interfaces: [eth0, lo], ec2: {instance-type: null, placement: null, ami-id: null, instance-id: null}, node_version: {search: null, jobtracker: null, tasktracker: null, spark: {master: null, version: null, worker: null}, dse: null, cassandra: 2.0.9}, dc: us-central1, node_name: cassandra-db-2.c.fs-staging.internal, num_procs: 1, streaming: {}, token: 4393271482730462700, data_held: 51388793.0, mode: normal, rpc_ip: 10.240.129.185, partitions: {saved_caches: /dev/sdb, commitlog: /dev/sdb, other: [/dev/disk/by_uuid/2795e34f_4e9b_46a9_9e39_e6cc0f1f1478, rootfs], data: [/dev/sdb]}, os: linux, rack: b, last_seen: 0}] -- Nikolai Grigoriev (514) 772-5178
Re: does safe cassandra shutdown require disable binary?
Personally I believe that you do not have to do these steps just to perform the restart. I know the node will start faster if drained before shutdown but according to my experience these steps make the restart process slightly longer (I mean stop + start phase, total). So if it is really about rolling restart to apply some JVM or C* settings I would simply kill it and start. On Mon, Dec 1, 2014 at 12:02 AM, Kevin Burton bur...@spinn3r.com wrote: I’m trying to figure out a safe way to do a rolling restart. http://devblog.michalski.im/2012/11/25/safe-cassandra-shutdown-and-restart/ It has the following command which make sense: root@cssa01:~# nodetool -h cssa01.michalski.im disablegossiproot@cssa01:~# nodetool -h cssa01.michalski.im disablethriftroot@cssa01:~# nodetool -h cssa01.michalski.im drain … but I don’t think this takes into consideration CQL. So you would first disablethrift, then disablebinary anything else needed in modern Cassandra ? -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com -- Nikolai Grigoriev (514) 772-5178
Re: Compaction Strategy guidance
Hi Jean-Armel, I am using latest and greatest DSE 4.5.2 (4.5.3 in another cluster but there are no relevant changes between 4.5.2 and 4.5.3) - thus, Cassandra 2.0.10. I have about 1,8Tb of data per node now in total, which falls into that range. As I said, it is really a problem with large amount of data in a single CF, not total amount of data. Quite often the nodes are idle yet having quite a bit of pending compactions. I have discussed it with other members of C* community and DataStax guys and, they have confirmed my observation. I believe that increasing the sstable size won't help at all and probably will make the things worse - everything else being equal, of course. But I would like to hear from Andrei when he is done with his test. Regarding the last statement - yes, C* clearly likes many small servers more than fewer large ones. But it is all relative - and can be all recalculated to $$$ :) C* is all about partitioning of everything - storage, traffic...Less data per node and more nodes give you lower latency, lower heap usage etc, etc. I think I have learned this with my project. Somewhat hard way but still, nothing is better than the personal experience :) On Tue, Nov 25, 2014 at 3:23 AM, Jean-Armel Luce jaluc...@gmail.com wrote: Hi Andrei, Hi Nicolai, Which version of C* are you using ? There are some recommendations about the max storage per node : http://www.datastax.com/dev/blog/performance-improvements-in-cassandra-1-2 For 1.0 we recommend 300-500GB. For 1.2 we are looking to be able to handle 10x (3-5TB). I have the feeling that those recommendations are sensitive according many criteria such as : - your hardware - the compaction strategy - ... It looks that LCS lower those limitations. Increasing the size of sstables might help if you have enough CPU and you can put more load on your I/O system (@Andrei, I am interested by the results of your experimentation about large sstable files) From my point of view, there are some usage patterns where it is better to have many small servers than a few large servers. Probably, it is better to have many small servers if you need LCS for large tables. Just my 2 cents. Jean-Armel 2014-11-24 19:56 GMT+01:00 Robert Coli rc...@eventbrite.com: On Mon, Nov 24, 2014 at 6:48 AM, Nikolai Grigoriev ngrigor...@gmail.com wrote: One of the obvious recommendations I have received was to run more than one instance of C* per host. Makes sense - it will reduce the amount of data per node and will make better use of the resources. This is usually a Bad Idea to do in production. =Rob -- Nikolai Grigoriev (514) 772-5178
Re: Rule of thumb for concurrent asynchronous queries?
I think it all depends on how many machines will be involved in the query (read consistency is also a factor) and how long is a typical response in bytes. Large responses will put more pressure on the GC, which will result in more time spent in GC and possibly long(er) GC pauses. Cassandra can tolerate many things - but at the cost for other queries and all the way up to the heal of the individual node. From the original question it is not clear if all these rows are coming from the same or few nodes (token range) or these are really 10K primary keys - so they are spread more or less evenly across the cluster. Also the node disk I/O may be a concern - especially if the data is not in OS cache (or row cache if applicable). I think it is a tough question to get a precise answer. If I had such a problem I would try to determine the peak speed I can achieve first. I.e. find the limiting factor (CPU or disk I/O most likely), then shoot as many requests in as many threads as practical for the client app. Measure the load to prove that you've determined the limiting factor correctly (either CPU or I/O, I doubt it will be network). Then measure the latency and decide what kind of latency you can tolerate for your use case. And then go down from that peak load you've created by certain factor (i.e. limit yourself to XX% of the peak load you have achieved). On Tue, Nov 25, 2014 at 11:34 AM, Jack Krupansky j...@basetechnology.com wrote: Great question. The safe answer is to do a proof of concept implementation and try various rates to determine where the bottleneck is. It will also depend on the row size. Hard to say if you will be limited by the cluster load or network bandwidth. Is there only one client talking to your cluster? Or are you asking what each of, say, one million clients can be simultaneously requesting? The rate of requests will matter as well, particularly if the cluster has a non-trivial load. My ultimate rule of thumb is simple: Moderation. Not too many threads, not too frequent request rate. It would be nice if we had a way to calculate this number (both numbers) for you so that a client (driver) could ping for it from the cluster, as well as for the cluster to return a suggested wait interval before sending another request based on actual load. -- Jack Krupansky -Original Message- From: Robert Wille Sent: Tuesday, November 25, 2014 10:57 AM To: user@cassandra.apache.org Subject: Rule of thumb for concurrent asynchronous queries? Suppose I have the primary keys for 10,000 rows and I want them all. Is there a rule of thumb for the maximum number of concurrent asynchronous queries I should execute?= -- Nikolai Grigoriev (514) 772-5178
Re: Compaction Strategy guidance
Andrei, Oh, yes, I have scanned the top of your previous email but overlooked the last part. I am using SSDs so I prefer to put extra work to keep my system performing and save expensive disk space. So far I've been able to size the system more or less correctly so these LCS limitations do not cause too much troubles. But I do keep the CF sharding option as backup - for me it will be relatively easy to implement it. On Tue, Nov 25, 2014 at 1:25 PM, Andrei Ivanov aiva...@iponweb.net wrote: Nikolai, Just in case you've missed my comment in the thread (guess you have) - increasing sstable size does nothing (in our case at least). That is, it's not worse but the load pattern is still the same - doing nothing most of the time. So, I switched to STCS and we will have to live with extra storage cost - storage is way cheaper than cpu etc anyhow:-) On Tue, Nov 25, 2014 at 5:53 PM, Nikolai Grigoriev ngrigor...@gmail.com wrote: Hi Jean-Armel, I am using latest and greatest DSE 4.5.2 (4.5.3 in another cluster but there are no relevant changes between 4.5.2 and 4.5.3) - thus, Cassandra 2.0.10. I have about 1,8Tb of data per node now in total, which falls into that range. As I said, it is really a problem with large amount of data in a single CF, not total amount of data. Quite often the nodes are idle yet having quite a bit of pending compactions. I have discussed it with other members of C* community and DataStax guys and, they have confirmed my observation. I believe that increasing the sstable size won't help at all and probably will make the things worse - everything else being equal, of course. But I would like to hear from Andrei when he is done with his test. Regarding the last statement - yes, C* clearly likes many small servers more than fewer large ones. But it is all relative - and can be all recalculated to $$$ :) C* is all about partitioning of everything - storage, traffic...Less data per node and more nodes give you lower latency, lower heap usage etc, etc. I think I have learned this with my project. Somewhat hard way but still, nothing is better than the personal experience :) On Tue, Nov 25, 2014 at 3:23 AM, Jean-Armel Luce jaluc...@gmail.com wrote: Hi Andrei, Hi Nicolai, Which version of C* are you using ? There are some recommendations about the max storage per node : http://www.datastax.com/dev/blog/performance-improvements-in-cassandra-1-2 For 1.0 we recommend 300-500GB. For 1.2 we are looking to be able to handle 10x (3-5TB). I have the feeling that those recommendations are sensitive according many criteria such as : - your hardware - the compaction strategy - ... It looks that LCS lower those limitations. Increasing the size of sstables might help if you have enough CPU and you can put more load on your I/O system (@Andrei, I am interested by the results of your experimentation about large sstable files) From my point of view, there are some usage patterns where it is better to have many small servers than a few large servers. Probably, it is better to have many small servers if you need LCS for large tables. Just my 2 cents. Jean-Armel 2014-11-24 19:56 GMT+01:00 Robert Coli rc...@eventbrite.com: On Mon, Nov 24, 2014 at 6:48 AM, Nikolai Grigoriev ngrigor...@gmail.com wrote: One of the obvious recommendations I have received was to run more than one instance of C* per host. Makes sense - it will reduce the amount of data per node and will make better use of the resources. This is usually a Bad Idea to do in production. =Rob -- Nikolai Grigoriev (514) 772-5178 -- Nikolai Grigoriev (514) 772-5178
Re: Compaction Strategy guidance
Jean-Armel, I have only two large tables, the rest is super-small. In the test cluster of 15 nodes the largest table has about 110M rows. Its total size is about 1,26Gb per node (total disk space used per node for that CF). It's got about 5K sstables per node - the sstable size is 256Mb. cfstats on a healthy node look like this: Read Count: 8973748 Read Latency: 16.130059053251774 ms. Write Count: 32099455 Write Latency: 1.6124713938912671 ms. Pending Tasks: 0 Table: wm_contacts SSTable count: 5195 SSTables in each level: [27/4, 11/10, 104/100, 1053/1000, 4000, 0, 0, 0, 0] Space used (live), bytes: 1266060391852 Space used (total), bytes: 1266144170869 SSTable Compression Ratio: 0.32604853410787327 Number of keys (estimate): 25696000 Memtable cell count: 71402 Memtable data size, bytes: 26938402 Memtable switch count: 9489 Local read count: 8973748 Local read latency: 17.696 ms Local write count: 32099471 Local write latency: 1.732 ms Pending tasks: 0 Bloom filter false positives: 32248 Bloom filter false ratio: 0.50685 Bloom filter space used, bytes: 20744432 Compacted partition minimum bytes: 104 Compacted partition maximum bytes: 3379391 Compacted partition mean bytes: 172660 Average live cells per slice (last five minutes): 495.0 Average tombstones per slice (last five minutes): 0.0 Another table of similar structure (same number of rows) is about 4x times smaller. That table does not suffer from those issues - it compacts well and efficiently. On Mon, Nov 24, 2014 at 2:30 AM, Jean-Armel Luce jaluc...@gmail.com wrote: Hi Nikolai, Please could you clarify a little bit what you call a large amount of data ? How many tables ? How many rows in your largest table ? How many GB in your largest table ? How many GB per node ? Thanks. 2014-11-24 8:27 GMT+01:00 Jean-Armel Luce jaluc...@gmail.com: Hi Nikolai, Thanks for those informations. Please could you clarify a little bit what you call 2014-11-24 4:37 GMT+01:00 Nikolai Grigoriev ngrigor...@gmail.com: Just to clarify - when I was talking about the large amount of data I really meant large amount of data per node in a single CF (table). LCS does not seem to like it when it gets thousands of sstables (makes 4-5 levels). When bootstraping a new node you'd better enable that option from CASSANDRA-6621 (the one that disables STCS in L0). But it will still be a mess - I have a node that I have bootstrapped ~2 weeks ago. Initially it had 7,5K pending compactions, now it has almost stabilized ad 4,6K. Does not go down. Number of sstables at L0 is over 11K and it is slowly slowly building upper levels. Total number of sstables is 4x the normal amount. Now I am not entirely sure if this node will ever get back to normal life. And believe me - this is not because of I/O, I have SSDs everywhere and 16 physical cores. This machine is barely using 1-3 cores at most of the time. The problem is that allowing STCS fallback is not a good option either - it will quickly result in a few 200Gb+ sstables in my configuration and then these sstables will never be compacted. Plus, it will require close to 2x disk space on EVERY disk in my JBOD configuration...this will kill the node sooner or later. This is all because all sstables after bootstrap end at L0 and then the process slowly slowly moves them to other levels. If you have write traffic to that CF then the number of sstables and L0 will grow quickly - like it happens in my case now. Once something like https://issues.apache.org/jira/browse/CASSANDRA-8301 is implemented it may be better. On Sun, Nov 23, 2014 at 4:53 AM, Andrei Ivanov aiva...@iponweb.net wrote: Stephane, We are having a somewhat similar C* load profile. Hence some comments in addition Nikolai's answer. 1. Fallback to STCS - you can disable it actually 2. Based on our experience, if you have a lot of data per node, LCS may work just fine. That is, till the moment you decide to join another node - chances are that the newly added node will not be able to compact what it gets from old nodes. In your case, if you switch strategy the same thing may happen. This is all due to limitations mentioned by Nikolai. Andrei, On Sun, Nov 23, 2014 at 8:51 AM, Servando Muñoz G. smg...@gmail.com wrote: ABUSE YA NO QUIERO MAS MAILS SOY DE MEXICO De: Nikolai Grigoriev [mailto:ngrigor...@gmail.com] Enviado el: sábado, 22 de noviembre de 2014 07:13 p. m. Para: user@cassandra.apache.org Asunto: Re: Compaction Strategy guidance Importancia: Alta Stephane, As everything good, LCS comes at certain price. LCS will put most load on you I/O system (if you use spindles - you may need to be careful about that) and on CPU. Also LCS (by default) may fall back to STCS
Re: Compaction Strategy guidance
Andrei, Oh, Monday mornings...Tb :) On Mon, Nov 24, 2014 at 9:12 AM, Andrei Ivanov aiva...@iponweb.net wrote: Nikolai, Are you sure about 1.26Gb? Like it doesn't look right - 5195 tables with 256Mb table size... Andrei On Mon, Nov 24, 2014 at 5:09 PM, Nikolai Grigoriev ngrigor...@gmail.com wrote: Jean-Armel, I have only two large tables, the rest is super-small. In the test cluster of 15 nodes the largest table has about 110M rows. Its total size is about 1,26Gb per node (total disk space used per node for that CF). It's got about 5K sstables per node - the sstable size is 256Mb. cfstats on a healthy node look like this: Read Count: 8973748 Read Latency: 16.130059053251774 ms. Write Count: 32099455 Write Latency: 1.6124713938912671 ms. Pending Tasks: 0 Table: wm_contacts SSTable count: 5195 SSTables in each level: [27/4, 11/10, 104/100, 1053/1000, 4000, 0, 0, 0, 0] Space used (live), bytes: 1266060391852 Space used (total), bytes: 1266144170869 SSTable Compression Ratio: 0.32604853410787327 Number of keys (estimate): 25696000 Memtable cell count: 71402 Memtable data size, bytes: 26938402 Memtable switch count: 9489 Local read count: 8973748 Local read latency: 17.696 ms Local write count: 32099471 Local write latency: 1.732 ms Pending tasks: 0 Bloom filter false positives: 32248 Bloom filter false ratio: 0.50685 Bloom filter space used, bytes: 20744432 Compacted partition minimum bytes: 104 Compacted partition maximum bytes: 3379391 Compacted partition mean bytes: 172660 Average live cells per slice (last five minutes): 495.0 Average tombstones per slice (last five minutes): 0.0 Another table of similar structure (same number of rows) is about 4x times smaller. That table does not suffer from those issues - it compacts well and efficiently. On Mon, Nov 24, 2014 at 2:30 AM, Jean-Armel Luce jaluc...@gmail.com wrote: Hi Nikolai, Please could you clarify a little bit what you call a large amount of data ? How many tables ? How many rows in your largest table ? How many GB in your largest table ? How many GB per node ? Thanks. 2014-11-24 8:27 GMT+01:00 Jean-Armel Luce jaluc...@gmail.com: Hi Nikolai, Thanks for those informations. Please could you clarify a little bit what you call 2014-11-24 4:37 GMT+01:00 Nikolai Grigoriev ngrigor...@gmail.com: Just to clarify - when I was talking about the large amount of data I really meant large amount of data per node in a single CF (table). LCS does not seem to like it when it gets thousands of sstables (makes 4-5 levels). When bootstraping a new node you'd better enable that option from CASSANDRA-6621 (the one that disables STCS in L0). But it will still be a mess - I have a node that I have bootstrapped ~2 weeks ago. Initially it had 7,5K pending compactions, now it has almost stabilized ad 4,6K. Does not go down. Number of sstables at L0 is over 11K and it is slowly slowly building upper levels. Total number of sstables is 4x the normal amount. Now I am not entirely sure if this node will ever get back to normal life. And believe me - this is not because of I/O, I have SSDs everywhere and 16 physical cores. This machine is barely using 1-3 cores at most of the time. The problem is that allowing STCS fallback is not a good option either - it will quickly result in a few 200Gb+ sstables in my configuration and then these sstables will never be compacted. Plus, it will require close to 2x disk space on EVERY disk in my JBOD configuration...this will kill the node sooner or later. This is all because all sstables after bootstrap end at L0 and then the process slowly slowly moves them to other levels. If you have write traffic to that CF then the number of sstables and L0 will grow quickly - like it happens in my case now. Once something like https://issues.apache.org/jira/browse/CASSANDRA-8301 is implemented it may be better. On Sun, Nov 23, 2014 at 4:53 AM, Andrei Ivanov aiva...@iponweb.net wrote: Stephane, We are having a somewhat similar C* load profile. Hence some comments in addition Nikolai's answer. 1. Fallback to STCS - you can disable it actually 2. Based on our experience, if you have a lot of data per node, LCS may work just fine. That is, till the moment you decide to join another node - chances are that the newly added node will not be able to compact what it gets from old nodes. In your case, if you switch strategy the same thing may happen. This is all due to limitations mentioned by Nikolai. Andrei, On Sun, Nov 23, 2014 at 8:51 AM, Servando Muñoz G. smg...@gmail.com wrote: ABUSE
Re: Compaction Strategy guidance
I was thinking about that option and I would be curious to find out how does this change help you. I suspected that increasing sstable size won't help too much because the compaction throughput (per task/thread) is still the same. So, it will simply take 4x longer to finish a compaction task. It is possible that because of that the CPU will be under-used for even longer. My data model, unfortunately, requires this amount of data. And I suspect that regardless of how it is organized I won't be able to optimize it - I do need these rows to be in one row so I can read them quickly. One of the obvious recommendations I have received was to run more than one instance of C* per host. Makes sense - it will reduce the amount of data per node and will make better use of the resources. I would go for it myself, but it may be a challenge for the people in operations. Without a VM this would be more tricky for them to operate such a thing and I do not want any VMs there. Another option is to probably simply shard my data between several identical tables in the same keyspace. I could also think about different keyspaces but I prefer not to spread the data for the same logical tenant across multiple keyspaces. Use my primary key's hash and then simply do something like mod 4 and add this to the table name :) This would effectively reduce the number of sstables and amount of data per table (CF). I kind of like this idea more - yes, a bit more challenge at coding level but obvious benefits without extra operational complexity. On Mon, Nov 24, 2014 at 9:32 AM, Andrei Ivanov aiva...@iponweb.net wrote: Nikolai, This is more or less what I'm seeing on my cluster then. Trying to switch to bigger sstables right now (1Gb) On Mon, Nov 24, 2014 at 5:18 PM, Nikolai Grigoriev ngrigor...@gmail.com wrote: Andrei, Oh, Monday mornings...Tb :) On Mon, Nov 24, 2014 at 9:12 AM, Andrei Ivanov aiva...@iponweb.net wrote: Nikolai, Are you sure about 1.26Gb? Like it doesn't look right - 5195 tables with 256Mb table size... Andrei On Mon, Nov 24, 2014 at 5:09 PM, Nikolai Grigoriev ngrigor...@gmail.com wrote: Jean-Armel, I have only two large tables, the rest is super-small. In the test cluster of 15 nodes the largest table has about 110M rows. Its total size is about 1,26Gb per node (total disk space used per node for that CF). It's got about 5K sstables per node - the sstable size is 256Mb. cfstats on a healthy node look like this: Read Count: 8973748 Read Latency: 16.130059053251774 ms. Write Count: 32099455 Write Latency: 1.6124713938912671 ms. Pending Tasks: 0 Table: wm_contacts SSTable count: 5195 SSTables in each level: [27/4, 11/10, 104/100, 1053/1000, 4000, 0, 0, 0, 0] Space used (live), bytes: 1266060391852 Space used (total), bytes: 1266144170869 SSTable Compression Ratio: 0.32604853410787327 Number of keys (estimate): 25696000 Memtable cell count: 71402 Memtable data size, bytes: 26938402 Memtable switch count: 9489 Local read count: 8973748 Local read latency: 17.696 ms Local write count: 32099471 Local write latency: 1.732 ms Pending tasks: 0 Bloom filter false positives: 32248 Bloom filter false ratio: 0.50685 Bloom filter space used, bytes: 20744432 Compacted partition minimum bytes: 104 Compacted partition maximum bytes: 3379391 Compacted partition mean bytes: 172660 Average live cells per slice (last five minutes): 495.0 Average tombstones per slice (last five minutes): 0.0 Another table of similar structure (same number of rows) is about 4x times smaller. That table does not suffer from those issues - it compacts well and efficiently. On Mon, Nov 24, 2014 at 2:30 AM, Jean-Armel Luce jaluc...@gmail.com wrote: Hi Nikolai, Please could you clarify a little bit what you call a large amount of data ? How many tables ? How many rows in your largest table ? How many GB in your largest table ? How many GB per node ? Thanks. 2014-11-24 8:27 GMT+01:00 Jean-Armel Luce jaluc...@gmail.com: Hi Nikolai, Thanks for those informations. Please could you clarify a little bit what you call 2014-11-24 4:37 GMT+01:00 Nikolai Grigoriev ngrigor...@gmail.com: Just to clarify - when I was talking about the large amount of data I really meant large amount of data per node in a single CF (table). LCS does not seem to like it when it gets thousands of sstables (makes 4-5 levels). When bootstraping a new node you'd better enable that option from CASSANDRA-6621 (the one that disables STCS in L0). But it will still be a mess - I have a node that I have
Re: Compaction Strategy guidance
Just to clarify - when I was talking about the large amount of data I really meant large amount of data per node in a single CF (table). LCS does not seem to like it when it gets thousands of sstables (makes 4-5 levels). When bootstraping a new node you'd better enable that option from CASSANDRA-6621 (the one that disables STCS in L0). But it will still be a mess - I have a node that I have bootstrapped ~2 weeks ago. Initially it had 7,5K pending compactions, now it has almost stabilized ad 4,6K. Does not go down. Number of sstables at L0 is over 11K and it is slowly slowly building upper levels. Total number of sstables is 4x the normal amount. Now I am not entirely sure if this node will ever get back to normal life. And believe me - this is not because of I/O, I have SSDs everywhere and 16 physical cores. This machine is barely using 1-3 cores at most of the time. The problem is that allowing STCS fallback is not a good option either - it will quickly result in a few 200Gb+ sstables in my configuration and then these sstables will never be compacted. Plus, it will require close to 2x disk space on EVERY disk in my JBOD configuration...this will kill the node sooner or later. This is all because all sstables after bootstrap end at L0 and then the process slowly slowly moves them to other levels. If you have write traffic to that CF then the number of sstables and L0 will grow quickly - like it happens in my case now. Once something like https://issues.apache.org/jira/browse/CASSANDRA-8301 is implemented it may be better. On Sun, Nov 23, 2014 at 4:53 AM, Andrei Ivanov aiva...@iponweb.net wrote: Stephane, We are having a somewhat similar C* load profile. Hence some comments in addition Nikolai's answer. 1. Fallback to STCS - you can disable it actually 2. Based on our experience, if you have a lot of data per node, LCS may work just fine. That is, till the moment you decide to join another node - chances are that the newly added node will not be able to compact what it gets from old nodes. In your case, if you switch strategy the same thing may happen. This is all due to limitations mentioned by Nikolai. Andrei, On Sun, Nov 23, 2014 at 8:51 AM, Servando Muñoz G. smg...@gmail.com wrote: ABUSE YA NO QUIERO MAS MAILS SOY DE MEXICO De: Nikolai Grigoriev [mailto:ngrigor...@gmail.com] Enviado el: sábado, 22 de noviembre de 2014 07:13 p. m. Para: user@cassandra.apache.org Asunto: Re: Compaction Strategy guidance Importancia: Alta Stephane, As everything good, LCS comes at certain price. LCS will put most load on you I/O system (if you use spindles - you may need to be careful about that) and on CPU. Also LCS (by default) may fall back to STCS if it is falling behind (which is very possible with heavy writing activity) and this will result in higher disk space usage. Also LCS has certain limitation I have discovered lately. Sometimes LCS may not be able to use all your node's resources (algorithm limitations) and this reduces the overall compaction throughput. This may happen if you have a large column family with lots of data per node. STCS won't have this limitation. By the way, the primary goal of LCS is to reduce the number of sstables C* has to look at to find your data. With LCS properly functioning this number will be most likely between something like 1 and 3 for most of the reads. But if you do few reads and not concerned about the latency today, most likely LCS may only save you some disk space. On Sat, Nov 22, 2014 at 6:25 PM, Stephane Legay sle...@looplogic.com wrote: Hi there, use case: - Heavy write app, few reads. - Lots of updates of rows / columns. - Current performance is fine, for both writes and reads.. - Currently using SizedCompactionStrategy We're trying to limit the amount of storage used during compaction. Should we switch to LeveledCompactionStrategy? Thanks -- Nikolai Grigoriev (514) 772-5178 -- Nikolai Grigoriev (514) 772-5178
Re: Compaction Strategy guidance
Stephane, As everything good, LCS comes at certain price. LCS will put most load on you I/O system (if you use spindles - you may need to be careful about that) and on CPU. Also LCS (by default) may fall back to STCS if it is falling behind (which is very possible with heavy writing activity) and this will result in higher disk space usage. Also LCS has certain limitation I have discovered lately. Sometimes LCS may not be able to use all your node's resources (algorithm limitations) and this reduces the overall compaction throughput. This may happen if you have a large column family with lots of data per node. STCS won't have this limitation. By the way, the primary goal of LCS is to reduce the number of sstables C* has to look at to find your data. With LCS properly functioning this number will be most likely between something like 1 and 3 for most of the reads. But if you do few reads and not concerned about the latency today, most likely LCS may only save you some disk space. On Sat, Nov 22, 2014 at 6:25 PM, Stephane Legay sle...@looplogic.com wrote: Hi there, use case: - Heavy write app, few reads. - Lots of updates of rows / columns. - Current performance is fine, for both writes and reads.. - Currently using SizedCompactionStrategy We're trying to limit the amount of storage used during compaction. Should we switch to LeveledCompactionStrategy? Thanks -- Nikolai Grigoriev (514) 772-5178
Re: high context switches
How do the clients connect, which protocol is used and do they use keep-alive connections? Is it possible that the clients use Thrift and the server type is sync? It is just my guess, but in this scenario with high number of clients connecting-disconnecting often there may be high number of context switching. On Fri, Nov 21, 2014 at 4:21 AM, Jan Karlsson jan.karls...@ericsson.com wrote: Hello, We are running a 3 node cluster with RF=3 and 5 clients in a test environment. The C* settings are mostly default. We noticed quite high context switching during our tests. On 100 000 000 keys/partitions we averaged around 260 000 cs (with a max of 530 000). We were running 12 000~ transactions per second. 10 000 reads and 2000 updates. Nothing really wrong with that however I would like to understand why these numbers are so high. Have others noticed this behavior? How much context switching is expected and why? What are the variables that affect this? /J -- Nikolai Grigoriev (514) 772-5178
coordinator selection in remote DC
Hi, There is something odd I have observed when testing a configuration with two DC for the first time. I wanted to do a simple functional test to prove myself (and my pessimistic colleagues ;) ) that it works. I have a test cluster of 6 nodes, 3 in each DC, and a keyspace that is replicated as follows: CREATE KEYSPACE xxx WITH replication = { 'class': 'NetworkTopologyStrategy', 'DC2': '3', 'DC1': '3' }; I have disabled the traffic compression between DCs to get more accurate numbers. I have set up a bunch of IP accounting rules on each node so they count the outgoing traffic from this node to each other node. I had rules for different ports but, of course, but it is mostly about port 7000 (or 7001) when talking about inter-node traffic. Anyway, I have a table that shows the traffic from any node to any node's port 7000. I have ran a test with DCAwareRoundRobinPolicy and the client talking only to DC1 nodes. Everything looks fine - the client has sent identical amount of data to each of 3 nodes in DC1. These nodes inside of DC1 (I was writing with LOCAL_ONE consistency) have sent similar amount of data to each other that represents exactly two extra replicas. However, when I look at the traffic from the nodes in DC1 to the nodes in DC1 the picture is different: 10.3.45.156 10.3.45.159 dpt:7000 117,273,075 10.3.45.156 10.3.45.160 dpt:7000 228,326,091 10.3.45.156 10.3.45.161 dpt:7000 46,924,339 10.3.45.157 10.3.45.159 dpt:7000 118,978,269 10.3.45.157 10.3.45.160 dpt:7000 230,444,929 10.3.45.157 10.3.45.161 dpt:7000 47,394,179 10.3.45.158 10.3.45.159 dpt:7000 113,969,248 10.3.45.158 10.3.45.160 dpt:7000 225,844,838 10.3.45.158 10.3.45.161 dpt:7000 46,338,939 Nodes 10.3.45.156-158 are in DC1, .159-161 - in DC2. As you can see, each of nodes in DC1 has sent different amount of traffic to the remote nodes: 117Mb, 228Mb and 46Mb respectively. Both DC have one rack. So, here is my question. How does node select the node in remote DC to send the message to? I did a quick sweep through the code and I could only find the sorting by proximity (checking the rack and DC). So, considering that for each request I fire the targets are all 3 nodes in the remote DC, the list will contain all 3 nodes in DC2. And, if I understood correctly, the first node from the list is picked to send the message. So, it seems to me that there is no any kind of round-robin-type logic is applied when selecting the target node to forward the write to from the list of targets in remote DC. If this is true (and the numbers kind of show it is, right?), then probably the list with equal proximity should be shuffled randomly? Or, instead of picking the first target, a random one should be picked? -- Nikolai Grigoriev
Re: any way to get nodetool proxyhistograms data for an entire cluster?
DSE can manage an existing cluster (even vanilla Cassandra). And you are not required to use OpsCenter for DSE, of course. As for the graphs...I do think that OpsCenter is not necessarily the ideal graphing tool for Cassandra. If you want nice and detailed (and independent) dashboards, you can feed Cassandra metrics to anything - Graphite etc. I use both - OpsCenter and Graphite. There are things that you get out of the box from OpsCenter, there are very nice things you can only do with Graphite :) On Thu, Nov 20, 2014 at 11:10 AM, Robert Coli rc...@eventbrite.com wrote: On Wed, Nov 19, 2014 at 7:20 PM, Clint Kelly clint.ke...@gmail.com wrote: We have DSE so I can use opscenter. I was just looking for something more precise than the graphs that I get from opscenter. To be clear, I don't believe DSE is required to use OpsCenter. My understanding is that it can be used with vanilla Apache Cassandra, but I have never actually tried to do so. =Rob -- Nikolai Grigoriev (514) 772-5178
Re: coordinator selection in remote DC
Hmmm...I am using: endpoint_snitch: com.datastax.bdp.snitch.DseDelegateSnitch which is using: delegated_snitch: org.apache.cassandra.locator.PropertyFileSnitch (for this specific test cluster) I did not check the code - is this snitch on by default and, maybe, used as wrapper for configured endpoint_snitch? It would explain the difference in the inter-DC traffic for sure. Also it would not affect the local DC traffic as all nodes are replicas for the data anyway. On Thu, Nov 20, 2014 at 12:03 PM, Tyler Hobbs ty...@datastax.com wrote: The difference is likely due to the DynamicEndpointSnitch (aka dynamic snitch), which picks replicas to send messages to based on recently observed latency and self-reported load (accounting for compactions, repair, etc). If you want to confirm this, you can disable the dynamic snitch by adding this line to cassandra.yaml: dynamic_snitch: false. On Thu, Nov 20, 2014 at 9:52 AM, Nikolai Grigoriev ngrigor...@gmail.com wrote: Hi, There is something odd I have observed when testing a configuration with two DC for the first time. I wanted to do a simple functional test to prove myself (and my pessimistic colleagues ;) ) that it works. I have a test cluster of 6 nodes, 3 in each DC, and a keyspace that is replicated as follows: CREATE KEYSPACE xxx WITH replication = { 'class': 'NetworkTopologyStrategy', 'DC2': '3', 'DC1': '3' }; I have disabled the traffic compression between DCs to get more accurate numbers. I have set up a bunch of IP accounting rules on each node so they count the outgoing traffic from this node to each other node. I had rules for different ports but, of course, but it is mostly about port 7000 (or 7001) when talking about inter-node traffic. Anyway, I have a table that shows the traffic from any node to any node's port 7000. I have ran a test with DCAwareRoundRobinPolicy and the client talking only to DC1 nodes. Everything looks fine - the client has sent identical amount of data to each of 3 nodes in DC1. These nodes inside of DC1 (I was writing with LOCAL_ONE consistency) have sent similar amount of data to each other that represents exactly two extra replicas. However, when I look at the traffic from the nodes in DC1 to the nodes in DC1 the picture is different: 10.3.45.156 10.3.45.159 dpt:7000 117,273,075 10.3.45.156 10.3.45.160 dpt:7000 228,326,091 10.3.45.156 10.3.45.161 dpt:7000 46,924,339 10.3.45.157 10.3.45.159 dpt:7000 118,978,269 10.3.45.157 10.3.45.160 dpt:7000 230,444,929 10.3.45.157 10.3.45.161 dpt:7000 47,394,179 10.3.45.158 10.3.45.159 dpt:7000 113,969,248 10.3.45.158 10.3.45.160 dpt:7000 225,844,838 10.3.45.158 10.3.45.161 dpt:7000 46,338,939 Nodes 10.3.45.156-158 are in DC1, .159-161 - in DC2. As you can see, each of nodes in DC1 has sent different amount of traffic to the remote nodes: 117Mb, 228Mb and 46Mb respectively. Both DC have one rack. So, here is my question. How does node select the node in remote DC to send the message to? I did a quick sweep through the code and I could only find the sorting by proximity (checking the rack and DC). So, considering that for each request I fire the targets are all 3 nodes in the remote DC, the list will contain all 3 nodes in DC2. And, if I understood correctly, the first node from the list is picked to send the message. So, it seems to me that there is no any kind of round-robin-type logic is applied when selecting the target node to forward the write to from the list of targets in remote DC. If this is true (and the numbers kind of show it is, right?), then probably the list with equal proximity should be shuffled randomly? Or, instead of picking the first target, a random one should be picked? -- Nikolai Grigoriev -- Tyler Hobbs DataStax http://datastax.com/ -- Nikolai Grigoriev (514) 772-5178