Re: Coprosessors/Triggers in C*
I understood it as a run trigger when column gets deleted due to TTL, so - as you said - it doesn't sound like something that can be done. Gareth, TTL'd columns in Cassandra are not really removed after TTL - they are just ignored from that time (so they're not returned by queries), but they still exist as long as they're not tombstoned and then removed after grace period. Cassandra doesn't know about the exact moment they become outdated due to TTL. It could be doable to do something when they get converted to tombstone, but I don't think it's the use case you're looking for. M. I do not understand what feature you suggesting. Columns can already have a ttl. Are you speaking of a ttl column that could delete something beside itself. That does not sound easy because a ttl comment is dorment until read or compacted. On Tuesday, June 11, 2013, Gareth Collins gareth.o.coll...@gmail.com wrote: Hello Edward, I am curious - What about triggering on a TTL timeout delete (something I am most interested in doing - perhaps it doesn't make sense?)? Would you say that is something the user should implement themselves? Would you see intravert being able to do something with this at some later point (somehow?)? thanks, Gareth On Tue, Jun 11, 2013 at 2:34 PM, Edward Capriolo edlinuxg...@gmail.com wrote: This is arguably something you should do yourself. I have been investigating integrating vertx and cassandra together for a while to accomplish this type of work, mainly to move processing close to data and eliminate large batches that can be computed from a single map of data. https://github.com/zznate/intravert-ug/wiki/Service-Processor-for-trigger-like-functionality On Tue, Jun 11, 2013 at 5:06 AM, Tanya Malik sonichedg...@gmail.com wrote: Thanks Romain. On Tue, Jun 11, 2013 at 1:44 AM, Romain HARDOUIN romain.hardo...@urssaf.fr wrote: Not yet but Cassandra 2.0 will provide experimental triggers: https://issues.apache.org/jira/browse/CASSANDRA-1311 Tanya Malik sonichedg...@gmail.com a écrit sur 11/06/2013 04:12:44 : De : Tanya Malik sonichedg...@gmail.com A : user@cassandra.apache.org, Date : 11/06/2013 04:13 Objet : Coprosessors/Triggers in C* Hi, Does C* support something like co-processor functionality/triggers to run client-supplied code in the address space of the server?
Re: Coprosessors/Triggers in C*
Edward, Michal, Thanks very much for the answers. I hadn't really thought before about how Cassandra would implement the TTL feature. I had foolishly assumed that it would be like a delete (which I would eventually be able to trigger on to execute another action) but it makes sense how it is really implemented. I will need to find another way outside of Cassandra to implement my do something if not deleted before TTL requirement (ugh). Anyway, thanks again for the clarification. Gareth On Thu, Jun 13, 2013 at 2:19 AM, Michal Michalski mich...@opera.com wrote: I understood it as a run trigger when column gets deleted due to TTL, so - as you said - it doesn't sound like something that can be done. Gareth, TTL'd columns in Cassandra are not really removed after TTL - they are just ignored from that time (so they're not returned by queries), but they still exist as long as they're not tombstoned and then removed after grace period. Cassandra doesn't know about the exact moment they become outdated due to TTL. It could be doable to do something when they get converted to tombstone, but I don't think it's the use case you're looking for. M. I do not understand what feature you suggesting. Columns can already have a ttl. Are you speaking of a ttl column that could delete something beside itself. That does not sound easy because a ttl comment is dorment until read or compacted. On Tuesday, June 11, 2013, Gareth Collins gareth.o.coll...@gmail.com wrote: Hello Edward, I am curious - What about triggering on a TTL timeout delete (something I am most interested in doing - perhaps it doesn't make sense?)? Would you say that is something the user should implement themselves? Would you see intravert being able to do something with this at some later point (somehow?)? thanks, Gareth On Tue, Jun 11, 2013 at 2:34 PM, Edward Capriolo edlinuxg...@gmail.com wrote: This is arguably something you should do yourself. I have been investigating integrating vertx and cassandra together for a while to accomplish this type of work, mainly to move processing close to data and eliminate large batches that can be computed from a single map of data. https://github.com/zznate/**intravert-ug/wiki/Service-** Processor-for-trigger-like-**functionalityhttps://github.com/zznate/intravert-ug/wiki/Service-Processor-for-trigger-like-functionality On Tue, Jun 11, 2013 at 5:06 AM, Tanya Malik sonichedg...@gmail.com wrote: Thanks Romain. On Tue, Jun 11, 2013 at 1:44 AM, Romain HARDOUIN romain.hardo...@urssaf.fr wrote: Not yet but Cassandra 2.0 will provide experimental triggers: https://issues.apache.org/**jira/browse/CASSANDRA-1311https://issues.apache.org/jira/browse/CASSANDRA-1311 Tanya Malik sonichedg...@gmail.com a écrit sur 11/06/2013 04:12:44 : De : Tanya Malik sonichedg...@gmail.com A : user@cassandra.apache.org, Date : 11/06/2013 04:13 Objet : Coprosessors/Triggers in C* Hi, Does C* support something like co-processor functionality/triggers to run client-supplied code in the address space of the server?
Re: Coprosessors/Triggers in C*
Dne 13.6.2013 8:19, Michal Michalski napsal(a): It could be doable to do something when they get converted to tombstone, but I don't think it's the use case you're looking for. actually, this would be good enough for me
Scaling a cassandra cluster with auto_bootstrap set to false
Hi Cassandra community, we are currently experimenting with different Cassandra scaling strategies. We observed that Cassandra performance decreases drastically when we insert more data into the cluster (say, going from 60GB to 600GB in a 3-node cluster). So we want to find out how to deal with this problem. One scaling strategy seems interesting but we don't fully understand what is going on, yet. The strategy works like this: add new nodes to a Cassandra cluster with auto_bootstrap = false to avoid streaming to the new nodes. We were a bit surprised that this strategy improved performance considerably and that it worked much better than other strategies that we tried before, both in terms of scaling speed and performance impact during scaling. Let me share our little experiment with you: In a initial setup S1 we have 4 nodes where each node is similar to the Amazon EC2 large instance type, i.e., 4 cores, 15GB memory, 700GB free disk space, Cassandra replication factor 2. Each node is loaded with 10 million 1KB rows into a single column family, i.e., ~20 GB data/node, using the Yahoo Cloud Serving Benchmark (YCSB) tool. All Cassandra settings are default. In the setup S1 we achieved an average throughput of ~800 ops/s. The workload is a 95/5 read/update mix with a Zipfian workload distribution (= YCSB workload B). Setup S2: We then added two empty nodes to our 4-node cluster with auto_bootstrap set to false. The throughput that we observered thereafter tripled from 800 ops/s to 2,400 ops/s. We looked at various outputs from nodetool commands to understand this effect. On the new nodes, $ nodetool info tells us that the keycache is empty; $ nodetool cfstats clearly shows write and read requests coming in. The memtable columns count and data size are multiple times larger compared to the other four nodes. We are wondering: what exactly gets stored on the two new nodes in setup S2 and where (cache, memtable, disk?). Would it be necessary (in a production environment) to stream the old SSTables from the other four nodes at some point in time? Or can we simply be happy with the performance improvement and leave it like this? Are we missing something here; can you advise us to look at specific monitoring data to better understand the observed effect? Thanks, Markus Klems
Billions of counters
We want to precalculate counts for some common metrics for usage. We have events, locations, products, etc. The problem is we have millions events/day, thousands of locations and millions of products. Were trying to precalculate counts for some common queries like 'how many times was product X purchased in location Y last week'. It seems like we'll end up with trillions of counters for even these basic permutations. Is this a cause for concern? TIA -- Darren
cql-rb, the CQL3 driver for Ruby has reached v1.0
After a few months of development and many preview releases cql-rb, the pure Ruby CQL3 driver has finally reached v1.0. You can find the code and examples on GitHub: https://github.com/iconara/cql-rb T#
Re: Billions of counters
Hi! We have a similar situation of millions of events on millions of items - turns out that this isn't really a problem, because there tends to be a very strong power -distribution: very few of the items get a lot of hits, some get some, and the majority gets no hits (though most of them do get hits every now and then). So it's basically a sparse multidimensional array, and turns out that Cassandra is pretty good at storing those. We just treat a missing counter column as zero, and add a counter only when necessary. To avoid I/O, we also do some statistical sampling for certain counters where we don't need an exact figure. YMMV, of course, but I'd look at the likelihood of all the products being purchased from the same location during one week at least once and start the modeling from there. :) /Janne On 13 Jun 2013, at 21:19, Darren Smythe darren1...@gmail.com wrote: We want to precalculate counts for some common metrics for usage. We have events, locations, products, etc. The problem is we have millions events/day, thousands of locations and millions of products. Were trying to precalculate counts for some common queries like 'how many times was product X purchased in location Y last week'. It seems like we'll end up with trillions of counters for even these basic permutations. Is this a cause for concern? TIA -- Darren
Re: Scaling a cassandra cluster with auto_bootstrap set to false
On Thu, Jun 13, 2013 at 10:47 AM, Markus Klems markuskl...@gmail.com wrote: One scaling strategy seems interesting but we don't fully understand what is going on, yet. The strategy works like this: add new nodes to a Cassandra cluster with auto_bootstrap = false to avoid streaming to the new nodes. If you set auto_bootstrap to false, new nodes take over responsibility for a range of the ring but do not receive the data for the range from the old nodes. If you read the new node at CL.ONE, you will get the answer that data you wrote to the old node does not exist, because the new node did not receive it as part of bootstrap. This is probably not what you expect. We were a bit surprised that this strategy improved performance considerably and that it worked much better than other strategies that we tried before, both in terms of scaling speed and performance impact during scaling. CL.ONE requests for rows which do not exist are very fast. Would it be necessary (in a production environment) to stream the old SSTables from the other four nodes at some point in time? Bootstrapping is necessary for consistency and durability, yes. If you were to : 1) start new node without bootstrapping it 2) run cleanup compaction on the old node You would permanently delete the copy of the data that is no longer supposed to live on the old node. With a RF of 1, that data would be permanently gone. With a RF of 1 you have other copies, but if you never bootstrap while adding new nodes you are relatively likely to not be able to access those copies over time. =Rob
Re: Scaling a cassandra cluster with auto_bootstrap set to false
Robert, thank you for your explanation. I think you are right. YCSB probably does not correctly interpret the missing record response. We will look into it and report our results here in the next days. Thanks, Markus On Thu, Jun 13, 2013 at 9:47 PM, Robert Coli rc...@eventbrite.com wrote: On Thu, Jun 13, 2013 at 10:47 AM, Markus Klems markuskl...@gmail.com wrote: One scaling strategy seems interesting but we don't fully understand what is going on, yet. The strategy works like this: add new nodes to a Cassandra cluster with auto_bootstrap = false to avoid streaming to the new nodes. If you set auto_bootstrap to false, new nodes take over responsibility for a range of the ring but do not receive the data for the range from the old nodes. If you read the new node at CL.ONE, you will get the answer that data you wrote to the old node does not exist, because the new node did not receive it as part of bootstrap. This is probably not what you expect. We were a bit surprised that this strategy improved performance considerably and that it worked much better than other strategies that we tried before, both in terms of scaling speed and performance impact during scaling. CL.ONE requests for rows which do not exist are very fast. Would it be necessary (in a production environment) to stream the old SSTables from the other four nodes at some point in time? Bootstrapping is necessary for consistency and durability, yes. If you were to : 1) start new node without bootstrapping it 2) run cleanup compaction on the old node You would permanently delete the copy of the data that is no longer supposed to live on the old node. With a RF of 1, that data would be permanently gone. With a RF of 1 you have other copies, but if you never bootstrap while adding new nodes you are relatively likely to not be able to access those copies over time. =Rob
RE: Looking for a fully working AWS multi DC configuration.
For the ones that need access by public IP we have not found a way to automate it. Would be curious to know if anyone else has been able to that. In the case of access by private IP we just specify security group as the source. From: Alain RODRIGUEZ [mailto:arodr...@gmail.com] Sent: Wednesday, June 05, 2013 5:45 PM To: user@cassandra.apache.org Subject: Re: Looking for a fully working AWS multi DC configuration. Do you open all these nodes one by one on every Security Group in each region every time you add a node or did you manage to automate it somehow ? 2013/6/5 Dan Kogan d...@iqtell.commailto:d...@iqtell.com Hi, We are using a very similar configuration. From our experience, Cassandra nodes in the same DC need access over both public and private IP on the storage port (7000/7001). Nodes from other DC will need access over public IP on the storage port. All Cassandra nodes also need access over the public IP on the Thrift port (9160). Dan From: Alain RODRIGUEZ [mailto:arodr...@gmail.commailto:arodr...@gmail.com] Sent: Wednesday, June 05, 2013 9:49 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Looking for a fully working AWS multi DC configuration. Hi, We use to work on a single DC (EC2Snitch / SimpleStrategy). For latency reason we had top open a new DC in the US (us-east). We run C* 1.2.2. We don't use VPC. Now we use: - 2 DC (eu-west, us-east) - EC2MultiRegionSnitch / NTS - public IPs as broadcast_address and seeds - private IPs as listen_address Yet we are experimenting some troubles (node can't reach itself, Could not start register mbean in JMX...), mainly because of the use of public IPs and the AWS inter-region communication. If someone has successfully setup this kind of cluster, I would like to know, if our configuration is correct and if I am missing something. I also would like to know what ports I have to open and either where I have to open them from. Any insight would be greatly appreciated.
Re: Scaling a cassandra cluster with auto_bootstrap set to false
CL.ONE requests for rows which do not exist are very fast. http://adrianotto.com/2010/08/dev-null-unlimited-scale/ On Thu, Jun 13, 2013 at 3:47 PM, Robert Coli rc...@eventbrite.com wrote: On Thu, Jun 13, 2013 at 10:47 AM, Markus Klems markuskl...@gmail.com wrote: One scaling strategy seems interesting but we don't fully understand what is going on, yet. The strategy works like this: add new nodes to a Cassandra cluster with auto_bootstrap = false to avoid streaming to the new nodes. If you set auto_bootstrap to false, new nodes take over responsibility for a range of the ring but do not receive the data for the range from the old nodes. If you read the new node at CL.ONE, you will get the answer that data you wrote to the old node does not exist, because the new node did not receive it as part of bootstrap. This is probably not what you expect. We were a bit surprised that this strategy improved performance considerably and that it worked much better than other strategies that we tried before, both in terms of scaling speed and performance impact during scaling. CL.ONE requests for rows which do not exist are very fast. Would it be necessary (in a production environment) to stream the old SSTables from the other four nodes at some point in time? Bootstrapping is necessary for consistency and durability, yes. If you were to : 1) start new node without bootstrapping it 2) run cleanup compaction on the old node You would permanently delete the copy of the data that is no longer supposed to live on the old node. With a RF of 1, that data would be permanently gone. With a RF of 1 you have other copies, but if you never bootstrap while adding new nodes you are relatively likely to not be able to access those copies over time. =Rob
Re: Scaling a cassandra cluster with auto_bootstrap set to false
On Thu, Jun 13, 2013 at 11:20 PM, Edward Capriolo edlinuxg...@gmail.com wrote: CL.ONE requests for rows which do not exist are very fast. http://adrianotto.com/2010/08/dev-null-unlimited-scale/ Yep, /dev/null is a might force ;-) I took a look at the YCSB source code and spotted the line of code that caused our confusion: it's in file https://github.com/brianfrankcooper/YCSB/blob/master/core/src/main/java/com/yahoo/ycsb/workloads/CoreWorkload.java in the public boolean doTransaction(DB db, Object threadstate) method in line 497. No matter what the result of a YCSB transaction operation is, the method always returns true. Not sure if this is a desirable behavior of a benchmarking tool. It makes it difficult to spot these kind of mistakes. The problem can also be observed by running this piece of code: public static void main(String[] args) { CassandraClient10 cli = new CassandraClient10(); Properties props = new Properties(); props.setProperty(hosts, args[0]); cli.setProperties(props); try { cli.init(); } catch (Exception e) { e.printStackTrace(); System.exit(0); } HashMapString, ByteIterator vals = new HashMapString, ByteIterator(); vals.put(age, new StringByteIterator(57)); vals.put(middlename, new StringByteIterator(bradley)); vals.put(favoritecolor, new StringByteIterator(blue)); int res = cli.insert(usertable, BrianFrankCooper, vals); System.out.println(Result of insert: + res); HashMapString, ByteIterator result = new HashMapString, ByteIterator(); HashSetString fields = new HashSetString(); fields.add(middlename); fields.add(age); fields.add(favoritecolor); res = cli.read(usertable, BrianFrankCooper, null, result); System.out.println(Result of read: + res); for (String s : result.keySet()) { System.out.println([ + s + ]=[ + result.get(s) + ]); } res = cli.delete(usertable, BrianFrankCooper); System.out.println(Result of delete: + res); res = cli.read(usertable, BrianFrankCooper, null, result); System.out.println(Result of read: + res); for (String s : result.keySet()) { System.out.println([ + s + ]=[ + result.get(s) + ]); } } which results in: Result of insert: 0 Result of read: 0 [middlename]=[bradley] [favoritecolor]=[blue] [age]=[57] Result of delete: 0 Result of read: 0 [middlename]=[] [favoritecolor]=[] [age]=[] The second read should not return true (0). @Robert Edward, thanks for your help, -Markus