Recommended way of data migration
Hello, Let's say we have a simple CQL3 table CREATE TABLE example ( id UUID PRIMARY KEY, timestamp TIMESTAMP, data ASCII ); And I need to mutate (for example encrypt) column values in the data column for all rows. What's the recommended approach to perform such migration programatically? For me the general approach is: 1. Create another column family 2. extract a batch of records 3. for each extracted record, perform mutation, insert it in the new cf and delete from old one 4. repeat until source cf not empty Is it correct approach and if yes, how to implement some kind of paging for the step 2?
Re: Recommended way of data migration
I would do something like you are suggesting. I would not do the delete until all the rows are moved. Since writes in cassandra are idempotent you can even run the migration process multiple times without harm. On Sat, Sep 7, 2013 at 5:31 PM, Renat Gilfanov gren...@mail.ru wrote: Hello, Let's say we have a simple CQL3 table CREATE TABLE example ( id UUID PRIMARY KEY, timestamp TIMESTAMP, data ASCII ); And I need to mutate (for example encrypt) column values in the data column for all rows. What's the recommended approach to perform such migration programatically? For me the general approach is: 1. Create another column family 2. extract a batch of records 3. for each extracted record, perform mutation, insert it in the new cf and delete from old one 4. repeat until source cf not empty Is it correct approach and if yes, how to implement some kind of paging for the step 2?
[ANN] Cassaforte 1.2.0 is released
Cassaforte [1] is a Clojure client for Apache Cassandra 1.2+. It is built around CQL 3 and focuses on ease of use. You will likely find that using Cassandra from Clojure has never been so easy. 1.2.0 is a minor release that introduces one minor feature, fixes a couple of bugs, and makes Cassaforte compatible with Cassandra 2.0. Release notes: http://blog.clojurewerkz.org/blog/2013/09/07/cassaforte-1-dot-2-0-is-released/ 1. http://clojurecassandra.info/ http://clojurememcached.info/ -- Alex P https://github.com/ifesdjeen https://twitter.com/ifesdjeen
Re: row cache
I have found row cache to be more trouble then bene. The term fools gold comes to mind. Using key cache and leaving more free main memory seems stable and does not have as many complications. On Wednesday, September 4, 2013, S C as...@outlook.com wrote: Thank you all for your valuable comments and information. -SC Date: Tue, 3 Sep 2013 12:01:59 -0400 From: chris.burrou...@gmail.com To: user@cassandra.apache.org CC: fsareshw...@quantcast.com Subject: Re: row cache On 09/01/2013 03:06 PM, Faraaz Sareshwala wrote: Yes, that is correct. The SerializingCacheProvider stores row cache contents off heap. I believe you need JNA enabled for this though. Someone please correct me if I am wrong here. The ConcurrentLinkedHashCacheProvider stores row cache contents on the java heap itself. Naming things is hard. Both caches are in memory and are backed by a ConcurrentLinkekHashMap. In the case of the SerializingCacheProvider the *values* are stored in off heap buffers. Both must store a half dozen or so objects (on heap) per entry (org.apache.cassandra.cache.RowCacheKey, com.googlecode.concurrentlinkedhashmap.ConcurrentLinkedHashMap$WeightedValue, java.util.concurrent.ConcurrentHashMap$HashEntry, etc). It would probably be better to call this a mixed-heap rather than off-heap cache. You may find the number of entires you can hold without gc problems to be surprising low (relative to say memcached, or physical memory on modern hardware). Invalidating a column with SerializingCacheProvider invalidates the entire row while with ConcurrentLinkedHashCacheProvider it does not. SerializingCacheProvider does not require JNA. Both also use memory estimation of the size (of the values only) to determine the total number of entries retained. Estimating the size of the totally on-heap ConcurrentLinkedHashCacheProvider has historically been dicey since we switched from sizing in entries, and it has been removed in 2.0.0. As said elsewhere in this thread the utility of the row cache varies from absolutely essential to source of numerous problems depending on the specifics of the data model and request distribution.
w00tw00t.at.ISC.SANS.DFind not found
Hey all, I'm seeing this exception in my cassandra logs: Exception during http request mx4j.tools.adaptor.http.HttpException: file mx4j/tools/adaptor/http/xsl/w00tw00t.at.ISC.SANS.DFind:) not found at mx4j.tools.adaptor.http.XSLTProcessor.notFoundElement(XSLTProcessor.java:314) at mx4j.tools.adaptor.http.HttpAdaptor.findUnknownElement(HttpAdaptor.java:800) at mx4j.tools.adaptor.http.HttpAdaptor$HttpClient.run(HttpAdaptor.java:976) Do I need to be concerned about the security of this server? How can I correct/eliminate this error message? I've just upgraded to Cassandra 2.0 ,and this is the first time I've seen this error. Thanks! Tim -- GPG me!! gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B
Re: row cache
I agree. We've had similar experience. Sent from my iPhone On Sep 7, 2013, at 6:05 PM, Edward Capriolo edlinuxg...@gmail.com wrote: I have found row cache to be more trouble then bene. The term fools gold comes to mind. Using key cache and leaving more free main memory seems stable and does not have as many complications. On Wednesday, September 4, 2013, S C as...@outlook.com wrote: Thank you all for your valuable comments and information. -SC Date: Tue, 3 Sep 2013 12:01:59 -0400 From: chris.burrou...@gmail.com To: user@cassandra.apache.org CC: fsareshw...@quantcast.com Subject: Re: row cache On 09/01/2013 03:06 PM, Faraaz Sareshwala wrote: Yes, that is correct. The SerializingCacheProvider stores row cache contents off heap. I believe you need JNA enabled for this though. Someone please correct me if I am wrong here. The ConcurrentLinkedHashCacheProvider stores row cache contents on the java heap itself. Naming things is hard. Both caches are in memory and are backed by a ConcurrentLinkekHashMap. In the case of the SerializingCacheProvider the *values* are stored in off heap buffers. Both must store a half dozen or so objects (on heap) per entry (org.apache.cassandra.cache.RowCacheKey, com.googlecode.concurrentlinkedhashmap.ConcurrentLinkedHashMap$WeightedValue, java.util.concurrent.ConcurrentHashMap$HashEntry, etc). It would probably be better to call this a mixed-heap rather than off-heap cache. You may find the number of entires you can hold without gc problems to be surprising low (relative to say memcached, or physical memory on modern hardware). Invalidating a column with SerializingCacheProvider invalidates the entire row while with ConcurrentLinkedHashCacheProvider it does not. SerializingCacheProvider does not require JNA. Both also use memory estimation of the size (of the values only) to determine the total number of entries retained. Estimating the size of the totally on-heap ConcurrentLinkedHashCacheProvider has historically been dicey since we switched from sizing in entries, and it has been removed in 2.0.0. As said elsewhere in this thread the utility of the row cache varies from absolutely essential to source of numerous problems depending on the specifics of the data model and request distribution.