Re: column with TTL of 10 seconds lives very long...
If you do that same get again, is the column still being returned? (days later) -Jeremiah On Thu, May 23, 2013 at 6:16 AM, Tamar Fraenkel ta...@tok-media.com wrote: Hi! TTL was set: [default@HLockingManager] get HLocks['/LockedTopic/31a30c12-652d-45b3-9ac2-0401cce85517']; = (column=69b057d4-3578-4326-a9d9-c975cb8316d2, value=36396230353764342d333537382d343332362d613964392d633937356362383331366432, timestamp=1369307815049000, ttl=10) Also, all other lock columns expire as expected. Thanks, Tamar *Tamar Fraenkel * Senior Software Engineer, TOK Media [image: Inline image 1] ta...@tok-media.com Tel: +972 2 6409736 Mob: +972 54 8356490 Fax: +972 2 5612956 On Thu, May 23, 2013 at 1:58 PM, moshe.kr...@barclays.com wrote: Maybe you didn’t set the TTL correctly. Check the TTL of the column using CQL, e.g.: SELECT TTL (colName) from colFamilyName WHERE condition; ** ** *From:* Felipe Sere [mailto:felipe.s...@1und1.de] *Sent:* Thursday, May 23, 2013 1:28 PM *To:* user@cassandra.apache.org *Subject:* AW: column with TTL of 10 seconds lives very long... ** ** This is interesting as it might affect me too :) I have been observing deadlocks with HLockManagerImpl which dont get resolved for a long time even though the columns with the locks should only live for about 5-10secs. Any ideas how to investigate this further from the Cassandra-side? -- *Von:* Tamar Fraenkel [ta...@tok-media.com] *Gesendet:* Donnerstag, 23. Mai 2013 11:58 *An:* user@cassandra.apache.org *Betreff:* Re: column with TTL of 10 seconds lives very long... Thanks for the response. Running date simultaneously on all nodes (using parallel ssh) shows that they are synced. Tamar *Tamar Fraenkel * Senior Software Engineer, TOK Media [image: Inline image 1] ta...@tok-media.com Tel: +972 2 6409736 Mob: +972 54 8356490 Fax: +972 2 5612956 ** ** ** ** ** ** On Thu, May 23, 2013 at 12:29 PM, Nikolay Mihaylov n...@nmmm.nu wrote:* *** Did you synchronized the clocks between servers? ** ** On Thu, May 23, 2013 at 9:32 AM, Tamar Fraenkel ta...@tok-media.com wrote: Hi! I have Cassandra cluster with 3 node running version 1.0.11. I am using Hector HLockManagerImpl, which creates a keyspace named HLockManagerImpl and CF HLocks. For some reason I have a row with single column that should have expired yesterday who is still there. I tried deleting it using cli, but it is stuck... Any ideas how to delete it? Thanks, *Tamar Fraenkel * Senior Software Engineer, TOK Media [image: Inline image 1] ta...@tok-media.com Tel: +972 2 6409736 Mob: +972 54 8356490 Fax: +972 2 5612956 ** ** ** ** ** ** ** ** ___ This message is for information purposes only, it is not a recommendation, advice, offer or solicitation to buy or sell a product or service nor an official confirmation of any transaction. It is directed at persons who are professionals and is not intended for retail customer use. Intended for recipient only. This message is subject to the terms at: www.barclays.com/emaildisclaimer. For important disclosures, please see: www.barclays.com/salesandtradingdisclaimer regarding market commentary from Barclays Sales and/or Trading, who are active market participants; and in respect of Barclays Research, including disclosures relating to specific issuers, please see http://publicresearch.barclays.com. ___ image001.pngtokLogo.png
Re: remove DC
If you have any data that you wrote to DC2, since the last time you ran repair, you should probably run repair to make sure that data made it over to DC1, if you never wrote data directly to DC2, then you are correct you don't need to run repair. You should just need to update the schema, and then decommission the node. -Jeremiah On Nov 12, 2012, at 2:25 PM, William Oberman ober...@civicscience.com wrote: There is a great guide here on how to add resources: http://www.datastax.com/docs/1.1/operations/cluster_management#adding-capacity What about deleting resources? I'm thinking of removing a data center. Clearly I'd need to change strategy options, which is currently something like this: {DC1:3,DC2:1} to: {DC1:3}) But, after that change, I'm wondering if anything else needs to happen? All of the data in DC1 is already in the correct spots, so I don't think I have to run repair or cleanup... will
Re: CREATE COLUMNFAMILY
That is fine. You just have to be careful that you haven't already inserted data which would be rejected by the type you update to, as a client will have issues reading that data back. -Jeremiah On Nov 11, 2012, at 4:09 PM, Kevin Burton rkevinbur...@charter.net wrote: What happens when you are mainly concerned about the human readable formats? Say initially you don’t supply metadata for a key like foo in the column family, but you get tired of seeing binary data displayed for the values so you update the column family to get a more human readable format by adding metadata for foo. Will this work? From: aaron morton [mailto:aa...@thelastpickle.com] Sent: Sunday, November 11, 2012 3:39 PM To: user@cassandra.apache.org Subject: Re: CREATE COLUMNFAMILY Also most idomatic clients use the information so they can return the appropriate type to you. Can the metadata be applied after the fact? If so how? UPDATE COLUMN FAMILY in the CLI will let you change it. Note that we do not update the existing data. This can be a problem if you do something like change a variable length integer to a fixed length one. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 12/11/2012, at 8:06 AM, Kevin Burton rkevinbur...@charter.net wrote: Thank you this helps with my understanding. So the goal here is to supply as many name/type pairs as can be reasonably be foreseen when the column family is created? Can the metadata be applied after the fact? If so how? -Original Message- From: Edward Capriolo [mailto:edlinuxg...@gmail.com] Sent: Sunday, November 11, 2012 9:37 AM To: user@cassandra.apache.org Subject: Re: CREATE COLUMNFAMILY If you supply metadata cassandra can use it for several things. 1) It validates data on insertion 2) Helps display the information in human readable formats in tools like the CLI and sstabletojson 3) If you add a built-in secondary index the type information is needed, strings sort differently then integer 4) columns in rows are sorted by the column name, strings sort differently then integers On Sat, Nov 10, 2012 at 11:55 PM, Kevin Burton rkevinbur...@charter.net wrote: I am sure this has been asked before but what is the purpose of entering key/value or more correctly key name/data type values on the CREATE COLUMNFAMILY command.
Re: NetworkTopologyStrategy with 1 node
What is the output of nodetool ring? Does the cluster actually think your node is in DC1? -Jeremiah On May 26, 2012, at 6:36 AM, Cyril Auburtin wrote: I get the same issue on Cassandra 1.1: create keyspace ks with strategy_class = 'NetworkTopologyStrategy' AND strategy_options ={DC1:1}; then for example [default@ks] create column family rr WITH key_validation_class=UTF8Type and comparator = UTF8Type and column_metadata = [{column_name: boo, validation_class: UTF8Type}]; 5c6d0b86-86f2-3444-8335-fe4bdaa4745d Waiting for schema agreement... ... schemas agree across the cluster [default@ks] set rr['1']['boo'] = '1'; null UnavailableException() at org.apache.cassandra.thrift.Cassandra$insert_result.read(Cassandra.java:15898) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78) at org.apache.cassandra.thrift.Cassandra$Client.recv_insert(Cassandra.java:788) at org.apache.cassandra.thrift.Cassandra$Client.insert(Cassandra.java:772) at org.apache.cassandra.cli.CliClient.executeSet(CliClient.java:896) at org.apache.cassandra.cli.CliClient.executeCLIStatement(CliClient.java:213) at org.apache.cassandra.cli.CliMain.processStatementInteractive(CliMain.java:219) at org.apache.cassandra.cli.CliMain.main(CliMain.java:346) 2012/5/26 Cyril Auburtin cyril.aubur...@gmail.commailto:cyril.aubur...@gmail.com thx, but still not I did: update keyspace ks with strategy_options = [{DC1:1}] and placement_strategy = 'org.apache.cassandra.locator.NetworkTopologyStrategy'; then in cassandra-cli : [default@ks] list Position; Using default limit of 100 Internal error processing get_range_slices and in cassandra console: INFO 11:10:52,680 Keyspace updated. Please perform any manual operations. ERROR 11:13:37,565 Internal error processing get_range_slices java.lang.IllegalStateException: datacenter (DC1) has no more endpoints, (1) replicas still needed at org.apache.cassandra.locator.NetworkTopologyStrategy.calculateNaturalEndpoints(NetworkTopologyStrategy.java:118) at org.apache.cassandra.locator.AbstractReplicationStrategy.getNaturalEndpoints(AbstractReplicationStrategy.java:101) at org.apache.cassandra.service.StorageService.getLiveNaturalEndpoints(StorageService.java:1538) How do I have to set the cassndra-topology.properties for a single node in ths DC? I will try to do the same thing with C1.1, it could work 2012/5/26 Edward Capriolo edlinuxg...@gmail.commailto:edlinuxg...@gmail.com replication_factor = 1 and strategy_options = [{DC1:0}] You should not be setting both of these. All you should need is: strategy_options = [{DC1:1}] On Fri, May 25, 2012 at 1:47 PM, Cyril Auburtin cyril.aubur...@gmail.commailto:cyril.aubur...@gmail.com wrote: I was using a single node, on cassandra 0.7.10 with Network strategy = SimpleStrategy, and replication factor = 1, everything is fine, I was using a consistency level of ONE, for reading/writing I have updated the keyspace to update keyspace Mymed with replication_factor = 1 and strategy_options = [{DC1:0}] and placement_strategy = 'org.apache.cassandra.locator.NetworkTopologyStrategy'; with conf/cassandra-topology.properties having just this for the moment: default=DC1:r1 the keyspace could update, I could use ks; also, but can't read anything even from Thrift using ConsistencyLevel.ONE; it will complain that this strategy require Quorum I tried with ConsistencyLevel.LOCAL_QUORUM; but get an exception like : org.apache.thrift.TApplicationException: Internal error processing get_slice and in cassandra console: DEBUG 19:45:02,013 Command/ConsistencyLevel is SliceFromReadCommand(table='Mymed', key='637972696c2e617562757274696e40676d61696c2e636f6d', column_parent='QueryPath(columnFamilyName='Authentication', superColumnName='null', columnName='null')', start='', finish='', reversed=false, count=100)/LOCAL_QUORUM ERROR 19:45:02,014 Internal error processing get_slice java.lang.NullPointerException at org.apache.cassandra.locator.NetworkTopologyStrategy.getReplicationFactor(NetworkTopologyStrategy.java:139) at org.apache.cassandra.service.DatacenterReadCallback.determineBlockFor(DatacenterReadCallback.java:83) at org.apache.cassandra.service.ReadCallback.init(ReadCallback.java:77) at org.apache.cassandra.service.DatacenterReadCallback.init(DatacenterReadCallback.java:48) at org.apache.cassandra.service.StorageProxy.getReadCallback(StorageProxy.java:461) at org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:326) at org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:291) So probably I guess Network topology strategy can't work with just one node? thx for any feedback
RE: understanding of native indexes: limitations, potential side effects,...
The limitation is because number of columns could be equal to number of rows. If number of rows is large this can become an issue. -Jeremiah From: David Vanderfeesten [feest...@gmail.com] Sent: Wednesday, May 16, 2012 6:58 AM To: user@cassandra.apache.org Subject: understanding of native indexes: limitations, potential side effects,... Hi I like to better understand the limitations of native indexes, potential side effects and scenarios where they are required. My understanding so far : - Is that indexes on each node are storing indexes for data locally on the node itself. - Indexes do not return values in a sorted way (hashes of the indexed row keys are defining the order) - Given by the design referred in the first bullet, a coordinator node receiving a read of a native index, needs to spawn a read to multiple nodes(set of nodes together covering at least the complete key space + potentially more to assure read consistency level). - Each write to an indexed column leads to an additional local read of the index to update the index (kind of obvious but easily forgotten when tuning your system for write-only workload) - When using a where clause in CQL you need at least to specify an equal condition on a native indexed column. Additional conditions in the where clause are filtered out by the coordinator node receiving the CQL query. - native indexes do not support very well columns with high number of discrete values throughout the entire CF. Is upper understanding correct and complete? Some doubts: - about the limitation of indexing columns with high number of discrete values: I assume native indexes are implemented with an internally managed CF per index. With high cardinality values, in worst case, the number of rows in the index are identical to the number of rows of the indexed CF. Or are there other reasons for the limitation, and if that's the case, is there a guideline on the max. nbr of cardinality that is still reasonable? -Are column updates and the update of the indexes (read + write action) atomic and isolated from concurrent updates? Txs! David
Re: Does or will Cassandra support OpenJDK ?
Open JDK is java 1.7. Once Cassandra supports Java 1.7 it would most likely work on Open JDK, as the 1.7 Open JDK really is the same thing as Oracle JDK 1.7 without some licensed stuff. -Jeremiah On May 11, 2012, at 10:02 PM, ramesh wrote: I've had problem downloading the Sun (Oracle) JDK and found this thread where the Oracle official is insisting or rather forcing Linux users to move to OpenJDK. Here is the thread https://forums.oracle.com/forums/thread.jspa?threadID=2365607 I need this because I run Cassandra. Just curious to know if I would be able to avoid the pain of using Sun JDK in future for production Cassandra ? regards Ramesh
Re: DELETE from table with composite keys
Slice deletes are not supported currently. It is being worked on. https://issues.apache.org/jira/browse/CASSANDRA-3708 -Jeremiah On May 14, 2012, at 12:18 PM, Roland Mechler wrote: I have a table with a 3 part composite key and I want to delete rows based on the first 2 parts of the key. SELECT works using 2 parts of the key, but DELETE fails with the error: Bad Request: Missing mandatory PRIMARY KEY part part3 (see details below). Is there a reason why deleting based on the first 2 parts should not work? I.e., is it just currently not supported, or is it a permanent limitation? Note that deleting based on just the first part of the key will work… deletes all matching rows. cqlsh:Keyspace1 CREATE TABLE MyTable (part1 text, part2 text, part3 text, data text, PRIMARY KEY(part1, part2, part3)); cqlsh:Keyspace1 INSERT INTO MyTable (part1, part2, part3, data) VALUES (‘a’, ‘b’, ‘c’, ‘d’); cqlsh:Keyspace1 SELECT * FROM MyTable WHERE part1 = ‘a’ AND part2 = ‘b’; part1 | part2 | part3 | data ——-+——-+——-+—— a | b | c | d cqlsh:Keyspace1 DELETE FROM MyTable WHERE part1 = ‘a’ AND part2 = ‘b’; Bad Request: Missing mandatory PRIMARY KEY part part3 cqlsh:Keyspace1 DELETE data FROM MyTable WHERE part1 = ‘a’ AND part2 = ‘b’; Bad Request: Missing mandatory PRIMARY KEY part part3 cqlsh:Keyspace1 DELETE FROM MyTable WHERE part1 = ‘a’; cqlsh:Keyspace1 SELECT * FROM MyTable WHERE part1 = ‘a’ AND part2 = ‘b’; cqlsh:Keyspace1 -Roland
RE: Initial token - newbie question (version 1.0.8)
You have to use nodetool move to change the token after the node has started the first time. The value in the config file is only used on first startup. Unless you were using RF=3 on your 3 node ring, you can't just start with a new token without using nodetool. You have to do move so that the data gets put in the right place. How you would do it with out nodetool: Dangerous, not smart, can easily shoot yourself in the foot and lose your data way, if you were RF = 3: If you used RF=3, then all nodes should have all data, and you can stop all nodes, remove the system keyspace data, and start up the new cluster with the right stuff in the yaml file (blowing away system means this is like starting a brand new cluster). Then re-create all of your keyspaces/column families and they will pick up the already existing data. Though, if you are rf=3, nodetool move shouldn't be moving anything anyway, so you should just do it the right way and use nodetool. From: Jay Parashar [jparas...@itscape.com] Sent: Wednesday, April 11, 2012 1:44 PM To: user@cassandra.apache.org Subject: Initial token - newbie question (version 1.0.8) I created a 3 node ring with the intial_token blank. Of course as expected, Cassandra generated its own tokens on startup (e.g. tokens X, Y and Z) The nodes or course were not properly balanced, so I did the following steps 1) stopped all the 3 nodes 2) assigned initial_tokens (A, B, C) respectively 3) Restarted the nodes What I find if that the node were still using the original tokens (X, Y and Z). Log messages say for node 1 show Using saved token X I could rebalance suing nodetool and now the nodes are using the correct tokens. But the question is, why were the new tokens not read from the Cassandra.yaml file? Without using nodetool, how do I make it get the token from the yaml file? Where is it saved? Another question: I could not find the auto_bootstrap in the yaml file as per the documentation. Where is this param located? Appreciate it. Thanks in advance Jay
Re: Resident size growth
He says he disabled JNA. You can't mmap without JNA can you? On Apr 9, 2012, at 4:52 AM, aaron morton wrote: see http://wiki.apache.org/cassandra/FAQ#mmap Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.comhttp://www.thelastpickle.com/ On 9/04/2012, at 5:09 AM, ruslan usifov wrote: mmap sstables? It's normal 2012/4/5 Omid Aladini omidalad...@gmail.commailto:omidalad...@gmail.com Hi, I'm experiencing a steady growth in resident size of JVM running Cassandra 1.0.7. I disabled JNA and off-heap row cache, tested with and without mlockall disabling paging, and upgraded to JRE 1.6.0_31 to prevent this bug [1] to leak memory. Still JVM's resident set size grows steadily. A process with Xmx=2048M has grown to 6GB resident size and one with Xmx=8192M to 16GB in a few hours and increasing. Has anyone experienced this? Any idea how to deal with this issue? Thanks, Omid [1] http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7066129
RE: Write performance compared to Postgresql
So Cassandra may or may not be faster than your current system when you have a couple connections. Where it is faster, and scales, is when you get hundreds of clients across many nodes. See: http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html With 60 clients running 200 threads each they were able to get 10K writes per second per server, and as you added servers from 48-288 you still got 10K writes per second, so the aggregate writes per second went from 48*10K to 288*10K -Jeremiah From: Jeff Williams [je...@wherethebitsroam.com] Sent: Tuesday, April 03, 2012 10:09 AM To: user@cassandra.apache.org Subject: Re: Write performance compared to Postgresql Vitalii, Yep, that sounds like a good idea. Do you have any more information about how you're doing that? Which client? Because even with 3 concurrent client nodes, my single postgresql server is still out performing my 2 node cassandra cluster, although the gap is narrowing. Jeff On Apr 3, 2012, at 4:08 PM, Vitalii Tymchyshyn wrote: Note that having tons of TCP connections is not good. We are using async client to issue multiple calls over single connection at same time. You can do the same. Best regards, Vitalii Tymchyshyn. 03.04.12 16:18, Jeff Williams написав(ла): Ok, so you think the write speed is limited by the client and protocol, rather than the cassandra backend? This sounds reasonable, and fits with our use case, as we will have several servers writing. However, a bit harder to test! Jeff On Apr 3, 2012, at 1:27 PM, Jake Luciani wrote: Hi Jeff, Writing serially over one connection will be slower. If you run many threads hitting the server at once you will see throughput improve. Jake On Apr 3, 2012, at 7:08 AM, Jeff Williamsje...@wherethebitsroam.com wrote: Hi, I am looking at cassandra for a logging application. We currently log to a Postgresql database. I set up 2 cassandra servers for testing. I did a benchmark where I had 100 hashes representing logs entries, read from a json file. I then looped over these to do 10,000 log inserts. I repeated the same writing to a postgresql instance on one of the cassandra servers. The script is attached. The cassandra writes appear to perform a lot worse. Is this expected? jeff@transcoder01:~$ ruby cassandra-bm.rb cassandra 3.17 0.48 3.65 ( 12.032212) jeff@transcoder01:~$ ruby cassandra-bm.rb postgres 2.14 0.33 2.47 ( 7.002601) Regards, Jeff cassandra-bm.rb
RE: Counter Column
Right, it affects every version of Cassandra from 0.8 beta 1 until the Fix Version, which right now is None, so it isn't fixed yet... From: Avi-h [avih...@gmail.com] Sent: Tuesday, April 03, 2012 5:23 AM To: cassandra-u...@incubator.apache.org Subject: Re: Counter Column this bug is for 0.8 beta 1, is it also relevant for 1.0.8? -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Counter-Column-tp7432010p7432450.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
RE: Compression on client side vs server side
The server side compression can compress across columns/rows so it will most likely be more efficient. Whether you are CPU bound or IO bound depends on your application and node setup. Unless your working set fits in memory you will be IO bound, and in that case server side compression helps because there is less to read from disk. In many cases it is actually faster to read a compressed file from disk and decompress it, then to read an uncompressed file from disk. See Ed's post: Cassandra compression is like more servers for free! http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/cassandra_compression_is_like_getting From: benjamin.j.mcc...@gmail.com [benjamin.j.mcc...@gmail.com] on behalf of Ben McCann [b...@benmccann.com] Sent: Monday, April 02, 2012 10:42 AM To: user@cassandra.apache.org Subject: Compression on client side vs server side Hi, I was curious if I compress my data on the client side with Snappy whether there's any difference between doing that and doing it on the server side? The wiki said that compression works best where each row has the same columns. Does this mean the compression will be more efficient on the server side since it can look at multiple rows at once instead of only the row being inserted? The reason I was thinking about possibly doing it client side was that it would save CPU on the datastore machine. However, does this matter? Is CPU typically the bottleneck on a machine or is it some other resource? (of course this will vary for each person, but wondering if there's a rule of thumb. I'm making a web app, which hopefully will store about 5TB of data and have 10s of millions of page views per month) Thanks, Ben
Re: data size difference between supercolumn and regular column
Is that 80% with compression? If not, the first thing to do is turn on compression. Cassandra doesn't behave well when it runs out of disk space. You really want to try and stay around 50%, 60-70% works, but only if it is spread across multiple column families, and even then you can run into issues when doing repairs. -Jeremiah On Apr 1, 2012, at 9:44 PM, Yiming Sun wrote: Thanks Aaron. Well I guess it is possible the data files from sueprcolumns could've been reduced in size after compaction. This bring yet another question. Say I am on a shoestring budget and can only put together a cluster with very limited storage space. The first iteration of pushing data into cassandra would drive the disk usage up into the 80% range. As time goes by, there will be updates to the data, and many columns will be overwritten. If I just push the updates in, the disks will run out of space on all of the cluster nodes. What would be the best way to handle such a situation if I cannot to buy larger disks? Do I need to delete the rows/columns that are going to be updated, do a compaction, and then insert the updates? Or is there a better way? Thanks -- Y. On Sat, Mar 31, 2012 at 3:28 AM, aaron morton aa...@thelastpickle.commailto:aa...@thelastpickle.com wrote: does cassandra 1.0 perform some default compression? No. The on disk size depends to some degree on the work load. If there are a lot of overwrites or deleted you may have rows/columns that need to be compacted. You may have some big old SSTables that have not been compacted for a while. There is some overhead involved in the super columns: the super col name, length of the name and the number of columns. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.comhttp://www.thelastpickle.com/ On 29/03/2012, at 9:47 AM, Yiming Sun wrote: Actually, after I read an article on cassandra 1.0 compression just now ( http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-compression), I am more puzzled. In our schema, we didn't specify any compression options -- does cassandra 1.0 perform some default compression? or is the data reduction purely because of the schema change? Thanks. -- Y. On Wed, Mar 28, 2012 at 4:40 PM, Yiming Sun yiming@gmail.commailto:yiming@gmail.com wrote: Hi, We are trying to estimate the amount of storage we need for a production cassandra cluster. While I was doing the calculation, I noticed a very dramatic difference in terms of storage space used by cassandra data files. Our previous setup consists of a single-node cassandra 0.8.x with no replication, and the data is stored using supercolumns, and the data files total about 534GB on disk. A few weeks ago, I put together a cluster consisting of 3 nodes running cassandra 1.0 with replication factor of 2, and the data is flattened out and stored using regular columns. And the aggregated data file size is only 488GB (would be 244GB if no replication). This is a very dramatic reduction in terms of storage needs, and is certainly good news in terms of how much storage we need to provision. However, because of the dramatic reduction, I also would like to make sure it is absolutely correct before submitting it - and also get a sense of why there was such a difference. -- I know cassandra 1.0 does data compression, but does the schema change from supercolumn to regular column also help reduce storage usage? Thanks. -- Y.
RE: Any improvements in Cassandra JDBC driver ?
There is no such thing as pure insert which will give an error if the thing already exists. Everything is really UPDATE OR INSERT. Whether you say UPDATE, or INSERT, it will all act like UPDATE OR INSERT, if the thing is there it get over written, if it isn't there it gets inserted. -Jeremiah From: Dinusha Dilrukshi [sdddilruk...@gmail.com] Sent: Wednesday, March 28, 2012 11:41 PM To: user@cassandra.apache.org Subject: Any improvements in Cassandra JDBC driver ? Hi, We are using Cassandra JDBC driver (found in [1]) to call to Cassandra sever using CQL and JDBC calls. One of the main disadvantage is, this driver is not available in maven repository where people can publicly access. Currently we have to checkout the source and build ourselves. Is there any possibility to host this driver in a maven repository ? And one of the other limitation in driver is, it does not support for the insert query. If we need to do a insert , then it can be done using the update statement. So basically it will be same query used for both UPDATE and INSERT. As an example, if you execute following query: update USER set 'username'=?, 'password'=? where key = ? and if the provided 'KEY' already exist in the Column family then it will do a update to existing columns. If the provided KEY does not already exist, then it will do a insert.. Is that the INSERT query option is now available in latest driver? Are there any other improvements/supports added to this driver recently ? Is this driver compatible with Cassandra-1.1.0 and is that the changes done for driver will be backward compatible with older Cassandra versions (1.0.0) ? [1]. http://code.google.com/a/apache-extras.org/p/cassandra-jdbc/http://code.google.com/a/apache-extras.org/p/cassandra-jdbc/ Regards, ~Dinusha~
Re: copy data for dev
If you have the disck space you can just copy all the data files from the snapshot onto the dev node, renaming any with conflicting names. Then bring up the dev node and it should see the data. you can then compact to merge and drop all the duplicate data. You can also use the sstable loader tool to send the snapshot files to the dev node. -Jeremiah On Mar 26, 2012, at 2:13 PM, Deno Vichas wrote: all, is there a easy way to take a 4 node snapshot and restore it on my single node dev cluster? thanks, deno
RE: Network, Compaction, Garbage collection and Cache monitoring in cassandra
You can also use any network/server monitoring tool which can talk to JMX. We are currently using vFabric Hyperic's JMX plugin for this. IIRC there are some cacti and nagios scripts on github for getting the data into those. -Jeremiah From: R. Verlangen [ro...@us2.nl] Sent: Wednesday, March 21, 2012 10:40 AM To: user@cassandra.apache.org Subject: Re: Network, Compaction, Garbage collection and Cache monitoring in cassandra Hi Rishabh, Please take a look at OpsCenter: http://www.datastax.com/products/opscenter It provides most of the details you request for. Good luck! 2012/3/21 Rishabh Agrawal rishabh.agra...@impetus.co.inmailto:rishabh.agra...@impetus.co.in Hello, Can someone help me with how to proactively monitor Network, Compaction, Garbage collection and Cache use in Cassandra. Regards Rishabh Impetus to sponsor and exhibit at Structure Data 2012, NY; Mar 21-22. Know more about our Big Data quick-start program at the event. New Impetus webcast ‘Cloud-enabled Performance Testing vis-à-vis On-premise’ available at http://bit.ly/z6zT4L. NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.
RE: repair broke TTL based expiration
You need to create the tombstone in case the data was inserted without a timestamp at some point. -Jeremiah From: Radim Kolar [h...@filez.com] Sent: Monday, March 19, 2012 4:48 PM To: user@cassandra.apache.org Subject: Re: repair broke TTL based expiration Dne 19.3.2012 20:28, i...@4friends.od.ua napsal(a): Hello Datasize should decrease during minor compactions. Check logs for compactions results. they do but not as much as i expect. Look at sizes and file dates: -rw-r--r-- 1 root wheel 5.4G Feb 23 17:03 resultcache-hc-27045-Data.db -rw-r--r-- 1 root wheel 6.4G Feb 23 17:11 resultcache-hc-27047-Data.db -rw-r--r-- 1 root wheel 5.5G Feb 25 06:40 resultcache-hc-27167-Data.db -rw-r--r-- 1 root wheel 2.2G Mar 2 05:03 resultcache-hc-27323-Data.db -rw-r--r-- 1 root wheel 2.0G Mar 5 09:15 resultcache-hc-27542-Data.db -rw-r--r-- 1 root wheel 2.2G Mar 12 23:24 resultcache-hc-27791-Data.db -rw-r--r-- 1 root wheel 468M Mar 15 03:27 resultcache-hc-27822-Data.db -rw-r--r-- 1 root wheel 483M Mar 16 05:23 resultcache-hc-27853-Data.db -rw-r--r-- 1 root wheel53M Mar 17 05:33 resultcache-hc-27901-Data.db -rw-r--r-- 1 root wheel 485M Mar 17 09:37 resultcache-hc-27930-Data.db -rw-r--r-- 1 root wheel 480M Mar 19 00:45 resultcache-hc-27961-Data.db -rw-r--r-- 1 root wheel95M Mar 19 09:35 resultcache-hc-27967-Data.db -rw-r--r-- 1 root wheel98M Mar 19 17:04 resultcache-hc-27973-Data.db -rw-r--r-- 1 root wheel19M Mar 19 18:23 resultcache-hc-27974-Data.db -rw-r--r-- 1 root wheel19M Mar 19 19:50 resultcache-hc-27975-Data.db -rw-r--r-- 1 root wheel19M Mar 19 21:17 resultcache-hc-27976-Data.db -rw-r--r-- 1 root wheel19M Mar 19 22:05 resultcache-hc-27977-Data.db I insert everything with 7days TTL + 10 days tombstone expiration. This means that there should not be in ideal case nothing older then Mar 2. These 3x5 GB files waits to be compacted. Because they contains only tombstones, cassandra should make some optimalizations - mark sstable as tombstone only, remember time of latest tombstone and delete entire sstable without need to merge it first. 1. Question is why create tombstone after row expiration at all, because it will expire at all cluster nodes at same time without need to be deleted. 2. Its super column family. When i dump oldest sstable, i wonder why it looks like this: { 772c61727469636c65736f61702e636f6d: {}, 7175616b652d34: {1: {deletedAt: -9223372036854775808, subColumns: [[crc32,4f34455c,1328220892597002,d], [id,4f34455c,1328220892597000,d], [name,4f34455c,1328220892597001,d], [size,4f34455c,1328220892597003,d]]}, 2: {deletedAt: -9223372036854775808, subColumns: [[crc32,4f34455c,1328220892597007,d], [id,4f34455c,1328220892597005,d], [name,4f34455c,1328220892597006,d], [size,4f34455c,1328220892597008,d]]}, 3: {deletedAt: -9223372036854775808, subColumns: * all subcolums are deleted. why to keep their names in table? isnt marking column as deleted enough? 1: {deletedAt: -9223372036854775808} enough? * another question is why was not tombstone entire row, because all its members were expired.
RE: Hector counter question
No, Cassandra doesn't support atomic counters. IIRC it is on the list of things for 1.2. -Jeremiah From: Tamar Fraenkel [ta...@tok-media.com] Sent: Monday, March 19, 2012 1:26 PM To: cassandra-u...@incubator.apache.org Subject: Hector counter question Hi! Is there a way to read and increment counter column atomically, something like incrementAndGet (Hector)? Thanks, Tamar Fraenkel Senior Software Engineer, TOK Media [Inline image 1] ta...@tok-media.commailto:ta...@tok-media.com Tel: +972 2 6409736 Mob: +972 54 8356490 Fax: +972 2 5612956 inline: tokLogo.png
RE: 0.8.1 Vs 1.0.7
I would guess more aggressive compaction settings, did you update rows or insert some twice? If you run major compaction a couple times on the 0.8.1 cluster does the data size get smaller? You can use the describe command to check if compression got turned on. -Jeremiah From: Ravikumar Govindarajan [ravikumar.govindara...@gmail.com] Sent: Thursday, March 15, 2012 4:41 AM To: user@cassandra.apache.org Subject: 0.8.1 Vs 1.0.7 Hi, I ran some data import tests for cassandra 0.8.1 and 1.0.7. The results were a little bit surprising 0.8.1, SimpleStrategy, Rep_Factor=3,QUORUM Writes, RP, SimpleSnitch XXX.XXX.XXX.A datacenter1 rack1 Up Normal 140.61 GB 12.50% XXX.XXX.XXX.B datacenter1 rack1 Up Normal 139.92 GB 12.50% XXX.XXX.XXX.C datacenter1 rack1 Up Normal 138.81 GB 12.50% XXX.XXX.XXX.D datacenter1 rack1 Up Normal 139.78 GB 12.50% XXX.XXX.XXX.E datacenter1 rack1 Up Normal 137.44 GB 12.50% XXX.XXX.XXX.F datacenter1 rack1 Up Normal 138.48 GB 12.50% XXX.XXX.XXX.G datacenter1 rack1 Up Normal 140.52 GB 12.50% XXX.XXX.XXX.H datacenter1 rack1 Up Normal 145.24 GB 12.50% 1.0.7, NTS, Rep_Factor{DC1:3, DC2:2}, LOCAL_QUORUM writes, RP [DC2 m/c yet to join ring], PropertyFileSnitch XXX.XXX.XXX.A DC1 RAC1 Up Normal 48.72 GB 12.50% XXX.XXX.XXX.B DC1 RAC1 Up Normal 51.23 GB 12.50% XXX.XXX.XXX.C DC1 RAC1 Up Normal 52.4GB 12.50% XXX.XXX.XXX.D DC1 RAC1 Up Normal 49.64 GB 12.50% XXX.XXX.XXX.E DC1 RAC1 Up Normal 48.5GB 12.50% XXX.XXX.XXX.F DC1 RAC1 Up Normal53.38 GB 12.50% XXX.XXX.XXX.G DC1 RAC1 Up Normal 51.11 GB 12.50% XXX.XXX.XXX.H DC1 RAC1 Up Normal 53.36 GB 12.50% There seems to be 3X savings in size for the same dataset running 1.0.7. I have not enabled compression for any of the CFs. Will it be enabled by default when creating a new CF in 1.0.7? cassandra.yaml is also mostly identical. Thanks and Regards, Ravi
RE: Composite keys and range queries
Right, so until the new CQL stuff exists to actually query with something smart enough to know about composite keys , You have to define and query on your own. Row Key = UUID Column = CompositeColumn(string, string) You want to then use COLUMN slicing, not row ranges to query the data. Where you slice in priority as the first part of a Composite Column Name. See the Under the hood and historical notes section of the blog post. You want to layout your data per the Physical representation of the denormalized timeline rows diagram. Where your UUID is the user_id from the example, and your priority is the tweet_id -Jeremiah From: John Laban [j...@pagerduty.com] Sent: Wednesday, March 14, 2012 12:37 PM To: user@cassandra.apache.org Subject: Re: Composite keys and range queries Hmm, now I'm really confused. This may be of use to you http://www.datastax.com/dev/blog/schema-in-cassandra-1-1 This article is what I actually used to come up with my schema here. In the Clustering, composite keys, and more section they're using a schema very similarly to how I'm trying to use it. They define a composite key with two parts, expecting the first part to be used as the partition key and the second part to be used for ordering. The hash for (uuid-1 , p1) may be 100 and the hash for (uuid-1, p2) may be 1 . Why? Shouldn't only uuid-1 be used as the partition key? (So shouldn't those two hash to the same location?) I'm thinking of using supercolumns for this instead as I know they'll work (where the row key is the uuid and the supercolumn name is the priority), but aren't composite row keys supposed to essentially replace the need for supercolumns? Thanks, and sorry if I'm getting this all wrong, John On Wed, Mar 14, 2012 at 12:52 AM, aaron morton aa...@thelastpickle.commailto:aa...@thelastpickle.com wrote: You are seeing this http://wiki.apache.org/cassandra/FAQ#range_rp The hash for (uuid-1 , p1) may be 100 and the hash for (uuid-1, p2) may be 1 . You cannot do what you want to. Even if you passed a start of (uuid1,empty) and no finish, you would not only get rows where the key starts with uuid1. This may be of use to you http://www.datastax.com/dev/blog/schema-in-cassandra-1-1 Or you can store all the priorities that are valid for an ID in another row. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 14/03/2012, at 1:05 PM, John Laban wrote: Forwarding to the Cassandra mailing list as well, in case this is more of an issue on how I'm using Cassandra. Am I correct to assume that I can use range queries on composite row keys, even when using a RandomPartitioner, if I make sure that the first part of the composite key is fixed? Any help would be appreciated, John On Tue, Mar 13, 2012 at 12:15 PM, John Laban j...@pagerduty.commailto:j...@pagerduty.com wrote: Hi, I have a column family that uses a composite key: (ID, priority) - ... Where the ID is a UUID and the priority is an integer. I'm trying to perform a range query now: I want all the rows where the ID matches some fixed UUID, but within a range of priorities. This is supported even if I'm using a RandomPartitioner, right? (Because the first key in the composite key is the partition key, and the second part of the composite key is automatically ordered?) So I perform a range slices query: val rangeQuery = HFactory.createRangeSlicesQuery(keyspace, new CompositeSerializer, StringSerializer.get, BytesArraySerializer.get) rangeQuery.setColumnFamily(RouteColumnFamilyName). setKeys( new Composite(id, priorityStart), new Composite(id, priorityEnd) ). setRange( null, null, false, Int.MaxValue ) But I get this error: me.prettyprint.hector.api.exceptions.HInvalidRequestException: InvalidRequestException(why:start key's md5 sorts after end key's md5. this is not allowed; you probably should not specify end key at all, under RandomPartitioner) Shouldn't they have the same md5, since they have the same partition key? Am I using the wrong query here, or does Hector not support composte range queries, or am I making some mistake in how I think Cassandra's composite keys work? Thanks, John
Re: Schema change causes exception when adding data
That is the best one I have found. On 03/01/2012 03:12 PM, Tharindu Mathew wrote: There are 2. I'd like to wait till there are one, when I insert the value. Going through the code, calling client.describe_schema_versions() seems to give a good answer to this. And I discovered that if I wait till there is only 1 version, I will not get this error. Is this the best practice if I want to check this programatically? On Thu, Mar 1, 2012 at 11:15 PM, aaron morton aa...@thelastpickle.com mailto:aa...@thelastpickle.com wrote: use describe cluster in the CLI to see how many schema versions there are. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 2/03/2012, at 12:25 AM, Tharindu Mathew wrote: On Thu, Mar 1, 2012 at 11:47 AM, Tharindu Mathew mcclou...@gmail.com mailto:mcclou...@gmail.com wrote: Jeremiah, Thanks for the reply. This is what we have been doing, but it's not reliable as we don't know a definite time that the schema would get replicated. Is there any way I can know for sure that changes have propagated? [Edit: corrected to a question] Then I can block the insertion of data until then. On Thu, Mar 1, 2012 at 4:33 AM, Jeremiah Jordan jeremiah.jor...@morningstar.com mailto:jeremiah.jor...@morningstar.com wrote: The error is that the specified colum family doesn’t exist. If you connect with the CLI and describe the keyspace does it show up? Also, after adding a new column family programmatically you can’t use it immediately, you have to wait for it to propagate. You can use calls to describe schema to do so, keep calling it until every node is on the same schema. -Jeremiah *From:*Tharindu Mathew [mailto:mcclou...@gmail.com mailto:mcclou...@gmail.com] *Sent:* Wednesday, February 29, 2012 8:27 AM *To:* user *Subject:* Schema change causes exception when adding data Hi, I have a 3 node cluster and I'm dynamically updating a keyspace with a new column family. Then, when I try to write records to it I get the following exception shown at [1]. How do I avoid this. I'm using Hector and the default consistency level of QUORUM is used. Cassandra version 0.7.8. Replication Factor is 1. How can I solve my problem? [1] - me.prettyprint.hector.api.exceptions.HInvalidRequestException: InvalidRequestException(why:unconfigured columnfamily proxySummary) at me.prettyprint.cassandra.service.ExceptionsTranslatorImpl.translate(ExceptionsTranslatorImpl.java:42) at me.prettyprint.cassandra.service.KeyspaceServiceImpl$10.execute(KeyspaceServiceImpl.java:397) at me.prettyprint.cassandra.service.KeyspaceServiceImpl$10.execute(KeyspaceServiceImpl.java:383) at me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:101) at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:156) at me.prettyprint.cassandra.service.KeyspaceServiceImpl.operateWithFailover(KeyspaceServiceImpl.java:129) at me.prettyprint.cassandra.service.KeyspaceServiceImpl.multigetSlice(KeyspaceServiceImpl.java:401) at me.prettyprint.cassandra.model.thrift.ThriftMultigetSliceQuery$1.doInKeyspace(ThriftMultigetSliceQuery.java:67) at me.prettyprint.cassandra.model.thrift.ThriftMultigetSliceQuery$1.doInKeyspace(ThriftMultigetSliceQuery.java:59) at me.prettyprint.cassandra.model.KeyspaceOperationCallback.doInKeyspaceAndMeasure(KeyspaceOperationCallback.java:20) at me.prettyprint.cassandra.model.ExecutingKeyspace.doExecute(ExecutingKeyspace.java:72) at me.prettyprint.cassandra.model.thrift.ThriftMultigetSliceQuery.execute(ThriftMultigetSliceQuery.java:58) -- Regards, Tharindu blog: http://mackiemathew.com/ -- Regards, Tharindu blog: http://mackiemathew.com/ -- Regards, Tharindu blog: http://mackiemathew.com/ -- Regards, Tharindu blog: http://mackiemathew.com/
Re: Adding a second datacenter
You need to make sure your clients are reading using LOCAL_* settings so that they don't try to get data from the other data center. But you shouldn't get errors while replication_factor is 0. Once you change the replication factor to 4, you should get missing data if you are using LOCAL_* for reading. What version are you using? See the IRC logs at the begining of this JIRA discussion thread for some info: https://issues.apache.org/jira/browse/CASSANDRA-3483 But you should be able to: 1. Set dc2:0 in the replication_factor. 2. Set bootstrap to false on the new nodes. 2. Start all of the new nodes. 3. Change replication_factor to dc2:4 4. run repair on the nodes in dc2. Once the repairs finish you should be able to start using DC2. You are still going to need a bunch of extra space because the repair is going to get you a couple copies of the data. Once 1.1 comes out it will have new nodetool commands for making this a little nicer per CASSANDRA-3483 -Jeremiah On 03/05/2012 09:42 AM, David Koblas wrote: Everything that I've read about data centers focuses on setting things up at the beginning of time. I've the the following situation: 10 machines in a datacenter (DC1), with replication factor of 2. I want to set up a second data center (DC2) with the following configuration: 20 machines with a replication factor of 4 What I've found is that if I initially start adding things, the first machine to join the network attempts to replicate all of the data from DC1 and fills up it's disk drive. I've played with setting the storage_options to have a replication factor of 0, then I can bring up all 20 machines in DC2 but then start getting a huge number of read errors from read on DC1. Is there a simple cookbook on how to add a second DC? I'm currently trying to set the replication factor to 1 and do a repair, but that doesn't feel like the right approach. Thanks,
Re: Rationale behind incrementing all tokens by one in a different datacenter (was: running two rings on the same subnet)
There is a requirement that all nodes have a unique token. There is still one global cluster/ring that each node needs to be unique on. The logically seperate rings that NetworkTopologyStrategy puts them into is hidden from the rest of the code. -Jeremiah On 03/05/2012 05:13 AM, Hontvári József Levente wrote: I am thinking about the frequent example: dc1 - node1: 0 dc1 - node2: large...number dc2 - node1: 1 dc2 - node2: large...number + 1 In theory using the same tokens in dc2 as in dc1 does not significantly affect key distribution, specifically the two keys on the border will move to the next one, but that is not much. However it seems that there is an unexplained requirement (at least I could not find an explanation), that all nodes must have a unique token, even if they are put into a different circle by NetworkTopologyStrategy. On 2012.03.05. 11:48, aaron morton wrote: Moreover all tokens must be unique (even across datacenters), although - from pure curiosity - I wonder what is the rationale behind this. Otherwise data is not evenly distributed.
Re: unidirectional communication/replication
You might check out some of the stuff Netflix does with their Cassandra backup, and Cassandra ETL tools.: http://techblog.netflix.com/2012/02/aegisthus-bulk-data-pipeline-out-of.html http://techblog.netflix.com/2012/02/announcing-priam.html -Jeremiah On 02/29/2012 11:04 AM, Alexandru Sicoe wrote: On Sun, Feb 26, 2012 at 8:24 PM, aaron morton aa...@thelastpickle.com mailto:aa...@thelastpickle.com wrote: All nodes in the cluster need two way communication. Nodes need to talk to Gossip to each other so they know they are alive. If you need to dump a lot of data consider the Hadoop integration. http://wiki.apache.org/cassandra/HadoopSupport It can run a bit faster than going through the thrift api. Thanks for the suggestion, I will look into it. Copying sstables may be another option depending on the data size. The problem with this is that the SSTable, from what I understand, is per CF, Since I will want to do a semi real time replication of just the latest data added this won't work because I will be copying over all the data in the CF. Cheers, A Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 25/02/2012, at 3:21 AM, Alexandru Sicoe wrote: Hello everyone, I'm battling with this contraint that I have: I need to regularly ship out timeseries data from a Cassandra cluster that sits within an enclosed network, outside of the network. I tried to select all the data within a certian time window, writing to a file, and then copying the file out but this hits the I/O performance because even for a small time window (say 5mins) I am hitting more than a million rows. It would really help if I used Cassandra to replicate the data automatically outside. The problem is they will only allow me to have outbound traffic out of the enclosed network (not inbound). Is there any way to configure the cluster or have 2 data centers in such a way that the data center (node or cluster) outside of the enclosed network only gets a replica of the data, without ever needing to communicate anything back? I appreciate the help, Alex
RE: Schema change causes exception when adding data
The error is that the specified colum family doesn't exist. If you connect with the CLI and describe the keyspace does it show up? Also, after adding a new column family programmatically you can't use it immediately, you have to wait for it to propagate. You can use calls to describe schema to do so, keep calling it until every node is on the same schema. -Jeremiah From: Tharindu Mathew [mailto:mcclou...@gmail.com] Sent: Wednesday, February 29, 2012 8:27 AM To: user Subject: Schema change causes exception when adding data Hi, I have a 3 node cluster and I'm dynamically updating a keyspace with a new column family. Then, when I try to write records to it I get the following exception shown at [1]. How do I avoid this. I'm using Hector and the default consistency level of QUORUM is used. Cassandra version 0.7.8. Replication Factor is 1. How can I solve my problem? [1] - me.prettyprint.hector.api.exceptions.HInvalidRequestException: InvalidRequestException(why:unconfigured columnfamily proxySummary) at me.prettyprint.cassandra.service.ExceptionsTranslatorImpl.translate(Exce ptionsTranslatorImpl.java:42) at me.prettyprint.cassandra.service.KeyspaceServiceImpl$10.execute(Keyspace ServiceImpl.java:397) at me.prettyprint.cassandra.service.KeyspaceServiceImpl$10.execute(Keyspace ServiceImpl.java:383) at me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation .java:101) at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailov er(HConnectionManager.java:156) at me.prettyprint.cassandra.service.KeyspaceServiceImpl.operateWithFailover (KeyspaceServiceImpl.java:129) at me.prettyprint.cassandra.service.KeyspaceServiceImpl.multigetSlice(Keysp aceServiceImpl.java:401) at me.prettyprint.cassandra.model.thrift.ThriftMultigetSliceQuery$1.doInKey space(ThriftMultigetSliceQuery.java:67) at me.prettyprint.cassandra.model.thrift.ThriftMultigetSliceQuery$1.doInKey space(ThriftMultigetSliceQuery.java:59) at me.prettyprint.cassandra.model.KeyspaceOperationCallback.doInKeyspaceAnd Measure(KeyspaceOperationCallback.java:20) at me.prettyprint.cassandra.model.ExecutingKeyspace.doExecute(ExecutingKeys pace.java:72) at me.prettyprint.cassandra.model.thrift.ThriftMultigetSliceQuery.execute(T hriftMultigetSliceQuery.java:58) -- Regards, Tharindu blog: http://mackiemathew.com/
Chicago Cassandra Meetup on 3/1 (Preview of my Pycon talk)
I am going to be doing a trial run of my Pycon talk about setting up a development instance of Cassandra and accessing it from Python (Pycassa mostly, some thrift just to scare people off of using thrift) for a Chicago Cassandra Meetup. Anyone in Chicago feel free to come by. The talk is next Thursday, 3/1. See the Meetup listing for full time/place/etc. http://www.meetup.com/Cassandra-Chicago/events/53378712/ If you are going to be at Pycon, I will be presenting on Friday 3/9 @ 2:40. https://us.pycon.org/2012/schedule/presentation/122/ If anyone is interested we could probably get some kind of Cassandra Open Space going as well. I see DataStax is a Pycon sponsor, are you guys planning anything? -Jeremiah
Re: Deleting a column vs setting it's value to empty
Either one works fine. Setting to may cause you less headaches as you won't have to deal with tombstones. Deleting a non existent column is fine. -Jeremiah On 02/10/2012 02:15 PM, Drew Kutcharian wrote: Hi Everyone, Let's say I have the following object which I would like to save in Cassandra: class User { UUID id; //row key String name; //columnKey: name, columnValue: the name of the user String description; //columnKey: description, columnValue: the description of the user } Description can be nullable. What's the best approach when a user updates her description and sets it to null? Should I delete the description column or set it to an empty string? In addition, if I go with the delete column strategy, since I don't know what was the previous value of description (the column could not even exist), what would happen when I delete a non existent column? Thanks, Drew
Re: Cassandra 1.0.6 multi data center question
No, not an issue. The nodes in DC2 know that they aren't supposed to have data, so they go ask the nodes in DC1 for the data to return to you. -Jeremiah On 02/09/2012 05:28 AM, Roshan Pradeep wrote: Thanks Peter for the replies. Previously it was a typing mistake and it should be getting. I checked the DC2 (with having replica 0) and noticed that there is no SSTables created. I use java hector sample program to insert data to the keyspace. After I insert a data item, I 1) Login to one of node in DC having replica count 0 using cassanda-cli. 2) Use the keyspace and list the column family. 3) I can see the data item inserted from DC having replica count 1. Is this a issue? Please clarify. Thanks again. On Thu, Feb 9, 2012 at 6:00 PM, Peter Schuller peter.schul...@infidyne.com mailto:peter.schul...@infidyne.com wrote: Again the *schema* gets propagated and the keyspace will exist everywhere. You should just have exactly zero amount of data for the keyspace in the DC w/o replicas. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Disable Nagle algoritm in thrift i.e. TCP_NODELAY
Should already be on for all of the server side stuff. All of the clients that I have used set it as well. -Jeremiah On 01/26/2012 07:17 AM, ruslan usifov wrote: Hello Is it possible set TCP_NODELAY on thrift socket in cassandra?
Re: Unbalanced cluster with RandomPartitioner
Are you deleting data or using TTL's? Expired/deleted data won't go away until the sstable holding it is compacted. So if compaction has happened on some nodes, but not on others, you will see this. The disparity is pretty big 400Gb to 20GB, so this probably isn't the issue, but with our data using TTL's if I run major compactions a couple times on that column family it can shrink ~30%-40%. -Jeremiah On 01/17/2012 12:51 PM, Marcel Steinbach wrote: We are running regular repairs, so I don't think that's the problem. And the data dir sizes match approx. the load from the nodetool. Thanks for the advise, though. Our keys are digits only, and all contain a few zeros at the same offsets. I'm not that familiar with the md5 algorithm, but I doubt that it would generate 'hotspots' for those kind of keys, right? On 17.01.2012, at 17:34, Mohit Anchlia wrote: Have you tried running repair first on each node? Also, verify using df -h on the data dirs On Tue, Jan 17, 2012 at 7:34 AM, Marcel Steinbach marcel.steinb...@chors.de mailto:marcel.steinb...@chors.de wrote: Hi, we're using RP and have each node assigned the same amount of the token space. The cluster looks like that: Address Status State LoadOwnsToken 205648943402372032879374446248852460236 1 Up Normal 310.83 GB 12.50% 56775407874461455114148055497453867724 2 Up Normal 470.24 GB 12.50% 78043055807020109080608968461939380940 3 Up Normal 271.57 GB 12.50% 99310703739578763047069881426424894156 4 Up Normal 282.61 GB 12.50% 120578351672137417013530794390910407372 5 Up Normal 248.76 GB 12.50% 141845999604696070979991707355395920588 6 Up Normal 164.12 GB 12.50% 163113647537254724946452620319881433804 7 Up Normal 76.23 GB12.50% 184381295469813378912913533284366947020 8 Up Normal 19.79 GB12.50% 205648943402372032879374446248852460236 I was under the impression, the RP would distribute the load more evenly. Our row sizes are 0,5-1 KB, hence, we don't store huge rows on a single node. Should we just move the nodes so that the load is more even distributed, or is there something off that needs to be fixed first? Thanks Marcel hr style=border-color:blue pchors GmbH brhr style=border-color:blue pspecialists in digital and direct marketing solutionsbr Haid-und-Neu-Straße 7br 76131 Karlsruhe, Germanybr www.chors.com/p pManaging Directors: Dr. Volker Hatz, Markus PlattnerbrAmtsgericht Montabaur, HRB 15029/p p style=font-size:9pxThis e-mail is for the intended recipient only and may contain confidential or privileged information. If you have received this e-mail by mistake, please contact us immediately and completely delete it (and any attachments) and do not forward it or inform any other person of its contents. If you send us messages by e-mail, we take this as your authorization to correspond with you by e-mail. E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, amended, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. Neither chors GmbH nor the sender accept liability for any errors or omissions in the content of this message which arise as a result of its e-mail transmission. Please note that all e-mail communications to and from chors GmbH may be monitored./p
Re: nodetool ring question
There were some nodetool ring load reporting issues with early version of 1.0.X don't remember when they were fixed, but that could be your issue. Are you using compressed column families, a lot of the issues were with those. Might update to 1.0.7. -Jeremiah On 01/16/2012 04:04 AM, Michael Vaknine wrote: Hi, I have a 4 nodes cluster 1.0.3 version This is what I get when I run nodetool ring Address DC RackStatus State Load OwnsToken 127605887595351923798765477786913079296 10.8.193.87 datacenter1 rack1 Up Normal 46.47 GB 25.00% 0 10.5.7.76 datacenter1 rack1 Up Normal 48.01 GB 25.00% 42535295865117307932921825928971026432 10.8.189.197datacenter1 rack1 Up Normal 53.7 GB 25.00% 85070591730234615865843651857942052864 10.5.3.17 datacenter1 rack1 Up Normal 43.49 GB 25.00% 127605887595351923798765477786913079296 I have finished running repair on all 4 nodes. I have less then 10 GB on the /var/lib/cassandra/data/ folders My question is Why nodetool reports almost 50 GB on each node? Thanks Michael
Re: How to reliably achieve unique constraints with Cassandra?
Correct, any kind of locking in Cassandra requires clocks that are in sync, and requires you to wait possible clock out of sync time before reading to check if you got the lock, to prevent the issue you describe below. There was a pretty detailed discussion of locking with only Cassandra a month or so back on this list. -Jeremiah On 01/06/2012 02:42 PM, Bryce Allen wrote: On Fri, 6 Jan 2012 10:38:17 -0800 Mohit Anchliamohitanch...@gmail.com wrote: It could be as simple as reading before writing to make sure that email doesn't exist. But I think you are looking at how to handle 2 concurrent requests for same email? Only way I can think of is: 1) Create new CF say tracker 2) write email and time uuid to CF tracker 3) read from CF tracker 4) if you find a row other than yours then wait and read again from tracker after few ms 5) read from USER CF 6) write if no rows in USER CF 7) delete from tracker Please note you might have to modify this logic a little bit, but this should give you some ideas of how to approach this problem without locking. Distributed locking is pretty subtle; I haven't seen a correct solution that uses just Cassandra, even with QUORUM read/write. I suspect it's not possible. With the above proposal, in step 4 two processes could both have inserted an entry in the tracker before either gets a chance to check, so you need a way to order the requests. I don't think the timestamp works for ordering, because it's set by the client (even the internal timestamp is set by the client), and will likely be different from when the data is actually committed and available to read by other clients. For example: * At time 0ms, client 1 starts insert of u...@example.org * At time 1ms, client 2 also starts insert for u...@example.org * At time 2ms, client 2 data is committed * At time 3ms, client 2 reads tracker and sees that it's the only one, so enters the critical section * At time 4ms, client 1 data is committed * At time 5ms, client 2 reads tracker, and sees that is not the only one, but since it has the lowest timestamp (0ms vs 1ms), it enters the critical section. I don't think Cassandra counters work for ordering either. This approach is similar to the Zookeeper lock recipe: http://zookeeper.apache.org/doc/current/recipes.html#sc_recipes_Locks but zookeeper has sequence nodes, which provide a consistent way of ordering the requests. Zookeeper also avoids the busy waiting. I'd be happy to be proven wrong. But even if it is possible, if it involves a lot of complexity and busy waiting it's probably not worth it. There's a reason people are using Zookeeper with Cassandra. -Bryce
Re: How to reliably achieve unique constraints with Cassandra?
Since a Zookeeper cluster is a quorum based system similar to Cassandra, it only goes down when n/2 nodes go down. And the same way you have to stop writing to Cassandra if N/2 nodes are down (if using QUoRUM), your App will have to wait for the Zookeeper cluster to come online again before it can proceed. On 01/06/2012 12:03 PM, Drew Kutcharian wrote: Hi Everyone, What's the best way to reliably have unique constraints like functionality with Cassandra? I have the following (which I think should be very common) use case. User CF Row Key: user email Columns: userId: UUID, etc... UserAttribute1 CF: Row Key: userId (which is the uuid that's mapped to user email) Columns: ... UserAttribute2 CF: Row Key: userId (which is the uuid that's mapped to user email) Columns: ... The issue is we need to guarantee that no two people register with the same email address. In addition, without locking, potentially a malicious user can hijack another user's account by registering using the user's email address. I know that this can be done using a lock manager such as ZooKeeper or HazelCast, but the issue with using either of them is that if ZooKeeper or HazelCast is down, then you can't be sure about the reliability of the lock. So this potentially, in the very rare instance where the lock manager is down and two users are registering with the same email, can cause major issues. In addition, I know this can be done with other tools such as Redis (use Redis for this use case, and Cassandra for everything else), but I'm interested in hearing if anyone has solved this issue using Cassandra only. Thanks, Drew
Re: How to reliably achieve unique constraints with Cassandra?
By using quorum. One of the partitions will may be able to acquire locks, the other one won't... On 01/06/2012 03:36 PM, Drew Kutcharian wrote: Bryce, I'm not sure about ZooKeeper, but I know if you have a partition between HazelCast nodes, than the nodes can acquire the same lock independently in each divided partition. How does ZooKeeper handle this situation? -- Drew On Jan 6, 2012, at 12:48 PM, Bryce Allen wrote: On Fri, 6 Jan 2012 10:03:38 -0800 Drew Kutchariand...@venarc.com wrote: I know that this can be done using a lock manager such as ZooKeeper or HazelCast, but the issue with using either of them is that if ZooKeeper or HazelCast is down, then you can't be sure about the reliability of the lock. So this potentially, in the very rare instance where the lock manager is down and two users are registering with the same email, can cause major issues. For most applications, if the lock managers is down, you don't acquire the lock, so you don't enter the critical section. Rather than allowing inconsistency, you become unavailable (at least to writes that require a lock). -Bryce
Re: Replacing supercolumns with composite columns; Getting the equivalent of retrieving a list of supercolumns by name
Unless you are running into an issue with using super columns that make the composite columns better fit what you are trying to do, I would just stick with super-columns. if it ain't broke don't fix it. -Jeremiah On 01/03/2012 11:21 PM, Asil Klin wrote: @Stephan: in that case, you can easily tell the names of all columns you want to retrieve, so you can make a query to retrieve those list of composite columns. @Jeremiah, So where is my best bet ? Should I leave the supercolumns as it is as of now, since I can find a good way to use them incase I replace them with composite columns? On Wed, Jan 4, 2012 at 4:01 AM, Stephen Pope stephen.p...@quest.com mailto:stephen.p...@quest.com wrote: The bonus you're talking about here, how do I apply that? For example, my columns are in the form of number.id http://number.id such as 4.steve, 4.greg, 5.steve, 5.george. Is there a way to query a slice of numbers with a list of ids? As in, I want all the columns with numbers between 4 and 10 which have ids steve or greg. Cheers, Steve -Original Message- From: Jeremiah Jordan [mailto:jeremiah.jor...@morningstar.com mailto:jeremiah.jor...@morningstar.com] Sent: Tuesday, January 03, 2012 3:12 PM To: user@cassandra.apache.org mailto:user@cassandra.apache.org Cc: Asil Klin Subject: Re: Replacing supercolumns with composite columns; Getting the equivalent of retrieving a list of supercolumns by name The main issue with replacing super columns with composite columns right now is that if you don't know all your sub-column names you can't select multiple super columns worth of data in the same query without getting extra stuff. You have to use a slice to get all subcolumns of a given super column, and you can't have disjoint slices, so if you want two super columns full, you have to get all the other stuff that is in between them, or make two queries. If you know what all of the sub-column names are you can ask for all of the super/sub column pairs for all of the super columns you want and not get extra data. If you don't need to pull multiple super columns at a time with slices like that, then there isn't really an issue. A bonus of using composite keys like this, is that if there is a specific sub column you want from multiple super columns, you can pull all those out with a single multiget and you don't have to pull the rest of the columns... So there are pros and cons... -Jeremiah On 01/03/2012 01:58 PM, Asil Klin wrote: I have a super columns family which I always use to retrieve a list of supercolumns(with all subcolumns) by name. I am looking forward to replace all SuperColumns in my schema with the composite columns. How could I design schema so that I could do the equivalent of retrieving a list of supercolumns by name, in case of using composite columns. (As of now I thought of using the supercolumn name as the first component of the composite name and the subcolumn name as 2nd component of composite name.)
Re: Replacing supercolumns with composite columns; Getting the equivalent of retrieving a list of supercolumns by name
The main issue with replacing super columns with composite columns right now is that if you don't know all your sub-column names you can't select multiple super columns worth of data in the same query without getting extra stuff. You have to use a slice to get all subcolumns of a given super column, and you can't have disjoint slices, so if you want two super columns full, you have to get all the other stuff that is in between them, or make two queries. If you know what all of the sub-column names are you can ask for all of the super/sub column pairs for all of the super columns you want and not get extra data. If you don't need to pull multiple super columns at a time with slices like that, then there isn't really an issue. A bonus of using composite keys like this, is that if there is a specific sub column you want from multiple super columns, you can pull all those out with a single multiget and you don't have to pull the rest of the columns... So there are pros and cons... -Jeremiah On 01/03/2012 01:58 PM, Asil Klin wrote: I have a super columns family which I always use to retrieve a list of supercolumns(with all subcolumns) by name. I am looking forward to replace all SuperColumns in my schema with the composite columns. How could I design schema so that I could do the equivalent of retrieving a list of supercolumns by name, in case of using composite columns. (As of now I thought of using the supercolumn name as the first component of the composite name and the subcolumn name as 2nd component of composite name.)
Re: Newbie question about writer/reader consistency
So you can do this with Cassandra, but you need more logic in your code. Basically, you get the last safe number, M, then get N..M, if there are any gaps, you try again reading those numbers. As long as you are not over writing data, and you only update the last safe number after a successful write to Cassandra, you can do this. We currently do something very similar to this for some of our data. -Jeremiah On Dec 26, 2011, at 12:38 PM, Vladimir Mosgalin wrote: Hello everybody. I am developer of financial-related application, and I'm currently evaluating various nosql databases for our current goal: storing various views which show state of the system in different aspects after each transaction. The write load seems to be bigger than typical SQL database would handle without problems - under test load of tens of transactions per second, each transaction generates changes in dozen of views, which generates hundreds messages per second total. Each message (change) for each view must be stored, as well as resulting view (generated as kind-of update of old view); it means multiple inserts updates per message which go as single transaction. I started to look into nosql databases. I'm a bit puzzled by guarantees of atomicity and isolation that Cassandra provides, so my question will be about how to (if possible at all) attain required level of consistency in Cassandra. I've read various documents and introductions into Cassandra's data model but still can't understands basics about data consistency. This discussion http://stackoverflow.com/questions/6033888/cassandra-atomicity-isolation-of-column-updates-on-a-single-row-on-on-single-n makes me feel disappointed about consistency in Cassandra, but I wonder is there is a way to work around it. The requirements are like this. There is one writer, which modifies two tables (I'm sorry for using SQL terms, I just don't want to create more confusion for mapping them into Cassandra terms at this stage). For the first table, it's a simple insert; index is unique SCN which is guaranteed to be larger than previous one. Let's say it inserts SCN DATA 1 AAA 2 BBB 3 CCC The goal for the client (reader) is to get all the data from scn N to scn M without gaps. It is fine if it can't see the very latest SCN yet, that is, gets 1:AAA and 2:BBB on request SCN: 1..END; what is NOT fine is to get something 1:AAA and 3:CCC. In other words, does Cassandra provide consistency between writer and reader regarding the order of changes? Or under some conditions (say, very fast writes - but always from single writer - and many concurrent reads or something) it might be possible to get that kind of gap? The second question is similar, but on bigger scale. The second table must be modified in more complicated way; both insert and update of old data are required. Sometimes it's few insert and few updates, which must be done atomically - under no conditions reader should be able to see the mid-state of these inserts/updates. Fortunately, all these new changes will have a new key (new SCNs), so if it would be just possible to use a column in separate table which stores last safe SCN it would work - but I have no faith that Cassandra offers such level of consistency. In example, let's say it works like this current last safe SCN: 1000 update (must be viewed as an atomic transaction): SCN DATA 1001 AAA 1002 BBB 800 1001 1003 DDD new last safe SCN: 1003 Here, readers need a mean to filter out lines with SCN1000 until the writer is done writing 1003:DDD line. They also need to filter out 800:1001 line because it references SCN which is after current last safe one. last safe SCN is stored somewhere, and for this pattern to work I once again need execution order consistency - no reader should ever see last safe: 1003 line before all the previous lines were commited; and any reader who saw last safe: 1003 line must be able to see all the lines from that update just like they are right now. Is this possible to do in Cassandra?
Re: memory estimate for each key in the key cache
It is not telling you to multiply your key size by 10-12, it is telling you to multiply the output of the nodetool cfstats reported key cache size by 10-12. -Jeremiah On Dec 18, 2011, at 6:37 PM, Guy Incognito wrote: to be blunt, this doesn't sound right to me, unless it's doing something rather more clever to manage the memory. i mocked up a simple class containing a byte[], ByteBuffer and long, and the shallow size alone is 32 bytes. deep size with a byte[16], 1-byte bytebuffer and long is 132. this is a on a 64-bit jvm on win x64, but is consistent(ish) with what i've seen in the past on linux jvms. the actual code has rather more objects than this (it's a map, it has a pair, decoratedKey) so would be quite a bit bigger per key. On 17/12/2011 03:42, Brandon Williams wrote: On Fri, Dec 16, 2011 at 9:31 PM, Dave Brosiusdbros...@mebigfatguy.com wrote: Wow, Java is a lot better than I thought if it can perform that kind of magic. I'm guessing the wiki information is just old and out of date. It's probably more like 60 + sizeof(key) With jamm and MAT it's fairly easy to test. The number is accurate last I checked. -Brandon
Re: gracefully recover from data file corruptions
You need to run repair on the node once it is back up (to get back the data you just deleted). If this is happening on more than one node you could have data loss... -Jeremiah On 12/16/2011 07:46 AM, Ramesh Natarajan wrote: We are running a 30 node 1.0.5 cassandra cluster running RHEL 5.6 x86_64 virtualized on ESXi 5.0. We are seeing Decorated Key assertion error during compactions and at this point we are suspecting anything from OS/ESXi/HBA/iSCSI RAID. Please correct me i am wrong, once a node gets into this state I don't see any way to recover unless I remove the corrupted data file and restart cassandra. I am running tests with replication factor 3 and all reads and writes are done with QUORUM. So i believe there will not be data loss if i do this. If this is a correct way to recover I would like to know how to gracefully do this in production environment.. - Disable thrift - Disable gossip - Drain the node - kill the cassandra java process ( send a sigterm and or sigkill ) - do a filesystem sync - remove the corrupted file from the /var/lib/cassandra/data directory - start cassandra - enable gossip so all pending hintedhandoff occurs - enable thrift. Thanks Ramesh
Re: Cassandra C client implementation
If you are OK linking to a C++ based library you can look at: https://github.com/minaguib/libcassandra/tree/kickstart-libcassie-0.7/libcassie It is wrapper code around libcassandra which exports a C++ interface. If you look at the function names etc in the other languages, just use the similar functions from the c_glib thrift... If you are going to mess with using the c_glib thrift, make sure to check out the JIRA for it, it is new and has some issues... https://issues.apache.org/jira/browse/THRIFT/component/12313854 On 12/14/2011 09:11 AM, Vlad Paiu wrote: Hello, I am trying to integrate some Cassandra related ops ( insert, get, etc ) into an application written entirelly in C, so C++ is not an option. Is there any C client library for cassandra ? I have also tried to generate thrift glibc code for Cassandra, but on wiki.apache.org/cassandra/ThriftExamples I cannot find an example for C. Can anybody suggest a C client library for Cassandra or provide some working examples for Thrift in C ? Thanks and Regards, Vlad
Re: Slow Compactions - CASSANDRA-3592
Does your issue look similar this one? https://issues.apache.org/jira/browse/CASSANDRA-3532 It is also dealing with compactaion taking 10X longer in 1.0.X On 12/13/2011 09:00 AM, Dan Hendry wrote: I have been observing that major compaction can be incredibly slow in Cassandra 1.0 and was curious the extent to which anybody else has noticed similar behaviour. Essentially I believe the problem involves the combination of wide rows and expiring columns. Relevant details included in: https://issues.apache.org/jira/browse/CASSANDRA-3592 Dan Hendry (403) 660-2297
Re: cassandra in production environment
What java are you using? OpenJDK or Sun/Oracle (http://www.oracle.com/technetwork/java/javase/downloads/index.html)? If you are using OpenJDK you might try Sun. Have you run diagnostics on the disk? It is more likely there is an issue with your disk, not with Cassandra. On 12/11/2011 07:04 PM, Ramesh Natarajan wrote: Hi, We are currently testing cassandra in RHEL 6.1 64 bit environment running on ESXi 5.0 and are experiencing issues with data file corruptions. If you are using linux for production environment can you please share which OS/version you are using? thanks Ramesh
Re: Need to reconcile data from 2 drives
If you don't want downtime, you can take the original data and use the bulk sstable loader to send it back into the cluster. If you don't mind downtime you can take all the files from both data folders and put them together, make sure there aren't any with the same names (rename them if there are) and then start cassandra, it will pick up all the files. -Jeremiah On 12/12/2011 12:53 PM, Stephane Legay wrote: Here's the situation. We're running a 2-node cluster on EC2 (v 0.8.6). Each node writes data to a mounted EBS volume mounted on /mnt2. On Dec. 9th, for some reason both instances were rebooted (not sure yet what triggered the reboot). But the EBS volumes were not added to /etc/fstab, and didn't mount upon reboot. Cassandra did auto-start without any problems, created a new data folder on the system drive and started writing there. We just found out about the issue today with users missing data. So, to recap: - each node contains data created since 12-09-2011, stored on the system drive - each node has access to data created on or before 12-09-2011 on an EBS volume - we need to move the data stored on the system drive to the EBS volume and restart Cassandra into a stable state will all data available What's the best way for me to do this? Thanks
Re: exporting data from Cassandra cluster
Once you get all of the data on one machine you can then flush/drain/compact shutdown the single node and then take the data folder off that machine and back it up. Then when you get your new cassandra cluster setup you can use the sstable loader to shoot the data from the backup into the new cluster. On 12/09/2011 07:09 AM, Alexandru Dan Sicoe wrote: Hi Jeremiah, The thing is I will send the data to a massive storage facility (I don't know what's behind the scenes) so I won't be backing up on one machine where I can install Cassandra. Does the sstable loader work just for copying data from a Cassandra cluster to somewhere on a disk where there is no Cassandra instance? If not what is the best way/tool to achieve that? Cheers, Alexandru On Wed, Dec 7, 2011 at 10:00 PM, Jeremiah Jordan jeremiah.jor...@morningstar.com mailto:jeremiah.jor...@morningstar.com wrote: Stop your current cluster. Start a new cassandra instance on the machine you want to store your data on. Use the sstable loader to load the sstables from all of the current machines into the new machine. Run major compaction a couple times. You will have all of the data on one machine. On 12/07/2011 10:17 AM, Alexandru Dan Sicoe wrote: Hello everyone. 3 node Cassandra 0.8.5 cluster. I've left the system running in production environment for long term testing. I've accumulated about 350GB of data with RF=2. The machines I used for the tests are older and need to be replaced. Because of this I need to export the data to a permanent location. How should I export the data? In order to reduce the storage spac I want to export only the non-replicated data? I mean, just one copy of the data (without the replicas). Is this possible? How? Cheers, Alexandru
Re: exporting data from Cassandra cluster
Stop your current cluster. Start a new cassandra instance on the machine you want to store your data on. Use the sstable loader to load the sstables from all of the current machines into the new machine. Run major compaction a couple times. You will have all of the data on one machine. On 12/07/2011 10:17 AM, Alexandru Dan Sicoe wrote: Hello everyone. 3 node Cassandra 0.8.5 cluster. I've left the system running in production environment for long term testing. I've accumulated about 350GB of data with RF=2. The machines I used for the tests are older and need to be replaced. Because of this I need to export the data to a permanent location. How should I export the data? In order to reduce the storage spac I want to export only the non-replicated data? I mean, just one copy of the data (without the replicas). Is this possible? How? Cheers, Alexandru
Re: Insufficient disk space to flush
If you are writing data with QUORUM or ALL you should be safe to restart cassandra on that node. If the extra space is all from *tmp* files from compaction they will get deleted at startup. You will then need to run repair on that node to get back any data that was missed while it was full. If your commit log was on a different device you may not even have lost much. -Jeremiah On 12/01/2011 04:16 AM, Alexandru Dan Sicoe wrote: Hello everyone, 4 node Cassandra 0.8.5 cluster with RF =2. One node started throwing exceptions in its log: ERROR 10:02:46,837 Fatal exception in thread Thread[FlushWriter:1317,5,main] java.lang.RuntimeException: java.lang.RuntimeException: Insufficient disk space to flush 17296 bytes at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) Caused by: java.lang.RuntimeException: Insufficient disk space to flush 17296 bytes at org.apache.cassandra.db.ColumnFamilyStore.getFlushPath(ColumnFamilyStore.java:714) at org.apache.cassandra.db.ColumnFamilyStore.createFlushWriter(ColumnFamilyStore.java:2301) at org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:246) at org.apache.cassandra.db.Memtable.access$400(Memtable.java:49) at org.apache.cassandra.db.Memtable$3.runMayThrow(Memtable.java:270) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) ... 3 more Checked disk and obviously it's 100% full. How do I recover from this without loosing the data? I've got plenty of space on the other nodes, so I thought of doing a decommission which I understand reassigns ranges to the other nodes and replicates data to them. After that's done I plan on manually deleting the data on the node and then joining in the same cluster position with auto-bootstrap turned off so that I won't get back the old data and I can continue getting new data with the node. Note, I would like to have 4 nodes in because the other three barely take the input load alone. These are just long running tests until I get some better machines. On strange thing I found is that the data folder on the ndoe that filled up the disk is 150 GB (as measured with du) while the data folder on all other 3 nodes is 50 GB. At the same time, DataStax OpsCenter shows a size of around 50GB for all 4 nodes. I though that the node was making a major compaction at which time it filled up the diskbut even that doesn't make sense because shouldn't a major compaction just be capable of doubling the size, not triple-ing it? Doesn anyone know how to explain this behavior? Thanks, Alex
Re: JMX monitoring
jconsole is going to be the most up to date documentation for the JMX interface =(. -Jeremiah On 11/23/2011 10:49 AM, David McNelis wrote: Ok. in that case I think the Docs are wrong. http://wiki.apache.org/cassandra/JmxInterface has StorageService as part of org.apache.cassandra.service. Also, once I executed a CLI command, I started getting the expected output (output being that it was able to return the live nodes). -- *David McNelis* Lead Software Engineer Agentis Energy www.agentisenergy.com http://www.agentisenergy.com c: 219.384.5143 /A Smart Grid technology company focused on helping consumers of energy control an often under-managed resource./
Re: DataCenters each with their own local data source
Cassandra's Multiple Data Center Support is meant for replicating all data across multiple datacenter's efficiently. You could use the Byte Order Partitioner to prefix data with a key and assign those keys to nodes in specific data centers, though the edge nodes would get tricky as those would want to have replicas in other data centers, you could probably do some stuff with sentinel values, and some nodes that only replicate data and aren't the primary node for any data to make this not happen. It is doable, though this would probably be more trouble then it is worth. I would probably just make each DC its own cluster and have client logic which knows which DC to query. -Jeremiah On Nov 22, 2011, at 6:57 PM, Mathieu Lalonde wrote: Hi, I am wondering if Cassandra's features and datacenter awareness can help me with my scalability problems. Suppose that I have a 10-20 Data centers, each with their own local (massive) source of time series data. I would like: - to avoid replication across data centers (this seems doable based on: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Different-KeySpaces-for-different-nodes-in-the-same-ring-td5096393.html#a5096568 ) - writes for local data to be done on the local data center (not sure about that one) - reads from a master data center to any remote data centers (not sure about that one either) It sounds like I am trying to use Cassandra in a very different way that it was intended to be used. Should I simply have a middle-tier that takes care of distributing reads to multiple data centers and treat each data center as its own autonomous cluster? Thanks! Matt
Re: DataCenters each with their own local data source
Oops, I was thinking all in the same keyspace. If you made a new keyspace for each DC you could specify where to put the data and have them only be in one place. -Jeremiah On Nov 22, 2011, at 8:49 PM, Jeremiah Jordan wrote: Cassandra's Multiple Data Center Support is meant for replicating all data across multiple datacenter's efficiently. You could use the Byte Order Partitioner to prefix data with a key and assign those keys to nodes in specific data centers, though the edge nodes would get tricky as those would want to have replicas in other data centers, you could probably do some stuff with sentinel values, and some nodes that only replicate data and aren't the primary node for any data to make this not happen. It is doable, though this would probably be more trouble then it is worth. I would probably just make each DC its own cluster and have client logic which knows which DC to query. -Jeremiah On Nov 22, 2011, at 6:57 PM, Mathieu Lalonde wrote: Hi, I am wondering if Cassandra's features and datacenter awareness can help me with my scalability problems. Suppose that I have a 10-20 Data centers, each with their own local (massive) source of time series data. I would like: - to avoid replication across data centers (this seems doable based on: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Different-KeySpaces-for-different-nodes-in-the-same-ring-td5096393.html#a5096568 ) - writes for local data to be done on the local data center (not sure about that one) - reads from a master data center to any remote data centers (not sure about that one either) It sounds like I am trying to use Cassandra in a very different way that it was intended to be used. Should I simply have a middle-tier that takes care of distributing reads to multiple data centers and treat each data center as its own autonomous cluster? Thanks! Matt
Re: 7199
Yes, that is the port nodetool needs to access. On Nov 22, 2011, at 8:43 PM, Maxim Potekhin wrote: Hello, I have this in my cassandra-env.sh JMX_PORT=7199 Does this mean that if I use nodetool from another node, it will try to connect to that particular port? Thanks, Maxim
Re: Efficiency of Cross Data Center Replication...?
If hinting is off. Read Repair and Manual Repair are the only ways data will get there (just like when a single node is down). On Nov 20, 2011, at 6:01 AM, Boris Yen wrote: A quick question, what if DC2 is down, and after a while it comes back on. how does the data get sync to DC2 in this case? (assume hint is disable) Thanks in advance. On Thu, Nov 17, 2011 at 10:46 AM, Jeremiah Jordan jeremiah.jor...@morningstar.com wrote: Pretty sure data is sent to the coordinating node in DC2 at the same time it is sent to replicas in DC1, so I would think 10's of milliseconds after the transport time to DC2. On Nov 16, 2011, at 3:48 PM, ehers...@gmail.com wrote: On a related note - assuming there are available resources across the board (cpu and memory on every node, low network latency, non-saturated nics/circuits/disks), what's a reasonable expectation for timing on replication? Sub-second? Less than five seconds? Ernie On Wed, Nov 16, 2011 at 4:00 PM, Brian Fleming bigbrianflem...@gmail.com wrote: Great - thanks Jake B. On Wed, Nov 16, 2011 at 8:40 PM, Jake Luciani jak...@gmail.com wrote: the former On Wed, Nov 16, 2011 at 3:33 PM, Brian Fleming bigbrianflem...@gmail.com wrote: Hi All, I have a question about inter-data centre replication : if you have 2 Data Centers, each with a local RF of 2 (i.e. total RF of 4) and write to a node in DC1, how efficient is the replication to DC2 - i.e. is that data : - replicated over to a single node in DC2 once and internally replicated or - replicated explicitly to two separate nodes? Obviously from a LAN resource utilisation perspective, the former would be preferable. Many thanks, Brian -- http://twitter.com/tjake
Re: Efficiency of Cross Data Center Replication...?
Pretty sure data is sent to the coordinating node in DC2 at the same time it is sent to replicas in DC1, so I would think 10's of milliseconds after the transport time to DC2. On Nov 16, 2011, at 3:48 PM, ehers...@gmail.com wrote: On a related note - assuming there are available resources across the board (cpu and memory on every node, low network latency, non-saturated nics/circuits/disks), what's a reasonable expectation for timing on replication? Sub-second? Less than five seconds? Ernie On Wed, Nov 16, 2011 at 4:00 PM, Brian Fleming bigbrianflem...@gmail.com wrote: Great - thanks Jake B. On Wed, Nov 16, 2011 at 8:40 PM, Jake Luciani jak...@gmail.com wrote: the former On Wed, Nov 16, 2011 at 3:33 PM, Brian Fleming bigbrianflem...@gmail.com wrote: Hi All, I have a question about inter-data centre replication : if you have 2 Data Centers, each with a local RF of 2 (i.e. total RF of 4) and write to a node in DC1, how efficient is the replication to DC2 - i.e. is that data : - replicated over to a single node in DC2 once and internally replicated or - replicated explicitly to two separate nodes? Obviously from a LAN resource utilisation perspective, the former would be preferable. Many thanks, Brian -- http://twitter.com/tjake
Re: Is a direct upgrade from .6 to 1.0 possible?
You should be able to do it as long as you shut down the whole cluster for it: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Upgrading-to-1-0-tp6954908p6955316.html On 11/13/2011 02:14 PM, Timothy Smith wrote: Due to some application dependencies I've been holding off on a Cassandra upgrade for a while. Now that my last application using the old thrift client is updated I have the green light to prep my upgrade. Since I'm on .6 the upgrade is obviously a bit trickier. Do the standard instructions for upgrading from .6 to .7 still apply or do I have to step from .6 - .7 - 1.0? Thanks, Tim
Re: questions on frequency and timing of async replication between DCs
If you query with ALL do you get the data? If you query with a range slice do you get the data (list from the cli)? On 11/11/2011 04:10 PM, Subrahmanya Harve wrote: I have cross dc replication set up using 0.8.7 with 3 nodes on each DC by following the +1 rule for tokens. I am seeing an issue where the insert into a DC happened successfully but on querying from cli or through Hector, i am not seeing the data being returned. i used cli on every node of both DCs and every node returned blank. So basic question is where is my data? CL.WRITE=ONE, CL.READ=1. RF = DC:2, DC:2 Apart from checking the data directory size on each DC to verify that cross-dc replication has happened, what others steps can i take to verify that cross dc replication is happening successfully? What tuning params can i control with regard to cross-dc replication? (frequency? batch size?, etc) would greatly appreciate any help.
Re: Data retrieval inconsistent
I am pretty sure the way you have K1 configured it will be placed across both DC's as if you had large ring. If you want it only in DC1 you need to say DC1:1, DC2:0. If you are writing and reading at ONE you are not guaranteed to get the data if RF 1. If RF = 2, and you write with ONE, you data could be written to server 1, and then read from server 2 before it gets over there. The differing on server times will only really matter for TTL's. Most everything else works off comparing user supplied times. -Jeremiah On 11/10/2011 02:27 PM, Subrahmanya Harve wrote: I am facing an issue in 0.8.7 cluster - - I have two clusters in two DCs (rather one cross dc cluster) and two keyspaces. But i have only configured one keyspace to replicate data to the other DC and the other keyspace to not replicate over to the other DC. Basically this is the way i ran the keyspace creation - create keyspace K1 with placement_strategy='org.apache.cassandra.locator.SimpleStrategy' and strategy_options = [{replication_factor:1}]; create keyspace K2 with placement_strategy='org.apache.cassandra.locator.NetworkTopologyStrategy' and strategy_options = [{DC1:2, DC2:2}]; I had to do this because i expect that K1 will get a large volume of data and i do not want this wired over to the other DC. I am writing the data at CL=ONE and reading the data at CL=ONE. I am seeing an issue where sometimes i get the data and other times i do not see the data. Does anyone know what could be going on here? A second larger question is - i am migrating from 0.7.4 to 0.8.7 , i can see that there are large changes in the yaml file, but a specific question i had was - how do i configure disk_access_mode like it used to be in 0.7.4? One observation i have made is that some nodes of the cross dc cluster are at different system times. This is something to fix but could this be why data is sometimes retrieved and other times not? Or is there some other thing to it? Would appreciate a quick response.
Re: Data retrieval inconsistent
No, that is what I thought you wanted. I was thinking your machines in DC1 had extra disk space or something... (I stopped replying to the dev list) On 11/10/2011 04:09 PM, Subrahmanya Harve wrote: Thanks Ed and Jeremiah for that useful info. I am pretty sure the way you have K1 configured it will be placed across both DC's as if you had large ring. If you want it only in DC1 you need to say DC1:1, DC2:0. Infact i do want K1 to be available across both DCs as if i had a large ring. I just do not want them to replicate over across DCs. Also i did try doing it like you said DC1:1, DC2:0 but wont that mean that, all my data goes into DC1 irrespective of whether the data is getting into the nodes of DC1 or DC2, thereby creating a hot DC? Since the volume of data for this case is huge, that might create a load imbalance on DC1? (Am i missing something?) On Thu, Nov 10, 2011 at 1:30 PM, Jeremiah Jordan jeremiah.jor...@morningstar.com wrote: I am pretty sure the way you have K1 configured it will be placed across both DC's as if you had large ring. If you want it only in DC1 you need to say DC1:1, DC2:0. If you are writing and reading at ONE you are not guaranteed to get the data if RF 1. If RF = 2, and you write with ONE, you data could be written to server 1, and then read from server 2 before it gets over there. The differing on server times will only really matter for TTL's. Most everything else works off comparing user supplied times. -Jeremiah On 11/10/2011 02:27 PM, Subrahmanya Harve wrote: I am facing an issue in 0.8.7 cluster - - I have two clusters in two DCs (rather one cross dc cluster) and two keyspaces. But i have only configured one keyspace to replicate data to the other DC and the other keyspace to not replicate over to the other DC. Basically this is the way i ran the keyspace creation - create keyspace K1 with placement_strategy='org.** apache.cassandra.locator.**SimpleStrategy' and strategy_options = [{replication_factor:1}]; create keyspace K2 with placement_strategy='org.** apache.cassandra.locator.**NetworkTopologyStrategy' and strategy_options = [{DC1:2, DC2:2}]; I had to do this because i expect that K1 will get a large volume of data and i do not want this wired over to the other DC. I am writing the data at CL=ONE and reading the data at CL=ONE. I am seeing an issue where sometimes i get the data and other times i do not see the data. Does anyone know what could be going on here? A second larger question is - i am migrating from 0.7.4 to 0.8.7 , i can see that there are large changes in the yaml file, but a specific question i had was - how do i configure disk_access_mode like it used to be in 0.7.4? One observation i have made is that some nodes of the cross dc cluster are at different system times. This is something to fix but could this be why data is sometimes retrieved and other times not? Or is there some other thing to it? Would appreciate a quick response.
Re: : Cassandra reads under write-only load, read degradation after massive writes
Indexed columns cause read before write so that the index can be updated if the column already exists. On 11/09/2011 02:46 PM, Oleg Tsernetsov wrote: When monitoring JMX metrics of cassandra 0.8.7 loaded by write-only test I observe significant read activity on column family where I write to. It seems strange to me, but I expected no read activity on write-only load. The read activity is caused by writes, as when I stop the write test, reads activity disappears. The test performs parallel column writes to a single row, writing the values of fixed column set over and over again. Furthermore, the second problem is that parallel massive reads of such row degrade over time (even without parallel write load) and cassandra starts burning 100% of CPU with read latency degrading x20 times comparing with exactly the same row created from scratch. The test setup is 3 cassandra nodes, read/write consistency = Quorum. Row has 10 and above columns (tested with 10, 100, 1000, 1 cols), the higher is the number of columns, the worse is observed degradation. Column family has 2 indexed columns that are written with exactly the same values on each and every write. Row key, column name and column value are all Utf8Type. Column family compaction on all the nodes does not help, and the row remains degraded. Read here means one of: read all the the columns with slice query without bounds/with bounds; executing column count query for a row with bounds/without bounds. I use Hector as cassandra client. I would be thankful if anyone could explain the read activity on write load and give any hints on row read degradation after massive write load on that row. Regards, Oleg
Re: Second Cassandra users survey
Actually, the data will be visible at QUORUM as well if you can see it with ONE. QUORUM actually gives you a higher chance of seeing the new value than ONE does. In the case of R=3 you have 2/3 chance of seeing the new value with QUORUM, with ONE you have 1/3... And this JIRA fixed an issue where two QUORUM reads in a row could give you the NEW value and then the OLD value. https://issues.apache.org/jira/browse/CASSANDRA-2494 So quorum read on fail for a single row always gives consistent results now. For multiple rows your still have issues, but you can always mitigate that in app with something like giving all of the changes the same time stamp, and then on read checking to make sure the time stamps match, and reading the data again if they don't. I'm not arguing against atomic batch operations, they would be nice =). Just clarifying how things work now. -Jeremiah On 11/06/2011 02:05 PM, Pierre Chalamet wrote: - support for atomic operations or batches (if QUORUM fails, data should not be visible with ONE) zookeeper is solving that. I might have screwed up a little bit since I didn't talk about isolation; let's reformulate: support for read committed (using DB terminology). Cassandra is more like read uncommitted. Even if row mutations in one CF for one key are atomic on one server , stuff is not rolled back when the CL can't be satisfied at the coordinator level. Data won't be visible at QUORUM level, but when using weaker CL, invalid data can appear imho. Also it should be possible to tell which operations failed with batch_mutate but unfortunately it is not
Re: Second Cassandra users survey
- Batch read/slice from multiple column families. On 11/01/2011 05:59 PM, Jonathan Ellis wrote: Hi all, Two years ago I asked for Cassandra use cases and feature requests. [1] The results [2] have been extremely useful in setting and prioritizing goals for Cassandra development. But with the release of 1.0 we've accomplished basically everything from our original wish list. [3] I'd love to hear from modern Cassandra users again, especially if you're usually a quiet lurker. What does Cassandra do well? What are your pain points? What's your feature wish list? As before, if you're in stealth mode or don't want to say anything in public, feel free to reply to me privately and I will keep it off the record. [1] http://www.mail-archive.com/cassandra-dev@incubator.apache.org/msg01148.html [2] http://www.mail-archive.com/cassandra-user@incubator.apache.org/msg01446.html [3] http://www.mail-archive.com/dev@cassandra.apache.org/msg01524.html
Re: Cassandra 1.0.0 - Node Load Bug
I thought this patch made it into the 1.0 release? I remember it being referenced in one of the re-rolls. On Oct 20, 2011, at 9:56 PM, Jonathan Ellis jbel...@gmail.com wrote: That looks to me like it's reporting uncompressed size as the load. Should be fixed in the 1.0 branch for 1.0.1. (https://issues.apache.org/jira/browse/CASSANDRA-3338) On Thu, Oct 20, 2011 at 11:53 AM, Dan Hendry dan.hendry.j...@gmail.com wrote: I have been playing around with Cassandra 1.0.0 in our test environment it seems pretty sweet so far. I have however come across what appears to be a bug tracking node load. I have enabled compression and levelled compaction on all CFs (scrub + snapshot deletion) and the nodes have been operating normally for a day or two. I started getting concerned when the load as reported by nodetool ring kept increasing (it seems monotonically) despite seeing a compression ratio of ~2.5x (as a side note, I find it strange Cassandra does not provide the compression ratio via jmx or in the logs). I initially thought there might be a bug in cleaning up obsolete SSTables but I then noticed the following discrepancy: Nodetool ring reports: 10.112.27.65datacenter1 rack1 Up Normal 8.64 GB 50.00% 170141183460469231731687303715884105727 Yet du . –h reports: only 2.4G in the data directory. After restarting the node, nodetool ring reports a more accurate: 10.112.27.65datacenter1 rack1 Up Normal 2.35 GB 50.00% 170141183460469231731687303715884105727 Again, both compression and levelled compaction have been enabled on all CFs. Is this a known issue or has anybody else observed a similar pattern? Dan Hendry (403) 660-2297 -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Massive writes when only reading from Cassandra
I could be totally wrong here, but If you are doing a QUORUM read and there is a bad value encountered from the QUORUM won't a repair happen? I thought read_repair_chance 0 just means it won't query extra nodes to check for bad values. -Jeremiah On Oct 17, 2011, at 4:22 PM, Jeremy Hanna wrote: Even after disabling hinted handoff and setting read_repair_chance to 0 on all our column families, we were still experiencing massive writes. Apparently the read_repair_chance is completely ignored at any CL higher than CL.ONE. So we were doing CL.QUORUM on reads and writes and seeing massive writes still. It was because of the background read repairs being done. We did extensive logging and checking and that's all it could be as no mutations were coming in via thrift to those column families. In any case, just wanted to give some follow-up here as it's been an inexplicable rock in our backpack and hopefully clears up where that setting is actually used. I'll update the storage configuration wiki to include that caveat as well. On Sep 10, 2011, at 5:14 PM, Jeremy Hanna wrote: Thanks for the insights. I may first try disabling hinted handoff for one run of our data pipeline and see if it exhibits the same behavior. Will post back if I see anything enlightening there. On Sep 10, 2011, at 5:04 PM, Chris Goffinet wrote: You could tail the commit log with `strings` to see what keys are being inserted. On Sat, Sep 10, 2011 at 2:24 PM, Jonathan Ellis jbel...@gmail.com wrote: Two possibilities: 1) Hinted handoff (this will show up in the logs on the sending machine, on the receiving one it will just look like any other write) 2) You have something doing writes that you're not aware of, I guess you could track that down using wireshark to see where the write messages are coming from On Sat, Sep 10, 2011 at 3:56 PM, Jeremy Hanna jeremy.hanna1...@gmail.com wrote: Oh and we're running 0.8.4 and the RF is 3. On Sep 10, 2011, at 3:49 PM, Jeremy Hanna wrote: In addition, the mutation stage and the read stage are backed up like: Pool NameActive Pending Blocked ReadStage32 773 0 RequestResponseStage 0 0 0 ReadRepairStage 0 0 0 MutationStage 158525918 0 ReplicateOnWriteStage 0 0 0 GossipStage 0 0 0 AntiEntropyStage 0 0 0 MigrationStage0 0 0 StreamStage 0 0 0 MemtablePostFlusher 1 5 0 FILEUTILS-DELETE-POOL 0 0 0 FlushWriter 2 5 0 MiscStage 0 0 0 FlushSorter 0 0 0 InternalResponseStage 0 0 0 HintedHandoff 0 0 0 CompactionManager n/a29 MessagingServicen/a 0,34 On Sep 10, 2011, at 3:38 PM, Jeremy Hanna wrote: We are experiencing massive writes to column families when only doing reads from Cassandra. A set of 5 hadoop jobs are reading from Cassandra and then writing out to hdfs. That is the only thing operating on the cluster. We are reading at CL.QUORUM with hadoop and have written with CL.QUORUM. Read repair chance is set to 0.0 on all column families. However, in the logs, I'm seeing flush after flush of memtables and compactions taking place. Is there something else that would be writing based on the above description? Jeremy -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: nodetool ring Load column
Are you using compressed sstables? or the leveled sstables? Make sure you include how you are configured in any JIRA you make, someone else was seeing a similar issue with compression turned on. -Jeremiah On Oct 14, 2011, at 1:13 PM, Ramesh Natarajan wrote: What does the Load column in nodetool ring mean? From the output below it shows 101.62 GB. However if I do a disk usage it is about 6 GB. thanks Ramesh [root@CAP2-CNode1 cassandra]# ~root/apache-cassandra-1.0.0-rc2/bin/nodetool -h localhost ring Address DC RackStatus State Load OwnsToken 148873535527910577765226390751398592512 10.19.102.11datacenter1 rack1 Up Normal 101.62 GB 12.50% 0 10.19.102.12datacenter1 rack1 Up Normal 84.42 GB 12.50% 21267647932558653966460912964485513216 10.19.102.13datacenter1 rack1 Up Normal 95.47 GB 12.50% 42535295865117307932921825928971026432 10.19.102.14datacenter1 rack1 Up Normal 91.25 GB 12.50% 63802943797675961899382738893456539648 10.19.103.11datacenter1 rack1 Up Normal 93.98 GB 12.50% 85070591730234615865843651857942052864 10.19.103.12datacenter1 rack1 Up Normal 100.33 GB 12.50% 106338239662793269832304564822427566080 10.19.103.13datacenter1 rack1 Up Normal 74.1 GB 12.50% 127605887595351923798765477786913079296 10.19.103.14datacenter1 rack1 Up Normal 93.96 GB 12.50% 148873535527910577765226390751398592512 [root@CAP2-CNode1 cassandra]# du -hs /var/lib/cassandra/data/ 6.0G/var/lib/cassandra/data/
Re: How to speed up Waiting for schema agreement for a single node Cassandra cluster?
But truncate is still slow, especially if it can't use JNA (windows) as it snapshots. Depending on how much data you are inserting during your unit tests, just paging through all the keys and then deleting them is the fastest way, though if you use timestamps besides now this won't work, as the timestamps need to be increasing between test runs. On Oct 4, 2011, at 9:33 AM, Joseph Norton wrote: I didn't consider using truncate because a set of potentially random Column Families are created dynamically during the test. Are there any configuration knobs that could be adjusted for drop + recreate? thanks in advance, - Joe N Joseph Norton nor...@alum.mit.edu On Oct 4, 2011, at 11:19 PM, Jonathan Ellis wrote: Truncate is faster than drop + recreate. On Tue, Oct 4, 2011 at 9:15 AM, Joseph Norton nor...@lovely.email.ne.jp wrote: Hello. For unit test purposes, I have a single node Cassandra cluster. I need to drop and re-create several keyspaces between each test iteration. This process takes approximately 10 seconds for a single node installation. Can you recommend any tricks or recipes to reduce the time required for such operations and/or for Waiting for schema agreement to complete? regards, - Joe N. $ time ./setupDB.sh Deleteing cassandra keyspaces Connected to: Foo on 127.0.0.1/9160 ed9c7fc0-ee91-11e0--534d24a6e7f7 Waiting for schema agreement... ... schemas agree across the cluster ee8c36f0-ee91-11e0--534d24a6e7f7 Waiting for schema agreement... ... schemas agree across the cluster eeb14b20-ee91-11e0--534d24a6e7f7 Waiting for schema agreement... ... schemas agree across the cluster Insert data Creating cassandra keyspaces Connected to: Foo on 127.0.0.1/9160 ef1a6d30-ee91-11e0--534d24a6e7f7 Waiting for schema agreement... ... schemas agree across the cluster Authenticated to keyspace: Bars ef4af310-ee91-11e0--534d24a6e7f7 Waiting for schema agreement... ... schemas agree across the cluster ef9bab20-ee91-11e0--534d24a6e7f7 Waiting for schema agreement... ... schemas agree across the cluster efbceec0-ee91-11e0--534d24a6e7f7 Waiting for schema agreement... ... schemas agree across the cluster f00e4310-ee91-11e0--534d24a6e7f7 Waiting for schema agreement... ... schemas agree across the cluster f0589280-ee91-11e0--534d24a6e7f7 Waiting for schema agreement... ... schemas agree across the cluster f0821380-ee91-11e0--534d24a6e7f7 Waiting for schema agreement... ... schemas agree across the cluster f0c44ca0-ee91-11e0--534d24a6e7f7 Waiting for schema agreement... ... schemas agree across the cluster Authenticated to keyspace: Baz f121d5f0-ee91-11e0--534d24a6e7f7 Waiting for schema agreement... ... schemas agree across the cluster f1619e10-ee91-11e0--534d24a6e7f7 Waiting for schema agreement... ... schemas agree across the cluster f18b4620-ee91-11e0--534d24a6e7f7 Waiting for schema agreement... ... schemas agree across the cluster Authenticated to keyspace: Buz f1debd50-ee91-11e0--534d24a6e7f7 Waiting for schema agreement... ... schemas agree across the cluster f20690a0-ee91-11e0--534d24a6e7f7 Waiting for schema agreement... ... schemas agree across the cluster f25043d0-ee91-11e0--534d24a6e7f7 Waiting for schema agreement... ... schemas agree across the cluster f29a1e10-ee91-11e0--534d24a6e7f7 Waiting for schema agreement... ... schemas agree across the cluster Inserting data in cassandra Connected to: Foo on 127.0.0.1/9160 Authenticated to keyspace: Boo Value inserted. Value inserted. Value inserted. Value inserted. Value inserted. Value inserted. Value inserted. Value inserted. Value inserted. Value inserted. Value inserted. Value inserted. Value inserted. Value inserted. Value inserted. Value inserted. Value inserted. Value inserted. Value inserted. real0m9.554s user0m2.729s sys 0m0.194s Joseph Norton nor...@alum.mit.edu -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Very large rows VS small rows
If A works for our use case, it is a much better option. A given row has to be read in full to return data from it, there used to be limitations that a row had to fit in memory, but there is now code to page through the data, so while that isn't a limitation any more, it means rows that don't fit in memory are very slow to use. Also wide rows spread across nodes. You should also consider more nodes in your cluster. From our experience node perform better when they are only managing a few Hundred GB each. Pretty sure that 10TB+ of data (100's * 100GB) will not perform very well on a 3 node cluster, especially if you plan to have RF=3, making it 10TB+ per node. -Jeremiah On 09/29/2011 12:20 PM, M Vieira wrote: What would be the best approach A) millions of ~2Kb rows, where each row could have ~6 columns B) hundreds of ~100Gb rows, where each row could have ~1million columns Considerarions: Most entries will be searched for (read+write) at least once a day but no more than 3 times a day. Cheap hardware accross the cluster of 3 nodes each with 16Gb mem (heap = 8Gb) Any input would be appreciated M.
Re: Very large rows VS small rows
So I need to read what I write before hitting send. Should have been, If A works for YOUR use case. and Wide rows DON'T spread across nodes well On 09/29/2011 02:34 PM, Jeremiah Jordan wrote: If A works for our use case, it is a much better option. A given row has to be read in full to return data from it, there used to be limitations that a row had to fit in memory, but there is now code to page through the data, so while that isn't a limitation any more, it means rows that don't fit in memory are very slow to use. Also wide rows spread across nodes. You should also consider more nodes in your cluster. From our experience node perform better when they are only managing a few Hundred GB each. Pretty sure that 10TB+ of data (100's * 100GB) will not perform very well on a 3 node cluster, especially if you plan to have RF=3, making it 10TB+ per node. -Jeremiah On 09/29/2011 12:20 PM, M Vieira wrote: What would be the best approach A) millions of ~2Kb rows, where each row could have ~6 columns B) hundreds of ~100Gb rows, where each row could have ~1million columns Considerarions: Most entries will be searched for (read+write) at least once a day but no more than 3 times a day. Cheap hardware accross the cluster of 3 nodes each with 16Gb mem (heap = 8Gb) Any input would be appreciated M.
Re: Thrift CPU Usage
Yes. All the stress tool does is flood data through the API, no real processing or anything happens. So thrift reading/writing data should be the majority of the CPU time... On 09/26/2011 08:32 AM, Baskar Duraikannu wrote: Hello - I have been running read tests on Cassandra using stress tool. I have been noticing that thrift seems to be taking lot of CPU over 70% when I look at the CPU samples report. Is this normal? CPU usage seems to go down by 5 to 10% when I change the RPC from sync to async. Is this normal? I am running Cassandra 0.8.4 on Cent OS 5.6 ( Kernel 2.6.18.238) and Oracle JVM. - Thanks Baskar Duraikannu
Re: [BETA RELEASE] Apache Cassandra 1.0.0-beta1 released
Is it possible to update an existing column family with {stable_compression: SnappyCompressor, compaction_strategy:LeveldCompactionStrategy}? Or will I have to make a new column family and migrate my data to it? -Jeremiah On 09/15/2011 01:01 PM, Sylvain Lebresne wrote: The Cassandra team is pleased to announce the release of the first beta for the future Apache Cassandra 1.0. Let me first stress that this is beta software and as such is *not* ready for production use. The goal of this release is to give a preview of what will be Cassandra 1.0 and more importantly to get wider testing before the final release. So please help us make Cassandra 1.0 be the best it possibly could by testing this beta release and reporting any problem you may encounter[3,4]. You can have a look at the change log[1] and the release notes[2] to see where Cassandra 1.0 differs from the 0.8 series. Apache Cassandra 1.0.0-beta1[5] is available as usual from the cassandra website: http://cassandra.apache.org/download/ Thank you for your help in testing and have fun with it. [1]: http://goo.gl/evCW0 (CHANGES.txt) [2]: http://goo.gl/HbNsV (NEWS.txt) [3]: https://issues.apache.org/jira/browse/CASSANDRA [4]: user@cassandra.apache.org [5]: https://svn.apache.org/repos/asf/cassandra/tags/cassandra-1.0.0-beta1
Re: Updates lost
Are you running on windows? If the default timestamp is just using time.time()*1e6 you will get the same timestamp twice if the code is close together. time.time() on windows is only millisecond resolution. I don't use pycassa, but in the Thrift api wrapper I created for our python code I implemented the following function for getting timestamps: def GetTimeInMicroSec(): Returns the current time in microseconds, returned value always increases with each call. :return: Current time in microseconds newTime = long(time.time()*1e6) try: if GetTimeInMicroSec.lastTime = newTime: newTime = GetTimeInMicroSec.lastTime + 1 except AttributeError: pass GetTimeInMicroSec.lastTime = newTime return newTime On 08/29/2011 04:56 PM, Peter Schuller wrote: If the client sleeps for a few ms at each loop, the success rate increases. At 15 ms, the script always succeeds so far. Interestingly, the problem seems to be sensitive to alphabetical order. Updating the value from 'aaa' to 'bbb' never has problem. No pause needed. Is it possible the version of pycassa you're using does not guarantee that successive queries use non-identical and monotonically increasing timestamps? I'm just speculating, but if that is the case and two requests are sent with the same timestamp (due to resolution being lower than the time it takes between calls), the tie breaking would be the column value which jives with the fact that you're saying it seems to depend on the value. (I haven't checked current nor past versions of pycassa to determine if this is plausible. Just speculating.)
Solandra distributed search
When using Solandra, do I need to use the Solr sharding synxtax in my queries? I don't think I do because Cassandra is handling the sharding, not Solr, but just want to make sure. The Solandra wiki references the distributed search limitations, which talks about the shard syntax further down the page. From what I see with how it is implemented I should just be able to pick a random Solandra node and do my query, since they are all backed by the same Cassandra data store. Correct? Thanks! -Jeremiah
Re: Cassandra in Multiple Datacenters Active - Standby configuration
Assign the tokens like they are two separate rings, just make sure you don't have any duplicate tokens. http://wiki.apache.org/cassandra/Operations#Token_selection The two datacenters are treated as separate rings, LOCAL_QUORUM will only delay the client as long as it takes to write the data to the local nodes. The nodes in the other datacenter will get asynchronous writes. On 08/15/2011 03:39 PM, Oleg Tsvinev wrote: Hi all, I have a question that documentation has not clear answer for. I have the following requirements: 1. Synchronously store data in datacenter DC1 on 2+ nodes 2. Asynchronously replicate the same data to DC2 and store it on 2+ nodes to act as a hot standby Now, I have configured keyspaces with o.a.c.l.NetworkTopologyStrategy with strategy_options=[{DC1:2, DC2:2}] and use LOCAL_QUORUM consistency level, following documentation here: http://www.datastax.com/docs/0.8/operations/datacenter Now, how do I assign initial tokens? If I have, say 6 nodes total, 3 in DC1 and 3 in DC2, and create a ring as if all 6 nodes share the total 2^128 space equally. Now say node N1:DC2 has key K and is in remote datacenter (for an app in DC1). Wouldn't Cassandra always forward K to the DC2 node N1 thus turning asynchronous writes into synchronous ones? Performance impact will be huge as the latency between DC1 and DC2 is significant. I hope there's an answer and I'm just missing something. My case falls under Disaster Recovery in http://www.datastax.com/docs/0.8/operations/datacenter but I don't see how Cassandra will support my use case. I appreciate any help on this. Thank you, Oleg
Re: thrift c++ insert Exception [Column value is required]
You can checkout libcassandra for a C++ client built on top of thrift. It is not feature complete, but it is pretty good. https://github.com/matkor/libcassandra On Aug 14, 2011, at 3:59 AM, Konstantinos Chasapis wrote: Hi, Thank you for your answer. Is there any documentation that describes all this values that I have to set? Konstantinos Chasapis On Aug 14, 2011, at 6:28 AM, Jonathan Ellis wrote: In C++ you need to set .__isset.fieldname on optional fields (e.g. .__isset.value). 2011/8/13 Hassapis Constantinos cha...@ics.forth.gr: Hi all, I'm using Cassandra 0.8.3 and thrift for c++ and I can't insert column in a column family. Starting from an empty keyspace first I add a new keyspace and then a new column family and that works fine but I can't insert a column. The code that I have written is: transport-open(); KsDef ks_def; ks_def.name = test_keyspace; ks_def.replication_factor = 0; ks_def.strategy_class = LocalStrategy; std::string res; cout add keyspace.. endl; client.system_add_keyspace( res, ks_def); client.set_keyspace(test_keyspace); cout add column family.. endl; CfDef cf_def; cf_def.keyspace= test_keyspace; cf_def.name = cf_name_test; client.system_add_column_family( res, cf_def ); const string key=test_key; const string value=valu_; ColumnParent cparent; cparent.column_family = cf_name_test; Column c; c.name = column_namess; c.value = value; c.timestamp = getTS(); cout insert key value: c.value endl; client.insert( key, cparent, c, ConsistencyLevel::ONE); cout drop column family endl; client.system_drop_column_family( res, cf_name_test); cout drop keyspace endl; client.system_drop_keyspace( res, test_keyspace); transport-close(); and I recive the bellow Exception: Default TException. [Column value is required] as you can see from the source code I have fill the value of the column. thank you in advance for your help. Konstantinos Chasapis p.s please cc me in the reply. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Restarting servers
You need to wait for the servers to be up again before restarting the next one. nodetool ring on one of the servers you aren't restarting will tell you when it is back up. You can also watch for Starting up server gossip in the log file to know when it is starting to join the cluster again. On 08/12/2011 01:59 PM, Jason Baker wrote: So restarting cassandra servers has a tendency to cause a lot of exceptions like MaximumRetryException: Retried 6 times. Last failure was UnavailableException() and TApplicationException: Internal error processing batch_mutate (using pycassa). If I restart the servers too quickly, I get all servers unavailable. So two questions: 1. Is there anything I can do to prevent MaximumRetryExceptions and TApplicationExceptions, or is this just a case of needing better exception handling? 2. Are there any rules of thumb regarding how much time I should allow between server restarts?
RE: Write everywhere, read anywhere
If you have RF=3 quorum won't fail with one node down. So R/W quorum will be consistent in the case of one node down. If two nodes go down at the same time, then you can get inconsistent data from quorum write/read if the write fails with TimeOut, the nodes come back up, and then read asks the two nodes that were down what the value is. And another read asks the node that was up, and a node that was down. Those two reads will get different answers. From: Mike Malone [mailto:m...@simplegeo.com] Sent: Thursday, August 04, 2011 12:16 PM To: user@cassandra.apache.org Subject: Re: Write everywhere, read anywhere 2011/8/3 Patricio Echagüe patric...@gmail.com On Wed, Aug 3, 2011 at 4:00 PM, Philippe watche...@gmail.com wrote: Hello, I have a 3-node, RF=3, cluster configured to write at CL.ALL and read at CL.ONE. When I take one of the nodes down, writes fail which is what I expect. When I run a repair, I see data being streamed from those column families... that I didn't expect. How can the nodes diverge ? Does this mean that reading at CL.ONE may return inconsistent data ? we abort the mutation before hand when there are enough replicas alive. If a mutation went through and in the middle of it a replica goes down, in that case you can write to some nodes and the request will Timeout. In that case the CL.ONE may return inconsistence data. Doesn't CL.QUORUM suffer from the same problem? There's no isolation or rollback with CL.QUORUM either. So if I do a quorum write with RF=3 and it fails after hitting a single node, a subsequent quorum read could return the old data (if it hits the two nodes that didn't receive the write) or the new data that failed mid-write (if it hits the node that did receive the write). Basically, the scenarios where CL.ALL + CL.ONE results in a read of inconsistent data could also cause a CL.QUORUM write followed by a CL.QUORUM read to return inconsistent data. Right? The problem (if there is one) is that even in the quorum case columns with the most recent timestamp win during repair resolution, not columns that have quorum consensus. Mike
Re: Cassandra 0.6.8 snapshot problem?
Does snapshot in 0.6 cause a flush to happen first? If not there could be data in the database that won't be in the snapshot. Though that seems like a long time for data to be sitting in the commit log and not make it to the sstables. On Thu, 2011-07-28 at 17:30 -0500, Jonathan Ellis wrote: Doesn't ring a bell. But I'd say if you upgrade and it's still a problem, then (a) you're not _worse_ off than you are now, and (b) it's a lot more likely to get fixed in modern version. On Thu, Jul 28, 2011 at 9:47 AM, Jian Fang jian.fang.subscr...@gmail.com wrote: Hi, We have an old production Cassandra 0.6.8 instance without replica, i.e., the replication factor is 1. Recently, we noticed that the snapshot data we took from this instance are inconsistent with the running instance data. For example, we took snapshot in early July 2011. From the running instance, we got a record that was created in March 2011, but on the snapshot copy, the record with the same key was different and was created in January 2011. Yesterday, we created another snapshot and reproduced the problem. I just like to know if this is a known issue for Cassandra 0.6. We are going to migrate to Cassandra 0.8, but we need to make sure this will not be a problem in 0.8. Thanks in advance, John
Re: RF=1
If you have RF=1, taking one node down is going to cause 25% of your data to be unavailable. If you want to tolerate a machines going down you need to have at least RF=2, if you want to use quorum and have a machine go down, you need at least RF=3. On Tue, 2011-08-02 at 16:22 +0200, Patrik Modesto wrote: Hi all! I've a test cluster of 4 nodes running cassandra 0.7.8, with one keyspace with RF=1, each node owns 25% of the data. As long as all nodes are alive, there is no problem, but when I shut down just one node I get UnavailableException in my application. cassandra-cli returns null and hadoop mapreduce task won't start at all. Loosing one node is not a problem for me, the data are not important, loosing even half the cluster is not a problem as long as everything runs just as with a full cluster. The error from hadoop is like this: Exception in thread main java.io.IOException: Could not get input splits at org.apache.cassandra.hadoop.ColumnFamilyInputFormat.getSplits(ColumnFamilyInputFormat.java:120) at cz.xxx.yyy.zzz.DelegatingInputFormat.getSplits(DelegatingInputFormat.java:111) at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:944) at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:961) at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:880) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833) at org.apache.hadoop.mapreduce.Job.submit(Job.java:476) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:506) at cz.xxx.yyy.zzz.ContextIndexer.run(ContextIndexer.java:663) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at cz.xxx.yyy.zzz.ContextIndexer.main(ContextIndexer.java:94) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:186) Caused by: java.util.concurrent.ExecutionException: java.io.IOException: failed connecting to all endpoints 10.0.18.87 at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222) at java.util.concurrent.FutureTask.get(FutureTask.java:83) at org.apache.cassandra.hadoop.ColumnFamilyInputFormat.getSplits(ColumnFamilyInputFormat.java:116) ... 20 more Caused by: java.io.IOException: failed connecting to all endpoints 10.0.18.87 at org.apache.cassandra.hadoop.ColumnFamilyInputFormat.getSubSplits(ColumnFamilyInputFormat.java:197) at org.apache.cassandra.hadoop.ColumnFamilyInputFormat.access$200(ColumnFamilyInputFormat.java:67) at org.apache.cassandra.hadoop.ColumnFamilyInputFormat$SplitCallable.call(ColumnFamilyInputFormat.java:153) at org.apache.cassandra.hadoop.ColumnFamilyInputFormat$SplitCallable.call(ColumnFamilyInputFormat.java:138) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662)
Re: 8 million Cassandra data files on disk
Connect with jconsole and run garbage collection. All of the files that have a -Compacted with the same name will get deleted the next time a full garbage collection runs, or when the node is restarted. They have already been combined into new files, the old ones just haven't been deleted yet. On Tue, 2011-08-02 at 16:09 -0400, Yiming Sun wrote: Hi, I am new to Cassandra, and am hoping someone could help me understand the (large amount of small) data files on disk that Cassandra generates. The reason we are using Cassandra is because we are dealing with thousands to millions of small text files on disk, so we are experimenting with Cassandra hoping that by dropping the files contents into Cassandra, it will achieve more efficient disk usage because Cassandra is going to aggregate them into bigger files (one file per column family, according to the wiki). But after we pushed a subset of the files into a single node Cassandra v0.7.0 instance, we noted that in the Cassandra data directory for the keyspace, there are 8.5 million very small files, most are named SuperColumnFamilyName-e-n.Filter.db SuperColumnFamilyName-e-n.Compacted.db SuperColumnFamilyName-e-n.Index.db SuperColumnFamilyName-e-n.Statistics.db and among these files, the Compacted.db are always empty, Filter and Index are under 100 bytes, and Statistics are around 4k. What are these files? Why are there so many of them? We originally hope that Cassandra was going to solve our issue with the small files we have, but now it doesn't seem to help -- we still end up with tons of small files. Is there any way to reduce/combine these small files? Thanks. -- Y.
Re: Nodetool ring not showing all nodes in cluster
All of the nodes should have the same seedlist. Don't use localhost as one of the items in it if you have multiple nodes. On Tue, 2011-08-02 at 10:10 -0700, Aishwarya Venkataraman wrote: Nodetool does not show me all the nodes. Assuming I have three nodes A, B and C. The seedlist of A is localhost. Seedlist of B is localhost, A_ipaddr and seedlist of C is localhost,B_ipaddr,A_ipaddr. I have autobootstrap set to false for all 3 nodes since they all have the correct data and do not hav to migrate data from any particular node. My problem here is why does n't nodetool ring show me all nodes in the ring ? I agree that the cluster thinks that only one node is present. How do I fix this ? Thanks, Aishwarya On Tue, Aug 2, 2011 at 9:56 AM, samal sa...@wakya.in wrote: ERROR 08:53:47,678 Internal error processing batch_mutate java.lang.IllegalStateException: replication factor (3) exceeds number of endpoints (1) You already answered It always keeps showing only one node and mentions that it is handling 100% of the load. Cluster think only one node is present in ring, it doesn't agree RF=3 it is expecting RF=1. Original Q: I m not exactly sure what is the problem. But Does nodetool ring show all the host? What is your seed list? Is bootstrapped node has seed ip of its own? AFAIK gossip work even without actively joining a ring. On Tue, Aug 2, 2011 at 7:21 AM, Aishwarya Venkataraman cyberai...@gmail.com wrote: Replies inline. Thanks, Aishwarya On Tue, Aug 2, 2011 at 7:12 AM, Sorin Julean sorin.jul...@gmail.com wrote: Hi, Until someone answers with more details, few questions: 1. did you moved the system keyspace as well ? Yes. But I deleted the LocationInfo* files under the system folder. Shall I go ahead and delete the entire system folder ? 2. the gossip IP of the new nodes are the same as the old ones ? No. The Ip is different. 3. which cassandra version are you running ? I am using 0.8.1 If 1. is yes and 2. is no, for a quick fix: take down the cluster, remove system keyspace, bring the cluster up and bootstrap the nodes. Kind regards, Sorin On Tue, Aug 2, 2011 at 2:53 PM, Aishwarya Venkataraman cyberai...@gmail.com wrote: Hello, I recently migrated 400 GB of data that was on a different cassandra cluster (3 node with RF= 3) to a new cluster. I have a 3 node cluster with replication factor set to three. When I run nodetool ring, it does not show me all the nodes in the cluster. It always keeps showing only one node and mentions that it is handling 100% of the load. But when I look at the logs, the nodes are able to talk to each other via the gossip protocol. Why does this happen ? Can you tell me what I am doing wrong ? Thanks, Aishwarya
RE: custom StoragePort?
If you are on linux see: https://github.com/pcmanus/ccm -Original Message- From: Yang [mailto:tedd...@gmail.com] Sent: Monday, July 11, 2011 3:08 PM To: user@cassandra.apache.org Subject: Re: custom StoragePort? never mind, found this.. https://issues.apache.org/jira/browse/CASSANDRA-200?page=com.atlassian.j ira.plugin.system.issuetabpanels:all-tabpanel On Mon, Jul 11, 2011 at 12:39 PM, Yang tedd...@gmail.com wrote: I tried to run multiple cassandra daemons on the same host, using different ports, for a test env. I thought this would work, but it turns out that the StoragePort used by outputTcpConnection is always assumed to be the one specified in .yaml, i.e. the code assumes that the storageport is same everywhere. in fact this assumption seems deeply held in many places in the code, so it's a bit difficult to refactor it , for example by substituting InetAddress with InetSocketAddress. I am just wondering, do you see any other value to a custom storageport, besides testing? if there is real value, maybe someone more familiar with the code could do the refactoring Thanks yang
RE: Node repair questions
The more often you repair, the quicker it will be. The more often your nodes go down the longer it will be. Repair streams data that is missing between nodes. So the more data that is different the longer it will take. Your workload is impacted because the node has to scan the data it has to be able to compare with other nodes, and if there are differences, it has to send/receive data from other nodes. -Original Message- From: A J [mailto:s5a...@gmail.com] Sent: Monday, July 11, 2011 2:43 PM To: user@cassandra.apache.org Subject: Node repair questions Hello, Have the following questions related to nodetool repair: 1. I know that Nodetool Repair Interval has to be less than GCGraceSeconds. How do I come up with an exact value of GCGraceSeconds and 'Nodetool Repair Interval'. What factors would want me to change the default of 10 days of GCGraceSeconds. Similarly what factors would want me to keep Nodetool Repair Interval to be just slightly less than GCGraceSeconds (say a day less). 2. Does a Nodetool Repair block any reads and writes on the node, while the repair is going on ? During repair, if I try to do an insert, will the insert wait for repair to complete first ? 3. I read that repair can impact your workload as it causes additional disk and cpu activity. But any details of the impact mechanism and any ballpark on how much the read/write performance deteriorates ? Thanks.
RE: Cassandra memory problem
We are running into the same issue on some of our machines. Still haven't tracked down what is causing it. From: William Oberman [mailto:ober...@civicscience.com] Sent: Thursday, July 07, 2011 7:19 AM To: user@cassandra.apache.org Subject: Re: Cassandra memory problem I think I had (and have) a similar problem: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/OOM-or- what-settings-to-use-on-AWS-large-td6504060.html My memory usage grew slowly until I ran out of mem and the OS killed my process (due to no swap). I'm still on 0.7.4, but I'm rolling out 0.8.1 next week, which I was hoping would fix the problem. I'm using Centos with Sun 1.6.0_24-b07 will On Thu, Jul 7, 2011 at 7:41 AM, Daniel Doubleday daniel.double...@gmx.net wrote: Hm - had to digg deeper and it totally looks like a native mem leak to me: We are still growing with res += 100MB a day. Cassandra is 8G now I checked the cassandra process with pmap -x Here's the human readable (aggregated) output: Format is thingy: RSS in KB Summary: Total SST: 1961616 Anon RSS: 6499640 Total RSS: 8478376 Here's a little more detail: SSTables (data and index files) ** Attic: 0 PrivateChatNotification: 38108 Schema: 0 PrivateChat: 161048 UserData: 116788 HintsColumnFamily: 0 Rooms: 100548 Tracker: 476 Migrations: 0 ObjectRepository: 793680 BlobStore: 350924 Activities: 400044 LocationInfo: 0 Libraries ** javajar: 2292 nativelib: 13028 Other ** 28201: 32 jna979649866618987247.tmp: 92 locale-archive: 1492 [stack]: 132 java: 44 ffi8TsQPY(deleted): 8 And ** [anon]: 6499640 Maybe the output of pmap is totally misleading but my interpretation is that only 2GB of RSS is attributed to paged in sstables. I have one large anon block which looks like this: Address Kbytes RSS Dirty Mode Mapping 00073f60 0 3093248 3093248 rwx--[ anon ] This is the native heap thats been allocated on startup and mlocked So theres still 3.5GB of anon memory. We haven't deployed https://issues.apache.org/jira/browse/CASSANDRA-2654 yet and this might be part of it but I don't think thats the main problem. As I said mem goes up by 100MB each day pretty linearly. Would be great if anyone could verify this by running pmap or talk my off the roof by explaining that nothing's the way it seems. All this might be heavily OS specific so maybe that's only on Debian? Thanks a lot Daniel On Jul 4, 2011, at 2:42 PM, Jonathan Ellis wrote: mmap'd data will be attributed to res, but the OS can page it out instead of killing the process. On Mon, Jul 4, 2011 at 5:52 AM, Daniel Doubleday daniel.double...@gmx.net wrote: Hi all, we have a mem problem with cassandra. res goes up without bounds (well until the os kills the process because we dont have swap) I found a thread that's about the same problem but on OpenJDK: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Very-hi gh-memory-utilization-not-caused-by-mmap-on-sstables-td5840777.html We are on Debian with Sun JDK. Resident mem is 7.4G while heap is restricted to 3G. Anyone else is seeing this with Sun JDK? Cheers, Daniel :/home/dd# java -version java version 1.6.0_24 Java(TM) SE Runtime Environment (build 1.6.0_24-b07) Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode) :/home/dd# ps aux |grep java cass 28201 9.5 46.8 372659544 7707172 ? SLl May24 5656:21 /usr/bin/java -ea -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms3000M -Xmx3000M -Xmn400M ... PID USER
RE: custom reconciling columns?
The reason to break it up is that the information will then be on different servers, so you can have server 1 spending time retrieving row 1, while you have server 2 retrieving row 2, and server 3 retrieving row 3... So instead of getting 3000 things from one server, you get 1000 from 3 servers in parallel... From: Yang [mailto:tedd...@gmail.com] Sent: Wednesday, June 29, 2011 12:07 AM To: user@cassandra.apache.org Subject: Re: custom reconciling columns? ok, here is the profiling result. I think this is consistent (having been trying to recover how to effectively use yourkit ...) see attached picture since I actually do not use the thrift interface, but just directly use the thrift.CassandraServer and run my code in the same JVM as cassandra, and was running the whole thing on a single box, there is no message serialization/deserialization cost. but more columns did add on to more time. the time was spent in the ConcurrentSkipListMap operations that implement the memtable. regarding breaking up the row, I'm not sure it would reduce my run time, since our requirement is to read the entire rolling window history (we already have the TTL enabled , so the history is limited to a certain length, but it is quite long: over 1000 , in some cases, can be 5000 or more ) . I think accessing roughly 1000 items is not an uncommon requirement for many applications. in our case, each column has about 30 bytes of data, besides the meta data such as ttl, timestamp. at history length of 3000, the read takes about 12ms (remember this is completely in-memory, no disk access) I just took a look at the expiring column logic, it looks that the expiration does not come into play until when the CassandraServer.internal_get()===thriftifyColumns() gets called. so the above memtable access time is still spent. yes, then breaking up the row is going to be helpful, but only to the degree of preventing accessing expired columns (btw if this is actually built into cassandra code it would be nicer, so instead of spending multiple key lookups, I locate to the row once, and then within the row, there are different generation buckets, so those old generation buckets that are beyond expiration are not read ); currently just accessing the 3000 live columns is already quite slow. I'm trying to see whether there are some easy magic bullets for a drop-in replacement for concurrentSkipListMap... Yang On Tue, Jun 28, 2011 at 4:18 PM, Nate McCall n...@datastax.com wrote: I agree with Aaron's suggestion on data model and query here. Since there is a time component, you can split the row on a fixed duration for a given user, so the row key would become userId_[timestamp rounded to day]. This provides you an easy way to roll up the information for the date ranges you need since the key suffix can be created without a read. This also benefits from spreading the read load over the cluster instead of just the replicas since you have 30 rows in this case instead of one. On Tue, Jun 28, 2011 at 5:55 PM, aaron morton aa...@thelastpickle.com wrote: Can you provide some more info: - how big are the rows, e.g. number of columns and column size ? - how much data are you asking for ? - what sort of read query are you using ? - what sort of numbers are you seeing ? - are you deleting columns or using TTL ? I would consider issues with the data churn, data model and query before looking at serialisation. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 29 Jun 2011, at 10:37, Yang wrote: I can see that as my user history grows, the reads time proportionally ( or faster than linear) grows. if my business requirements ask me to keep a month's history for each user, it could become too slow.- I was suspecting that it's actually the serializing and deserializing that's taking time (I can definitely it's cpu bound) On Tue, Jun 28, 2011 at 3:04 PM, aaron morton aa...@thelastpickle.com wrote: There is no facility to do custom reconciliation for a column. An append style operation would run into many of the same problems as the Counter type, e.g. not every node may get an append and there is a chance for lost appends unless you go to all the trouble Counter's do. I would go with using a row for the user and columns for each item. Then you can have fast no look writes. What problems are you seeing with the reads ? Cheers - Aaron Morton
RE: Cassandra ACID
For your Consistency case, it is actually an ALL read that is needed, not an ALL write. ALL read, with what ever consistency level of write that you need (to support machines dyeing) is the only way to get consistent results in the face of a failed write which was at ONE that went to one node, but not the others. From: AJ [mailto:a...@dude.podzone.net] Sent: Friday, June 24, 2011 11:28 PM To: user@cassandra.apache.org Subject: Re: Cassandra ACID Ok, here it is reworked; consider it a summary of the thread. If I left out an important point that you think is 100% correct even if you already mentioned it, then make some noise about it and provide some evidence so it's captured sufficiently. And, if you're in a debate, please try and get to a resolution; all will appreciate it. It will be evident below that Consistency is not the only thing that is tunable, at least indirectly. Unfortunately, you still can't tunafish. Ar ar ar. Atomicity All individual writes are atomic at the row level. So, a batch mutate for one specific key will apply updates to all the columns for that one specific row atomically. If part of the single-key batch update fails, then all of the updates will be reverted since they all pertained to one key/row. Notice, I said 'reverted' not 'rolled back'. Note: atomicity and isolation are related to the topic of transactions but one does not imply the other. Even though row updates are atomic, they are not isolated from other users' updates or reads. Refs: http://wiki.apache.org/cassandra/FAQ#batch_mutate_atomic Consistency Cassandra does not provide the same scope of Consistency as defined in the ACID standard. Consistency in C* does not include referential integrity since C* is not a relational database. Any referential integrity required would have to be handled by the client. Also, even though the official docs say that QUORUM writes/reads is the minimal consistency_level setting to guarantee full consistency, this assumes that the write preceding the read does not fail (see comments below). Therefore, an ALL write would be necessary prior to a QUORUM read of the same data. For a multi-dc scenario use an ALL write followed by a EACH_QUORUM read. Refs: http://wiki.apache.org/cassandra/ArchitectureOverview Isolation NOTHING is isolated; because there is no transaction support in the first place. This means that two or more clients can update the same row at the same time. Their updates of the same or different columns may be interleaved and leave the row in a state that may not make sense depending on your application. Note: this doesn't mean to say that two updates of the same column will be corrupted, obviously; columns are the smallest atomic unit ('atomic' in the more general thread-safe context). Refs: None that directly address this explicitly and clearly and in one place. Durability Updates are made highly durable at the level comparable to a DBMS by the use of the commit log. However, this requires commitlog_sync: batch in cassandra.yaml. For some performance improvement with some cost in durability you can specify commitlog_sync: periodic. See discussion below for more details. Refs: Plenty + this thread. On 6/24/2011 1:46 PM, Jim Newsham wrote: On 6/23/2011 8:55 PM, AJ wrote: Can any Cassandra contributors/guru's confirm my understanding of Cassandra's degree of support for the ACID properties? I provide official references when known. Please let me know if I missed some good official documentation. Atomicity All individual writes are atomic at the row level. So, a batch mutate for one specific key will apply updates to all the columns for that one specific row atomically. If part of the single-key batch update fails, then all of the updates will be reverted since they all pertained to one key/row. Notice, I said 'reverted' not 'rolled back'. Note: atomicity and isolation are related to the topic of transactions but one does not imply the other. Even though row updates are atomic, they are not isolated from other users' updates or reads. Refs: http://wiki.apache.org/cassandra/FAQ#batch_mutate_atomic Consistency If you want 100% consistency, use consistency level QUORUM for both reads and writes and EACH_QUORUM in a multi-dc scenario. Refs: http://wiki.apache.org/cassandra/ArchitectureOverview This is a pretty narrow interpretation of consistency. In a traditional database, consistency prevents you from getting into a logically inconsistent state, where records in one table do not agree with records in another table. This includes referential integrity, cascading deletes, etc. It seems to me Cassandra has no support for this concept whatsoever.
RE: RAID or no RAID
With multiple data dirs you are still limited by the space free on any one drive. So if you have two data dirs with 40GB free on each, and you have 50GB to be compacted, it won't work, but if you had a raid, you would have 80GB free and could compact... -Original Message- From: mcasandra [mailto:mohitanch...@gmail.com] Sent: Tuesday, June 28, 2011 7:55 PM To: cassandra-u...@incubator.apache.org Subject: Re: RAID or no RAID aaron morton wrote: Not sure what the intended purpose is, but we've mostly used it as an emergency disk-capacity-increase option Thats what I've used it for. Cheers How does compaction work in terms of utilizing multiple data dirs? Also, is there a reference on wiki somewhere that says not to use multiple data dirs? -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/RAID-or -no-RAID-tp6522904p6527219.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
RE: Docs: Token Selection
Run two Cassandra clusters... -Original Message- From: Eric tamme [mailto:eta...@gmail.com] Sent: Friday, June 17, 2011 11:31 AM To: user@cassandra.apache.org Subject: Re: Docs: Token Selection What I don't like about NTS is I would have to have more replicas than I need. {DC1=2, DC2=2}, RF=4 would be the minimum. If I felt that 2 local replicas was insufficient, I'd have to move up to RF=6 which seems like a waste... I'm predicting data in the TB range so I'm trying to keep replicas to a minimum. My goal is to have 2-3 replicas in a local data center and 1 replica in another dc. I think that would be enough barring a major catastrophe. But, I'm not sure this is possible. I define local as in the same data center as the client doing the insert/update. Yes, not being able to configure the replication factor differently for each data center is a bit annoying. Im assuming you basically want DC1 to have a replication factor of {DC1:2, DC2:1} and DC2 to have {DC1:1,DC2:2}. I would very much like that feature as well, but I dont know the feasibility of it. -Eric
RE: Docs: Token Selection
Run two clusters, one which has {DC1:2, DC2:1} and one which is {DC1:1,DC2:2}. You can't have both in the same cluster, otherwise it isn't possible to tell where the data got written when you want to read it. For a given key XYZ you must be able to compute which nodes it is stored on just using XYZ, so a strategy where it is on nodes DC1_1,DC1_2, and DC2_1 when a node in DC1 is the coordinator, and to DC1_1, DC2_1 and DC2_2 when a node in DC2 is the coordinator won't work. Given just XYZ I don't know where to look for the data. But, from the way you describe what you want to happen, clients from DC1 aren't using data inserted by clients from DC2, so you should just make two different Cassandra clusters. Once for the DC1 guys which is {DC1:2, DC2:1} and one for the DC2 guys which is {DC1:1,DC2:2}. -Original Message- From: AJ [mailto:a...@dude.podzone.net] Sent: Friday, June 17, 2011 1:02 PM To: user@cassandra.apache.org Subject: Re: Docs: Token Selection Hi Jeremiah, can you give more details? Thanks On 6/17/2011 10:49 AM, Jeremiah Jordan wrote: Run two Cassandra clusters... -Original Message- From: Eric tamme [mailto:eta...@gmail.com] Sent: Friday, June 17, 2011 11:31 AM To: user@cassandra.apache.org Subject: Re: Docs: Token Selection What I don't like about NTS is I would have to have more replicas than I need. {DC1=2, DC2=2}, RF=4 would be the minimum. If I felt that 2 local replicas was insufficient, I'd have to move up to RF=6 which seems like a waste... I'm predicting data in the TB range so I'm trying to keep replicas to a minimum. My goal is to have 2-3 replicas in a local data center and 1 replica in another dc. I think that would be enough barring a major catastrophe. But, I'm not sure this is possible. I define local as in the same data center as the client doing the insert/update. Yes, not being able to configure the replication factor differently for each data center is a bit annoying. Im assuming you basically want DC1 to have a replication factor of {DC1:2, DC2:1} and DC2 to have {DC1:1,DC2:2}. I would very much like that feature as well, but I dont know the feasibility of it. -Eric
RE: Docs: Why do deleted keys show up during range scans?
I am pretty sure how Cassandra works will make sense to you if you think of it that way, that rows do not get deleted, columns get deleted. While you can delete a row, if I understand correctly, what happens is a tombstone is created which matches every column, so in effect it is deleting the columns, not the whole row. A row key will not be forgotten/deleted until there are no columns or tombstones which reference it. Until there are no references to that row key in any SSTables you can still get that key back from the API. -Jeremiah -Original Message- From: AJ [mailto:a...@dude.podzone.net] Sent: Monday, June 13, 2011 12:11 PM To: user@cassandra.apache.org Subject: Re: Docs: Why do deleted keys show up during range scans? On 6/13/2011 10:14 AM, Stephen Connolly wrote: store the query inverted. that way empty - deleted I don't know what that means... get the other columns? Can you elaborate? Is there docs for this or is this a hack/workaround? the tombstones are stored for each column that had data IIRC... but at this point my grok of C* is lacking I suspected this, but wasn't sure. It sounds like when a row is deleted, a tombstone is not attached to the row, but to each column??? So, if all columns are deleted then the row is considered deleted? Hmmm, that doesn't sound right, but that doesn't mean it isn't ! ;o)
RE: Docs: Why do deleted keys show up during range scans?
Also, tombstone's are not attached anywhere. A tombstone is just a column with special value which says I was deleted. And I am pretty sure they go into SSTables etc the exact same way regular columns do. -Original Message- From: Jeremiah Jordan [mailto:jeremiah.jor...@morningstar.com] Sent: Tuesday, June 14, 2011 11:22 AM To: user@cassandra.apache.org Subject: RE: Docs: Why do deleted keys show up during range scans? I am pretty sure how Cassandra works will make sense to you if you think of it that way, that rows do not get deleted, columns get deleted. While you can delete a row, if I understand correctly, what happens is a tombstone is created which matches every column, so in effect it is deleting the columns, not the whole row. A row key will not be forgotten/deleted until there are no columns or tombstones which reference it. Until there are no references to that row key in any SSTables you can still get that key back from the API. -Jeremiah -Original Message- From: AJ [mailto:a...@dude.podzone.net] Sent: Monday, June 13, 2011 12:11 PM To: user@cassandra.apache.org Subject: Re: Docs: Why do deleted keys show up during range scans? On 6/13/2011 10:14 AM, Stephen Connolly wrote: store the query inverted. that way empty - deleted I don't know what that means... get the other columns? Can you elaborate? Is there docs for this or is this a hack/workaround? the tombstones are stored for each column that had data IIRC... but at this point my grok of C* is lacking I suspected this, but wasn't sure. It sounds like when a row is deleted, a tombstone is not attached to the row, but to each column??? So, if all columns are deleted then the row is considered deleted? Hmmm, that doesn't sound right, but that doesn't mean it isn't ! ;o)
RE: how to know there are some columns in a row
I am pretty sure this would cut down on network traffic, but not on Disk IO or CPU use. I think Cassandra would still have to deserialize the whole column to get to the name. So if you really have a use case where you just want the name, it would be better to store a separate name with no data column. From: Patrick de Torcy [mailto:pdeto...@gmail.com] Sent: Wednesday, June 08, 2011 4:00 AM To: user@cassandra.apache.org Subject: Re: how to know there are some columns in a row There is no reason for ambiguities... We could add in the api another method call (similar to get_count) : get_columnNames * liststring get_columnNames(key, column_parent, predicate, consistency_level) Get the columns names present in column_parent within the predicate. The method is not O(1). It takes all the columns from disk to calculate the answer. The only benefit of the method is that you do not need to pull all their values over Thrift interface to get their names (just to get the idea...) In fact column names can really be data in themselves, so there should be a way to retrieve them (without their values). When you have big values, it's a real show stopper to use get_slice, since a lot of unnecessary traffic would be generated... Forgive me if I am a little insistent, but it's important for us and I'm sure we are not the only ones interested in this feature... cheers
RE: Backups, Snapshots, SSTable Data Files, Compaction
Don't manually delete things. Let Cassandra do it. Force a garbage collection or restart your instance and Cassandra will delete the unused files. -Original Message- From: AJ [mailto:a...@dude.podzone.net] Sent: Tuesday, June 07, 2011 10:15 AM To: user@cassandra.apache.org Subject: Re: Backups, Snapshots, SSTable Data Files, Compaction On 6/7/2011 2:29 AM, Maki Watanabe wrote: You can find useful information in: http://www.datastax.com/docs/0.8/operations/scheduled_tasks sstables are immutable. Once it written to disk, it won't be updated. When you take snapshot, the tool makes hard links to sstable files. After certain time, you will have some times of memtable flushs, so your sstable files will be merged, and obsolete sstable files will be removed. But snapshot set will remains on your disk, for backup. Thanks for the doc source. I will be experimenting with 0.8.0 since it has many features I've been waiting for. But, still, if the snapshots don't link to all of the previous sets of .db files, then those unlinked previous file sets MUST be safe to manually delete. But, they aren't deleted until later after a GC. It's a bit confusing why they are kept after compaction up until GC when they seem to not be needed. We have Big Data plans... one node can have 10's of TBs, so I'm trying to get an idea of how much disk space will be required and whether or not I can free-up some disk space. Hopefully someone can still elaborate on this.
RE: Reading quorum
Only waiting for quorum responses and then resolving the one with the latest timestamp to return to the client. From: Fredrik Stigbäck [mailto:fredrik.l.stigb...@sitevision.se] Sent: Friday, June 03, 2011 9:44 AM To: user@cassandra.apache.org Subject: Reading quorum Does reading quorum mean only waiting for quorum respones or does it mean quorum respones with same latest timestamp? Regards /Fredrik
RE: Loading Keyspace from YAML in 0.8
Or at least someone should write a script which will take a YAML config and turn it into a CLI script. From: Edward Capriolo [mailto:edlinuxg...@gmail.com] Sent: Friday, June 03, 2011 12:00 PM To: user@cassandra.apache.org Subject: Re: Loading Keyspace from YAML in 0.8 On Fri, Jun 3, 2011 at 12:35 PM, Paul Loy ketera...@gmail.com wrote: ugh! On Fri, Jun 3, 2011 at 5:19 PM, Edward Capriolo edlinuxg...@gmail.com wrote: On Fri, Jun 3, 2011 at 12:14 PM, Paul Loy ketera...@gmail.com wrote: We embed cassandra in our app. When we first load a cluster, we specify one node in the cluster as the seed node. This node installs the schema using StorageService.instance.loadKeyspacesFromYAML(). This call has disappeared in 0.8. How can we do the same thing in Cassandra 0.8? Thanks, -- - Paul Loy p...@keteracel.com http://uk.linkedin.com/in/paulloy That was only a feature for migration from 0.6.X-0.7.X. You can use bin/bassandra-cli -f file_with defs But I would use the methods in thrift such as system_add_keyspace(). Edward -- - Paul Loy p...@keteracel.com http://uk.linkedin.com/in/paulloy Yes, Cassandra is very aggressive with deprecating stuff. However to be fair it is clear that StorageService is subject to change at any time. With things like this I personally do not see the harm in letting them hang around for a while. In fact I really think it should be added back, because it makes me wonder what the MANY people going from 0.6.X to 0.8.X are going to do.
RE: Appending to fields
Cassandra handles this by using a different design, you don't append anything. You use the fact that in Cassandra you have dynamic columns and you make a new column every time you want to put more data in. Then when you do finally need to read the data out you read out a slice of columns, not just one column. -Jeremiah -Original Message- From: Marcus Bointon [mailto:mar...@synchromedia.co.uk] Sent: Tuesday, May 31, 2011 2:23 PM To: user@cassandra.apache.org Subject: Appending to fields I'm wondering how cassandra implements appending values to fields. Since (so the docs tell me) there's not really any such thing such thing as an update in Cassandra, I wonder if it falls into the same trap as MySQL does. With a query like update x set y = concat(y, 'a') where id = 1, mysql reads the entire value of y, appends the data, then writes the whole thing back, which unfortunately is an O(n^2) operation. The situation I'm doing this in involves what amount to log files on hundreds of thousands of items, many of which might need updating at once, so they're all simple appends, but it becomes unusably slow very quickly. In MySQL it's just a plain bug as it could optimise this by appending data at a known offset and then bumping up the field length counter, which is back in at least O(n) territory. Does cassandra's design avoid this problem? Marcus
java.lang.RuntimeException: Cannot recover SSTable with version a (current version f).
Running repair and I am getting this error: java.lang.RuntimeException: Cannot recover SSTable with version a (current version f). at org.apache.cassandra.io.sstable.SSTableWriter.createBuilder(SSTableWrite r.java:237) at org.apache.cassandra.db.CompactionManager.submitSSTableBuild(CompactionM anager.java:938) at org.apache.cassandra.streaming.StreamInSession.finished(StreamInSession. java:107) at org.apache.cassandra.streaming.IncomingStreamReader.readFile(IncomingStr eamReader.java:112) at org.apache.cassandra.streaming.IncomingStreamReader.read(IncomingStreamR eader.java:61) at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection .java:91) The comment by that exception is: // TODO: streaming between different versions will fail: need support for // recovering other versions to provide a stable streaming api This cluster was updated from 0.6.8-0.7.4-0.7.5. Do I need to run scrub or compact or something to get all the sstables updated to the new version? Jeremiah Jordan Application Developer Morningstar, Inc. Morningstar. Illuminating investing worldwide. +1 312 696-6128 voice jeremiah.jor...@morningstar.com www.morningstar.com This e-mail contains privileged and confidential information and is intended only for the use of the person(s) named above. Any dissemination, distribution, or duplication of this communication without prior written consent from Morningstar is strictly prohibited. If you have received this message in error, please contact the sender immediately and delete the materials from any computer.
RE: Replica data distributing between racks
So we are currently running a 10 node ring in one DC, and we are going to be adding 5 more nodes in another DC. To keep the rings in each DC balanced, should I really calculate the tokens independently and just make sure none of them are the same? Something like: DC1 (RF 5): 1: 0 2: 17014118346046923173168730371588410572 3: 34028236692093846346337460743176821144 4: 51042355038140769519506191114765231716 5: 68056473384187692692674921486353642288 6: 85070591730234615865843651857942052860 7: 102084710076281539039012382229530463432 8: 119098828422328462212181112601118874004 9: 136112946768375385385349842972707284576 10: 153127065114422308558518573344295695148 DC2 (RF 3): 1: 1 (one off from DC1 node 1) 2: 34028236692093846346337460743176821145 (one off from DC1 node 3) 3: 68056473384187692692674921486353642290 (two off from DC1 node 5) 4: 102084710076281539039012382229530463435 (three off from DC1 node 7) 5: 136112946768375385385349842972707284580 (four off from DC1 node 9) Originally I was thinking I should spread the DC2 nodes evenly in between every other DC1 node. Or does it not matter where they are in respect to the DC1 nodes, and long as they fall somewhere after every other DC1 node? So it is DC1-1, DC2-1, DC1-2, DC1-3, DC2-2, DC1-4, DC1-5... -Original Message- From: Jonathan Ellis [mailto:jbel...@gmail.com] Sent: Tuesday, May 03, 2011 9:14 AM To: user@cassandra.apache.org Subject: Re: Replica data distributing between racks Right, when you are computing balanced RP tokens for NTS you need to compute the tokens for each DC independently. On Tue, May 3, 2011 at 6:23 AM, aaron morton aa...@thelastpickle.com wrote: I've been digging into this and worked was able to reproduce something, not sure if it's a fault and I can't work on it any more tonight. To reproduce: - 2 node cluster on my mac book - set the tokens as if they were nodes 3 and 4 in a 4 node cluster, e.g. node 1 with 85070591730234615865843651857942052864 and node 2 127605887595351923798765477786913079296 - set cassandra-topology.properties to put the nodes in DC1 on RAC1 and RAC2 - create a keyspace using NTS and strategy_options = [{DC1:1}] Inserted 10 rows they were distributed as - node 1 - 9 rows - node 2 - 1 row I *think* the problem has to do with TokenMetadata.firstTokenIndex(). It often says the closest token to a key is the node 1 because in effect... - node 1 is responsible for 0 to 85070591730234615865843651857942052864 - node 2 is responsible for 85070591730234615865843651857942052864 to 127605887595351923798765477786913079296 - AND node 1 does the wrap around from 127605887595351923798765477786913079296 to 0 as keys that would insert past the last token in the ring array wrap to 0 because insertMin is false. Thoughts ? Aaron On 3 May 2011, at 10:29, Eric tamme wrote: On Mon, May 2, 2011 at 5:59 PM, aaron morton aa...@thelastpickle.com wrote: My bad, I missed the way TokenMetadata.ringIterator() and firstTokenIndex() work. Eric, can you show the output from nodetool ring ? Sorry if the previous paste was way to unformatted, here is a pastie.org link with nicer formatting of nodetool ring output than plain text email allows. http://pastie.org/private/50khpakpffjhsmgf66oetg -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
RE: best way to backup
The files inside the keyspace folders are the SSTable. From: aaron morton [mailto:aa...@thelastpickle.com] Sent: Friday, April 29, 2011 4:49 PM To: user@cassandra.apache.org Subject: Re: best way to backup William, Some info on the sstables from me http://thelastpickle.com/2011/04/28/Forces-of-Write-and-Read/ http://thelastpickle.com/2011/04/28/Forces-of-Write-and-Read/ If you want to know more check out the BigTable and original Facebook papers, linked from the wiki http://wiki.apache.org/cassandra/ArchitectureOverview Aaron On 29 Apr 2011, at 23:43, William Oberman wrote: Dumb question, but referenced twice now: which files are the SSTables and why is backing them up incrementally a win? Or should I not bother to understand internals, and instead just roll with the backup my keyspace(s) and system in a compressed tar strategy, as while it may be excessive, it's guaranteed to work and work easily (which I like, a great deal). will On Fri, Apr 29, 2011 at 4:58 AM, Daniel Doubleday daniel.double...@gmx.net wrote: What we are about to set up is a time machine like backup. This is more like an add on to the s3 backup. Our boxes have an additional larger drive for local backup. We create a new backup snaphot every x hours which hardlinks the files in the previous snapshot (bit like cassandras incremental_backups thing) and than we sync that snapshot dir with the cassandra data dir. We can do archiving / backup to external system from there without impacting the main data raid. But the main reason to do this is to have an 'omg we screwed up big time and deleted / corrupted data' recovery. On Apr 28, 2011, at 9:53 PM, William Oberman wrote: Even with N-nodes for redundancy, I still want to have backups. I'm an amazon person, so naturally I'm thinking S3. Reading over the docs, and messing with nodeutil, it looks like each new snapshot contains the previous snapshot as a subset (and I've read how cassandra uses hard links to avoid excessive disk use). When does that pattern break down? I'm basically debating if I can do a rsync like backup, or if I should do a compressed tar backup. And I obviously want multiple points in time. S3 does allow file versioning, if a file or file name is changed/resused over time (only matters in the rsync case). My only concerns with compressed tars is I'll have to have free space to create the archive and I get no delta space savings on the backup (the former is solved by not allowing the disk space to get so low and/or adding more nodes to bring down the space, the latter is solved by S3 being really cheap anyways). -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com
Changing replica placement strategy
If I am currently only running with one data center, can I change the replica_placement_strategy from org.apache.cassandra.locator.RackUnawareStrategy to org.apache.cassandra.locator.NetworkTopologyStrategy without issue? We are planning to add another data center in the near future and want to be able to use NetworkTopologyStrategy. I am pretty sure RackUnawareStrategy and NetworkTopologyStrategy pick the same nodes to put data on if there is only one DC, so it should be ok right? Jeremiah Jordan Application Developer Morningstar, Inc. Morningstar. Illuminating investing worldwide. +1 312 696-6128 voice jeremiah.jor...@morningstar.com www.morningstar.com This e-mail contains privileged and confidential information and is intended only for the use of the person(s) named above. Any dissemination, distribution, or duplication of this communication without prior written consent from Morningstar is strictly prohibited. If you have received this message in error, please contact the sender immediately and delete the materials from any computer.
Link to Hudson on the download page is broken
The apache Hudson server address needs to be updated on the download page, it is now: https://builds.apache.org The link to the latest builds from the download page: http://cassandra.apache.org/download/ Needs to be updated from: http://hudson.zones.apache.org/hudson/job/Cassandra/lastSuccessfulBuild/ artifact/cassandra/build/ To: https://builds.apache.org/hudson/job/Cassandra/lastSuccessfulBuild/artif act/cassandra/build/ The old link doesn't work anymore. -Jeremiah Jeremiah Jordan Application Developer Morningstar, Inc. Morningstar. Illuminating investing worldwide. +1 312 696-6128 voice jeremiah.jor...@morningstar.com www.morningstar.com This e-mail contains privileged and confidential information and is intended only for the use of the person(s) named above. Any dissemination, distribution, or duplication of this communication without prior written consent from Morningstar is strictly prohibited. If you have received this message in error, please contact the sender immediately and delete the materials from any computer.
RE: Abnormal memory consumption
Connect with jconsole and watch the memory consumption graph. Click the force GC button watch what the low point is, that is how much memory is being used for persistent stuff, the rest is garbage generated while satisfying queries. Run a query, watch how the graph spikes up when you run your query, that is how much is needed for the query. Like others have said, Cassandra isn't using 600Mb of RAM, the Java Virtual Machine is using 600Mb of RAM, because your settings told it it could. The JVM will use as much memory as your settings allow it to. If you really are putting that little data into your test server, you should be able to tune everything down to only 256Mb easily (I do this for test instances of Cassandra that I spin up to run some tests on), maybe further. -Jeremiah From: openvictor Open [mailto:openvic...@gmail.com] Sent: Wednesday, April 06, 2011 7:59 PM To: user@cassandra.apache.org Subject: Re: Abnormal memory consumption Hello Paul, Thank you for the tip. The random port attribution policy of JMX was really making me mad ! Good to know there is a solution for that problem. Concerning the rest of the conversation, my only concern is that as an administrator and a student it is hard to constantly watch Cassandra instances so that they don't crash. As much as I love the principle of Cassandra, being constantly afraid of memory consumption is an issue in my opinion. That being said, I took a new 16 Gb server today, but I don't want Cassandra to eat up everything if it is not needed, because Cassandra will have some neighbors such as Tomcat, solR on this server. And for me it is very weird that on my small instance where I put a lot of constraints like throughput_memtableInMb to 6 Cassandra uses 600 Mb of ram for 6 Mb of data. It seems to be a little bit of an overkill to me... And so far I failed to find any information on what this massive overhead can be... Thank you for your answers and for taking the time to answer my questions. 2011/4/6 Paul Choi paulc...@plaxo.com You can use JMX over ssh by doing this: http://blog.reactive.org/2011/02/connecting-to-cassandra-jmx-via-ssh.htm l Basically, you use SSH -D to do dynamic application port forwarding. In terms of scaling, you'll be able to afford 120GB RAM/node in 3 years if you're successful. Or, a machine with much less RAM and flash-based storage. :) Seriously, though, the formula in the tuning guidelines is a guideline. You can probably get acceptable performance with much less. If not, you can shard your app such that you host a few Cfs per cluster. I doubt you'll need to though. From: openvictor Open openvic...@gmail.com Reply-To: user@cassandra.apache.org Date: Mon, 4 Apr 2011 18:24:25 -0400 To: user@cassandra.apache.org Subject: Re: Abnormal memory consumption Okay, I see. But isn't there a big issue for scaling here ? Imagine that I am the developper of a certain very successful website : At year 1 I need 20 CF. I might need to have 8Gb of RAM. Year 2 I need 50 CF because I added functionalities to my wonderful webiste will I need 20 Gb of RAM ? And if at year three I had 300 Column families, will I need 120 Gb of ram / node ? Or did I miss something about memory consuption ? Thank you very much, Victor 2011/4/4 Peter Schuller peter.schul...@infidyne.com And about the production 7Gb or RAM is sufficient ? Or 11 Gb is the minimum ? Thank you for your inputs for the JVM I'll try to tune that Production mem reqs are mostly dependent on memtable thresholds: http://www.datastax.com/docs/0.7/operations/tuning If you enable key caching or row caching, you will have to adjust accordingly as well. -- / Peter Schuller
Secondary Index keeping track of column names
In 0.7.X is there a way to have an automatic secondary index which keeps track of what keys contain a certain column? Right now we are keeping track of this manually, so we can quickly get all of the rows which contain a given column, it would be nice if it was automatic. -Jeremiah Jeremiah Jordan Application Developer Morningstar, Inc. Morningstar. Illuminating investing worldwide. +1 312 696-6128 voice jeremiah.jor...@morningstar.com www.morningstar.com This e-mail contains privileged and confidential information and is intended only for the use of the person(s) named above. Any dissemination, distribution, or duplication of this communication without prior written consent from Morningstar is strictly prohibited. If you have received this message in error, please contact the sender immediately and delete the materials from any computer.
Thrift version
Anyone know if 0.7.4 will work with thirft 0.6? Or do I have to keep thrift 0.5 around to use it? Thanks! Jeremiah Jordan Application Developer Morningstar, Inc. Morningstar. Illuminating investing worldwide. +1 312 696-6128 voice jeremiah.jor...@morningstar.com www.morningstar.com This e-mail contains privileged and confidential information and is intended only for the use of the person(s) named above. Any dissemination, distribution, or duplication of this communication without prior written consent from Morningstar is strictly prohibited. If you have received this message in error, please contact the sender immediately and delete the materials from any computer.