Re: Storing big objects into columns
In a project I would like to store big objects in columns, serialized. For example entire images (several Ko to several Mo), flash animations (several Mo) etc... Does someone use Cassandra with those relatively big columns and if yes does it work well ? Is there any drawbacks using this method ? Not in production, but I've done testing with values on the order of a few megs. Expect compaction to be entirely disk bound rather than CPU bound. Make sure latency is acceptable even when data sizes grow beyond memory size. -- / Peter Schuller
Re: Usage Pattern : amp;quot;uniqueamp;quot; value of a key.
You're right when you say it's unlikely that 2 threads have the same timestamp, but it can. So it could work for user creation, but maybe not on a more write intensive problem. Um, sorry I thought you re solving exact case of duplicate user creation. If youre trying to solve the concurrent updates to cassandra in general, consider using zookeeper. By the way, lock algorithm in zookeeper is very much like you descibed - but zookeeper is the right tool for this job. Moreover, we cannot rely on fully time synchronized node in the cluster (but on node synchronized at a few ms), so a second node could theoretically write a smaller timestamp after the first node. This is not a problem - then this node will loose the race - cassandra will ignore updates with timestamp older then timestamp of the current value. An even worst case could be the one illustrated here (http://noisette.ch/cassandra/cassandra_unique_key_pattern.png) : nodes are synchronized, but something goes wrong (slow) during the write, then both nodes think the key belongs to them. So my idea of writing a lock is not well suitablte following modification - when either user performs write{K,lock A}, it passes timestamp, recorded earlier - at the moment of performing very 1st read K. So the scenario for user A is: 1. record current timestamp from machine clock - T1 2. make read K, K not exists 3. make write{K, lock A, timestamp = T1} 3.1 cassandra sees no current value in memtable for K - write succeeds. cassanda records timestamp of the value K,A to be T1 4. read K, compare lock to be A (for your original solution) or returned data timestamp == T1 (for proposed by me) Then user B scenario would be: 1. record current timestamp from machine clock. It's value is T0, which is T1. 2. make read K, K not exists 3. slowness on: make a pause for couple of (milli)seconds, GCing or drinking coffee, so user A executes its scenario above 4. slowness off: make write{K,lock B, timestamp = T0} 4.1 on cassandra side, this write will be ignored, becase current timestamp of K is T1, which is later than T0 5. read K, see lock == A, instead of B (in your original solution) or timestamp != T0 (in mine). 6. user B understands it is lost race. - do something with it. Of course this scenario will work only, if clocks are synced and variation between machine clocks of user A and B is much less than duration of write-read roundtrip. In practice this means, that you'll need to introduce delays of 50- 100ms between write and last read on both users. So this could work if 100ms delay is acceptable. Otherwise use zookeeper. At least until cassandra does not have version vector support implemented. Umpf, that was long story ;-)
RE: about the data directory
Thanks very much -Original Message- From: sc...@scode.org [mailto:sc...@scode.org] On Behalf Of Peter Schuller Sent: Friday, January 14, 2011 4:40 PM To: user@cassandra.apache.org Subject: Re: about the data directory as a administrator, I want to know why I can read the data from any node, because the data just be kept the replica. Can you tell me? Thanks in advance. It's part of the point of Cassandra. You talk to the cluster, period. It's Cassandra's job to keep track of where data lives, and client applications don't care. This is a fundamental design goal. -- / Peter Schuller
Re: Is there any way I could use keys of other rows as column names that could be sorted according to time ?
It's possible that I am misunderstanding the question in some way. The row keys can be Time UUIDs and with those row keys as column names, u can use comparator TIMEUUIDTYPE to have them sorted by time automatically. On Fri, Jan 14, 2011 at 9:18 AM, Aaron Morton aa...@thelastpickle.comwrote: You could make the time an a fixed width integer and prefix your row keys with it, then set the comparotor to ascii or utf. Some issues: - Will you have time collisions ? - Not sure what your are storing in the super columns, but their are limitations http://wiki.apache.org/cassandra/CassandraLimitations http://wiki.apache.org/cassandra/CassandraLimitations- If you are using cassandra 0.7, have you looked at the secondary indexes ? http://www.riptano.com/blog/whats-new-cassandra-07-secondary-indexes http://www.riptano.com/blog/whats-new-cassandra-07-secondary-indexesIf you provide some more info on the problem your trying to solve we may be able to help some more. Cheers Aaron On 14 Jan, 2011,at 04:27 PM, Aklin_81 asdk...@gmail.com wrote: I would like to keep the reference of other rows as names of super column and sort those super columns according to time. Is there any way I could implement that ? Thanks in advance! -- Roshan Blog: http://roshandawrani.wordpress.com/ Twitter: @roshandawrani http://twitter.com/roshandawrani Skype: roshandawrani # # # #
Re: Timeout Errors while running Hadoop over Cassandra
The cassandra logs strangely show no errors at the time of failure. Changing the RPCTimeoutInMillis seemed to help. Though it slowed down the job considerably, it seems to be finishing by changing the timeout value to 1 min. Unfortunately, I cannot be sure if it will continue to work if the data increases further. Hopefully will be upgrading to the recently released final version of 0.7.0. Thanks for all the help and suggestions. Warm regards, Jairam Chandar On 13/01/2011 14:47, Jeremy Hanna jeremy.hanna1...@gmail.com wrote: On Jan 12, 2011, at 12:40 PM, Jairam Chandar wrote: Hi folks, We have a Cassandra 0.6.6 cluster running in production. We want to run Hadoop (version 0.20.2) jobs over this cluster in order to generate reports. I modified the word_count example in the contrib folder of the cassandra distribution. While the program is running fine for small datasets (in the order of 100-200 MB) on small clusters (2 machines), it starts to give errors while trying to run on a bigger cluster (5 machines) with much larger dataset (400 GB). Here is the error that we get - java.lang.RuntimeException: TimedOutException() at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeIni t(ColumnFamilyRecordReader.java:186) at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeN ext(ColumnFamilyRecordReader.java:236) at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeN ext(ColumnFamilyRecordReader.java:104) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractItera tor.java:135) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java: 130) at org.apache.cassandra.hadoop.ColumnFamilyRecordReader.nextKeyValue(ColumnF amilyRecordReader.java:98) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(Map Task.java:423) at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) Caused by: TimedOutException() at org.apache.cassandra.thrift.Cassandra$get_range_slices_result.read(Cassan dra.java:11094) at org.apache.cassandra.thrift.Cassandra$Client.recv_get_range_slices(Cassan dra.java:628) at org.apache.cassandra.thrift.Cassandra$Client.get_range_slices(Cassandra.j ava:602) at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeIni t(ColumnFamilyRecordReader.java:164) ... 11 more I wonder if messing with RpcTimeoutInMillis in storage-conf.xml would help. I came across this page on the Cassandra wiki - http://wiki.apache.org/cassandra/HadoopSupport and tried modifying the ulimit and changing batch sizes. These did not help. Though the number of successful map tasks increased, it eventually fails since the total number of map tasks is huge. Any idea on what could be causing this? The program we are running is a very slight modification of the word_count example with respect to reading from Cassandra. The only change being specific keyspace, columnfamily and columns. The rest of the code for reading is the same as the word_count example in the source code for Cassandra 0.6.6. Thanks and regards, Jairam Chandar
Different comparator types for column and supercolumn don't work
Hello, I'm new to cassandra. I'm using cassandra release 0.7.0 (local, single node). I can't perform write operations in case the column and supercolumn families have different comparator types. For example if I use the code given in Issue: https://issues.apache.org/jira/browse/CASSANDRA-1712 by Jonathan Ellis in the CLI, I get the following output: [default@Keyspace1] create keyspace KS1 8bb2fc2d-1fcb-11e0-add0-a9c93d38c544 [default@Keyspace1] use KS1 Authenticated to keyspace: KS1 [default@KS1] create column family CFCli with column_type= 'Super' and comparator= 'LongType' and subcomparator='UTF8Type' 97742bbe-1fcb-11e0-add0-a9c93d38c544 [default@KS1] set CFCli['newrow'][1234567890]['column'] = 'value' 'column' could not be translated into a LongType. I also tried a setup with the release inclosed example keyspace (loaded via the StorageService bean loadSchemaFromYAML method): ColumnFamily: Super3 (Super) A column family with supercolumns, whose column names are Longs (8 bytes) Columns sorted by: org.apache.cassandra.db.marshal.LongType/org.apache.cassandra.db.marshal.BytesType Subcolumns sorted by: org.apache.cassandra.db.marshal.LongType Row cache size / save period: 0.0/0 Key cache size / save period: 20.0/3600 Memtable thresholds: 0.2953125/63/60 GC grace seconds: 864000 Compaction min/max thresholds: 4/32 CLI output: [default@Keyspace1] set Super3['account_value']['1:1'][1234567890] = 'value1' A long is exactly 8 bytes: 3 [default@Keyspace1] set Super3['account_value'][1234567890]['test'] = 'value1' 'test' could not be translated into a LongType. [default@Keyspace1] set Super3['account_value'][1234567890][1234567890] = 'value1' A long is exactly 8 bytes: 10 [default@Keyspace1] set Super3[1234567890][1234567890][1234567890] = 'value1' Syntax error at position 11: mismatched input '1234567890' expecting set null [default@Keyspace1] set Super3['account_value']['test'][1234567890] = 'value1' A long is exactly 8 bytes: 4 [default@Keyspace1] set Super3[1234567890]['test']['column'] = 'value1' Syntax error at position 11: mismatched input '1234567890' expecting set null According to the CLI help the format is: set cf['key']['super']['col'] = value, thus the errors generated seem weird for me. What am I doing wrong? Thanks in advance, Kind regards, Karin
Re: Is there any way I could use keys of other rows as column names that could be sorted according to time ?
@Roshan Yes, I thought about that, but then I wouldn't be able to use the Random Partitioner. @Aaron Do you mean like this: 'timeUUID+ row_key' as the supercolumn names? then when retriving the row_key from this column name, will I be required to parse the name ? How do I do that exactly ? Some issues: - Will you have time collisions ? No I wont be mostly having time collisions. If they happen in 1% case, I dont mind. - Not sure what your are storing in the super columns, but their are limitations. I would be storing maximum 5 subcolumns inside and would be retrieving them altogether. - If you are using cassandra 0.7, have you looked at the secondary indexes ? Yes I did but I think they are not helpful in my case. This is what I am trying to do : ** This is from an older post that I made earlier on the mailing list:- I am working on a project of Questions/answers forum that allows a user to follow questions on certain topics from his followies. I want to build user's news-feed that comprises of only those questions that have been posted by his followies tagged on the topics that he is following. Simple news-feed design that shows all the posts from network would be easy to design using Cassandra by executing fast writes to all followers of a user about the post from user. But for my application, there is an additional filter of 'followed topics', (ie, the user receives posts created by his followies on topics user is following) I was thinking of implementing this way: Initially writing to all followers, the postID of posts from their network, by adding a supercolumn to the rows of all followers in the News-feed supercolumnfamily, with supercolumn name as timestamp(for sort by time) and 5 sub-columns containing the topic tags of that post. At the read time, compare subcolumn values with the topics user is following, if they match then show the post. (I would be required to fetch the list of followed topics of the user at read time, hence should I store the topic list as a supercolumn in this Newsfeed supercolumnfamily only?) An important point to note that, often, the posts will have zero subcolumn which would mean that this post has to be shown without validating with the user's list of followed topics. There is another view for the users which allows them to see all the posts from their followies(without topic filters). In this case no checking of subcolumns for topics will be performed. I got good insights from Tyler on this, but he was recommending me an approach which although would be beneficial for reads performance, but by way of too much denormalizing like 70-80x. I currently fear that approach and would like to test upon this. ** any comments, feedback greatly appreciated.. thanks so much! On 1/14/11, Roshan Dawrani roshandawr...@gmail.com wrote: It's possible that I am misunderstanding the question in some way. The row keys can be Time UUIDs and with those row keys as column names, u can use comparator TIMEUUIDTYPE to have them sorted by time automatically. On Fri, Jan 14, 2011 at 9:18 AM, Aaron Morton aa...@thelastpickle.comwrote: You could make the time an a fixed width integer and prefix your row keys with it, then set the comparotor to ascii or utf. Some issues: - Will you have time collisions ? - Not sure what your are storing in the super columns, but their are limitations http://wiki.apache.org/cassandra/CassandraLimitations http://wiki.apache.org/cassandra/CassandraLimitations- If you are using cassandra 0.7, have you looked at the secondary indexes ? http://www.riptano.com/blog/whats-new-cassandra-07-secondary-indexes http://www.riptano.com/blog/whats-new-cassandra-07-secondary-indexesIf you provide some more info on the problem your trying to solve we may be able to help some more. Cheers Aaron On 14 Jan, 2011,at 04:27 PM, Aklin_81 asdk...@gmail.com wrote: I would like to keep the reference of other rows as names of super column and sort those super columns according to time. Is there any way I could implement that ? Thanks in advance! -- Roshan Blog: http://roshandawrani.wordpress.com/ Twitter: @roshandawrani http://twitter.com/roshandawrani Skype: roshandawrani # # # #
Re: Is there any way I could use keys of other rows as column names that could be sorted according to time ?
On Fri, Jan 14, 2011 at 7:15 PM, Aklin_81 asdk...@gmail.com wrote: @Roshan Yes, I thought about that, but then I wouldn't be able to use the Random Partitioner. Can you please expand a bit on this? What is this restriction? Can you point me to some relevant documentation on this? Thanks. # # # #
Re: Is there any way I could use keys of other rows as column names that could be sorted according to time ?
I am not sure but I guess because all the rows of certain time range will go to just one node will not be evenly distributed because the timeUUID will not be random but sequential according to time... I am not sure anyways... On Fri, Jan 14, 2011 at 7:18 PM, Roshan Dawrani roshandawr...@gmail.comwrote: On Fri, Jan 14, 2011 at 7:15 PM, Aklin_81 asdk...@gmail.com wrote: @Roshan Yes, I thought about that, but then I wouldn't be able to use the Random Partitioner. Can you please expand a bit on this? What is this restriction? Can you point me to some relevant documentation on this? Thanks. #12d84d3a0b3ce961_12d84c9312ae2134_ #12d84d3a0b3ce961_12d84c9312ae2134_ #12d84d3a0b3ce961_12d84c9312ae2134_ #12d84d3a0b3ce961_12d84c9312ae2134_
Re: Is there any way I could use keys of other rows as column names that could be sorted according to time ?
I too believed so! but not totally sure. On 1/14/11, Rajkumar Gupta rajkumar@gmail.com wrote: I am not sure but I guess because all the rows of certain time range will go to just one node will not be evenly distributed because the timeUUID will not be random but sequential according to time... I am not sure anyways... On Fri, Jan 14, 2011 at 7:18 PM, Roshan Dawrani roshandawr...@gmail.comwrote: On Fri, Jan 14, 2011 at 7:15 PM, Aklin_81 asdk...@gmail.com wrote: @Roshan Yes, I thought about that, but then I wouldn't be able to use the Random Partitioner. Can you please expand a bit on this? What is this restriction? Can you point me to some relevant documentation on this? Thanks. #12d84d3a0b3ce961_12d84c9312ae2134_ #12d84d3a0b3ce961_12d84c9312ae2134_ #12d84d3a0b3ce961_12d84c9312ae2134_ #12d84d3a0b3ce961_12d84c9312ae2134_
Re: Is there any way I could use keys of other rows as column names that could be sorted according to time ?
I am not clear what you guys are trying to do and say :-) So, let's take some specifics... Say you want to create rows in some column family (say CF_A), and as you create them, you want to store their row key in column names in some other column family (say CF_B) - possibly for filtering keys based on time later, etc, etc... Now your rows in CF_A may be keyed on a TimeUUID and if you store these keys as column names in CF_B that has comparator as TimeUUID, then you get your column names time sorted automatically. Now CF_A may be split across nodes - is that of any concern to you? Are you expecting any storage relationship between column names of CF_B and rows of CF_A? rgds, Roshan On Fri, Jan 14, 2011 at 7:58 PM, Aklin_81 asdk...@gmail.com wrote: I too believed so! but not totally sure. On 1/14/11, Rajkumar Gupta rajkumar@gmail.com wrote: I am not sure but I guess because all the rows of certain time range will go to just one node will not be evenly distributed because the timeUUID will not be random but sequential according to time... I am not sure anyways... # # # #
Problem starting Cassandra on Ubuntu
Hi, just installed Cassandra on Ubuntu using package manager but I can not start it I get the following error in the logs: INFO [main] 2011-01-14 15:37:49,758 AbstractCassandraDaemon.java (line 74) Heap size: 1051525120/1051525120 WARN [main] 2011-01-14 15:37:49,826 CLibrary.java (line 73) Obsolete version of JNA present; unable to read errno. Upgrade to JNA 3.2.7 or later WARN [main] 2011-01-14 15:37:49,827 CLibrary.java (line 73) Obsolete version of JNA present; unable to read errno. Upgrade to JNA 3.2.7 or later WARN [main] 2011-01-14 15:37:49,827 CLibrary.java (line 105) Unknown mlockall error 0 INFO [main] 2011-01-14 15:37:49,841 DatabaseDescriptor.java (line 121) Loading settings from file:/etc/cassandra/cassandra.yaml ERROR [main] 2011-01-14 15:37:49,965 DatabaseDescriptor.java (line 388) Fatal error: null; mapping values are not allowed here
Re: Is there any way I could use keys of other rows as column names that could be sorted according to time ?
I just read that cassandra internally creates a md5 hash that is used for distributing the load by sending it to a node reponsible for the range within which that md5 hash falls, so even when we create sequential keys, their MD5 hash is not the same hence they are not sent to same node. This was my misunderstanding of this concept. Sorry for creating confusions ! So.. with this I think I will be able to use timeUUID as row key !? Aaron, if you could kindly share your views on my response to your queries above. On 1/14/11, Roshan Dawrani roshandawr...@gmail.com wrote: I am not clear what you guys are trying to do and say :-) So, let's take some specifics... Say you want to create rows in some column family (say CF_A), and as you create them, you want to store their row key in column names in some other column family (say CF_B) - possibly for filtering keys based on time later, etc, etc... Now your rows in CF_A may be keyed on a TimeUUID and if you store these keys as column names in CF_B that has comparator as TimeUUID, then you get your column names time sorted automatically. Now CF_A may be split across nodes - is that of any concern to you? Are you expecting any storage relationship between column names of CF_B and rows of CF_A? rgds, Roshan On Fri, Jan 14, 2011 at 7:58 PM, Aklin_81 asdk...@gmail.com wrote: I too believed so! but not totally sure. On 1/14/11, Rajkumar Gupta rajkumar@gmail.com wrote: I am not sure but I guess because all the rows of certain time range will go to just one node will not be evenly distributed because the timeUUID will not be random but sequential according to time... I am not sure anyways... # # # #
Re: limiting columns in a row
Hi, does this seem like a generally useful feature? I do think this could be a useful feature. If only because I don't think there is any satisfactory/efficient way to do this client side. if so, would it be hard to implement (maybe it could be done at compaction time like the TTL feature)? Out of the top of my hat (aka, I haven't really think that through but I'll still give my opinion), I see the following difficulties: 1) You can only do this limiting during major compaction or the same cases as CASSANDRA-1074 for minor, since you need to make sure the x columns you are keeping are not deleted ones. Or you'll want to disable deletes altogether on the cf with this 'limit' option (I feel like this last option would really simplify things). 2) Even if the removal of the column exceeding the limit is eventual (and it will), you'll want query to only ever return column inside the limit (otherwise the feature would be too unpredictable). But I think this will be quite challenging. That is, slice query from the start of the row are easy. Everything else is harder (at least if you want to make it efficient). That was my 2 cents. Anyway, you can always open a JIRA ticket. -- Sylvain On Fri, Jan 14, 2011 at 7:38 AM, mike dooley doo...@apple.com wrote: hi, the time-to-live feature in 0.7 is very nice and it made me want to ask about a somewhat similar feature. i have a stream of data consisting of entities and associated samples. so i create a row for each entity and the columns in each row contain the samples for that entity. when i get around to processing an entity i only care about the most recent N samples. so i read the most recent N columns and delete all the rest. what i would like would be a column family property that allows me to specify a maximum number of columns per row. then i could just keep writing and not have to do the deletes. in my case it would be fine if the limit is only 'eventually' applied (so that sometimes there might be extra columns). does this seem like a generally useful feature? if so, would it be hard to implement (maybe it could be done at compaction time like the TTL feature)? thanks, -mike
live data migration from mysql to cassandra
Hello Dear community please share your experience, home you make live(without stop) migration from mysql or other RDBM to cassandra
Re: Is there any way I could use keys of other rows as column names that could be sorted according to time ?
No, you do not need to shut up, please! :) you may be clearing up my further misconceptions on the topic! Anyways, the link b/w 1st and 2nd para was that since the rows distribution among nodes is not affected by key(as you rightly said) but by md5 hash of the key thus I can use just any key including the timeUUIDType key (that would be helpful in my case) with Random partition. On 1/14/11, Roshan Dawrani roshandawr...@gmail.com wrote: On Fri, Jan 14, 2011 at 8:51 PM, Aklin_81 asdk...@gmail.com wrote: I just read that cassandra internally creates a md5 hash that is used for distributing the load by sending it to a node reponsible for the range within which that md5 hash falls, so even when we create sequential keys, their MD5 hash is not the same hence they are not sent to same node. This was my misunderstanding of this concept. Sorry for creating confusions ! So.. with this I think I will be able to use timeUUID as row key !? Now, what really is the link between your corrected understanding and the conclusion in the 2nd para? :-) I miss the link you are using to come from para 1 to para 2. Just because you use time UUID as the row key, there is no storage guarantee because of that. Distribution of rows and ordering across nodes is only based on what partitioner you are using - it is not (only) related to the the type of the key. May be I should just shut up now as I don't seem to be understanding you requirement :-) # # # #
Re: live data migration from mysql to cassandra
On Fri, Jan 14, 2011 at 10:40 AM, ruslan usifov ruslan.usi...@gmail.com wrote: Hello Dear community please share your experience, home you make live(without stop) migration from mysql or other RDBM to cassandra There is no built in way to do this. I remember hearing at hadoop world this year that the hbase guys have a system to read mysql slave logs and replay into hbase. Since all the nosql community seems to do this maybe we can 'borrow' this idea. Edward
Do you have a site in production environment with Cassandra? What client do you use?
Hey, If you have a site in production environment or considering so, what is the client that you use to interact with Cassandra. I know that there are several clients available out there according to the language you use but I would love to know what clients are being used widely in production environments and are best to work with(support most required features for performance). Also preferably tell about the technology stack for your applications. Any suggestions, comments appreciated ? Thanks Ertio
Re: cassandra row cache
Digest reads could be being dropped..? On Thu, Jan 13, 2011 at 4:11 PM, Jonathan Ellis jbel...@gmail.com wrote: On Thu, Jan 13, 2011 at 2:00 PM, Edward Capriolo edlinuxg...@gmail.com wrote: Is it possible that your are reading at READ.ONE and that READ.ONE only warms cache on 1 of your three nodes= 20. 2nd read warms another 60%, and by the third read all the replicas are warm? 99% ? This would be true if digest reads were not warming caches. Digest reads do go through the cache path. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
Re: cassandra row cache
That's possible, yes. He'd want to make sure there aren't any of those WARN messages in the logs. On Fri, Jan 14, 2011 at 11:46 AM, Mike Malone m...@simplegeo.com wrote: Digest reads could be being dropped..? On Thu, Jan 13, 2011 at 4:11 PM, Jonathan Ellis jbel...@gmail.com wrote: On Thu, Jan 13, 2011 at 2:00 PM, Edward Capriolo edlinuxg...@gmail.com wrote: Is it possible that your are reading at READ.ONE and that READ.ONE only warms cache on 1 of your three nodes= 20. 2nd read warms another 60%, and by the third read all the replicas are warm? 99% ? This would be true if digest reads were not warming caches. Digest reads do go through the cache path. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
Re: Do you have a site in production environment with Cassandra? What client do you use?
I use Hector, if that counts. .. On Jan 14, 2011 7:25 PM, Ertio Lew ertio...@gmail.com wrote: Hey, If you have a site in production environment or considering so, what is the client that you use to interact with Cassandra. I know that there are several clients available out there according to the language you use but I would love to know what clients are being used widely in production environments and are best to work with(support most required features for performance). Also preferably tell about the technology stack for your applications. Any suggestions, comments appreciated ? Thanks Ertio
Re: Do you have a site in production environment with Cassandra? What client do you use?
what is the technology stack do you use? On 1/14/11, Ran Tavory ran...@gmail.com wrote: I use Hector, if that counts. .. On Jan 14, 2011 7:25 PM, Ertio Lew ertio...@gmail.com wrote: Hey, If you have a site in production environment or considering so, what is the client that you use to interact with Cassandra. I know that there are several clients available out there according to the language you use but I would love to know what clients are being used widely in production environments and are best to work with(support most required features for performance). Also preferably tell about the technology stack for your applications. Any suggestions, comments appreciated ? Thanks Ertio
Re: Do you have a site in production environment with Cassandra? What client do you use?
Java On Jan 14, 2011 8:25 PM, Ertio Lew ertio...@gmail.com wrote: what is the technology stack do you use? On 1/14/11, Ran Tavory ran...@gmail.com wrote: I use Hector, if that counts. .. On Jan 14, 2011 7:25 PM, Ertio Lew ertio...@gmail.com wrote: Hey, If you have a site in production environment or considering so, what is the client that you use to interact with Cassandra. I know that there are several clients available out there according to the language you use but I would love to know what clients are being used widely in production environments and are best to work with(support most required features for performance). Also preferably tell about the technology stack for your applications. Any suggestions, comments appreciated ? Thanks Ertio
phpcassa never return(infinite loop)?!!!
I am trying to use phpcasse I use the following example CassandraConn::add_node('localhost', 9160); $users = new CassandraCF('rhg', 'Users'); // ColumnFamily $users-insert('1', array('email' = 't...@example.com', 'password' = 'test')); when I run it, it never returns,,, and apache process eats 100% CPU I am using cassandra 0.7 any idea why this happens? thanks
Cassandra in less than 1G of memory?
Hello. According to JVM heap size topic at http://wiki.apache.org/cassandra/MemtableThresholds , Cassandra would need atleast 1G of memory to run. Is it possible to have a running Cassandra cluster with machines that have less than that memory... say 512M? I can live with slow transactions, no compactions etc, but do not want an OutOfMemory error. The reason for a smaller bound for Cassandra is that I want to leave room for other processes to run. Please help with specific parameters to tune. Thanks, Rajat
Re: Newbie Replication/Cluster Question
On Thu, Jan 13, 2011 at 2:32 PM, Mark Moseley moseleym...@gmail.com wrote: On Thu, Jan 13, 2011 at 1:08 PM, Gary Dusbabek gdusba...@gmail.com wrote: It is impossible to properly bootstrap a new node into a system where there are not enough nodes to satisfy the replication factor. The cluster as it stands doesn't contain all the data you are asking it to replicate on the new node. Ok, maybe I'm thinking of replication_factor backwards. I took it to mean how many nodes would have *full* copies of the whole of the keyspace's data, in which case with my keyspace with replication_factor=2 the still-alive node would have 100% of the data to replicate to the wiped-clean node--in which case all the data would be there to bootstrap. I was assuming replication_factor=2 in a 2-node cluster == both nodes having a full replica of the data. Do I have that wrong? What's also confusing is that I did this same test on a clean node that wasn't clustered yet (which is interesting that it doesn't complain then about replication_factor # of nodes), so unless it was throwing away data as I was inserting it, it'd all be there. Is the general rule then that the max. replication factor must be #_of_nodes-1 then? If replication_factor==#_of_nodes, then if you lost a box, it seems like your cluster would be toast. Perhaps the better question would be, if I have a two node cluster and I want to be able to lose one box completely and replace it (without losing the cluster), what settings would I need? Or is that an impossible scenario? In production, I'd imagine a 3 node cluster being the minimum but even there I could see each box having a full replica, but probably not beyond 3.
Re: Do you have a site in production environment with Cassandra? What client do you use?
We've done hundreds of gigs in and out of cassandra 0.6.8 with pycassa 0.3. Working on upgrading to 0.7 and pycassa 1.03. I don't know if we're using it wrong, but the connection object is tied to a particular keyspace constraint isn't that awesome--we have a number of keyspaces used simultaneously. Haven't looked into it yet. On Fri, Jan 14, 2011 at 1:52 PM, Mike Wynholds m...@carbonfive.com wrote: We have one in production with Ruby / fauna Cassandra gem and Cassandra 0.6.x. The project is live but is stuck in a sort of private beta, so it hasn't really been run through any load scenarios. ..mike.. -- Michael Wynholds | Carbon Five | 310.821.7125 x13 | m...@carbonfive.com On Fri, Jan 14, 2011 at 9:24 AM, Ertio Lew ertio...@gmail.com wrote: Hey, If you have a site in production environment or considering so, what is the client that you use to interact with Cassandra. I know that there are several clients available out there according to the language you use but I would love to know what clients are being used widely in production environments and are best to work with(support most required features for performance). Also preferably tell about the technology stack for your applications. Any suggestions, comments appreciated ? Thanks Ertio
Re: Cassandra in less than 1G of memory?
Dear rajat, Yes it is possible, I have the same constraints. However I must warn you, from what I see Cassandra memory consumption is not bounded in 0.6.X on debian 64 Bit Here is an example of an instance launch in a node : root 19093 0.1 28.3 1210696 *570052* ? Sl Jan11 9:08 /usr/bin/java -ea -Xms128M *-Xmx512M *-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -Dcom.sun.management.jmxremote.port=8081 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dstorage-config=bin/../conf -Dcassandra-foreground=yes -cp bin/../conf:bin/../build/classes:bin/../lib/antlr-3.1.3.jar:bin/../lib/apache-cassandra-0.6.6.jar:bin/../lib/avro-1.2.0-dev.jar:bin/../lib/cassandra-javautils.jar:bin/../lib/clhm-production.jar:bin/../lib/commons-cli-1.1.jar:bin/../lib/commons-codec-1.2.jar:bin/../lib/commons-collections-3.2.1.jar:bin/../lib/commons-io-1.4.jar:bin/../lib/commons-lang-2.4.jar:bin/../lib/commons-pool-1.5.4.jar:bin/../lib/google-collections-1.0.jar:bin/../lib/hadoop-core-0.20.1.jar:bin/../lib/hector-0.6.0-14.jar:bin/../lib/high-scale-lib.jar:bin/../lib/ivy-2.1.0.jar:bin/../lib/jackson-core-asl-1.4.0.jar:bin/../lib/jackson-mapper-asl-1.4.0.jar:bin/../lib/jline-0.9.94.jar:bin/../lib/json-simple-1.1.jar:bin/../lib/libthrift-r917130.jar:bin/../lib/log4j-1.2.14.jar:bin/../lib/perf4j-0.9.12.jar:bin/../lib/slf4j-api-1.5.8.jar:bin/../lib/slf4j-log4j12-1.5.8.jar:bin/../lib/uuid-3.1.jar org.apache.cassandra.thrift.CassandraDaemon Look at the second bold value, Xmx indicates the maximum memory that cassandra can use; it is set to be 512, so it could easily fit into 1 Gb. Now look at the first one, 570Mb 512 Mb. Moreover if I come back in one day the first value will be even higher. Probably around 610 Mb. Actually it increases to the point where I need to restart it otherwise other program are shut down by Linux for cassandra to further expand its memory usage... By the way it's a call to other cassandra users, am I the only one to encounter this problem ? Best regards, Victor K. 2011/1/14 Rajat Chopra rcho...@makara.com Hello. According to JVM heap size topic at http://wiki.apache.org/cassandra/MemtableThresholds , Cassandra would need atleast 1G of memory to run. Is it possible to have a running Cassandra cluster with machines that have less than that memory… say 512M? I can live with slow transactions, no compactions etc, but do not want an OutOfMemory error. The reason for a smaller bound for Cassandra is that I want to leave room for other processes to run. Please help with specific parameters to tune. Thanks, Rajat
Re: Newbie Replication/Cluster Question
Perhaps the better question would be, if I have a two node cluster and I want to be able to lose one box completely and replace it (without losing the cluster), what settings would I need? Or is that an impossible scenario? In production, I'd imagine a 3 node cluster being the minimum but even there I could see each box having a full replica, but probably not beyond 3. Or perhaps, in the case of losing a box completely in a 2-node RF=2 cluster, do I need to lower the replication_factor on the still-alive box, bootstrap the replaced node back in, and then change the replication_factor=2?
Cassandra-Maven-Plugin
OK, I nearly have the Cassandra-Maven-Plugin ready. It has the following goals: run: launches Cassandra in the foreground and blocks until you press ^C at which point Maven terminates. Use-case: Running integration tests from your IDE. Live development from your IDE. start: launches Cassandra in the background. Cassandra will be torn down when Maven ends or if the stop goal is called. Use-case: Running integration tests from Maven. Live development from your IDE with e.g. jetty clean: Clears out the Cassandra database directory in ${basedir}/target/cassandra. Use-case: Resetting the dataset. load: Runs the cassandra-cli with a file as input. Use-case: Creating Keyspaces pre-populating the dataset stop: Shuts down the background Cassandra instance started by start. Use-case: Running integration tests from Maven. So for example, if you are developing a web application using Maven you would use a command like: mvn cassandra:clean cassandra:start cassandra:load jetty:run which would start up cassandra with a clean dataset and then start up jetty (which presumably connects via a client library to cassandra). Similarly, you can use cassandra-maven-plugin, jetty-maven-plugin, maven-failsafe-plugin and selenium-maven-plugin to run web integration tests as part of your build. So I have some questions: 1. Is there a standard file extension for the scripts that get passed to cassandra-cli? 2. Is there any other obvious goal I have missed out on? There is a small bit of tidy-up left and then I just have to add some integration tests and the site documentation. Once I have all that in place I will raise a JIRA with the full source code against CASSANDRA and hopefully a friendly committer will pick it up and commit it into the tree. While waiting for a committer testers will be welcome. If it gets accepted I will then see about getting it released and published on central. Expect to see the JIRA sometime Monday or Tuesday. -Stephen
Re: Newbie Replication/Cluster Question
On Fri, Jan 14, 2011 at 4:29 PM, Aaron Morton aa...@thelastpickle.com wrote: Here's some slides I did last year that have a simple explanation of RF http://www.slideshare.net/mobile/aaronmorton/well-railedcassandra24112010-5901169 Short version is, generally no single node contains all the data in the db. Normally the RF is going to be less than the number of nodes, and the higher the rf the number of concurrent node failure you can handle (when writing at Quorum). - at rf3 you can keep reading and writing with 1 node down. If you lose a second node the cluster will appear to be down for a portion of the keys. The portion depends on the total number of nodes. - at rf 5 the cluster will be up for all keys if you have 2 nodes down. If you have 3 down the cluster will appear down for only a portion of the keys, again the portion depends on the total number of nodes. Its a bit more complicated though, when I say 'node is down' I mean one of the nodes that the key would have been written to is down (the 3 or 5 above). So if you had 10 nodes, rf 5, you could have 4 nodes down and the cluster be available for all keys. So long as there are still 3 natural endpoints for each key. Hope that helps. Aaron On 15/01/2011, at 8:52 AM, Mark Moseley moseleym...@gmail.com wrote: Perhaps the better question would be, if I have a two node cluster and I want to be able to lose one box completely and replace it (without losing the cluster), what settings would I need? Or is that an impossible scenario? In production, I'd imagine a 3 node cluster being the minimum but even there I could see each box having a full replica, but probably not beyond 3. Or perhaps, in the case of losing a box completely in a 2-node RF=2 cluster, do I need to lower the replication_factor on the still-alive box, bootstrap the replaced node back in, and then change the replication_factor=2? Excellent, thanks! I'll definitely be checking those out. I just want to make sure I've got the hang of DR before we start deploying Cassandra, and I'd hate to figure all this out later on with angry customers standing over my shoulder :)
Re: Cassandra in less than 1G of memory?
On Fri, Jan 14, 2011 at 2:13 PM, Victor Kabdebon victor.kabde...@gmail.com wrote: Dear rajat, Yes it is possible, I have the same constraints. However I must warn you, from what I see Cassandra memory consumption is not bounded in 0.6.X on debian 64 Bit Here is an example of an instance launch in a node : root 19093 0.1 28.3 1210696 570052 ? Sl Jan11 9:08 /usr/bin/java -ea -Xms128M -Xmx512M -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -Dcom.sun.management.jmxremote.port=8081 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dstorage-config=bin/../conf -Dcassandra-foreground=yes -cp bin/../conf:bin/../build/classes:bin/../lib/antlr-3.1.3.jar:bin/../lib/apache-cassandra-0.6.6.jar:bin/../lib/avro-1.2.0-dev.jar:bin/../lib/cassandra-javautils.jar:bin/../lib/clhm-production.jar:bin/../lib/commons-cli-1.1.jar:bin/../lib/commons-codec-1.2.jar:bin/../lib/commons-collections-3.2.1.jar:bin/../lib/commons-io-1.4.jar:bin/../lib/commons-lang-2.4.jar:bin/../lib/commons-pool-1.5.4.jar:bin/../lib/google-collections-1.0.jar:bin/../lib/hadoop-core-0.20.1.jar:bin/../lib/hector-0.6.0-14.jar:bin/../lib/high-scale-lib.jar:bin/../lib/ivy-2.1.0.jar:bin/../lib/jackson-core-asl-1.4.0.jar:bin/../lib/jackson-mapper-asl-1.4.0.jar:bin/../lib/jline-0.9.94.jar:bin/../lib/json-simple-1.1.jar:bin/../lib/libthrift-r917130.jar:bin/../lib/log4j-1.2.14.jar:bin/../lib/perf4j-0.9.12.jar:bin/../lib/slf4j-api-1.5.8.jar:bin/../lib/slf4j-log4j12-1.5.8.jar:bin/../lib/uuid-3.1.jar org.apache.cassandra.thrift.CassandraDaemon Look at the second bold value, Xmx indicates the maximum memory that cassandra can use; it is set to be 512, so it could easily fit into 1 Gb. Now look at the first one, 570Mb 512 Mb. Moreover if I come back in one day the first value will be even higher. Probably around 610 Mb. Actually it increases to the point where I need to restart it otherwise other program are shut down by Linux for cassandra to further expand its memory usage... By the way it's a call to other cassandra users, am I the only one to encounter this problem ? Best regards, Victor K. 2011/1/14 Rajat Chopra rcho...@makara.com Hello. According to JVM heap size topic at http://wiki.apache.org/cassandra/MemtableThresholds , Cassandra would need atleast 1G of memory to run. Is it possible to have a running Cassandra cluster with machines that have less than that memory… say 512M? I can live with slow transactions, no compactions etc, but do not want an OutOfMemory error. The reason for a smaller bound for Cassandra is that I want to leave room for other processes to run. Please help with specific parameters to tune. Thanks, Rajat -Xmx512M is not an overall memory limit. MMAP'ed files also consume memory. Try turning disk access mode to standard not (MMAP or MMAP_INDEX_ONLY).
is it possible to map an one from a a file and an one from cassandra?
hi, cassandra supports hadoop to map reduce from cassandra. now I am digging to find out a way to map from a file and cassandra together. I mean if both of them are files in my disk, it is possible by using splits. but, in this kind of a situtation, which way is posssible? for example. in a cassandra) key1| value1 | value2 key2| value3 | value4 key3| value5 | value6 in a file) key1| value1 | value2 key2| value7 | value4 key3| value7 | value6 the size of both are very hugh. I want to get a result from diff from both of them. which keys are deleted? which values are changed? thanks.
Re: Cassandra in less than 1G of memory?
mmapping only consumes memory that the OS can afford to feed it. On Fri, Jan 14, 2011 at 7:29 PM, Edward Capriolo edlinuxg...@gmail.com wrote: On Fri, Jan 14, 2011 at 2:13 PM, Victor Kabdebon victor.kabde...@gmail.com wrote: Dear rajat, Yes it is possible, I have the same constraints. However I must warn you, from what I see Cassandra memory consumption is not bounded in 0.6.X on debian 64 Bit Here is an example of an instance launch in a node : root 19093 0.1 28.3 1210696 570052 ? Sl Jan11 9:08 /usr/bin/java -ea -Xms128M -Xmx512M -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -Dcom.sun.management.jmxremote.port=8081 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dstorage-config=bin/../conf -Dcassandra-foreground=yes -cp bin/../conf:bin/../build/classes:bin/../lib/antlr-3.1.3.jar:bin/../lib/apache-cassandra-0.6.6.jar:bin/../lib/avro-1.2.0-dev.jar:bin/../lib/cassandra-javautils.jar:bin/../lib/clhm-production.jar:bin/../lib/commons-cli-1.1.jar:bin/../lib/commons-codec-1.2.jar:bin/../lib/commons-collections-3.2.1.jar:bin/../lib/commons-io-1.4.jar:bin/../lib/commons-lang-2.4.jar:bin/../lib/commons-pool-1.5.4.jar:bin/../lib/google-collections-1.0.jar:bin/../lib/hadoop-core-0.20.1.jar:bin/../lib/hector-0.6.0-14.jar:bin/../lib/high-scale-lib.jar:bin/../lib/ivy-2.1.0.jar:bin/../lib/jackson-core-asl-1.4.0.jar:bin/../lib/jackson-mapper-asl-1.4.0.jar:bin/../lib/jline-0.9.94.jar:bin/../lib/json-simple-1.1.jar:bin/../lib/libthrift-r917130.jar:bin/../lib/log4j-1.2.14.jar:bin/../lib/perf4j-0.9.12.jar:bin/../lib/slf4j-api-1.5.8.jar:bin/../lib/slf4j-log4j12-1.5.8.jar:bin/../lib/uuid-3.1.jar org.apache.cassandra.thrift.CassandraDaemon Look at the second bold value, Xmx indicates the maximum memory that cassandra can use; it is set to be 512, so it could easily fit into 1 Gb. Now look at the first one, 570Mb 512 Mb. Moreover if I come back in one day the first value will be even higher. Probably around 610 Mb. Actually it increases to the point where I need to restart it otherwise other program are shut down by Linux for cassandra to further expand its memory usage... By the way it's a call to other cassandra users, am I the only one to encounter this problem ? Best regards, Victor K. 2011/1/14 Rajat Chopra rcho...@makara.com Hello. According to JVM heap size topic at http://wiki.apache.org/cassandra/MemtableThresholds , Cassandra would need atleast 1G of memory to run. Is it possible to have a running Cassandra cluster with machines that have less than that memory… say 512M? I can live with slow transactions, no compactions etc, but do not want an OutOfMemory error. The reason for a smaller bound for Cassandra is that I want to leave room for other processes to run. Please help with specific parameters to tune. Thanks, Rajat -Xmx512M is not an overall memory limit. MMAP'ed files also consume memory. Try turning disk access mode to standard not (MMAP or MMAP_INDEX_ONLY). -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
Re: Cassandra in less than 1G of memory?
Hi Jonathan, hi Edward, Jonathan : but it looks like mmaping wants to consume the entire memory of my server. It goes up to 1.7 Gb for a ridiculously small amount of data. Am I doing something wrong or is there something I should change to prevent this never ending increase of memory consumption ? Edward : I am not sure, I will try to see that tomorrow but my disk access mode is standard, not mmap. Anyway thank you very much, Victor K. PS : here is some hours after the result of ps aux | grep cassandra root 19093 0.1 30.0 1243940 *605060* ? Sl Jan11 10:15 /usr/bin/java -ea -Xms128M *-Xmx512M* -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -Dcom.sun.management.jmxremote.port=8081 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dstorage-config=bin/../conf -Dcassandra-foreground=yes -cp bin/../conf:bin/../build/classes:bin/../lib/antlr-3.1.3.jar:bin/../lib/apache-cassandra-0.6.6.jar:bin/../lib/avro-1.2.0-dev.jar:bin/../lib/cassandra-javautils.jar:bin/../lib/clhm-production.jar:bin/../lib/commons-cli-1.1.jar:bin/../lib/commons-codec-1.2.jar:bin/../lib/commons-collections-3.2.1.jar:bin/../lib/commons-io-1.4.jar:bin/../lib/commons-lang-2.4.jar:bin/../lib/commons-pool-1.5.4.jar:bin/../lib/google-collections-1.0.jar:bin/../lib/hadoop-core-0.20.1.jar:bin/../lib/hector-0.6.0-14.jar:bin/../lib/high-scale-lib.jar:bin/../lib/ivy-2.1.0.jar:bin/../lib/jackson-core-asl-1.4.0.jar:bin/../lib/jackson-mapper-asl-1.4.0.jar:bin/../lib/jline-0.9.94.jar:bin/../lib/json-simple-1.1.jar:bin/../lib/libthrift-r917130.jar:bin/../lib/log4j-1.2.14.jar:bin/../lib/perf4j-0.9.12.jar:bin/../lib/slf4j-api-1.5.8.jar:bin/../lib/slf4j-log4j12-1.5.8.jar:bin/../lib/uuid-3.1.jar org.apache.cassandra.thrift.CassandraDaemon 2011/1/15 Jonathan Ellis jbel...@gmail.com mmapping only consumes memory that the OS can afford to feed it. On Fri, Jan 14, 2011 at 7:29 PM, Edward Capriolo edlinuxg...@gmail.com wrote: On Fri, Jan 14, 2011 at 2:13 PM, Victor Kabdebon victor.kabde...@gmail.com wrote: Dear rajat, Yes it is possible, I have the same constraints. However I must warn you, from what I see Cassandra memory consumption is not bounded in 0.6.X on debian 64 Bit Here is an example of an instance launch in a node : root 19093 0.1 28.3 1210696 570052 ? Sl Jan11 9:08 /usr/bin/java -ea -Xms128M -Xmx512M -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -Dcom.sun.management.jmxremote.port=8081 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dstorage-config=bin/../conf -Dcassandra-foreground=yes -cp bin/../conf:bin/../build/classes:bin/../lib/antlr-3.1.3.jar:bin/../lib/apache-cassandra-0.6.6.jar:bin/../lib/avro-1.2.0-dev.jar:bin/../lib/cassandra-javautils.jar:bin/../lib/clhm-production.jar:bin/../lib/commons-cli-1.1.jar:bin/../lib/commons-codec-1.2.jar:bin/../lib/commons-collections-3.2.1.jar:bin/../lib/commons-io-1.4.jar:bin/../lib/commons-lang-2.4.jar:bin/../lib/commons-pool-1.5.4.jar:bin/../lib/google-collections-1.0.jar:bin/../lib/hadoop-core-0.20.1.jar:bin/../lib/hector-0.6.0-14.jar:bin/../lib/high-scale-lib.jar:bin/../lib/ivy-2.1.0.jar:bin/../lib/jackson-core-asl-1.4.0.jar:bin/../lib/jackson-mapper-asl-1.4.0.jar:bin/../lib/jline-0.9.94.jar:bin/../lib/json-simple-1.1.jar:bin/../lib/libthrift-r917130.jar:bin/../lib/log4j-1.2.14.jar:bin/../lib/perf4j-0.9.12.jar:bin/../lib/slf4j-api-1.5.8.jar:bin/../lib/slf4j-log4j12-1.5.8.jar:bin/../lib/uuid-3.1.jar org.apache.cassandra.thrift.CassandraDaemon Look at the second bold value, Xmx indicates the maximum memory that cassandra can use; it is set to be 512, so it could easily fit into 1 Gb. Now look at the first one, 570Mb 512 Mb. Moreover if I come back in one day the first value will be even higher. Probably around 610 Mb. Actually it increases to the point where I need to restart it otherwise other program are shut down by Linux for cassandra to further expand its memory usage... By the way it's a call to other cassandra users, am I the only one to encounter this problem ? Best regards, Victor K. 2011/1/14 Rajat Chopra rcho...@makara.com Hello. According to JVM heap size topic at http://wiki.apache.org/cassandra/MemtableThresholds , Cassandra would need atleast 1G of memory to run. Is it possible to have a running Cassandra cluster with machines that have less than that memory… say 512M? I can live with slow transactions, no compactions etc, but do not want an OutOfMemory error. The reason for a smaller bound for Cassandra is that I want