Re: Installing Thrift with Solandra
you are trying to run solandra from resources directory, follow these steps 1) don't use root - use a regular user 2) cd /tmp/ 3) git clone git://github.com/tjake/Solandra.git 4) cd Solandra 5) ant once you get BUILD SUCCESSFUL 6) cd solandra-app 7) ./start-solandra.sh On Tue, Jun 7, 2011 at 10:29 PM, Jean-Nicolas Boulay Desjardins jnbdzjn...@gmail.com wrote: I found start-solandra.sh in resources folder. But when I execute it. I still get an error. http://dl.dropbox.com/u/20599297/Screen%20shot%202011-06-08%20at%201.27.26%20AM.png http://dl.dropbox.com/u/20599297/Screen%20shot%202011-06-08%20at%201.27.26%20AM.pngThanks again. On Tue, Jun 7, 2011 at 12:23 PM, Jean-Nicolas Boulay Desjardins jnbdzjn...@gmail.com wrote: Ok So I have to install Thrift and Cassandra than Solandra. I am asking because I followed the instructions in your Git page but I get this error: # cd solandra-app; ./start-solandra.sh -bash: ./start-solandra.sh: No such file or directory Thanks again :) On Tue, Jun 7, 2011 at 7:55 AM, Jake Luciani jak...@gmail.com wrote: This seems to be a common cause of confusion. Let me try again. Solandra doesn't integrate your Cassandra data into solr. It simply provides a scalable backend for solr by Building on Cassandra. The inverted index lives in it's own Cassandra keyspace. What you have in the end is two functionally different components (Cassandra and solr) in one logical service. Jake On Tuesday, June 7, 2011, Jean-Nicolas Boulay Desjardins jnbdzjn...@gmail.com wrote: I just saw a post you made on Stackoverflow, where you said: The Solandra project which is replacing Lucandra no longer uses thrift, only Solr. So I use Solr to access my data in Cassandra? Thanks again... On Tue, Jun 7, 2011 at 1:39 AM, Jean-Nicolas Boulay Desjardins jnbdzjn...@gmail.com wrote: Thanks again :) Ok... But in the tutorial it says that I need to build a Thrift interface for Cassandra: ./compiler/cpp/thrift -gen php ../PATH-TO-CASSANDRA/interface/cassandra.thrift How do I do this? Where is the interface folder? Again, tjake thanks allot for your time and help. On Mon, Jun 6, 2011 at 11:13 PM, Jake Luciani jak...@gmail.com wrote: To access Cassandra in Solandra it's the same as regular cassandra. To access Solr you use one of the Php Solr libraries http://wiki.apache.org/solr/SolPHP On Mon, Jun 6, 2011 at 11:04 PM, Jean-Nicolas Boulay Desjardins jnbdzjn...@gmail.com wrote: I am trying to install Thrift with Solandra. Normally when I just want to install Thrift with Cassandra, I followed this tutorial: https://wiki.fourkitchens.com/display/PF/Using+Cassandra+with+PHP But how can I do the same for Solandra? Thrift with PHP...-- Name / Nom: Boulay Desjardins, Jean-Nicolas Website / Site Web: www.jeannicolas.com -- http://twitter.com/tjake -- Name / Nom: Boulay Desjardins, Jean-Nicolas Website / Site Web: www.jeannicolas.com -- Name / Nom: Boulay Desjardins, Jean-Nicolas Website / Site Web: www.jeannicolas.com
Misc Performance Questions
Is there a performance hit when dropping a CF? What if it contains .5 TB of data? If not, is there a quick and painless way to drop a large amount of data w/minimal perf hit? Is there a performance hit running multiple keyspaces on a cluster versus only one keyspace given a constant total data size? Is there some quantity limit? Using a Random Partitioner, but with a RF = 1, will the rows still be spread-out evenly on the cluster or will there be an affinity to a single node (like the one receiving the data from the client)? I see a lot of mention of using RAID-0, but not RAID-5/6. Why? Even though Cass can tolerate a down node due to data loss, it would still be more efficient to just rebuild a bad hdd live, right? Maybe perf related: Will there be a problem having multiple keyspaces on a cluster all with different replication factors, from 1-3? Thanks!
Re: how to retrieve data from supercolumns by phpcassa ?
Hi, Can u please tell me how to create a supercolumn and retrieve data from it using phpcassa??? student_details{id{sid,lesson_id,answers{time_expired,answer_opted}}}
Re: how to retrieve data from supercolumns by phpcassa ?
you'll find a response to this question on the phpcassa mailing list ... where you asked the same question. -sd On Wed, Jun 8, 2011 at 10:22 AM, amrita amritajayakuma...@gmail.com wrote: Hi, Can u please tell me how to create a supercolumn and retrieve data from it using phpcassa??? student_details{id{sid,lesson_id,answers{time_expired,answer_opted}}}
Re: Misc Performance Questions
Hi AJ, On Wed, Jun 8, 2011 at 9:29 AM, AJ a...@dude.podzone.net wrote: Is there a performance hit when dropping a CF? What if it contains .5 TB of data? If not, is there a quick and painless way to drop a large amount of data w/minimal perf hit? Dropping a CF is quick - it snapshots the files (which creates hard links) and removes the CF definition. To actually delete the data, remove the snapshot files from your data directory. Is there a performance hit running multiple keyspaces on a cluster versus only one keyspace given a constant total data size? Is there some quantity limit? There is a tiny amount of memory used per keyspace, but unless you have very many keyspaces you won't notice any impact of running multiple keyspaces. There is however a difference in running multiple column families versus putting everything in the same column family and separating them with e.g. a key prefix. E.g. if you have a large data set and a small one, it will be quicker to query the small one if it is in its own column family. Using a Random Partitioner, but with a RF = 1, will the rows still be spread-out evenly on the cluster or will there be an affinity to a single node (like the one receiving the data from the client)? The rows will be spread out the same way - RF=1 doesn't affect the load balancing. I see a lot of mention of using RAID-0, but not RAID-5/6. Why? Even though Cass can tolerate a down node due to data loss, it would still be more efficient to just rebuild a bad hdd live, right? There's a trade-off - RAID-0 will give better performance, but rebuilds are over a network. WIth RF 1, RAID-0 is enough so that that you're unlikely to lose data, but as you say, replacing a failed node will be slower. Maybe perf related: Will there be a problem having multiple keyspaces on a cluster all with different replication factors, from 1-3? No. Richard. -- Richard Low Acunu | http://www.acunu.com | @acunu
Re: how to know there are some columns in a row
There is no reason for ambiguities... We could add in the api another method call (similar to get_count) : get_columnNames - liststring get_columnNames(key, column_parent, predicate, consistency_level) Get the columns names present in column_parent within the predicate. The method is not O(1). It takes all the columns from disk to calculate the answer. The only benefit of the method is that you do not need to pull all their values over Thrift interface to get their names (just to get the idea...) In fact column names can really be data in themselves, so there should be a way to retrieve them (without their values). When you have big values, it's a real show stopper to use get_slice, since a lot of unnecessary traffic would be generated... Forgive me if I am a little insistent, but it's important for us and I'm sure we are not the only ones interested in this feature... cheers
Data directories
Hi, Is there a way to control what sstables go to what data directory? I have a fast but space limited ssd, and a way slower raid, and i'd like to put latency sensitive data into the ssd and leave the other data in the raid. Is this possible? If not, how well does cassandra play with symlinks?
Re: Misc Performance Questions
Thank you Richard! On 6/8/2011 2:57 AM, Richard Low wrote: snip There is however a difference in running multiple column families versus putting everything in the same column family and separating them with e.g. a key prefix. E.g. if you have a large data set and a small one, it will be quicker to query the small one if it is in its own column family. I assumed that a read would be O(1) for any size CF since Cass is implemented with hashmaps. Do you know why size matters? (forgive the pun)
Re: Misc Performance Questions
On Wed, Jun 8, 2011 at 12:30 PM, AJ a...@dude.podzone.net wrote: There is however a difference in running multiple column families versus putting everything in the same column family and separating them with e.g. a key prefix. E.g. if you have a large data set and a small one, it will be quicker to query the small one if it is in its own column family. I assumed that a read would be O(1) for any size CF since Cass is implemented with hashmaps. Do you know why size matters? (forgive the pun) You may not notice a difference, but it can happen. For a query, each SSTable is queried. If there is more data then there are (most likely) more SSTables to query, slowing it down. For point queries, this isn't so bad because the Bloom filters will help, but for range queries you will notice a big difference. You will have to do more seeks to seek over unwanted data. It will also help buffer caching to separate them - the small SSTables are more likely to remain in cache. -- Richard Low Acunu | http://www.acunu.com | @acunu
Re: Data directories
No. https://issues.apache.org/jira/browse/CASSANDRA-2749 is open to track this but nobody is working on it to my knowledge. Cassandra is fine with symlinks at the data directory level but I don't think that helps you, since you really want to move the sstables themselves. (Cassandra is NOT fine with symlinked sstable files, or with any moving around of sstable files while it is running.) 2011/6/8 Héctor Izquierdo Seliva izquie...@strands.com: Hi, Is there a way to control what sstables go to what data directory? I have a fast but space limited ssd, and a way slower raid, and i'd like to put latency sensitive data into the ssd and leave the other data in the raid. Is this possible? If not, how well does cassandra play with symlinks? -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Data directories
El mié, 08-06-2011 a las 08:42 -0500, Jonathan Ellis escribió: No. https://issues.apache.org/jira/browse/CASSANDRA-2749 is open to track this but nobody is working on it to my knowledge. Cassandra is fine with symlinks at the data directory level but I don't think that helps you, since you really want to move the sstables themselves. (Cassandra is NOT fine with symlinked sstable files, or with any moving around of sstable files while it is running.) I was planing on creating another keyspace and moving the slow sstables there. Of course everything done while the node is stopped. Thanks for your help
Re: Installing Thrift with Solandra
Krish Pan THANKS! Also thank you for making build successful in uppercase :) But it seems it is still not working. This time when I go into solandra-app directory I get the start-solandra.sh and when I use the command: ./start-solandra.sh I get this: http://dl.dropbox.com/u/20599297/Screen%20shot%202011-06-08%20at%2011.00.15%20AM.png And it just stays stuck there. Any ideas? Thanks again. On Wed, Jun 8, 2011 at 2:32 AM, Krish Pan ceo.co...@gmail.com wrote: you are trying to run solandra from resources directory, follow these steps 1) don't use root - use a regular user 2) cd /tmp/ 3) git clone git://github.com/tjake/Solandra.git 4) cd Solandra 5) ant once you get BUILD SUCCESSFUL 6) cd solandra-app 7) ./start-solandra.sh On Tue, Jun 7, 2011 at 10:29 PM, Jean-Nicolas Boulay Desjardins jnbdzjn...@gmail.com wrote: I found start-solandra.sh in resources folder. But when I execute it. I still get an error. http://dl.dropbox.com/u/20599297/Screen%20shot%202011-06-08%20at%201.27.26%20AM.png http://dl.dropbox.com/u/20599297/Screen%20shot%202011-06-08%20at%201.27.26%20AM.pngThanks again. On Tue, Jun 7, 2011 at 12:23 PM, Jean-Nicolas Boulay Desjardins jnbdzjn...@gmail.com wrote: Ok So I have to install Thrift and Cassandra than Solandra. I am asking because I followed the instructions in your Git page but I get this error: # cd solandra-app; ./start-solandra.sh -bash: ./start-solandra.sh: No such file or directory Thanks again :) On Tue, Jun 7, 2011 at 7:55 AM, Jake Luciani jak...@gmail.com wrote: This seems to be a common cause of confusion. Let me try again. Solandra doesn't integrate your Cassandra data into solr. It simply provides a scalable backend for solr by Building on Cassandra. The inverted index lives in it's own Cassandra keyspace. What you have in the end is two functionally different components (Cassandra and solr) in one logical service. Jake On Tuesday, June 7, 2011, Jean-Nicolas Boulay Desjardins jnbdzjn...@gmail.com wrote: I just saw a post you made on Stackoverflow, where you said: The Solandra project which is replacing Lucandra no longer uses thrift, only Solr. So I use Solr to access my data in Cassandra? Thanks again... On Tue, Jun 7, 2011 at 1:39 AM, Jean-Nicolas Boulay Desjardins jnbdzjn...@gmail.com wrote: Thanks again :) Ok... But in the tutorial it says that I need to build a Thrift interface for Cassandra: ./compiler/cpp/thrift -gen php ../PATH-TO-CASSANDRA/interface/cassandra.thrift How do I do this? Where is the interface folder? Again, tjake thanks allot for your time and help. On Mon, Jun 6, 2011 at 11:13 PM, Jake Luciani jak...@gmail.com wrote: To access Cassandra in Solandra it's the same as regular cassandra. To access Solr you use one of the Php Solr libraries http://wiki.apache.org/solr/SolPHP On Mon, Jun 6, 2011 at 11:04 PM, Jean-Nicolas Boulay Desjardins jnbdzjn...@gmail.com wrote: I am trying to install Thrift with Solandra. Normally when I just want to install Thrift with Cassandra, I followed this tutorial: https://wiki.fourkitchens.com/display/PF/Using+Cassandra+with+PHP But how can I do the same for Solandra? Thrift with PHP...-- Name / Nom: Boulay Desjardins, Jean-Nicolas Website / Site Web: www.jeannicolas.com -- http://twitter.com/tjake -- Name / Nom: Boulay Desjardins, Jean-Nicolas Website / Site Web: www.jeannicolas.com -- Name / Nom: Boulay Desjardins, Jean-Nicolas Website / Site Web: www.jeannicolas.com -- Name / Nom: Boulay Desjardins, Jean-Nicolas Website / Site Web: www.jeannicolas.com
Retrieving a column from a fat row vs retrieving a single row
Hi, I have an index I use to translate ids. I usually only read a column at a time, and it's becoming a bottleneck. I could rewrite the application to read a bunch at a time but it would make the application logic much harder, as it would involve buffering incoming data. As far as I know, to read a single column cassandra will deserialize a bunch of them and then pick the correct one (64KB of data right?) Would it be faster to have a row for each id I want to translate? This would make keycache less effective, but the amount of data read should be smaller. Thanks!
Re: Multiple large disks in server - setup considerations
On Wed, Jun 8, 2011 at 12:19 AM, AJ a...@dude.podzone.net wrote: On 6/7/2011 9:32 PM, Edward Capriolo wrote: snip I do not like large disk set-ups. I think they end up not being economical. Most low latency use cases want high RAM to DISK ratio. Two machines with 32GB RAM is usually less expensive then one machine with 64GB ram. For a machine with 1TB drives (or multiple 1TB drives) it is going to be difficult to get enough RAM to help with random read patterns. Also cluster operations like joining, decommissioning, or repair can take a *VERY* long time maybe a day. More smaller servers like blade style or more agile. Is there some rule-of-thumb as to how much RAM is needed per GB of data? I know it probably depends, but if you could try to explain the best you can that would be great! I too am projecting big data requirements. The way this is normally explained is active-set. IE you have 100,000,000 users but at any given time only 1,000,000 are active thus you need enough RAM to keep these users cached. No there is no rule of thumb it depends on access patterns. In the most extreme case you are using cassandra for an ETL workload. In this case your data will far exceed your RAM and since most operations will be like a full table scan caching is almost hopeless and useless. On the other side there are those that want every lookup to be predicatable low latency and totally random read and those might want to maintain a 1-1 ratio. I would track these things over time: reads/writes to c* disk utilization size of CF on disk cache hit rate latency And eventually you find what your ratio is. IE. last month: i had 30 reads/sec my disk was 40% utilized my column family was 40 GB my cache hit was 70% my latency was 1ms this month: i had 45 reads/sec my disk was 95% utilized my column family was 40 GB my cache hit was 30% my latency was 5ms Conclusion: my disk maxed and my cache hit/rate is dropping. I probably need more nodes|or more RAM.
Re: Retrieving a column from a fat row vs retrieving a single row
As far as I know, to read a single column cassandra will deserialize a bunch of them and then pick the correct one (64KB of data right?) Assuming the default setting of 64kb, the average amount deserialized given random column access should be 8 kb (not true with row cache, but with large rows presumably you don't have row cache). Would it be faster to have a row for each id I want to translate? This would make keycache less effective, but the amount of data read should be smaller. It depends on what bottlenecks you're optimizing for. A key is expensive in the sense that if (1) increases the size of bloom filters for the column family, and it (2) increases the memory cost of index sampling, and (3) increases the total data size (typically) because the row size is duplicated in both the index and data files. The cost of deserialization the same data repeatedly is CPU. So if you're nowhere near bottlenecking on disk and the memory trade-off is reasonable, it may be a suitable optimization. However, consider that unless you're doing order preserving partitioning, accessing those rows will be effectively random w.r.t. the locations on disk you're reading from so you're adding a lot of overhead in terms of disk I/O unless your data set fits comfortably in memory. -- / Peter Schuller
Re: Installing Thrift with Solandra
looks like it is running, you can verify by running jps it will show you a process with name jar try this, cd ../reuters-demo ./1-download_data.sh ./2-import_data.sh While data is loading, open the file ./website/index.html in your favorite browser. On Wed, Jun 8, 2011 at 8:04 AM, Jean-Nicolas Boulay Desjardins jnbdzjn...@gmail.com wrote: Krish Pan THANKS! Also thank you for making build successful in uppercase :) But it seems it is still not working. This time when I go into solandra-app directory I get the start-solandra.sh and when I use the command: ./start-solandra.sh I get this: http://dl.dropbox.com/u/20599297/Screen%20shot%202011-06-08%20at%2011.00.15%20AM.png And it just stays stuck there. Any ideas? Thanks again. On Wed, Jun 8, 2011 at 2:32 AM, Krish Pan ceo.co...@gmail.com wrote: you are trying to run solandra from resources directory, follow these steps 1) don't use root - use a regular user 2) cd /tmp/ 3) git clone git://github.com/tjake/Solandra.git 4) cd Solandra 5) ant once you get BUILD SUCCESSFUL 6) cd solandra-app 7) ./start-solandra.sh On Tue, Jun 7, 2011 at 10:29 PM, Jean-Nicolas Boulay Desjardins jnbdzjn...@gmail.com wrote: I found start-solandra.sh in resources folder. But when I execute it. I still get an error. http://dl.dropbox.com/u/20599297/Screen%20shot%202011-06-08%20at%201.27.26%20AM.png http://dl.dropbox.com/u/20599297/Screen%20shot%202011-06-08%20at%201.27.26%20AM.pngThanks again. On Tue, Jun 7, 2011 at 12:23 PM, Jean-Nicolas Boulay Desjardins jnbdzjn...@gmail.com wrote: Ok So I have to install Thrift and Cassandra than Solandra. I am asking because I followed the instructions in your Git page but I get this error: # cd solandra-app; ./start-solandra.sh -bash: ./start-solandra.sh: No such file or directory Thanks again :) On Tue, Jun 7, 2011 at 7:55 AM, Jake Luciani jak...@gmail.com wrote: This seems to be a common cause of confusion. Let me try again. Solandra doesn't integrate your Cassandra data into solr. It simply provides a scalable backend for solr by Building on Cassandra. The inverted index lives in it's own Cassandra keyspace. What you have in the end is two functionally different components (Cassandra and solr) in one logical service. Jake On Tuesday, June 7, 2011, Jean-Nicolas Boulay Desjardins jnbdzjn...@gmail.com wrote: I just saw a post you made on Stackoverflow, where you said: The Solandra project which is replacing Lucandra no longer uses thrift, only Solr. So I use Solr to access my data in Cassandra? Thanks again... On Tue, Jun 7, 2011 at 1:39 AM, Jean-Nicolas Boulay Desjardins jnbdzjn...@gmail.com wrote: Thanks again :) Ok... But in the tutorial it says that I need to build a Thrift interface for Cassandra: ./compiler/cpp/thrift -gen php ../PATH-TO-CASSANDRA/interface/cassandra.thrift How do I do this? Where is the interface folder? Again, tjake thanks allot for your time and help. On Mon, Jun 6, 2011 at 11:13 PM, Jake Luciani jak...@gmail.com wrote: To access Cassandra in Solandra it's the same as regular cassandra. To access Solr you use one of the Php Solr libraries http://wiki.apache.org/solr/SolPHP On Mon, Jun 6, 2011 at 11:04 PM, Jean-Nicolas Boulay Desjardins jnbdzjn...@gmail.com wrote: I am trying to install Thrift with Solandra. Normally when I just want to install Thrift with Cassandra, I followed this tutorial: https://wiki.fourkitchens.com/display/PF/Using+Cassandra+with+PHP But how can I do the same for Solandra? Thrift with PHP...-- Name / Nom: Boulay Desjardins, Jean-Nicolas Website / Site Web: www.jeannicolas.com -- http://twitter.com/tjake -- Name / Nom: Boulay Desjardins, Jean-Nicolas Website / Site Web: www.jeannicolas.com -- Name / Nom: Boulay Desjardins, Jean-Nicolas Website / Site Web: www.jeannicolas.com -- Name / Nom: Boulay Desjardins, Jean-Nicolas Website / Site Web: www.jeannicolas.com
nosql yes but yescql, no?
Gotta love, Eric! http://www.slideshare.net/jericevans/nosql-yes-but-yescql-no https://twitter.com/#!/jericevans/status/78118651043127297 -- SriSatish Ambati Director of Engineering, DataStax @srisatish
Re: nosql yes but yescql, no?
On 06/08/2011 01:23 PM, SriSatish Ambati wrote: Gotta love, Eric! http://www.slideshare.net/jericevans/nosql-yes-but-yescql-no -- SriSatish Ambati Director of Engineering, DataStax @srisatish Good resource. Thanks for share it with us SriSatish Regards -- Marcos Luís Ortíz Valmaseda Software Engineer (UCI) http://marcosluis2186.posterous.com http://twitter.com/marcosluis2186
Re: nosql yes but yescql, no?
While I agree the Thrift API sucks, Id love to see that sovled on a binary level, and CQl on top of that. JK On Wed, Jun 8, 2011 at 2:50 PM, Marcos Ortiz mlor...@uci.cu wrote: On 06/08/2011 01:23 PM, SriSatish Ambati wrote: Gotta love, Eric! http://www.slideshare.net/jericevans/nosql-yes-but-yescql-no -- SriSatish Ambati Director of Engineering, DataStax @srisatish Good resource. Thanks for share it with us SriSatish Regards -- Marcos Luís Ortíz Valmaseda Software Engineer (UCI) http://marcosluis2186.posterous.com http://twitter.com/marcosluis2186 -- It's always darkest just before you are eaten by a grue.
Re: nosql yes but yescql, no?
I think that's partly the idea of it. CQL could end up being a way forward and it currently builds on thrift. Then if it becomes the API/client of record to build on, then it could move to something else underneath that's more efficient and CQL itself wouldn't have to change at all. On Jun 8, 2011, at 1:29 PM, Jeffrey Kesselman wrote: While I agree the Thrift API sucks, Id love to see that sovled on a binary level, and CQl on top of that. JK On Wed, Jun 8, 2011 at 2:50 PM, Marcos Ortiz mlor...@uci.cu wrote: On 06/08/2011 01:23 PM, SriSatish Ambati wrote: Gotta love, Eric! http://www.slideshare.net/jericevans/nosql-yes-but-yescql-no -- SriSatish Ambati Director of Engineering, DataStax @srisatish Good resource. Thanks for share it with us SriSatish Regards -- Marcos Luís Ortíz Valmaseda Software Engineer (UCI) http://marcosluis2186.posterous.com http://twitter.com/marcosluis2186 -- It's always darkest just before you are eaten by a grue.
Re: Installing Thrift with Solandra
Thanks again... Here it gets a bit more complex. I added Solandra to /tmp folder like you told me. And the data also... Everything seems to work. The problem is I am running Solandra in a VM on my Mac OS X the VM is Ubuntu Server. On that VM I have a DNS server... And one of my domain names is jean-nicolas.name... I put the website folder in it, to test out Solandra. The page loads, but no content. So I want in the code to find where it was getting the content. Then I found reutors.js and in it was an address: http://localhost:8983/solandra/reutors So I used this command in the terminal (in my VM obviously): curl http://localhost:8983/solandra/reutors Then I got: NOT FOUND... In some HTML code... So it seems it cannot find the data... Also it was followed by a series of br/ nothing between them... So does that mean Solandra is not working? Or is it something else? Thanks again for your time and help... I know I am a bit slow :) Thanks again... On Wed, Jun 8, 2011 at 12:55 PM, Krish Pan ceo.co...@gmail.com wrote: looks like it is running, you can verify by running jps it will show you a process with name jar try this, cd ../reuters-demo ./1-download_data.sh ./2-import_data.sh While data is loading, open the file ./website/index.html in your favorite browser. On Wed, Jun 8, 2011 at 8:04 AM, Jean-Nicolas Boulay Desjardins jnbdzjn...@gmail.com wrote: Krish Pan THANKS! Also thank you for making build successful in uppercase :) But it seems it is still not working. This time when I go into solandra-app directory I get the start-solandra.sh and when I use the command: ./start-solandra.sh I get this: http://dl.dropbox.com/u/20599297/Screen%20shot%202011-06-08%20at%2011.00.15%20AM.png And it just stays stuck there. Any ideas? Thanks again. On Wed, Jun 8, 2011 at 2:32 AM, Krish Pan ceo.co...@gmail.com wrote: you are trying to run solandra from resources directory, follow these steps 1) don't use root - use a regular user 2) cd /tmp/ 3) git clone git://github.com/tjake/Solandra.git 4) cd Solandra 5) ant once you get BUILD SUCCESSFUL 6) cd solandra-app 7) ./start-solandra.sh On Tue, Jun 7, 2011 at 10:29 PM, Jean-Nicolas Boulay Desjardins jnbdzjn...@gmail.com wrote: I found start-solandra.sh in resources folder. But when I execute it. I still get an error. http://dl.dropbox.com/u/20599297/Screen%20shot%202011-06-08%20at%201.27.26%20AM.png http://dl.dropbox.com/u/20599297/Screen%20shot%202011-06-08%20at%201.27.26%20AM.pngThanks again. On Tue, Jun 7, 2011 at 12:23 PM, Jean-Nicolas Boulay Desjardins jnbdzjn...@gmail.com wrote: Ok So I have to install Thrift and Cassandra than Solandra. I am asking because I followed the instructions in your Git page but I get this error: # cd solandra-app; ./start-solandra.sh -bash: ./start-solandra.sh: No such file or directory Thanks again :) On Tue, Jun 7, 2011 at 7:55 AM, Jake Luciani jak...@gmail.com wrote: This seems to be a common cause of confusion. Let me try again. Solandra doesn't integrate your Cassandra data into solr. It simply provides a scalable backend for solr by Building on Cassandra. The inverted index lives in it's own Cassandra keyspace. What you have in the end is two functionally different components (Cassandra and solr) in one logical service. Jake On Tuesday, June 7, 2011, Jean-Nicolas Boulay Desjardins jnbdzjn...@gmail.com wrote: I just saw a post you made on Stackoverflow, where you said: The Solandra project which is replacing Lucandra no longer uses thrift, only Solr. So I use Solr to access my data in Cassandra? Thanks again... On Tue, Jun 7, 2011 at 1:39 AM, Jean-Nicolas Boulay Desjardins jnbdzjn...@gmail.com wrote: Thanks again :) Ok... But in the tutorial it says that I need to build a Thrift interface for Cassandra: ./compiler/cpp/thrift -gen php ../PATH-TO-CASSANDRA/interface/cassandra.thrift How do I do this? Where is the interface folder? Again, tjake thanks allot for your time and help. On Mon, Jun 6, 2011 at 11:13 PM, Jake Luciani jak...@gmail.com wrote: To access Cassandra in Solandra it's the same as regular cassandra. To access Solr you use one of the Php Solr libraries http://wiki.apache.org/solr/SolPHP On Mon, Jun 6, 2011 at 11:04 PM, Jean-Nicolas Boulay Desjardins jnbdzjn...@gmail.com wrote: I am trying to install Thrift with Solandra. Normally when I just want to install Thrift with Cassandra, I followed this tutorial: https://wiki.fourkitchens.com/display/PF/Using+Cassandra+with+PHP But how can I do the same for Solandra? Thrift with PHP...-- Name / Nom: Boulay Desjardins, Jean-Nicolas Website / Site Web: www.jeannicolas.com -- http://twitter.com/tjake -- Name / Nom: Boulay Desjardins, Jean-Nicolas Website / Site Web: www.jeannicolas.com -- Name / Nom: Boulay Desjardins, Jean-Nicolas Website /
Re: Installing Thrift with Solandra
Also how can I backup the data that I loaded. Because in the next reboot I am going to loose all the data that I loaded and like you know it takes time... I tried to copy the folder Solandra in another folder outside the /tmp... But I am not sure that is enough. Thanks! On Wed, Jun 8, 2011 at 2:39 PM, Jean-Nicolas Boulay Desjardins jnbdzjn...@gmail.com wrote: Thanks again... Here it gets a bit more complex. I added Solandra to /tmp folder like you told me. And the data also... Everything seems to work. The problem is I am running Solandra in a VM on my Mac OS X the VM is Ubuntu Server. On that VM I have a DNS server... And one of my domain names is jean-nicolas.name... I put the website folder in it, to test out Solandra. The page loads, but no content. So I want in the code to find where it was getting the content. Then I found reutors.js and in it was an address: http://localhost:8983/solandra/reutors So I used this command in the terminal (in my VM obviously): curl http://localhost:8983/solandra/reutors Then I got: NOT FOUND... In some HTML code... So it seems it cannot find the data... Also it was followed by a series of br/ nothing between them... So does that mean Solandra is not working? Or is it something else? Thanks again for your time and help... I know I am a bit slow :) Thanks again... On Wed, Jun 8, 2011 at 12:55 PM, Krish Pan ceo.co...@gmail.com wrote: looks like it is running, you can verify by running jps it will show you a process with name jar try this, cd ../reuters-demo ./1-download_data.sh ./2-import_data.sh While data is loading, open the file ./website/index.html in your favorite browser. On Wed, Jun 8, 2011 at 8:04 AM, Jean-Nicolas Boulay Desjardins jnbdzjn...@gmail.com wrote: Krish Pan THANKS! Also thank you for making build successful in uppercase :) But it seems it is still not working. This time when I go into solandra-app directory I get the start-solandra.sh and when I use the command: ./start-solandra.sh I get this: http://dl.dropbox.com/u/20599297/Screen%20shot%202011-06-08%20at%2011.00.15%20AM.png And it just stays stuck there. Any ideas? Thanks again. On Wed, Jun 8, 2011 at 2:32 AM, Krish Pan ceo.co...@gmail.com wrote: you are trying to run solandra from resources directory, follow these steps 1) don't use root - use a regular user 2) cd /tmp/ 3) git clone git://github.com/tjake/Solandra.git 4) cd Solandra 5) ant once you get BUILD SUCCESSFUL 6) cd solandra-app 7) ./start-solandra.sh On Tue, Jun 7, 2011 at 10:29 PM, Jean-Nicolas Boulay Desjardins jnbdzjn...@gmail.com wrote: I found start-solandra.sh in resources folder. But when I execute it. I still get an error. http://dl.dropbox.com/u/20599297/Screen%20shot%202011-06-08%20at%201.27.26%20AM.png http://dl.dropbox.com/u/20599297/Screen%20shot%202011-06-08%20at%201.27.26%20AM.pngThanks again. On Tue, Jun 7, 2011 at 12:23 PM, Jean-Nicolas Boulay Desjardins jnbdzjn...@gmail.com wrote: Ok So I have to install Thrift and Cassandra than Solandra. I am asking because I followed the instructions in your Git page but I get this error: # cd solandra-app; ./start-solandra.sh -bash: ./start-solandra.sh: No such file or directory Thanks again :) On Tue, Jun 7, 2011 at 7:55 AM, Jake Luciani jak...@gmail.comwrote: This seems to be a common cause of confusion. Let me try again. Solandra doesn't integrate your Cassandra data into solr. It simply provides a scalable backend for solr by Building on Cassandra. The inverted index lives in it's own Cassandra keyspace. What you have in the end is two functionally different components (Cassandra and solr) in one logical service. Jake On Tuesday, June 7, 2011, Jean-Nicolas Boulay Desjardins jnbdzjn...@gmail.com wrote: I just saw a post you made on Stackoverflow, where you said: The Solandra project which is replacing Lucandra no longer uses thrift, only Solr. So I use Solr to access my data in Cassandra? Thanks again... On Tue, Jun 7, 2011 at 1:39 AM, Jean-Nicolas Boulay Desjardins jnbdzjn...@gmail.com wrote: Thanks again :) Ok... But in the tutorial it says that I need to build a Thrift interface for Cassandra: ./compiler/cpp/thrift -gen php ../PATH-TO-CASSANDRA/interface/cassandra.thrift How do I do this? Where is the interface folder? Again, tjake thanks allot for your time and help. On Mon, Jun 6, 2011 at 11:13 PM, Jake Luciani jak...@gmail.com wrote: To access Cassandra in Solandra it's the same as regular cassandra. To access Solr you use one of the Php Solr libraries http://wiki.apache.org/solr/SolPHP On Mon, Jun 6, 2011 at 11:04 PM, Jean-Nicolas Boulay Desjardins jnbdzjn...@gmail.com wrote: I am trying to install Thrift with Solandra. Normally when I just want to install Thrift with Cassandra, I followed this tutorial:
RE: how to know there are some columns in a row
I am pretty sure this would cut down on network traffic, but not on Disk IO or CPU use. I think Cassandra would still have to deserialize the whole column to get to the name. So if you really have a use case where you just want the name, it would be better to store a separate name with no data column. From: Patrick de Torcy [mailto:pdeto...@gmail.com] Sent: Wednesday, June 08, 2011 4:00 AM To: user@cassandra.apache.org Subject: Re: how to know there are some columns in a row There is no reason for ambiguities... We could add in the api another method call (similar to get_count) : get_columnNames * liststring get_columnNames(key, column_parent, predicate, consistency_level) Get the columns names present in column_parent within the predicate. The method is not O(1). It takes all the columns from disk to calculate the answer. The only benefit of the method is that you do not need to pull all their values over Thrift interface to get their names (just to get the idea...) In fact column names can really be data in themselves, so there should be a way to retrieve them (without their values). When you have big values, it's a real show stopper to use get_slice, since a lot of unnecessary traffic would be generated... Forgive me if I am a little insistent, but it's important for us and I'm sure we are not the only ones interested in this feature... cheers
Re: nosql yes but yescql, no?
That makes sense :) On Wed, Jun 8, 2011 at 2:37 PM, Jeremy Hanna jeremy.hanna1...@gmail.com wrote: I think that's partly the idea of it. CQL could end up being a way forward and it currently builds on thrift. Then if it becomes the API/client of record to build on, then it could move to something else underneath that's more efficient and CQL itself wouldn't have to change at all. On Jun 8, 2011, at 1:29 PM, Jeffrey Kesselman wrote: While I agree the Thrift API sucks, Id love to see that sovled on a binary level, and CQl on top of that. JK On Wed, Jun 8, 2011 at 2:50 PM, Marcos Ortiz mlor...@uci.cu wrote: On 06/08/2011 01:23 PM, SriSatish Ambati wrote: Gotta love, Eric! http://www.slideshare.net/jericevans/nosql-yes-but-yescql-no -- SriSatish Ambati Director of Engineering, DataStax @srisatish Good resource. Thanks for share it with us SriSatish Regards -- Marcos Luís Ortíz Valmaseda Software Engineer (UCI) http://marcosluis2186.posterous.com http://twitter.com/marcosluis2186 -- It's always darkest just before you are eaten by a grue. -- It's always darkest just before you are eaten by a grue.
hadoop/pig notes
I decided to try out hadoop/pig + cassandra. I had my ups and downs to get the script I wanted to run to work. I'm sure everyone who tries will have their own experiences/problems, but mine were: -Everything I need to know was in http://hadoop.apache.org/common/docs/r0.20.2/cluster_setup.html and http://wiki.apache.org/cassandra/HadoopSupport -Java is really picky about hostnames. I'm in EC2, and rather than rely on DNS, I basically have all of my machines share an /etc/hosts file. But, the command line hostname wasn't returning the same thing as in /etc/hosts, which caused all kinds of weird hadoop issues at first. (I had hostname as foo and /etc/hosts had foo.prod). -I forgot I had iptables on. It's always easier to not have firewalls to start (this is true when configuring anything of course) -Use the same version of everything everywhere. And for hadoop/pig, I was having issues until I used the combination of hadoop-0.20.2 + pig-0.8.1. -For hadoop's mapred-site.xml you HAVE to supply a port (hostname:port), and there isn't a standard, and it seems arbitrary. I used 8021, based on notes in a case somewhere from hadoop (I think trying to standardize). It took me awhile to figure the syntax of Pig Latin out, but I finally managed to get a script that does a count of all columns in a column family: rows = LOAD 'cassandra://keyspace/columnfamily' USING CassandraStorage(); filter_rows = FILTER rows BY $1 is not null; counts = FOREACH filter_rows GENERATE COUNT($1); counts_in_bag = GROUP counts ALL; sum_of_bag = FOREACH counts_in_bag GENERATE SUM($1); dump sum_of_bag; I'm trying to see the impact of running hadoop on the same servers as cassandra now. And yes, I've seen the note in the wiki about the clever partitioning of cassandra nodes to allow for web latency nodes + hadoop processing nodes :-)
Re: CLI set command returns null, ver 0.8.0
Can you provide the cli script to create the schema and info on how many nodes you have. Thanks - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 8 Jun 2011, at 16:12, AJ wrote: Can anyone help? The CLI seems to be having issues. The count command isn't working either: [default@Keyspace1] count User[long(1)]; Expected 8 or 0 byte long (13) java.lang.RuntimeException: Expected 8 or 0 byte long (13) at org.apache.cassandra.cli.CliClient.executeCLIStatement(CliClient.java:284) at org.apache.cassandra.cli.CliMain.processStatement(CliMain.java:217) at org.apache.cassandra.cli.CliMain.main(CliMain.java:345) [default@Keyspace1] [default@Keyspace1] count User[1];; Expected 8 or 0 byte long (1) java.lang.RuntimeException: Expected 8 or 0 byte long (1) at org.apache.cassandra.cli.CliClient.executeCLIStatement(CliClient.java:284) at org.apache.cassandra.cli.CliMain.processStatement(CliMain.java:217) at org.apache.cassandra.cli.CliMain.main(CliMain.java:345) [default@Keyspace1] count User['1']; Expected 8 or 0 byte long (1) java.lang.RuntimeException: Expected 8 or 0 byte long (1) at org.apache.cassandra.cli.CliClient.executeCLIStatement(CliClient.java:284) at org.apache.cassandra.cli.CliMain.processStatement(CliMain.java:217) at org.apache.cassandra.cli.CliMain.main(CliMain.java:345) [default@Keyspace1] count User['12345678']; null java.lang.RuntimeException at org.apache.cassandra.cli.CliClient.executeCLIStatement(CliClient.java:292) at org.apache.cassandra.cli.CliMain.processStatement(CliMain.java:217) at org.apache.cassandra.cli.CliMain.main(CliMain.java:345) [default@Keyspace1] Granted, there are no rows in the CF yet (see probs below), but this exception seems to be during the parsing stage. I've check everything else, AFAIK, so I'm at a loss. Much obliged. On 6/7/2011 12:44 PM, AJ wrote: The log only shows INFO level messages about flushes, etc.. The debug mode of the CLI shows an exception after the set: [al@mars ~]$ cassandra-cli -h 192.168.1.101 --debug Connected to: Test Cluster on 192.168.1.101/9160 Welcome to the Cassandra CLI. Type 'help;' or '?' for help. Type 'quit;' or 'exit;' to quit. [default@unknown] use Keyspace1; Authenticated to keyspace: Keyspace1 [default@Keyspace1] set User[1]['name']='aaa'; null java.lang.RuntimeException at org.apache.cassandra.cli.CliClient.executeCLIStatement(CliClient.java:292) at org.apache.cassandra.cli.CliMain.processStatement(CliMain.java:217) at org.apache.cassandra.cli.CliMain.main(CliMain.java:345) [default@Keyspace1]
Re: hadoop/pig notes
I need to update the wiki with better pig info. I did put some information in the getting started docs of pygmalion, but it would be good to transfer that to cassandra's wiki and add to it. fwiw - https://github.com/jeromatron/pygmalion/wiki/Getting-Started Thanks for the rundown William! On Jun 8, 2011, at 4:11 PM, William Oberman wrote: I decided to try out hadoop/pig + cassandra. I had my ups and downs to get the script I wanted to run to work. I'm sure everyone who tries will have their own experiences/problems, but mine were: -Everything I need to know was in http://hadoop.apache.org/common/docs/r0.20.2/cluster_setup.html and http://wiki.apache.org/cassandra/HadoopSupport -Java is really picky about hostnames. I'm in EC2, and rather than rely on DNS, I basically have all of my machines share an /etc/hosts file. But, the command line hostname wasn't returning the same thing as in /etc/hosts, which caused all kinds of weird hadoop issues at first. (I had hostname as foo and /etc/hosts had foo.prod). -I forgot I had iptables on. It's always easier to not have firewalls to start (this is true when configuring anything of course) -Use the same version of everything everywhere. And for hadoop/pig, I was having issues until I used the combination of hadoop-0.20.2 + pig-0.8.1. -For hadoop's mapred-site.xml you HAVE to supply a port (hostname:port), and there isn't a standard, and it seems arbitrary. I used 8021, based on notes in a case somewhere from hadoop (I think trying to standardize). It took me awhile to figure the syntax of Pig Latin out, but I finally managed to get a script that does a count of all columns in a column family: rows = LOAD 'cassandra://keyspace/columnfamily' USING CassandraStorage(); filter_rows = FILTER rows BY $1 is not null; counts = FOREACH filter_rows GENERATE COUNT($1); counts_in_bag = GROUP counts ALL; sum_of_bag = FOREACH counts_in_bag GENERATE SUM($1); dump sum_of_bag; I'm trying to see the impact of running hadoop on the same servers as cassandra now. And yes, I've seen the note in the wiki about the clever partitioning of cassandra nodes to allow for web latency nodes + hadoop processing nodes :-)
Re: Retrieving a column from a fat row vs retrieving a single row
Just to make things less clear, if you have one row that you are continually writing it may end up spread out over several SSTables. Compaction helps here to reduce the number of files that must be accessed so long as is can keep up. But if you want to read column X and the row is fragmented over 5 SSTables then each one must be accessed. https://issues.apache.org/jira/browse/CASSANDRA-2319 is open to try and reduce the number of seeks. For now take a look at nodetool cfhistograms to see how many sstables are read for your queries. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 9 Jun 2011, at 04:50, Peter Schuller wrote: As far as I know, to read a single column cassandra will deserialize a bunch of them and then pick the correct one (64KB of data right?) Assuming the default setting of 64kb, the average amount deserialized given random column access should be 8 kb (not true with row cache, but with large rows presumably you don't have row cache). Would it be faster to have a row for each id I want to translate? This would make keycache less effective, but the amount of data read should be smaller. It depends on what bottlenecks you're optimizing for. A key is expensive in the sense that if (1) increases the size of bloom filters for the column family, and it (2) increases the memory cost of index sampling, and (3) increases the total data size (typically) because the row size is duplicated in both the index and data files. The cost of deserialization the same data repeatedly is CPU. So if you're nowhere near bottlenecking on disk and the memory trade-off is reasonable, it may be a suitable optimization. However, consider that unless you're doing order preserving partitioning, accessing those rows will be effectively random w.r.t. the locations on disk you're reading from so you're adding a lot of overhead in terms of disk I/O unless your data set fits comfortably in memory. -- / Peter Schuller
Re: how to know there are some columns in a row
| I am pretty sure this would cut down on network traffic, but not on Disk IO or CPU use. Well, that's the same for the get_count method ! I think that would be ok,since the network traffic is the real problem (big values...). To store the column names in a separate column could be a solution of course, but it generates dupplicate data, with risk of inconsistencies (and more work)
Re: how to know there are some columns in a row
Forgive me if I am a little insistent, but it's important for us and I'm sure we are not the only ones interested in this feature... Not an issue, it's how things get done on :) Create a jira ticket https://issues.apache.org/jira/browse/CASSANDRA with your ideas to start the process and ask others to vote if they would also like to see it. If you have time to donate for the feature include that on the ticket. Thanks - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 9 Jun 2011, at 06:55, Jeremiah Jordan wrote: I am pretty sure this would cut down on network traffic, but not on Disk IO or CPU use. I think Cassandra would still have to deserialize the whole column to get to the name. So if you really have a use case where you just want the name, it would be better to store a separate name with no data column. From: Patrick de Torcy [mailto:pdeto...@gmail.com] Sent: Wednesday, June 08, 2011 4:00 AM To: user@cassandra.apache.org Subject: Re: how to know there are some columns in a row There is no reason for ambiguities... We could add in the api another method call (similar to get_count) : get_columnNames liststring get_columnNames(key, column_parent, predicate, consistency_level) Get the columns names present in column_parent within the predicate. The method is not O(1). It takes all the columns from disk to calculate the answer. The only benefit of the method is that you do not need to pull all their values over Thrift interface to get their names (just to get the idea...) In fact column names can really be data in themselves, so there should be a way to retrieve them (without their values). When you have big values, it's a real show stopper to use get_slice, since a lot of unnecessary traffic would be generated... Forgive me if I am a little insistent, but it's important for us and I'm sure we are not the only ones interested in this feature... cheers
Re: [RELEASE] 0.8.0
Is there anyone willing to upgrade the libcassandra for C++, to support new features in 0.8.0? Or has anyone started to work on it? Thanks On Jun 3, 2011, at 7:36 AM, Eric Evans wrote: I am very pleased to announce the official release of Cassandra 0.8.0. If you haven't been paying attention to this release, this is your last chance, because by this time tomorrow all your friends are going to be raving, and you don't want to look silly. So why am I resorting to hyperbole? Well, for one because this is the release that debuts the Cassandra Query Language (CQL). In one fell swoop Cassandra has become more than NoSQL, it's MoSQL. Cassandra also has distributed counters now. With counters, you can count stuff, and counting stuff rocks. A kickass use-case for Cassandra is spanning data-centers for fault-tolerance and locality, but doing so has always meant sending data in the clear, or tunneling over a VPN. New for 0.8.0, encryption of intranode traffic. If you're not motivated to go upgrade your clusters right now, you're either not easily impressed, or you're very lazy. If it's the latter, would it help knowing that rolling upgrades between releases is now supported? Yeah. You can upgrade your 0.7 cluster to 0.8 without shutting it down. You see what I mean? Then go read the release notes[1] to learn about the full range of awesomeness, then grab a copy[2] and become a (fashionably )early adopter. Drivers for CQL are available in Python[3], Java[3], and Node.js[4]. As usual, a Debian package is available from the project's APT repository[5]. Enjoy! [1]: http://goo.gl/CrJqJ (NEWS.txt) [2]: http://cassandra.debian.org/download [3]: http://www.apache.org/dist/cassandra/drivers [4]: https://github.com/racker/node-cassandra-client [5]: http://wiki.apache.org/cassandra/DebianPackaging -- Eric Evans eev...@rackspace.com
Running a cluster with 256mb RAM nodes
I'd like to start using cassandra for a certain part of my database that has high write volume. I'm setting up a 3 node cluster, however my site doesn't make enough money yet to justify 3 nodes meeting the hardware recommendationhttp://wiki.apache.org/cassandra/CassandraHardwareof 4gb RAM. Instead I'm trying to get it working with nodes that have 256mb RAM (running in a VM). I looked around and found a couple places where people mention successfully running cassandra nodes with only 256mb, eg http://news.ycombinator.com/item?id=2074114 and http://groups.google.com/group/reddit-dev/browse_thread/thread/f7bc839dbc62d0ad/92af1e790f2fe05c, but they don't give any details about setting they've changed. It took a while, but I've settled on some settings that don't give me an OutOfMemoryException under load, and still seem to have acceptable performance (quick writes with throughput that's good enough for now, higher latency reads but that's okay for my use). They're a bit on the conservative side, but I'd rather have them low and never get an OOM than risk it. The JVM memory settings are the auto-calculated ones (running cassandra 0.7.6-2): -Xms122M -Xmx122M -Xmn30M I have 4 CF's. The settings I've changed are, for each CF: MemtableThroughputInMB to 1mb (yes, that's very low, that's part of my question) MemtableOperationsInMillions to 0.02 (20k operations) cached keys to 20,000 The docs at http://wiki.apache.org/cassandra/MemtableThresholds warn that tons of tiny memtables is bad. Why? Also, am I correct in believing that it's ok to change the memtable throughput/operations later on once I have larger nodes, and that there will be no lasting bad effects (eg I can just trigger a compaction, or even bring new nodes online and remove the old ones)? I've tested this setup doing reads and writes, but I haven't tried any operations (eg moving a node to a different token, bootstrapping a new node). Are there any operations I need to watch out for that could cause an OOM, or other problematic settings that should be tuned that haven't caused problems yet but could in certain cases? Thanks, Donny
Re: Running a cluster with 256mb RAM nodes
I once built a 4 node ring on my laptop, with 64MB heap for each instances. I could write and read on it, but nodetool repair caused OOM. You should test essential operations with estimated data loaded, under expected traffic. Btw I'm using 96MBx4 node ring on my laptop now just for my private lab. It survive on repair :-) maki On 2011/06/09, at 10:52, Donny Nadolny donny.nado...@gmail.com wrote: I'd like to start using cassandra for a certain part of my database that has high write volume. I'm setting up a 3 node cluster, however my site doesn't make enough money yet to justify 3 nodes meeting the hardware recommendation of 4gb RAM. Instead I'm trying to get it working with nodes that have 256mb RAM (running in a VM). I looked around and found a couple places where people mention successfully running cassandra nodes with only 256mb, eg http://news.ycombinator.com/item?id=2074114 and http://groups.google.com/group/reddit-dev/browse_thread/thread/f7bc839dbc62d0ad/92af1e790f2fe05c, but they don't give any details about setting they've changed. It took a while, but I've settled on some settings that don't give me an OutOfMemoryException under load, and still seem to have acceptable performance (quick writes with throughput that's good enough for now, higher latency reads but that's okay for my use). They're a bit on the conservative side, but I'd rather have them low and never get an OOM than risk it. The JVM memory settings are the auto-calculated ones (running cassandra 0.7.6-2): -Xms122M -Xmx122M -Xmn30M I have 4 CF's. The settings I've changed are, for each CF: MemtableThroughputInMB to 1mb (yes, that's very low, that's part of my question) MemtableOperationsInMillions to 0.02 (20k operations) cached keys to 20,000 The docs at http://wiki.apache.org/cassandra/MemtableThresholds warn that tons of tiny memtables is bad. Why? Also, am I correct in believing that it's ok to change the memtable throughput/operations later on once I have larger nodes, and that there will be no lasting bad effects (eg I can just trigger a compaction, or even bring new nodes online and remove the old ones)? I've tested this setup doing reads and writes, but I haven't tried any operations (eg moving a node to a different token, bootstrapping a new node). Are there any operations I need to watch out for that could cause an OOM, or other problematic settings that should be tuned that haven't caused problems yet but could in certain cases? Thanks, Donny
Re: CLI set command returns null, ver 0.8.0
Thanks Aaron, I created a script and everything went OK. I think that the problem is when you try to update a CF. Below, I try to change the column comparator and it complains that the 'comparators do not match'. Can you enlighten me on what that means? There is no data in the CF at this point. [default@Keyspace1] create column family User3; 503dba20-924b-11e0--f1169bb35ddf Waiting for schema agreement... ... schemas agree across the cluster [default@Keyspace1] set User3['1']['name'] = 'mike'; org.apache.cassandra.db.marshal.MarshalException: cannot parse 'name' as hex bytes java.lang.RuntimeException: org.apache.cassandra.db.marshal.MarshalException: cannot parse 'name' as hex bytes at org.apache.cassandra.cli.CliClient.executeCLIStatement(CliClient.java:292) at org.apache.cassandra.cli.CliMain.processStatement(CliMain.java:217) at org.apache.cassandra.cli.CliMain.main(CliMain.java:345) [default@Keyspace1] describe keyspace; Keyspace: Keyspace1: Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy Options: [datacenter1:1] Column Families: ColumnFamily: User3 Key Validation Class: org.apache.cassandra.db.marshal.BytesType Default column value validator: org.apache.cassandra.db.marshal.BytesType Columns sorted by: org.apache.cassandra.db.marshal.BytesType Row cache size / save period in seconds: 0.0/0 Key cache size / save period in seconds: 20.0/14400 Memtable thresholds: 0.2859375/61/1440 (millions of ops/MB/minutes) GC grace seconds: 864000 Compaction min/max thresholds: 4/32 Read repair chance: 1.0 Replicate on write: false Built indexes: [] [default@Keyspace1] /** Here, I figure the error above is because it cannot find the column called 'name' because it's using the BytesType column name sorter/comparator, so I try to change it below. */ [default@Keyspace1] update column family User3 with comparator = UTF8Type; comparators do not match. java.lang.RuntimeException: comparators do not match. at org.apache.cassandra.cli.CliClient.executeCLIStatement(CliClient.java:292) at org.apache.cassandra.cli.CliMain.processStatement(CliMain.java:217) at org.apache.cassandra.cli.CliMain.main(CliMain.java:345) [default@Keyspace1] What does comparators do not match mean? Thanks, Mike On 6/8/2011 4:37 PM, aaron morton wrote: Can you provide the cli script to create the schema and info on how many nodes you have. Thanks - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 8 Jun 2011, at 16:12, AJ wrote: Can anyone help? The CLI seems to be having issues. The count command isn't working either: [default@Keyspace1] count User[long(1)]; Expected 8 or 0 byte long (13) java.lang.RuntimeException: Expected 8 or 0 byte long (13) at org.apache.cassandra.cli.CliClient.executeCLIStatement(CliClient.java:284) at org.apache.cassandra.cli.CliMain.processStatement(CliMain.java:217) at org.apache.cassandra.cli.CliMain.main(CliMain.java:345) [default@Keyspace1] [default@Keyspace1] count User[1];; Expected 8 or 0 byte long (1) java.lang.RuntimeException: Expected 8 or 0 byte long (1) at org.apache.cassandra.cli.CliClient.executeCLIStatement(CliClient.java:284) at org.apache.cassandra.cli.CliMain.processStatement(CliMain.java:217) at org.apache.cassandra.cli.CliMain.main(CliMain.java:345) [default@Keyspace1] count User['1']; Expected 8 or 0 byte long (1) java.lang.RuntimeException: Expected 8 or 0 byte long (1) at org.apache.cassandra.cli.CliClient.executeCLIStatement(CliClient.java:284) at org.apache.cassandra.cli.CliMain.processStatement(CliMain.java:217) at org.apache.cassandra.cli.CliMain.main(CliMain.java:345) [default@Keyspace1] count User['12345678']; null java.lang.RuntimeException at org.apache.cassandra.cli.CliClient.executeCLIStatement(CliClient.java:292) at org.apache.cassandra.cli.CliMain.processStatement(CliMain.java:217) at org.apache.cassandra.cli.CliMain.main(CliMain.java:345) [default@Keyspace1] Granted, there are no rows in the CF yet (see probs below), but this exception seems to be during the parsing stage. I've check everything else, AFAIK, so I'm at a loss. Much obliged. On 6/7/2011 12:44 PM, AJ wrote: The log only shows INFO level messages about flushes, etc.. The debug mode of the CLI shows an exception after the set: [mike@mars ~]$ cassandra-cli -h 192.168.1.101 --debug Connected to: Test Cluster on 192.168.1.101/9160 Welcome to the Cassandra CLI. Type 'help;' or '?' for help. Type 'quit;' or 'exit;' to quit. [default@unknown] use Keyspace1; Authenticated to keyspace: Keyspace1 [default@Keyspace1] set User[1]['name']='aaa'; null java.lang.RuntimeException at org.apache.cassandra.cli.CliClient.executeCLIStatement(CliClient.java:292) at
Is there a way from a running Cassandra node to determine whether or not itself is up?
Is there a way (preferably an exposed method accessible through Thrift), from a running Cassandra node to determine whether or not itself is up? (Per Cassandra standards, I'm assuming based on the gossip protocol). Another way to think of what I'm looking for is basically running nodetool ring just on myself, but I'm only interested in knowing whether I'm Up or Down? I'm currently using the describe_cluster method, but earlier today when the commitlogs for a node filled up and it appeared down to the other nodes, describe_cluster() still worked fine, thus failing the check. Thanks, Suan