Re: Introducing DSBench
Here is a link to get started with DSBench: https://github.com/datastax/dsbench-labs#getting-started and DataStax Labs: https://downloads.datastax.com/#labs On Thu, Jan 30, 2020 at 11:47 AM Jonathan Shook wrote: > > Some of you may remember NGCC talks on metagener (now VirtualDataSet) > and engineblock from 2015 and 2016. The main themes went something > along the lines of "testing c* with realistic workloads is hard, > sizing cassandra is hard, we need tools in this space that go beyond > what cassandra-stress can do but don't require math phd skills." > > We just released our latest attempt at solving this difficult problem > set. It's called DSBench and it's free to download from DataStax Labs. > Looking forward to your feedback and hope this tool can prove valuable > for your sizing, stress testing, and performance benchmarking needs. - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org
Introducing DSBench
Some of you may remember NGCC talks on metagener (now VirtualDataSet) and engineblock from 2015 and 2016. The main themes went something along the lines of "testing c* with realistic workloads is hard, sizing cassandra is hard, we need tools in this space that go beyond what cassandra-stress can do but don't require math phd skills." We just released our latest attempt at solving this difficult problem set. It's called DSBench and it's free to download from DataStax Labs. Looking forward to your feedback and hope this tool can prove valuable for your sizing, stress testing, and performance benchmarking needs. - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org
Re: Replacing Redis
Benson, I was considering using Redis for a specific project. Can you elaborate a bit on your problem with it? What were the circumstances, loading factors, etc? On Fri, Feb 18, 2011 at 9:19 AM, Benson Margulies bimargul...@gmail.com wrote: redis times out at random regardless of what we configure for client timeouts; the platform-sensitive binaries are painful for us since we support many platform; just to name two reasons. On Fri, Feb 18, 2011 at 10:04 AM, Joshua Partogi joshua.j...@gmail.com wrote: Any reason why you want to do that? On Sat, Feb 19, 2011 at 1:32 AM, Benson Margulies bimargul...@gmail.com wrote: I'm about to launch off on replacing redis with cassandra. I wonder if anyone else has ever been there and done that. -- http://twitter.com/jpartogi
Re: Stress test inconsistencies
Would you share with us the changes you made, or problems you found? On Wed, Jan 26, 2011 at 10:41 AM, Oleg Proudnikov ol...@cloudorange.com wrote: Hi All, I was able to run contrib/stress at a very impressive throughput. Single threaded client was able to pump 2,000 inserts per second with 0.4 ms latency. Multithreaded client was able to pump 7,000 inserts per second with 7ms latency. Thank you very much for your help! Oleg
Re: Do you have a site in production environment with Cassandra? What client do you use?
clients: Java and MVEL + Hector Perl + thrift Usage: high-traffic monitoring harness with dynamic mapping and loading of handlers Cassandra was part of the do more with less hardware approach to designing this system. On Fri, Jan 14, 2011 at 11:24 AM, Ertio Lew ertio...@gmail.com wrote: Hey, If you have a site in production environment or considering so, what is the client that you use to interact with Cassandra. I know that there are several clients available out there according to the language you use but I would love to know what clients are being used widely in production environments and are best to work with(support most required features for performance). Also preferably tell about the technology stack for your applications. Any suggestions, comments appreciated ? Thanks Ertio
Re: Java cient
Perhaps. I use hector. I have an bit of rework to do moving from .6 to .7. This is something I wasn't anticipating in my earlier planning. Had Pelops been around when I started using Hector, I would have probably chosen it over Hector. The Pelops client seemed to be better conceived as far as programmer experience and simplicity went. Since then, Hector has had a v2 upgrade to their API which breaks much of the things that you would have done in version .6 and before. Conceptually speaking, they appear more similar now than before the Hector changes. I'm dreading having to do a significant amount of work on my client interface because of the incompatible API changes.. but I will have to in order to get my client/server caught up to the currently supported branch. That is just part of the cost of doing business with Cassandra at the moment. Hopefully after 1.0 on the server and some of the clients, this type of thing will be more unusual. 2011/1/19 Noble Paul നോബിള് नोब्ळ् noble.p...@gmail.com: Thanks everyone. I guess, I should go with hector On 18 Jan 2011 17:41, Alois Bělaška alois.bela...@gmail.com wrote: Definitelly Pelops https://github.com/s7/scale7-pelops 2011/1/18 Noble Paul നോബിള് नोब्ळ् noble.p...@gmail.com What is the most commonly used java client library? Which is the the most mature/feature complete? Noble
Re: Reclaim deleted rows space
I believe the following condition within submitMinorIfNeeded(...) determines whether to continue, so it's not a hard loop. // if (sstables.size() = minThreshold) ... On Thu, Jan 6, 2011 at 2:51 AM, shimi shim...@gmail.com wrote: According to the code it make sense. submitMinorIfNeeded() calls doCompaction() which calls submitMinorIfNeeded(). With minimumCompactionThreshold = 1 submitMinorIfNeeded() will always run compaction. Shimi On Thu, Jan 6, 2011 at 10:26 AM, shimi shim...@gmail.com wrote: On Wed, Jan 5, 2011 at 11:31 PM, Jonathan Ellis jbel...@gmail.com wrote: Pretty sure there's logic in there that says don't bother compacting a single sstable. No. You can do it. Based on the log I have a feeling that it triggers an infinite compaction loop. On Wed, Jan 5, 2011 at 2:26 PM, shimi shim...@gmail.com wrote: How does minor compaction is triggered? Is it triggered Only when a new SStable is added? I was wondering if triggering a compaction with minimumCompactionThreshold set to 1 would be useful. If this can happen I assume it will do compaction on files with similar size and remove deleted rows on the rest. Shimi On Tue, Jan 4, 2011 at 9:56 PM, Peter Schuller peter.schul...@infidyne.com wrote: I don't have a problem with disk space. I have a problem with the data size. [snip] Bottom line is that I want to reduce the number of requests that goes to disk. Since there is enough data that is no longer valid I can do it by reclaiming the space. The only way to do it is by running Major compaction. I can wait and let Cassandra do it for me but then the data size will get even bigger and the response time will be worst. I can do it manually but I prefer it to happen in the background with less impact on the system Ok - that makes perfect sense then. Sorry for misunderstanding :) So essentially, for workloads that are teetering on the edge of cache warmness and is subject to significant overwrites or removals, it may be beneficial to perform much more aggressive background compaction even though it might waste lots of CPU, to keep the in-memory working set down. There was talk (I think in the compaction redesign ticket) about potentially improving the use of bloom filters such that obsolete data in sstables could be eliminated from the read set without necessitating actual compaction; that might help address cases like these too. I don't think there's a pre-existing silver bullet in a current release; you probably have to live with the need for greater-than-theoretically-optimal memory requirements to keep the working set in memory. -- / Peter Schuller -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
Re: SSD vs. HDD
SSDs are not reliable after a (relatively-low compared to spinning disk) number of writes. They may significantly boost performance if used on the journal storage, but will suffer short lifetimes for highly-random write patterns. In general, plan to replace them frequently. Whether they are worth it, given the performance improvement over the cost of replacement x hardware x logistics is generally a calculus problem. It's difficult to make a generic rationale for or against them. You might be better off in general by throwing more memory at your servers, and isolating your random access from your journaled data. Is there any pattern to your reads and writes/deletes? If it is fully random across your keys, then you have the worst-case scenario. Sometimes you can impose access patterns or structural patterns in your app which make caching more effective. Good questions to ask about your data access: Is there a user session which shows an access pattern to proximal data? Are there sets of access which always happen close together? Are there keys or maps which add extra indirection? I'm not familiar with your situation. I was just providing some general ideas.. Jonathan Shook On Wed, Nov 3, 2010 at 2:32 PM, Alaa Zubaidi alaa.zuba...@pdf.com wrote: Hi, we have a continuous high throughput writes, read and delete, and we are trying to find the best hardware. Is using SSD for Cassandra improves performance? Did any one compare SSD vs. HDD? and any recommendations on SSDs? Thanks, Alaa
Re: SSD vs. HDD
Ah. Point taken on the random access SSD performance. I was trying to emphasize the relative failure rates given the two scenarios. I didn't mean to imply that SSD random access performance was not a likely improvement here, just that it was a complicated trade-off in the grand scheme of things.. Thanks for catching my goof. On Wed, Nov 3, 2010 at 3:58 PM, Tyler Hobbs ty...@riptano.com wrote: SSD will not generally improve your write performance very much, but they can significantly improve read performance. You do *not* want to waste an SSD on the commitlog drive, as even a slow HDD can write sequentially very quickly. For the data drive, they might make sense. As Jonathan talks about, it has a lot to do with your access patterns. If you either: (1) delete parts of rows (2) update parts of rows, or (3) insert new columns into existing rows frequently, you'll end up with rows spread across several SSTables (which are on disk). This means that each read may require several seeks, which are very slow for HDDs, but are very quick for SSDs. Of course, the randomness of what rows you access is also important, but Jonathan did a good job of covering that. Don't forget about the effects of caching here, too. The only way to tell if it is cost-effective is to test your particular access patterns (using a configured stress.py test or, preferably, your actual application). - Tyler On Wed, Nov 3, 2010 at 3:44 PM, Jonathan Shook jsh...@gmail.com wrote: SSDs are not reliable after a (relatively-low compared to spinning disk) number of writes. They may significantly boost performance if used on the journal storage, but will suffer short lifetimes for highly-random write patterns. In general, plan to replace them frequently. Whether they are worth it, given the performance improvement over the cost of replacement x hardware x logistics is generally a calculus problem. It's difficult to make a generic rationale for or against them. You might be better off in general by throwing more memory at your servers, and isolating your random access from your journaled data. Is there any pattern to your reads and writes/deletes? If it is fully random across your keys, then you have the worst-case scenario. Sometimes you can impose access patterns or structural patterns in your app which make caching more effective. Good questions to ask about your data access: Is there a user session which shows an access pattern to proximal data? Are there sets of access which always happen close together? Are there keys or maps which add extra indirection? I'm not familiar with your situation. I was just providing some general ideas.. Jonathan Shook On Wed, Nov 3, 2010 at 2:32 PM, Alaa Zubaidi alaa.zuba...@pdf.com wrote: Hi, we have a continuous high throughput writes, read and delete, and we are trying to find the best hardware. Is using SSD for Cassandra improves performance? Did any one compare SSD vs. HDD? and any recommendations on SSDs? Thanks, Alaa
Re: Re: Broken pipe
I have been able to reproduce this, although it was a bug in application client code. If you keep a thrift client around longer after it has had an exception, it may generate this error. In my case, I was holding a reference via ThreadLocal to a stale storage object. Another symptom which may help identify this scenario is that the broken client will not initiate any network traffic, not even a SYN packet. You may have to shut down other client traffic on the client node in order to see this... 2010/4/28 Jonathan Ellis jbel...@gmail.com: did you check the log for exceptions? On Wed, Apr 28, 2010 at 12:08 AM, Bingbing Liu rucb...@gmail.com wrote: but the situation is that ,at the beginning everything goes well, then when the get_range_slices gets about 13,000,000 rows (set the key range to 2000) the exception happens. and when i do the same thing on a smaller data set, no such thing happens. 2010-04-28 Bingbing Liu 发件人: Jonathan Ellis 发送时间: 2010-04-27 20:51:11 收件人: user 抄送: rucbing 主题: Re: Broken pipe get_range_slices works fine in the system tests, so something is wrong on your client side. Some possibilities: - sending to a non-Thrift port - using an incompatible set of Thrift bindings than the one your server supports - mixing a framed client with a non-framed server or vice versa [moving followups to user list] 2010/4/27 Bingbing Liu rucb...@gmail.com: when i use get_range_slices, i get the exceptions , i don't know what happens hope someone can help me org.apache.thrift.transport.TTransportException: java.net.SocketException: Broken pipe at org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:142) at org.apache.thrift.protocol.TBinaryProtocol.writeI32(TBinaryProtocol.java:152) at org.apache.thrift.protocol.TBinaryProtocol.writeMessageBegin(TBinaryProtocol.java:80) at org.apache.cassandra.thrift.Cassandra$Client.send_get_range_slices(Cassandra.java:592) at org.apache.cassandra.thrift.Cassandra$Client.get_range_slices(Cassandra.java:586) at org.clouddb.test.GrepSelect.main(GrepSelect.java:64) Caused by: java.net.SocketException: Broken pipe at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92) at java.net.SocketOutputStream.write(SocketOutputStream.java:136) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109) at org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:140) ... 5 more 2010-04-27 Bingbing Liu -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
Re: how to recover cassandra data
Don't forget about the tombstones. (delete markers) They are still present on the other two nodes, then they will replicate to the 3rd node and finish off your deleted data. On Mon, Aug 2, 2010 at 9:30 AM, Edward Capriolo edlinuxg...@gmail.com wrote: On Mon, Aug 2, 2010 at 9:11 AM, john xie shanfengg...@gmail.com wrote: ReplicationFactor = 3 one day i stop 192.168.1.147 and remove cassandra data by mistake, can i recover 192.168.1.147's cassadra data by restart cassandra ? DataFileDirectories DataFileDirectory/data1/cassandra//DataFileDirectory DataFileDirectory/data2/cassandra//DataFileDirectory DataFileDirectory/data3/cassandra//DataFileDirectory /DataFileDirectories /data3 mount /dev/sdd i remove /data3 and formatt /dev/sdd Address Status Load Range Ring 135438270110006521520577363629178401179 192.168.1.148 Up 50.38 GB 5243502939295338512484974245382898 |--| 192.168.1.145 Up 48.38 GB 63161078970569359253391371326773726097 | | 192.168.1.147 ? 23.5 GB 79546317728707787532885001681404757282 | | 192.168.1.146 Up 26.34 GB 135438270110006521520577363629178401179 |--| Since you have a replication factor of three if you bring a new node through auto-bootstrap data will migrate back to it since it has two copies. Nothing is lost.
Re: Multiget capabilities
CordiS, The general approach for this kind of change is to implement it yourself and submit a patch. In such a case, you may still have to be thoughtful and patient in order to get everyone on board. I wish you luck. On Mon, Jul 26, 2010 at 6:51 AM, CordiS cor...@willworkforfood.ru wrote: Thank you for nothing. 2010/7/26 aaron morton aa...@thelastpickle.com There is no way to request data from more than one ColumnFamily. The general approach is to de-normalise the data so all the information you need for a query can be returned from a single Column Family. I think this applies to both your questions. Aaron On 26 Jul 2010, at 22:51, CordiS wrote: Hello, I am interested in two features that i have not been able to found in API docs and mailing lists. First of all, is there any way to miss CF name in ColumnPath or ColumnParent (or better enumerate CFs to be retrieved). It would be commonly used to fetch all the data of a complex object identified by key. Secondly, it would be great to have an ability to fetch differently structured data by single request, providng mapkey : string, listColumnParent, Predicate to multiget_slice() Is it possible to be implemented. If so, when? Thank you.
Re: SV: How to stop cassandra server, installed from debian/ubuntupackage
If only one instance of Cassandra is running on each node, then use something like pkill -f 'java.*cassandra' If more than one (not recommended for various reasons), then you should modify the scripts to put a unique token in the process name. Something like -Dprocname=... will work. Then you can modify your pkill -f to be instance specific. On Mon, Jul 26, 2010 at 10:05 AM, Lee Parker l...@socialagency.com wrote: Which debian/ubuntu packages are you using? I am using the ones that are maintained by Eric Evans and the init.d script stops the server correctly. Lee Parker On Mon, Jul 26, 2010 at 9:22 AM, miche...@hermanus.cc wrote: This is how I have been doing it: pkill cassandra then I do a netstat -anp | grep 8080 I look for the java service I'd running and then kill that java I'd e.g. kill java id --Original Message-- From: Thorvaldsson Justus To: 'user@cassandra.apache.org' ReplyTo: user@cassandra.apache.org Subject: SV: How to stop cassandra server, installed from debian/ubuntupackage Sent: Jul 26, 2010 4:14 PM I use standard close, CTRL C, I don't run it as deamon Dunno but think it works fine =) -Ursprungligt meddelande- Från: o...@notrly.com [mailto:o...@notrly.com] Skickat: den 26 juli 2010 15:52 Till: user@cassandra.apache.org Ämne: How to stop cassandra server, installed from debian/ubuntu package Hi, this might be a dumb question, but I was wondering how do i stop the cassandra server.. I installed it using the debian package, so i start cassandra by running /etc/init.d/cassandra. I looked at the script and tried /etc/init.d/cassandra stop, but it looks like it just tries to start cassandra again, so i get the port in use exception. Thanks Sent via my BlackBerry from Vodacom - let your email find you!
Re: Cassandra to store 1 billion small 64KB Blobs
think suprcolumns (If I'm right in terms) (a-z,A_Z,0-9) the 64k Blobs meta data (which one belong to which file) should be stored separate in cassandra For Hardware we rely on solaris / opensolaris with ZFS in the backend Write operations occur much more often than reads Memory should hold the hash values mainly for fast search (not the binary data) Read Operations (restore from cassandra) may be async - (get about 1000 Blobs) - group them restore So my question is too: 2 or 3 Big boxes or 10 till 20 small boxes for storage... Could we separate caching - hash values CFs cashed and indexed - binary data CFs not ... Writes happens around the clock - on not that tremor speed but constantly Would compaction of the database need really much disk space Is it reliable on this size (more my fear) thx for thinking and answers... greetings Mike 2010/7/23 Jonathan Shook jsh...@gmail.com There are two scaling factors to consider here. In general the worst case growth of operations in Cassandra is kept near to O(log2(N)). Any worse growth would be considered a design problem, or at least a high priority target for improvement. This is important for considering the load generated by very large column families, as binary search is used when the bloom filter doesn't exclude rows from a query. O(log2(N)) is basically the best achievable growth for this type of data, but the bloom filter improves on it in some cases by paying a lower cost every time. The other factor to be aware of is the reduction of binary search performance for datasets which can put disk seek times into high ranges. This is mostly a direct consideration for those installations which will be doing lots of cold reads (not cached data) against large sets. Disk seek times are much more limited (low) for adjacent or near tracks, and generally much higher when tracks are sufficiently far apart (as in a very large data set). This can compound with other factors when session times are longer, but that is to be expected with any system. Your storage system may have completely different characteristics depending on caching, etc. The read performance is still quite high relative to other systems for a similar data set size, but the drop-off in performance may be much worse than expected if you are wanting it to be linear. Again, this is not unique to Cassandra. It's just an important consideration when dealing with extremely large sets of data, when memory is not likely to be able to hold enough hot data for the specific application. As always, the real questions have lots more to do with your specific access patterns, storage system, etc I would look at the benchmarking info available on the lists as a good starting point. On Fri, Jul 23, 2010 at 11:51 AM, Michael Widmann michael.widm...@gmail.com wrote: Hi We plan to use cassandra as a data storage on at least 2 nodes with RF=2 for about 1 billion small files. We do have about 48TB discspace behind for each node. now my question is - is this possible with cassandra - reliable - means (every blob is stored on 2 jbods).. we may grow up to nearly 40TB or more on cassandra storage data ... anyone out did something similar? for retrieval of the blobs we are going to index them with an hashvalue (means hashes are used to store the blob) ... so we can search fast for the entry in the database and combine the blobs to a normal file again ... thanks for answer michael -- bayoda.com - Professional Online Backup Solutions for Small and Medium Sized Companies -- bayoda.com - Professional Online Backup Solutions for Small and Medium Sized Companies -- bayoda.com - Professional Online Backup Solutions for Small and Medium Sized Companies -- bayoda.com - Professional Online Backup Solutions for Small and Medium Sized Companies
Re: Cassandra behaviour
My guess: Your test is beating up your system. The system may need more memory or disk throughput or CPU in order to keep up with that particular test. Check some of the posts on the list with deferred processing in the body to see why. Also, can you post the error log? On Mon, Jul 26, 2010 at 11:23 AM, tsuraan tsur...@gmail.com wrote: I have a system where we're currently using Postgres for all our data storage needs, but on a large table the index checks for primary keys are really slowing us down on insert. Cassandra sounds like a good alternative (not saying postgres and cassandra are equivalent; just that I think they are both reasonable fits for our particular product), so I tried running the py_stress tool on a recent repos checkout. I'm using code that's recent enough that it doesn't pay attention to the keyspace definitions in cassandra.yaml, so whatever the values are for cached info is just what py_stress defined when it made the keyspace it uses. I didn't change anything in cassandra.yaml, but I did change cassandra.in.sh to use 2G of RAM rather than 1G. I then ran python stress.py -o insert -n 10 (that's one billion). I left for a day, and when I came back cassandra had run out of RAM, and stress.py had crashed at somewhere around 120,000,000 inserts. This brings up a few questions: - is Cassandra's RAM use proportional to the number of values that it's storing? I know that it uses bloom filters for preventing lookups of non-existent keys, but since bloom filters are designed to give an accuracy/space tradeoff, Cassandra should sacrifice accuracy in order to prevent crashes, if it's just bloom filters that are using all the RAM - When I start Cassandra again, it appears to go into an eternal read/write loop, using between 45% and 90% of my CPU. It says it's compacting tables, but it's been doing that for hours, and it only has 70GB of data stored. How can cassandra be run on huge datasets, when 70GB appears to take forever to compact? I assume I'm doing something wrong, but I don't see a ton of tunables to play with. Can anybody give me advice on how to make cassandra keep running under a high insert load?
Re: Cassandra Graphical Modeling
+1 for Inkscape/SVG On Mon, Jul 26, 2010 at 1:07 PM, uncle mantis uncleman...@gmail.com wrote: What do you all use for this? I am currently using MySQL Workbench for my SQL projects. PowerPoint? Visio? Gimp? Pencil and Paper? Thanks for the help! Regards, Michael
Re: Cassandra Graphical Modeling
As long as you only want to edit YEd files and print them, it's great. Anything else to do with it is proprietary and expensive (for me, at least). On Mon, Jul 26, 2010 at 7:12 PM, Ashwin Jayaprakash ashwin.jayaprak...@gmail.com wrote: YEd ( http://www.yworks.com/en/products_yed_about.html http://www.yworks.com/en/products_yed_about.html ) is a pretty good tool. No setup required, free, very versatile and good for drawing graphs quickly. -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-Graphical-Modeling-tp5339132p5340364.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: CRUD test
That's a question that many Java developers would like the answer to. Unfortunately, anything better than milliseconds requires JNI, since the current JVM doesn't officially support anything higher. There are solutions to this particular problem, but for most people, milliseconds are sufficient outside of testing. This is because the likelihood of making two conflicting changes to the same row/column in the same session within the same millisecond is pretty low for actual users and real scenarios. Tests tend to be a little unrealistic in the sense that they happen quickly within a short amount of time and aren't dependent on the timing of people or other systems. If you are using a remove and replace scheme, it could still be a problem. The way I get around it for now is to use the microsecond unit of time with a millisecond source (getCurrentMillis()), and increment it artificially when it would return the same value twice in a row. It's a hack, but it works for my purposes. On Sun, Jul 25, 2010 at 12:54 AM, Oleg Tsvinev oleg.tsvi...@gmail.com wrote: Thank you guys for your help! Yes, I am using System.currentTimeMillis() in my CRUD test. Even though I'm still using it my tests now run as expected. I do not use cassandra-cli anymore. @Ran great job on Hector, I wish there was more documentation but I managed. @Jonathan, what is the recommended time source? I use batch_mutation to insert and update multiple columns atomically. Do I have to use the batch_mutation for deletion, too? On Sat, Jul 24, 2010 at 2:36 PM, Jonathan Shook jsh...@gmail.com wrote: Just to clarify, microseconds may be used, but they provide the same behavior as milliseconds if they aren't using a higher time resolution underneath. In some cases, the microseconds are generated simply as milliseconds * 1000, which doesn't actually fix any sequencing bugs. On Sat, Jul 24, 2010 at 3:46 PM, Ran Tavory ran...@gmail.com wrote: Hi Oleg, I didn't follow up the entire thread, but just to let you know that the 0.6.* version of the CLI uses microsec as the time unit for timestamps. Hector also uses micros to match that, however, previous versions of hector (as well as the CLI) used milliseconds, not micro. So if you're using hector version 0.6.0-11 or earlier, or by any chance in some other ways are mixing milisec in your app (are you using System.currentTimeMili() somewhere?) then the behavior you're seeing is expected. On Sat, Jul 24, 2010 at 1:06 AM, Jonathan Shook jsh...@gmail.com wrote: I think you are getting it. As far as what means what at which level, it's really about using them consistently in every case. The [row] key (or [row] key range) is a top-level argument for all of the operations, since it is the key to mapping the set of responsible nodes. The key is the part of the name of any column which most affects how the load is apportioned in the cluster, so it is used very early in request processing. On Fri, Jul 23, 2010 at 4:22 PM, Peter Minearo peter.mine...@reardencommerce.com wrote: Consequentially the remove should look like: ColumnPath cp1 = new ColumnPath(Super2); cp1.setSuper_column(Best Western.getBytes()); client.remove(KEYSPACE, hotel, cp1, System.currentTimeMillis(), ConsistencyLevel.ONE); ColumnPath cp2 = new ColumnPath(Super2); cp2.setSuper_column(Econolodge.getBytes()); client.remove(KEYSPACE, hotel, cp2, System.currentTimeMillis(), ConsistencyLevel.ONE); -Original Message- From: Peter Minearo [mailto:peter.mine...@reardencommerce.com] Sent: Fri 7/23/2010 2:17 PM To: user@cassandra.apache.org Subject: RE: CRUD test CORRECTION: ColumnPath cp1 = new ColumnPath(Super2); cp1.setSuper_column(Best Western.getBytes()); cp1.setColumn(name.getBytes()); client.insert(KEYSPACE, hotel, cp1, Best Western of SF.getBytes(), System.currentTimeMillis(), ConsistencyLevel.ALL); -Original Message- From: Peter Minearo [mailto:peter.mine...@reardencommerce.com] Sent: Friday, July 23, 2010 2:14 PM To: user@cassandra.apache.org Subject: RE: CRUD test Interesting!! Let me rephrase to make sure I understood what is going on: When Inserting data via the insert function/method: void insert(string keyspace, string key, ColumnPath column_path, binary value, i64 timestamp, ConsistencyLevel consistency_level) The key parameter is the actual Key to the Row, which contains SuperColumns. The 'ColumnPath' gives the path within the Key. INCORRECT
Re: CRUD test
Just to clarify, microseconds may be used, but they provide the same behavior as milliseconds if they aren't using a higher time resolution underneath. In some cases, the microseconds are generated simply as milliseconds * 1000, which doesn't actually fix any sequencing bugs. On Sat, Jul 24, 2010 at 3:46 PM, Ran Tavory ran...@gmail.com wrote: Hi Oleg, I didn't follow up the entire thread, but just to let you know that the 0.6.* version of the CLI uses microsec as the time unit for timestamps. Hector also uses micros to match that, however, previous versions of hector (as well as the CLI) used milliseconds, not micro. So if you're using hector version 0.6.0-11 or earlier, or by any chance in some other ways are mixing milisec in your app (are you using System.currentTimeMili() somewhere?) then the behavior you're seeing is expected. On Sat, Jul 24, 2010 at 1:06 AM, Jonathan Shook jsh...@gmail.com wrote: I think you are getting it. As far as what means what at which level, it's really about using them consistently in every case. The [row] key (or [row] key range) is a top-level argument for all of the operations, since it is the key to mapping the set of responsible nodes. The key is the part of the name of any column which most affects how the load is apportioned in the cluster, so it is used very early in request processing. On Fri, Jul 23, 2010 at 4:22 PM, Peter Minearo peter.mine...@reardencommerce.com wrote: Consequentially the remove should look like: ColumnPath cp1 = new ColumnPath(Super2); cp1.setSuper_column(Best Western.getBytes()); client.remove(KEYSPACE, hotel, cp1, System.currentTimeMillis(), ConsistencyLevel.ONE); ColumnPath cp2 = new ColumnPath(Super2); cp2.setSuper_column(Econolodge.getBytes()); client.remove(KEYSPACE, hotel, cp2, System.currentTimeMillis(), ConsistencyLevel.ONE); -Original Message- From: Peter Minearo [mailto:peter.mine...@reardencommerce.com] Sent: Fri 7/23/2010 2:17 PM To: user@cassandra.apache.org Subject: RE: CRUD test CORRECTION: ColumnPath cp1 = new ColumnPath(Super2); cp1.setSuper_column(Best Western.getBytes()); cp1.setColumn(name.getBytes()); client.insert(KEYSPACE, hotel, cp1, Best Western of SF.getBytes(), System.currentTimeMillis(), ConsistencyLevel.ALL); -Original Message- From: Peter Minearo [mailto:peter.mine...@reardencommerce.com] Sent: Friday, July 23, 2010 2:14 PM To: user@cassandra.apache.org Subject: RE: CRUD test Interesting!! Let me rephrase to make sure I understood what is going on: When Inserting data via the insert function/method: void insert(string keyspace, string key, ColumnPath column_path, binary value, i64 timestamp, ConsistencyLevel consistency_level) The key parameter is the actual Key to the Row, which contains SuperColumns. The 'ColumnPath' gives the path within the Key. INCORRECT: ColumnPath cp1 = new ColumnPath(Super2); cp1.setSuper_column(hotel.getBytes()); cp1.setColumn(Best Western.getBytes()); client.insert(KEYSPACE, name, cp1, Best Western of SF.getBytes(), System.currentTimeMillis(), ConsistencyLevel.ALL); CORRECT: ColumnPath cp1 = new ColumnPath(Super2); cp1.setSuper_column(name.getBytes()); cp1.setColumn(Best Western.getBytes()); client.insert(KEYSPACE, hotel, cp1, Best Western of SF.getBytes(), System.currentTimeMillis(), ConsistencyLevel.ALL); -Original Message- From: Jonathan Shook [mailto:jsh...@gmail.com] Sent: Friday, July 23, 2010 1:49 PM To: user@cassandra.apache.org Subject: Re: CRUD test Correct. After the initial insert, cassandra get Keyspace1.Super2['name'] = (super_column=hotel, (column=Best Western, value=Best Western of SF, timestamp=1279916772571) (column=Econolodge, value=Econolodge of SF, timestamp=1279916772573)) Returned 1 results. ... and ... cassandra get Keyspace1.Super2['hotel'] Returned 0 results. On Fri, Jul 23, 2010 at 3:41 PM, Peter Minearo peter.mine...@reardencommerce.com wrote: The Model Should look like: Super2 = { hotel: { Best Western: {name: Best Western of SF} Econolodge: {name: Econolodge of SF} } } Are the CRUD Operations not referencing this correctly? -Original Message- From: Jonathan Shook [mailto:jsh...@gmail.com] Sent: Friday, July 23, 2010 1:34 PM To: user@cassandra.apache.org Subject: Re: CRUD test There seem to be data consistency bugs in the test. Are name and hotel being used
Re: CRUD test
I suspect that it is still your timestamps. You can verify this with a fake timestamp generator that is simply incremented on each getTimestamp(). 1 millisecond is a long time for code that is wrapped tightly in a test. You are likely using the same logical time stamp for multiple operations. On Thu, Jul 22, 2010 at 6:29 PM, Peter Minearo peter.mine...@reardencommerce.com wrote: I am able to reproduce his problem. If you take the default storage-conf.xml file and utilize the Super2 ColumnFamily with the code below. You will see that the data is not getting created once you run the delete. It seems to not allow you to create data via Thrift. HOWEVER, data can be created via the command line tool. import java.io.UnsupportedEncodingException; import java.util.List; import org.apache.cassandra.thrift.Cassandra; import org.apache.cassandra.thrift.Column; import org.apache.cassandra.thrift.ColumnOrSuperColumn; import org.apache.cassandra.thrift.ColumnParent; import org.apache.cassandra.thrift.ColumnPath; import org.apache.cassandra.thrift.ConsistencyLevel; import org.apache.cassandra.thrift.InvalidRequestException; import org.apache.cassandra.thrift.NotFoundException; import org.apache.cassandra.thrift.SlicePredicate; import org.apache.cassandra.thrift.SliceRange; import org.apache.cassandra.thrift.SuperColumn; import org.apache.cassandra.thrift.TimedOutException; import org.apache.cassandra.thrift.UnavailableException; import org.apache.thrift.TException; import org.apache.thrift.protocol.TBinaryProtocol; import org.apache.thrift.protocol.TProtocol; import org.apache.thrift.transport.TSocket; import org.apache.thrift.transport.TTransport; public class CrudTest { private static final String KEYSPACE = Keyspace1; public static void main(String[] args) { CrudTest client = new CrudTest(); try { client.run(); } catch (Exception e) { e.printStackTrace(); } } public void run() throws TException, InvalidRequestException, UnavailableException, UnsupportedEncodingException, NotFoundException, TimedOutException { TTransport tr = new TSocket(localhost, 9160); TProtocol proto = new TBinaryProtocol(tr); Cassandra.Client client = new Cassandra.Client(proto); tr.open(); System.out.println( CREATING DATA *); createData(client); getData(client); System.out.println(); System.out.println( DELETING DATA *); deleteData(client); getData(client); System.out.println(); System.out.println( CREATING DATA *); createData(client); getData(client); tr.close(); } private void createData(Cassandra.Client client) throws InvalidRequestException, UnavailableException, TimedOutException, TException { ColumnPath cp1 = new ColumnPath(Super2); cp1.setSuper_column(hotel.getBytes()); cp1.setColumn(Best Western.getBytes()); client.insert(KEYSPACE, name, cp1, Best Western of SF.getBytes(), System.currentTimeMillis(), ConsistencyLevel.ALL); ColumnPath cp2 = new ColumnPath(Super2); cp2.setSuper_column(hotel.getBytes()); cp2.setColumn(Econolodge.getBytes()); client.insert(KEYSPACE, name, cp2, Econolodge of SF.getBytes(), System.currentTimeMillis(), ConsistencyLevel.ALL); } private void deleteData(Cassandra.Client client) throws InvalidRequestException, UnavailableException, TimedOutException, TException { client.remove(KEYSPACE, hotel, new ColumnPath(Super2), System.currentTimeMillis(), ConsistencyLevel.ONE); } private void getData(Cassandra.Client client) throws InvalidRequestException, UnavailableException, TimedOutException, TException { SliceRange sliceRange = new SliceRange(); sliceRange.setStart(new byte[] {}); sliceRange.setFinish(new byte[] {}); SlicePredicate slicePredicate = new SlicePredicate(); slicePredicate.setSlice_range(sliceRange); getData(client, slicePredicate); } private void
Re: Cassandra to store 1 billion small 64KB Blobs
There are two scaling factors to consider here. In general the worst case growth of operations in Cassandra is kept near to O(log2(N)). Any worse growth would be considered a design problem, or at least a high priority target for improvement. This is important for considering the load generated by very large column families, as binary search is used when the bloom filter doesn't exclude rows from a query. O(log2(N)) is basically the best achievable growth for this type of data, but the bloom filter improves on it in some cases by paying a lower cost every time. The other factor to be aware of is the reduction of binary search performance for datasets which can put disk seek times into high ranges. This is mostly a direct consideration for those installations which will be doing lots of cold reads (not cached data) against large sets. Disk seek times are much more limited (low) for adjacent or near tracks, and generally much higher when tracks are sufficiently far apart (as in a very large data set). This can compound with other factors when session times are longer, but that is to be expected with any system. Your storage system may have completely different characteristics depending on caching, etc. The read performance is still quite high relative to other systems for a similar data set size, but the drop-off in performance may be much worse than expected if you are wanting it to be linear. Again, this is not unique to Cassandra. It's just an important consideration when dealing with extremely large sets of data, when memory is not likely to be able to hold enough hot data for the specific application. As always, the real questions have lots more to do with your specific access patterns, storage system, etc. I would look at the benchmarking info available on the lists as a good starting point. On Fri, Jul 23, 2010 at 11:51 AM, Michael Widmann michael.widm...@gmail.com wrote: Hi We plan to use cassandra as a data storage on at least 2 nodes with RF=2 for about 1 billion small files. We do have about 48TB discspace behind for each node. now my question is - is this possible with cassandra - reliable - means (every blob is stored on 2 jbods).. we may grow up to nearly 40TB or more on cassandra storage data ... anyone out did something similar? for retrieval of the blobs we are going to index them with an hashvalue (means hashes are used to store the blob) ... so we can search fast for the entry in the database and combine the blobs to a normal file again ... thanks for answer michael
Re: CRUD test
There seem to be data consistency bugs in the test. Are name and hotel being used in a pair-wise way? Specifically, the first test is using creating one and checking for the other. On Fri, Jul 23, 2010 at 2:46 PM, Oleg Tsvinev oleg.tsvi...@gmail.com wrote: Johathan, I followed your suggestion. Unfortunately, CRUD test still does not work for me. Can you provide a simplest CRUD test possible that works? On Fri, Jul 23, 2010 at 10:59 AM, Jonathan Shook jsh...@gmail.com wrote: I suspect that it is still your timestamps. You can verify this with a fake timestamp generator that is simply incremented on each getTimestamp(). 1 millisecond is a long time for code that is wrapped tightly in a test. You are likely using the same logical time stamp for multiple operations. On Thu, Jul 22, 2010 at 6:29 PM, Peter Minearo peter.mine...@reardencommerce.com wrote: I am able to reproduce his problem. If you take the default storage-conf.xml file and utilize the Super2 ColumnFamily with the code below. You will see that the data is not getting created once you run the delete. It seems to not allow you to create data via Thrift. HOWEVER, data can be created via the command line tool. import java.io.UnsupportedEncodingException; import java.util.List; import org.apache.cassandra.thrift.Cassandra; import org.apache.cassandra.thrift.Column; import org.apache.cassandra.thrift.ColumnOrSuperColumn; import org.apache.cassandra.thrift.ColumnParent; import org.apache.cassandra.thrift.ColumnPath; import org.apache.cassandra.thrift.ConsistencyLevel; import org.apache.cassandra.thrift.InvalidRequestException; import org.apache.cassandra.thrift.NotFoundException; import org.apache.cassandra.thrift.SlicePredicate; import org.apache.cassandra.thrift.SliceRange; import org.apache.cassandra.thrift.SuperColumn; import org.apache.cassandra.thrift.TimedOutException; import org.apache.cassandra.thrift.UnavailableException; import org.apache.thrift.TException; import org.apache.thrift.protocol.TBinaryProtocol; import org.apache.thrift.protocol.TProtocol; import org.apache.thrift.transport.TSocket; import org.apache.thrift.transport.TTransport; public class CrudTest { private static final String KEYSPACE = Keyspace1; public static void main(String[] args) { CrudTest client = new CrudTest(); try { client.run(); } catch (Exception e) { e.printStackTrace(); } } public void run() throws TException, InvalidRequestException, UnavailableException, UnsupportedEncodingException, NotFoundException, TimedOutException { TTransport tr = new TSocket(localhost, 9160); TProtocol proto = new TBinaryProtocol(tr); Cassandra.Client client = new Cassandra.Client(proto); tr.open(); System.out.println( CREATING DATA *); createData(client); getData(client); System.out.println(); System.out.println( DELETING DATA *); deleteData(client); getData(client); System.out.println(); System.out.println( CREATING DATA *); createData(client); getData(client); tr.close(); } private void createData(Cassandra.Client client) throws InvalidRequestException, UnavailableException, TimedOutException, TException { ColumnPath cp1 = new ColumnPath(Super2); cp1.setSuper_column(hotel.getBytes()); cp1.setColumn(Best Western.getBytes()); client.insert(KEYSPACE, name, cp1, Best Western of SF.getBytes(), System.currentTimeMillis(), ConsistencyLevel.ALL); ColumnPath cp2 = new ColumnPath(Super2); cp2.setSuper_column(hotel.getBytes()); cp2.setColumn(Econolodge.getBytes()); client.insert(KEYSPACE, name, cp2, Econolodge of SF.getBytes(), System.currentTimeMillis(), ConsistencyLevel.ALL); } private void deleteData(Cassandra.Client client) throws InvalidRequestException, UnavailableException, TimedOutException, TException { client.remove(KEYSPACE, hotel, new ColumnPath(Super2), System.currentTimeMillis
Re: CRUD test
Correct. After the initial insert, cassandra get Keyspace1.Super2['name'] = (super_column=hotel, (column=Best Western, value=Best Western of SF, timestamp=1279916772571) (column=Econolodge, value=Econolodge of SF, timestamp=1279916772573)) Returned 1 results. ... and ... cassandra get Keyspace1.Super2['hotel'] Returned 0 results. On Fri, Jul 23, 2010 at 3:41 PM, Peter Minearo peter.mine...@reardencommerce.com wrote: The Model Should look like: Super2 = { hotel: { Best Western: {name: Best Western of SF} Econolodge: {name: Econolodge of SF} } } Are the CRUD Operations not referencing this correctly? -Original Message- From: Jonathan Shook [mailto:jsh...@gmail.com] Sent: Friday, July 23, 2010 1:34 PM To: user@cassandra.apache.org Subject: Re: CRUD test There seem to be data consistency bugs in the test. Are name and hotel being used in a pair-wise way? Specifically, the first test is using creating one and checking for the other. On Fri, Jul 23, 2010 at 2:46 PM, Oleg Tsvinev oleg.tsvi...@gmail.com wrote: Johathan, I followed your suggestion. Unfortunately, CRUD test still does not work for me. Can you provide a simplest CRUD test possible that works? On Fri, Jul 23, 2010 at 10:59 AM, Jonathan Shook jsh...@gmail.com wrote: I suspect that it is still your timestamps. You can verify this with a fake timestamp generator that is simply incremented on each getTimestamp(). 1 millisecond is a long time for code that is wrapped tightly in a test. You are likely using the same logical time stamp for multiple operations. On Thu, Jul 22, 2010 at 6:29 PM, Peter Minearo peter.mine...@reardencommerce.com wrote: I am able to reproduce his problem. If you take the default storage-conf.xml file and utilize the Super2 ColumnFamily with the code below. You will see that the data is not getting created once you run the delete. It seems to not allow you to create data via Thrift. HOWEVER, data can be created via the command line tool. import java.io.UnsupportedEncodingException; import java.util.List; import org.apache.cassandra.thrift.Cassandra; import org.apache.cassandra.thrift.Column; import org.apache.cassandra.thrift.ColumnOrSuperColumn; import org.apache.cassandra.thrift.ColumnParent; import org.apache.cassandra.thrift.ColumnPath; import org.apache.cassandra.thrift.ConsistencyLevel; import org.apache.cassandra.thrift.InvalidRequestException; import org.apache.cassandra.thrift.NotFoundException; import org.apache.cassandra.thrift.SlicePredicate; import org.apache.cassandra.thrift.SliceRange; import org.apache.cassandra.thrift.SuperColumn; import org.apache.cassandra.thrift.TimedOutException; import org.apache.cassandra.thrift.UnavailableException; import org.apache.thrift.TException; import org.apache.thrift.protocol.TBinaryProtocol; import org.apache.thrift.protocol.TProtocol; import org.apache.thrift.transport.TSocket; import org.apache.thrift.transport.TTransport; public class CrudTest { private static final String KEYSPACE = Keyspace1; public static void main(String[] args) { CrudTest client = new CrudTest(); try { client.run(); } catch (Exception e) { e.printStackTrace(); } } public void run() throws TException, InvalidRequestException, UnavailableException, UnsupportedEncodingException, NotFoundException, TimedOutException { TTransport tr = new TSocket(localhost, 9160); TProtocol proto = new TBinaryProtocol(tr); Cassandra.Client client = new Cassandra.Client(proto); tr.open(); System.out.println( CREATING DATA *); createData(client); getData(client); System.out.println(); System.out.println( DELETING DATA *); deleteData(client); getData(client); System.out.println(); System.out.println( CREATING DATA *); createData(client); getData(client); tr.close(); } private void createData(Cassandra.Client client) throws InvalidRequestException, UnavailableException, TimedOutException, TException { ColumnPath cp1 = new ColumnPath(Super2); cp1.setSuper_column(hotel.getBytes()); cp1.setColumn(Best Western.getBytes()); client.insert(KEYSPACE, name, cp1, Best Western of SF.getBytes(), System.currentTimeMillis
Re: CRUD test
I think you are getting it. As far as what means what at which level, it's really about using them consistently in every case. The [row] key (or [row] key range) is a top-level argument for all of the operations, since it is the key to mapping the set of responsible nodes. The key is the part of the name of any column which most affects how the load is apportioned in the cluster, so it is used very early in request processing. On Fri, Jul 23, 2010 at 4:22 PM, Peter Minearo peter.mine...@reardencommerce.com wrote: Consequentially the remove should look like: ColumnPath cp1 = new ColumnPath(Super2); cp1.setSuper_column(Best Western.getBytes()); client.remove(KEYSPACE, hotel, cp1, System.currentTimeMillis(), ConsistencyLevel.ONE); ColumnPath cp2 = new ColumnPath(Super2); cp2.setSuper_column(Econolodge.getBytes()); client.remove(KEYSPACE, hotel, cp2, System.currentTimeMillis(), ConsistencyLevel.ONE); -Original Message- From: Peter Minearo [mailto:peter.mine...@reardencommerce.com] Sent: Fri 7/23/2010 2:17 PM To: user@cassandra.apache.org Subject: RE: CRUD test CORRECTION: ColumnPath cp1 = new ColumnPath(Super2); cp1.setSuper_column(Best Western.getBytes()); cp1.setColumn(name.getBytes()); client.insert(KEYSPACE, hotel, cp1, Best Western of SF.getBytes(), System.currentTimeMillis(), ConsistencyLevel.ALL); -Original Message- From: Peter Minearo [mailto:peter.mine...@reardencommerce.com] Sent: Friday, July 23, 2010 2:14 PM To: user@cassandra.apache.org Subject: RE: CRUD test Interesting!! Let me rephrase to make sure I understood what is going on: When Inserting data via the insert function/method: void insert(string keyspace, string key, ColumnPath column_path, binary value, i64 timestamp, ConsistencyLevel consistency_level) The key parameter is the actual Key to the Row, which contains SuperColumns. The 'ColumnPath' gives the path within the Key. INCORRECT: ColumnPath cp1 = new ColumnPath(Super2); cp1.setSuper_column(hotel.getBytes()); cp1.setColumn(Best Western.getBytes()); client.insert(KEYSPACE, name, cp1, Best Western of SF.getBytes(), System.currentTimeMillis(), ConsistencyLevel.ALL); CORRECT: ColumnPath cp1 = new ColumnPath(Super2); cp1.setSuper_column(name.getBytes()); cp1.setColumn(Best Western.getBytes()); client.insert(KEYSPACE, hotel, cp1, Best Western of SF.getBytes(), System.currentTimeMillis(), ConsistencyLevel.ALL); -Original Message- From: Jonathan Shook [mailto:jsh...@gmail.com] Sent: Friday, July 23, 2010 1:49 PM To: user@cassandra.apache.org Subject: Re: CRUD test Correct. After the initial insert, cassandra get Keyspace1.Super2['name'] = (super_column=hotel, (column=Best Western, value=Best Western of SF, timestamp=1279916772571) (column=Econolodge, value=Econolodge of SF, timestamp=1279916772573)) Returned 1 results. ... and ... cassandra get Keyspace1.Super2['hotel'] Returned 0 results. On Fri, Jul 23, 2010 at 3:41 PM, Peter Minearo peter.mine...@reardencommerce.com wrote: The Model Should look like: Super2 = { hotel: { Best Western: {name: Best Western of SF} Econolodge: {name: Econolodge of SF} } } Are the CRUD Operations not referencing this correctly? -Original Message- From: Jonathan Shook [mailto:jsh...@gmail.com] Sent: Friday, July 23, 2010 1:34 PM To: user@cassandra.apache.org Subject: Re: CRUD test There seem to be data consistency bugs in the test. Are name and hotel being used in a pair-wise way? Specifically, the first test is using creating one and checking for the other. On Fri, Jul 23, 2010 at 2:46 PM, Oleg Tsvinev oleg.tsvi...@gmail.com wrote: Johathan, I followed your suggestion. Unfortunately, CRUD test still does not work for me. Can you provide a simplest CRUD test possible that works? On Fri, Jul 23, 2010 at 10:59 AM, Jonathan Shook jsh...@gmail.com wrote: I suspect that it is still your timestamps. You can verify this with a fake timestamp generator that is simply incremented on each getTimestamp(). 1 millisecond is a long time for code that is wrapped tightly in a test. You are likely using the same logical time stamp for multiple operations. On Thu, Jul 22, 2010 at 6:29 PM, Peter Minearo peter.mine...@reardencommerce.com wrote: I am able to reproduce his problem. If you take the default storage-conf.xml file and utilize the Super2 ColumnFamily with the code below. You will see that the data is not getting created once you run the delete. It seems to not allow you
Re: more questions on Cassandra ACID properties
You are correct. In this case, Cassandra would journal two writes to the same logical row, but they would be 2 independent writes. Writes do not depend on reads, so they are self-contained. If either column exists already, it will be overwritten. These journaled actions would then be applied to the memtables, and optionally to the on-disk structures depending on the configuration. (Asynchronous accumulation and flushing provides the best performance, but write through persistence is an option in the config) The memtables may have to be read and written, but they only keep a logical instance of each row, from what I know. Maybe a dev can confirm this. On Tue, Jul 20, 2010 at 2:58 PM, Alex Yiu bigcontentf...@gmail.com wrote: Hi, I have more questions on Cassandra ACID properties. Say, I have a row that has 3 columns already: colA, colB and colC And, if two *concurrent* clients perform a different insert(...) into the same row, one insert is for colD and the other insert is for colE. Then, Cassandra would guarantee both columns will be added to the same row. Is that correct? That is, insert(...) of a column does NOT involving reading and rewriting other existing columns of the same row? That is, we do not face the following situation: client X: read colA, colB and colC; then write: colA, colB, colC and colD client Y: read colA, colB and colC; then write: colA, colB, colC and colE BTW, it seems to me that insert() API as described in the wiki page: http://wiki.apache.org/cassandra/API should handle updating an existing column as well by the replacing the existing column value. If that is the case, I guess we should change the wording from insert to insert or update in the wiki doc And, ideally, insert(...) API operation name would be adapted to update_or_insert(...) Looking forward to replies that may confirm my understanding. Thanks! Regards, Alex Yiu
Re: get_range_slices
FYI: https://issues.apache.org/jira/browse/CASSANDRA-1145 Yes, it's a bug. CL.ONE is a reasonable work around. On Thu, Jul 8, 2010 at 11:04 PM, Mike Malone m...@simplegeo.com wrote: I think the answer to your question is no, you shouldn't. I'm feeling far too lazy to do even light research on the topic, but I remember there being a bug where replicas weren't consolidated and you'd get a result set that included data from each replica that was consulted for a query. That could be what you're seeing. Are you running the most recent release? Trying dropping to CL.ONE and see if you only get one copy. If that fixes it, I'd suggest searching JIRA. Mike On Thu, Jul 8, 2010 at 6:40 PM, Jonathan Shook jsh...@gmail.com wrote: Should I ever expect multiples of the same key (with non-empty column sets) from the same get_range_slices call? I've verified that the column data is identical byte-for-byte, as well, including column timestamps?
get_range_slices
Should I ever expect multiples of the same key (with non-empty column sets) from the same get_range_slices call? I've verified that the column data is identical byte-for-byte, as well, including column timestamps?
Re: Identifying Tombstones
Or the same key, in some cases. If you have multiple operations against the same columns 'at the same time', they ordering may be indefinite. This can happen if the effective resolution of your time stamp is coarse enough to bracket multiple operations. Milliseconds are not fine enough in many cases, and will be less adequate going forward. On Thu, Jul 1, 2010 at 9:08 AM, Jonathan Ellis jbel...@gmail.com wrote: On Thu, Jul 1, 2010 at 6:44 AM, Jools jool...@gmail.com wrote: Should you try to write to the same column family using the same key as a tombstone, it will be silently ignored. Only if you perform the write with a lower timestamp than the delete you previously performed. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
Re: Implementing Counter on Cassandra
Until then, a pragmatic solution, however undesirable, would be to only have a single logical thread/task/actor that is allowed to read,modify,update. If this doesn't work for your application, then a (distributed) lock manager may be used until such time that you can take it out. Some are using ZooKeeper for this. On Tue, Jun 29, 2010 at 11:45 AM, Ryan King r...@twitter.com wrote: On Tue, Jun 29, 2010 at 9:42 AM, Utku Can Topçu u...@topcu.gen.tr wrote: Hey Guys, Currently in a project I'm involved in, I need to have some columns holding incremented data. The easy approach for implementing a counter with increments is right now as I figured out is read - increment - insert however this approach is not an atomic operation and can easily be corrupted in time. Do you have any best practices in implementing an atomic counter on Cassandra? https://issues.apache.org/jira/browse/CASSANDRA-1072
Re: Distributed work-queues?
Ideas: Use a checkpoint that moves forward in time for each logical partition of the workload. Establish a way of dividing up jobs between clients that doesn't require synchronization. One way of doing this would be to modulo the key by the number of logical workers, allowing them to graze directly on the job data. Doing it this way means that you have to make the workers smart enough to checkpoint properly, handle exceptions, etc. Jobs may be dispatched out-of-order in this scheme, so you would have to decide how to handle explicit sequencing requirements. Some jobs have idempotent results only when executed in the same order, and keeping operations idempotent allows for simpler failure recovery. If your workers are capable of absorbing the workload, then backlogging won't hurt too much. Otherwise, you'll see strange ordering of things in your application when they would otherwise need to look more consistent. You might find it easier to just take the hit of having a synchronized dispatcher, but make it is a lean as possible. Another way to break workload up is to have logical groupings of jobs according to a natural boundary in your domain model, and to run a synchronized dispatcher for each of those. Using the job columns to keep track of who owns a job may not be the best approach. You may have to do row scans on column data, which is a Cassandra anti-pattern. Without an atomic check and modify operation, there is no way to do it that avoids possible race conditions or extra state management. This may be one of the strongest arguments for putting such an operation into Cassandra. You can set up your job name/keying such that every job result is logically ordered to come immediately after the job definition. Row key range scans would still be close to optimal, but would carry a marker for jobs which had been completed, This would allow clients to self-checkpoint, as long as result insertions are atomic row-wise. (I think they are). Another worker could clean up rows which were subsequently consumed (results no longer needed) after some gap in time. The client can avoid lots of tombstones by only looking where there should be additional work. (checkpoint time). Pick a character that is not natural for your keys and make it a delimiter. Require that all keys in the job CF be aggregate and fully-qualified. Clients might be able to remove jobs rows that allow for it after completion, but jobs which were dispatched to multiple works may end up with orphaned result rows to be cleaned up. .. just some drive-by ramblings .. Jonathan On Sat, Jun 26, 2010 at 3:56 PM, Andrew Miklas and...@pagerduty.com wrote: Hi all, Has anyone written a work-queue implementation using Cassandra? There's a section in the UseCase wiki page for A distributed Priority Job Queue which looks perfect, but unfortunately it hasn't been filled in yet. http://wiki.apache.org/cassandra/UseCases#A_distributed_Priority_Job_Queue I've been thinking about how best to do this, but every solution I've thought of seems to have some serious drawback. The range ghost problem in particular creates some issues. I'm assuming each job has a row within some column family, where the row's key is the time at which the job should be run. To find the next job, you'd do a range query with a start a few hours in the past, and an end at the current time. Once a job is completed, you delete the row. The problem here is that you have to scan through deleted-but-not-yet-GCed rows each time you run the query. Is there a better way? Preventing more than one worker from starting the same job seems like it would be a problem too. You'd either need an external locking manager, or have to use some other protocol where workers write their ID into the row and then immediately read it back to confirm that they are the owner of the job. Any ideas here? Has anyone come up with a nice implementation? Is Cassandra not well suited for queue-like tasks? Thanks, Andrew
Re: java.lang.RuntimeException: java.io.IOException: Value too large for defined data type
Actually, you shouldn't expect errors in the general case, unless you are simply trying to use data that can't fit in available heap. There are some practical limitations, as always. If there aren't enough resources on the server side to service the clients, the expectation should be that the servers have a graceful performance degradation, or in the worst case throw an error specific to resource exhaustion or explicit resource throttling. The fact that Cassandra does some background processing complicates this a bit. There are things which can cause errors after the fact, but these are generally considered resource tuning issues and are somewhat clear cut. There are specific changes in the works to bring background load exceptions into view of a client session, where users normally expect them. @see https://issues.apache.org/jira/browse/CASSANDRA-685 But otherwise, users shouldn't be expecting that simply increasing client load can blow up their Cassandra cluster. Any time this happens, it should be considered a bug or a misfeature. Devs please correct me here if I'm wrong. Jonathan On Tue, Jun 15, 2010 at 6:44 PM, Charles Butterfield charles.butterfi...@nextcentury.com wrote: Benjamin Black b at b3k.us writes: I am only saying something obvious: if you don't have sufficient resources to handle the demand, you should reduce demand, increase resources, or expect errors. Doing lots of writes without much heap space is such a situation (whether or not it is happening in this instance), but there are many others. This constraint it not specific to Cassandra. Hence, there is no free lunch. b I guess my point is that I have rarely run across database servers that die from either too many client connections, or too rapid client requests. They generally stop accepting incoming connections when there are too many connection requests, and further they do not queue and acknowledge an unbounded number of client requests on any given connection. In the example at hand, Julie has 8 clients, each of which is in a loop that writes 100 rows at a time (via batch_mutate), waits for successful completion, then writes another bunch of 100, until it completes all of the rows it is supposed to write (typically 100,000). So at any one time, each client should have about 10 MB of request (100 rows x 100 KB/row), times 8 clients, for a max pending request of no more than 80 MB. Further each request is running with a CL=ALL, so in theory, the request should not complete until each row has been handed off to the ultimate destination node, and perhaps written to the commit log (that part is not clear to me). It sounds like something else must be gobbling up either an unbounded amount of heap, or alternatively, a bounded, but large amount of heap. In the former case it is unclear how to make the application robust. In the later, it would be helpful to understand what the heap ussage upper bound is, and what parameters might have a significant effect on that value. To clarify the history here -- initially we were writing with CL=0 and had great performance but ended up killing the server. It was pointed out that we were really asking the server to accept and acknowledge an unbounded number of requests without waiting for any final disposition of the rows. So we had a doh! moment. That is why we went to the other extreme of CL=ALL, to let the server fully dispose of each request before acknowledging it and getting the next. TIA -- Charlie
Re: Some questions about using Cassandra
There is JSON import and export, of you want a form of external backup. No, you can't hook event subscribers into the storage engine. You can modify it to do this, however. It may not be trivial. An easier way to do this would be to have a boundary system (or dedicated thread, for example) consume data in small amounts, using some temporal criterion, with a checkpoint. If the results of consuming the data are idempotent, you don't have to use a checkpoint, necessarily, but some cyclic rework may occur. If your storage layout includes temporal names, it should be straightforward. The details how exactly how would depend on your storage layout, but it is not unusual as far as requirements go. On Tue, Jun 15, 2010 at 7:49 PM, Anthony Ikeda anthony.ik...@cardlink.com.au wrote: We are currently looking at a distributed database option and so far Cassandra ticks all the boxes. However, I still have some questions. Is there any need for archiving of Cassandra and what backup options are available? As it is a no-data-loss system I’m guessing archiving is not exactly relevant. Is there any concept of Listeners such that when data is added to Cassandra we can fire off another process to do something with that data? E.g. create a copy in a secondary database for Business Intelligence reports? Send the data to an LDAP server? Anthony Ikeda Java Analyst/Programmer Cardlink Services Limited Level 4, 3 Rider Boulevard Rhodes NSW 2138 Web: www.cardlink.com.au | Tel: + 61 2 9646 9221 | Fax: + 61 2 9646 9283 [image: logo_cardlink1] ** This e-mail message and any attachments are intended only for the use of the addressee(s) named above and may contain information that is privileged and confidential. If you are not the intended recipient, any display, dissemination, distribution, or copying is strictly prohibited. If you believe you have received this e-mail message in error, please immediately notify the sender by replying to this e-mail message or by telephone to (02) 9646 9222. Please delete the email and any attachments and do not retain the email or any attachments in any form. ** image001.gif
Re: Some questions about using Cassandra
Doh! Replace of with if in the top line. On Tue, Jun 15, 2010 at 7:57 PM, Jonathan Shook jsh...@gmail.com wrote: There is JSON import and export, of you want a form of external backup. No, you can't hook event subscribers into the storage engine. You can modify it to do this, however. It may not be trivial. An easier way to do this would be to have a boundary system (or dedicated thread, for example) consume data in small amounts, using some temporal criterion, with a checkpoint. If the results of consuming the data are idempotent, you don't have to use a checkpoint, necessarily, but some cyclic rework may occur. If your storage layout includes temporal names, it should be straightforward. The details how exactly how would depend on your storage layout, but it is not unusual as far as requirements go. On Tue, Jun 15, 2010 at 7:49 PM, Anthony Ikeda anthony.ik...@cardlink.com.au wrote: We are currently looking at a distributed database option and so far Cassandra ticks all the boxes. However, I still have some questions. Is there any need for archiving of Cassandra and what backup options are available? As it is a no-data-loss system I’m guessing archiving is not exactly relevant. Is there any concept of Listeners such that when data is added to Cassandra we can fire off another process to do something with that data? E.g. create a copy in a secondary database for Business Intelligence reports? Send the data to an LDAP server? Anthony Ikeda Java Analyst/Programmer Cardlink Services Limited Level 4, 3 Rider Boulevard Rhodes NSW 2138 Web: www.cardlink.com.au | Tel: + 61 2 9646 9221 | Fax: + 61 2 9646 9283 [image: logo_cardlink1] ** This e-mail message and any attachments are intended only for the use of the addressee(s) named above and may contain information that is privileged and confidential. If you are not the intended recipient, any display, dissemination, distribution, or copying is strictly prohibited. If you believe you have received this e-mail message in error, please immediately notify the sender by replying to this e-mail message or by telephone to (02) 9646 9222. Please delete the email and any attachments and do not retain the email or any attachments in any form. ** image001.gif
Re: Cassandra Write Performance, CPU usage
Rishi, I am not yet knowledgeable enough to answer your question in more detail. I would like to know more about the specifics as well. There are counters you can use via JMX to show logical events, but this will not always translate to good baseline information that you can use in scaling estimates. I would like to see a good analysis that characterizes the scaling factors of different parts of the system, both from load characterization and from an algorithmic perspective. This is a common area of inquiry. Maybe we should start http://wiki.apache.org/cassandra/ScalabilityFactors On Thu, Jun 10, 2010 at 11:05 PM, Rishi Bhardwaj khichri...@yahoo.com wrote: Hi Jonathan Thanks for such an informative reply. My application may end up doing such continuous bulk writes to Cassandra and thus I was interested in such a performance case. I was wondering as to what are all the CPU overheads for each row/column written to Cassandra? You mentioned updating of bloom filters, would that be the main CPU overhead, there may even be copying of data happening? I want to investigate about all the factors in play here and if there is a possibility for improvement. Is it possible to profile cassandra and see what maybe the bottleneck here. The auxiliary I/O you had mentioned for the Bloom filters, wouldn't that occur with the I/O for the SSTable, in which case the extra I/O for the bloom filter gets piggybacked with the SSTable I/O? I guess I don't understand the Cassandra internals too well but wanted to see how much can Cassandra achieve for continuous bulk writes. Has anyone done any bulk write experiments with Cassandra? Is Cassandra performance always expected to be bottlenecked by CPU when doing continuous bulk writes? Thanks for all the help, Rishi From: Jonathan Shook jsh...@gmail.com To: user@cassandra.apache.org Sent: Thu, June 10, 2010 7:39:24 PM Subject: Re: Cassandra Write Performance, CPU usage You are testing Cassandra in a way that it was not designed to be used. Bandwidth to disk is not a meaningful example for nearly anything except for filesystem benchmarking and things very nearly the same as filesystem benchmarking. Unless the usage patterns of your application match your test data, there is not a good reason to expect a strong correlation between this test and actual performance. Cassandra is not simply shuffling data through IO when you write. There are calculations that have to be done as writes filter their way through various stages of processing. The point of this is to minimize the overall effort Cassandra has to make in order to retrieve the data again. One example would be bloom filters. Each column that is written requires bloom filter processing and potentially auxiliary IO. Some of these steps are allowed to happen in the background, but if you try, you can cause them to stack up on top of the available CPU and memory resources. In such a case (continuous bulk writes), you are causing all of these costs to be taken in more of a synchronous (not delayed) fashion. You are not allowing the background processing that helps reduce client blocking (by deferring some processing) to do its magic. On Thu, Jun 10, 2010 at 7:42 PM, Rishi Bhardwaj khichri...@yahoo.com wrote: Hi I am investigating Cassandra write performance and see very heavy CPU usage from Cassandra. I have a single node Cassandra instance running on a dual core (2.66 Ghz Intel ) Ubuntu 9.10 server. The writes to Cassandra are being generated from the same server using BatchMutate(). The client makes exactly one RPC call at a time to Cassandra. Each BatchMutate() RPC contains 2 MB of data and once it is acknowledged by Cassandra, the next RPC is done. Cassandra has two separate disks, one for commitlog with a sequential b/w of 130MBps and the other a solid state disk for data with b/w of 90MBps. Tuning various parameters, I observe that I am able to attain a maximum write performance of about 45 to 50 MBps from Cassandra. I see that the Cassandra java process consistently uses 100% to 150% of CPU resources (as shown by top) during the entire write operation. Also, iostat clearly shows that the max disk bandwidth is not reached anytime during the write operation, every now and then the i/o activity on commitlog disk and the data disk spike but it is never consistently maintained by cassandra close to their peak. I would imagine that the CPU is probably the bottleneck here. Does anyone have any idea why Cassandra beats the heck out of the CPU here? Any suggestions on how to go about finding the exact bottleneck here? Some more information about the writes: I have 2 column families, the data though is mostly written in one column family with column sizes of around 32k and each row having around 256 or 512 columns. I would really appreciate any help here. Thanks, Rishi
Re: Perl/Thrift/Cassandra strangeness
I was misreading the result with the original slice range. I should have been expecting exactly 2 ColumnOrSuperColumns, which is what I got. I was erroneously expecting only 1. Thanks! Jonathan 2010/6/8 Ted Zlatanov t...@lifelogs.com: On Mon, 7 Jun 2010 17:20:56 -0500 Jonathan Shook jsh...@gmail.com wrote: JS The point is to get the last super-column. ... JS Is the Perl Thrift client problematic, or is there something else that JS I am missing? Try Net::Cassandra::Easy; if it does what you want, look at the debug output or trace the code to see how the predicate is specified so you can duplicate that in your own code. In general yes, the Perl Thrift interface is problematic. It's slow and semantically inconsistent. Ted
Re: Perl/Thrift/Cassandra strangeness
Possible bug... Using a slice range with the empty sentinel values, and a count of 1 sometimes yields 2 ColumnOrSuperColumns, sometimes 1. The inconsistency had lead me to believe that the count was not working, hence the additional confusion. There was a particular key which returns exactly 2 ColumnOrSuperColumns. This happened repeatedly, even when other data was inserted before or after. All of the other keys were returning the expected 1 ColumnOrSuperColumn. Once I added a 4th super column to the key in question, it started behaving the same as the others, yielding exactly 1 ColumnOrSuperColumn. here is the code: for the predicate: my $predicate = new Cassandra::SlicePredicate(); my $slice_range = new Cassandra::SliceRange(); $slice_range-{start} = ''; $slice_range-{finish} = ''; $slice_range-{reversed} = 1; $slice_range-{count} = 1; $predicate-{slice_range} = $slice_range; The columns are in the right order (reversed), so I'll get what I need by accessing only the first result in each slice. If I wanted to iterate the returned list of slices, it would manifest as a bug in my client. (Cassandra 6.1/Thrift/Perl) On Tue, Jun 8, 2010 at 11:18 AM, Jonathan Shook jsh...@gmail.com wrote: I was misreading the result with the original slice range. I should have been expecting exactly 2 ColumnOrSuperColumns, which is what I got. I was erroneously expecting only 1. Thanks! Jonathan 2010/6/8 Ted Zlatanov t...@lifelogs.com: On Mon, 7 Jun 2010 17:20:56 -0500 Jonathan Shook jsh...@gmail.com wrote: JS The point is to get the last super-column. ... JS Is the Perl Thrift client problematic, or is there something else that JS I am missing? Try Net::Cassandra::Easy; if it does what you want, look at the debug output or trace the code to see how the predicate is specified so you can duplicate that in your own code. In general yes, the Perl Thrift interface is problematic. It's slow and semantically inconsistent. Ted
Perl/Thrift/Cassandra strangeness
I have a structure like this: CF:Status { Row(Component42) { SuperColumn(1275948636203) (epoch millis) { sub columns... } } } The supercolumns are dropped in periodically by system A, which is using Hector. System B uses a lightweight perl/Thrift client to reduce process overhead. (It gets called as a process frequently) This will go away at some point, but for now it is the de-facto means for integrating the two systems. According to the API docs, under get_range_slices, The empty string () can be used as a sentinel value to get the first/last existing key (or first/last column in the column predicate parameter) This seems to conflict directly with the error message that I am getting: column name must not be empty, which ISA Cassandra::InvalidRequestException. The point is to get the last super-column. I've also tried to set the predicate's slice_range for all columns, reversed, limit 1 but it simply returns multiple super columns. Is the Perl Thrift client problematic, or is there something else that I am missing?
Re: Conditional get
It sounds like you are getting a handle on it, but maybe in a round-about way. Here are some ways I like of conceptualizing Cassandra. Maybe they can shorten your walk. Either the grid analogy or the maps-of-maps analogy can apply, as they both map conceptually to the way that we use a column family. -- The maps-of-maps analogy: Please try to think of the column as the intersection between a row key and a column name. This captures the most essential concepts. It's easier for me to think of in terms of a sorted map to a sorted map, where: * the outer map is the set of rows whose whose (map) keys and (map) values are (Cassandra) keys and (Cassandra) rows * the inner map for each row key is the set of columns whose keys and values are column names and column data. * column data is essentially a molecule of (column name, column value, storage timestamp). It can be thought of as the value, but it is stored as a 3-tuple. -- The grid analogy: (This one is my favorite) In the grid analogy, rows may be undefined. Rows that are defined may have columns that are undefined. Two things to think about when using this analogy: Cassandra doesn't have to store undefined values, except during deletes and before anti-entropy takes them away. Cassandra operates behind the scenes in row-major order. That means that while you can think of it terms of a Cartesian intersection, you should know that rows will always be accessed first. -- Another layer outward is the column family, which is also a map. Another layer inward is the sub-column, which is also a map. Don't get confused by super columns or sub columns. Super/Sub columns are really API sugar to reduce some of the work of using your own serialized aggregates within a normal column value. I find that the confusion is usually not worth the trouble when starting out. On the other hand, were you to implement your own aggregate types within a column value, the purpose of super/sub columns would seem obvious. It's just a little overly complex because of the supporting types in the API. Since this was basically bolted on to the standard column support, it falls into normal column behavior to the core Cassandra machinery. Neither the column family layer, nor the subcolumn layer have been given the same attention as the basic row-column with respect to performance and scalability. This may change in the future. For now, consider that only row-keys and column-names are places where Cassandra is able to scale the best. Jonathan On Sat, Jun 5, 2010 at 4:06 PM, Peter Schuller peter.schul...@infidyne.com wrote: Eric wrote a good explanation with sample code at http://www.rackspacecloud.com/blog/2010/05/12/cassandra-by-example/ Regarding the schema description and analogy problem mentioned in the article; I found that reading the BigTable paper helped a lot for me. It seemed very useful to me to think of a ColumnFamily in Cassandra as a sorted (on keys) on-disk table of entries with efficiency guarantees with respect to range queries and locality on disk. Please correct me if I am wrong, but the data model as I now understand it essentially boils down to a sorted table of the form (readers who don't know the answer, please don't assume I'm right unless someone in the know confirms it; I don't want to add to the confusion): rowkeyN+0,columnM+0 data rowkeyN+0,columnM+1 data ... rowkeyN+1 data rowkeyN+2 data ... Where each piece of data is is the column (I am ignoring super columns for now). The table, other than being sorted, is indexed on row key and column name. Is this correct? In my head I think of it as there being some N amount of keys (not the cassandra term) that are interesting to the application, which end up mapping to the actual key (not the cassandra term) in the table. So, in a column family users, we might have a john doe whose age is 47. This means we have a key (not the cassandra term) which is users,john doe,age and whose value is 47 (ignoring time stamps and ignoring keys that contain commas, and ignoring column names being semantically part of the data). So, given: users,john doe,age We have, in cassandra terms: column family: users key: john doe column name: age The fact that different column families are in different files, to me, seems mostly to be an implementation details since performance characteristics (sorting, locality on disk) should be the same as it had been if it was just one huge table (ignoring compactation concerns, etc). The API exposed by cassandra is not one of a generalized multi-level key, but rather one with specific concepts of ColumnFamily, Column and SuperColumn. These essentially provides a two-level key (in the case of a CF with C:s) and a three-level key (in the case of a CF with SC:s with C:s), with the caveat that three-level keys are still only indexed on their first two components (even though they are still sorted on disk). Does this make sense
Re: Conditional get
Sorry for the extra post. This version has confusing parts removed and better formatting. It sounds like you are getting a handle on it, but maybe in a round-about way. Here are some ways I like of conceptualizing Cassandra. Maybe they can help. Either the grid analogy or the maps-of-maps analogy can apply, as they both map conceptually to the way that we use a column family. The maps-of-maps analogy: Think of in terms of a sorted map to a sorted map, where: *) the outer map is the set of rows whose whose (map) keys and (map) values are (Cassandra) keys and (Cassandra) rows *) the inner map for each row key is the set of columns whose keys and values are column names and column data. *) column data is essentially a molecule of (column name, column value, storage timestamp). It can be thought of as the value, but it is stored as a 3-tuple. The grid analogy: (This one is my favorite) Think of the column as the intersection between a row key and a column name. *) Rows may be undefined. *) Rows that are defined may have columns that are undefined. *) Cassandra doesn't have to store undefined values, except during deletes and before housekeeping takes them away. *) Cassandra operates behind the scenes in row-major order. That means that while you can think of it terms of a Cartesian intersection, you should know that rows will always be accessed first. -- Another layer outward is the column family, which is also a map. Another layer inward is the sub-column, which is also a map. Don't get confused by super columns or sub columns. Super/Sub columns are really API sugar to reduce some of the work of using your own serialized aggregates within a normal column value. I find that the confusion is usually not worth the trouble when starting out. On the other hand, were you to implement your own aggregate types within a column value, the purpose of super/sub columns would seem obvious. It's just a little overly complex because of the supporting types in the API. Since this was basically bolted on to the standard column support, it falls into normal column behavior to the core Cassandra machinery. Neither the column family layer, nor the subcolumn layer have been given the same attention as the basic row-column with respect to performance and scalability. This may change in the future. For now, consider that only row-keys and column-names are places where Cassandra scales the best. Jonathan
Re: Seeds, autobootstrap nodes, and replication factor
If I may ask, why the need for frequent topology changes? On Fri, Jun 4, 2010 at 1:21 PM, Benjamin Black b...@b3k.us wrote: On Fri, Jun 4, 2010 at 11:14 AM, Philip Stanhope pstanh...@wimba.com wrote: I guess I'm thick ... What would be the right choice? Our data demands have already been proven to scale beyond what RDB can handle for our purposes. We are quite pleased with Cassandra read/write/scale out. Just trying to understand the operational considerations. Cassandra supports online topology changes, but those operations are not cheap. If you are expecting frequent addition and removal of nodes from a ring, things will be very unstable or slow (or both). As I already mentioned, having a large cluster (and 40 nodes qualifies right now) with RF=number of nodes is going to make read and write operations get more and more expensive as the cluster grows. While you might see reasonable performance at current, small scale, it will not be the case when the cluster gets large. I am not aware of anything like Cassandra (or any other Dynamo system) that support such extensive replication and topology churn. You might have to write it. b
Re: Range search on keys not working?
Can you clarify what you mean by 'random between nodes' ? On Wed, Jun 2, 2010 at 8:15 AM, David Boxenhorn da...@lookin2.com wrote: I see. But we could make this work if the random partitioner was random only between nodes, but was still ordered within each node. (Or if there were another partitioner that did this.) That way we could get everything we need from each node separately. The results would not be ordered, but they would be correct. On Wed, Jun 2, 2010 at 4:09 PM, Sylvain Lebresne sylv...@yakaz.com wrote: So why do the start and finish range parameters exist? Because especially if you want to iterate over all your key (which as stated by Ben above is the only meaningful way to use get_range_slices() with the random partitionner), you'll want to paginate that. And that's where the 'start' and 'finish' are useful (to be fair, the 'finish' part is not so useful in practice with the random partitioner). -- Sylvain On Wed, Jun 2, 2010 at 3:53 PM, Ben Browning ben...@gmail.com wrote: Martin, On Wed, Jun 2, 2010 at 8:34 AM, Dr. Martin Grabmüller martin.grabmuel...@eleven.de wrote: I think you can specify an end key, but it should be a key which does exist in your column family. Logically, it doesn't make sense to ever specify an end key with random partitioner. If you specified a start key of aaa and and end key of aac you might get back as results aaa, zfc, hik, etc. And, even if you have a key of aab it might not show up. Key ranges only make sense with order-preserving partitioner. The only time to ever use a key range with random partitioner is when you want to iterate over all keys in the CF. Ben But maybe I'm off the track here and someone else here knows more about this key range stuff. Martin From: David Boxenhorn [mailto:da...@lookin2.com] Sent: Wednesday, June 02, 2010 2:30 PM To: user@cassandra.apache.org Subject: Re: Range search on keys not working? In other words, I should check the values as I iterate, and stop iterating when I get out of range? I'll try that! On Wed, Jun 2, 2010 at 3:15 PM, Dr. Martin Grabmüller martin.grabmuel...@eleven.de wrote: When not using OOP, you should not use something like 'CATEGORY/' as the end key. Use the empty string as the end key and limit the number of returned keys, as you did with the 'max' value. If I understand correctly, the end key is used to generate an end token by hashing it, and there is not the same correspondence between 'CATEGORY' and 'CATEGORY/' as for hash('CATEGORY') and hash('CATEGORY/'). At least, this was the explanation I gave myself when I had the same problem. The solution is to iterate through the keys by always using the last key returned as the start key for the next call to get_range_slices, and the to drop the first element from the result. HTH, Martin From: David Boxenhorn [mailto:da...@lookin2.com] Sent: Wednesday, June 02, 2010 2:01 PM To: user@cassandra.apache.org Subject: Re: Range search on keys not working? The previous thread where we discussed this is called, key is sorted? On Wed, Jun 2, 2010 at 2:56 PM, David Boxenhorn da...@lookin2.com wrote: I'm not using OPP. But I was assured on earlier threads (I asked several times to be sure) that it would work as stated below: the results would not be ordered, but they would be correct. On Wed, Jun 2, 2010 at 2:51 PM, Torsten Curdt tcu...@vafer.org wrote: Sounds like you are not using an order preserving partitioner? On Wed, Jun 2, 2010 at 13:48, David Boxenhorn da...@lookin2.com wrote: Range search on keys is not working for me. I was assured in earlier threads that range search would work, but the results would not be ordered. I'm trying to get all the rows that start with CATEGORY. I'm doing: String start = CATEGORY.; . . . keyspace.getSuperRangeSlice(columnParent, slicePredicate, start, CATEGORY/, max) . . . in a loop, setting start to the last key each time - but I'm getting rows that don't start with CATEGORY.!! How do I get all rows that start with CATEGORY.?
Re: Giant sets of ordered data
Either OPP by key, or within a row by column name. I'd suggest the latter. If you have structured data to stick under a column (named by the timestamp), then you can serialize and unserialize it yourself, or you can use a supercolumn. It's effectively the same thing. Cassandra only provides the super column support as a convenience layer as it is currently implemented. That may change in the future. You didn't make clear in your question why a standard column would be less suitable. I presumed you had layered structure within the timestamp, hence my response. How would you logically partition your dataset according to natural application boundaries? This will answer most of your question. If you have a dataset which can't be partitioned into a reasonable size row, then you may want to use OPP and key concatenation. What do you mean by giant? On Wed, Jun 2, 2010 at 10:32 AM, David Boxenhorn da...@lookin2.com wrote: How do I handle giant sets of ordered data, e.g. by timestamps, which I want to access by range? I can't put all the data into a supercolumn, because it's loaded into memory at once, and it's too much data. Am I forced to use an order-preserving partitioner? I don't want the headache. Is there any other way?
Re: Giant sets of ordered data
If you want to do range queries on the keys, you can use OPP to do this: (example using UTF-8 lexicographic keys, with bursts split across rows according to row size limits) Events: { 20100601.05.30.003: { 20100601.05.30.003: value 20100601.05.30.007: value ... } } With a future version of Cassandra, you may be able to use the same basic datatype for both key and column name, as keys will be binary like the rest, I believe. I'm not aware of specific performance improvements when using OPP range queries on keys vs iterating over known keys. I suspect (hope) that round-tripping to the server should be reduced, which may be significant. Does anybody have decent benchmarks that tell the difference? On Wed, Jun 2, 2010 at 11:53 AM, Ben Browning ben...@gmail.com wrote: With a traffic pattern like that, you may be better off storing the events of each burst (I'll call them group) in one or more keys and then storing these keys in the day key. EventGroupsPerDay: { 20100601: { 123456789: group123, // column name is timestamp group was received, column value is key 123456790: group124 } } EventGroups: { group123: { 123456789: value1, 123456799: value2 } } If you think of Cassandra as a toolkit for building scalable indexes it seems to make the modeling a bit easier. In this case, you're building an index by day to lookup events that come in as groups. So, first you'd fetch the slice of columns for the day you're interested in to figure out which groups to look at then you'd fetch the events in those groups. There are plenty of alternate ways to divide up the data among rows also - you could use hour keys instead of days as an example. On Wed, Jun 2, 2010 at 11:57 AM, David Boxenhorn da...@lookin2.com wrote: Let's say you're logging events, and you have billions of events. What if the events come in bursts, so within a day there are millions of events, but they all come within microseconds of each other a few times a day? How do you find the events that happened on a particular day if you can't store them all in one row? On Wed, Jun 2, 2010 at 6:45 PM, Jonathan Shook jsh...@gmail.com wrote: Either OPP by key, or within a row by column name. I'd suggest the latter. If you have structured data to stick under a column (named by the timestamp), then you can serialize and unserialize it yourself, or you can use a supercolumn. It's effectively the same thing. Cassandra only provides the super column support as a convenience layer as it is currently implemented. That may change in the future. You didn't make clear in your question why a standard column would be less suitable. I presumed you had layered structure within the timestamp, hence my response. How would you logically partition your dataset according to natural application boundaries? This will answer most of your question. If you have a dataset which can't be partitioned into a reasonable size row, then you may want to use OPP and key concatenation. What do you mean by giant? On Wed, Jun 2, 2010 at 10:32 AM, David Boxenhorn da...@lookin2.com wrote: How do I handle giant sets of ordered data, e.g. by timestamps, which I want to access by range? I can't put all the data into a supercolumn, because it's loaded into memory at once, and it's too much data. Am I forced to use an order-preserving partitioner? I don't want the headache. Is there any other way?
Re: Giant sets of ordered data
Insert if you want to use long values for keys and column names above paragraph 2. I forgot that part. On Wed, Jun 2, 2010 at 1:29 PM, Jonathan Shook jsh...@gmail.com wrote: If you want to do range queries on the keys, you can use OPP to do this: (example using UTF-8 lexicographic keys, with bursts split across rows according to row size limits) Events: { 20100601.05.30.003: { 20100601.05.30.003: value 20100601.05.30.007: value ... } } With a future version of Cassandra, you may be able to use the same basic datatype for both key and column name, as keys will be binary like the rest, I believe. I'm not aware of specific performance improvements when using OPP range queries on keys vs iterating over known keys. I suspect (hope) that round-tripping to the server should be reduced, which may be significant. Does anybody have decent benchmarks that tell the difference? On Wed, Jun 2, 2010 at 11:53 AM, Ben Browning ben...@gmail.com wrote: With a traffic pattern like that, you may be better off storing the events of each burst (I'll call them group) in one or more keys and then storing these keys in the day key. EventGroupsPerDay: { 20100601: { 123456789: group123, // column name is timestamp group was received, column value is key 123456790: group124 } } EventGroups: { group123: { 123456789: value1, 123456799: value2 } } If you think of Cassandra as a toolkit for building scalable indexes it seems to make the modeling a bit easier. In this case, you're building an index by day to lookup events that come in as groups. So, first you'd fetch the slice of columns for the day you're interested in to figure out which groups to look at then you'd fetch the events in those groups. There are plenty of alternate ways to divide up the data among rows also - you could use hour keys instead of days as an example. On Wed, Jun 2, 2010 at 11:57 AM, David Boxenhorn da...@lookin2.com wrote: Let's say you're logging events, and you have billions of events. What if the events come in bursts, so within a day there are millions of events, but they all come within microseconds of each other a few times a day? How do you find the events that happened on a particular day if you can't store them all in one row? On Wed, Jun 2, 2010 at 6:45 PM, Jonathan Shook jsh...@gmail.com wrote: Either OPP by key, or within a row by column name. I'd suggest the latter. If you have structured data to stick under a column (named by the timestamp), then you can serialize and unserialize it yourself, or you can use a supercolumn. It's effectively the same thing. Cassandra only provides the super column support as a convenience layer as it is currently implemented. That may change in the future. You didn't make clear in your question why a standard column would be less suitable. I presumed you had layered structure within the timestamp, hence my response. How would you logically partition your dataset according to natural application boundaries? This will answer most of your question. If you have a dataset which can't be partitioned into a reasonable size row, then you may want to use OPP and key concatenation. What do you mean by giant? On Wed, Jun 2, 2010 at 10:32 AM, David Boxenhorn da...@lookin2.com wrote: How do I handle giant sets of ordered data, e.g. by timestamps, which I want to access by range? I can't put all the data into a supercolumn, because it's loaded into memory at once, and it's too much data. Am I forced to use an order-preserving partitioner? I don't want the headache. Is there any other way?
Re: Can't get data after building cluster
Depending on the key, the request would have been proxied to the first or second node. The CLI uses a consistency level of ONE, meaning that only a single node's data would have been considered when you get(). Also, the responsible nodes for a given key are mapped accordingly at request time, and proxy requests are made internally on your behalf. This allows the R+WN to hold, where N is the replication factor. It closes the subset of active nodes responsible for a key in a deterministic way. See http://www.slideshare.net/benjaminblack/introduction-to-cassandra-replication-and-consistency for more information. On Tue, Jun 1, 2010 at 1:43 AM, David Boxenhorn da...@lookin2.com wrote: I don't think it can be the case that at most data in the token range assigned to that node will be affected - the new node had no knowledge of any of our data. Any fake data that it might have had through some error on my part could not have been within the range of real data. I had 4.25 G of data on the 1st server, and as far as I could tell I couldn't access any of it. On Tue, Jun 1, 2010 at 9:10 AM, Jonathan Ellis jbel...@gmail.com wrote: To elaborate: If you manage to screw things up to where it thinks a node has data, but it does not (adding a node without bootstrap would do this, for instance, which is probably what you did), at most data in the token range assigned to that node will be affected. On Tue, Jun 1, 2010 at 12:45 AM, David Boxenhorn da...@lookin2.com wrote: You say no, but that is exactly what I just observed. Can I have some more explanation? To recap: I added a server to my cluster. It had some junk in the system/LocationInfo files from previous, unsuccessful attempts to add the server to the cluster. (They were unsuccessful because I hadn't opened the port on that computer.) When I finally succeeded in adding the 2nd server, the 1st server started returning null when I tried to get data using the CLI. I stopped the 2nd server, deleted the files in system, restarted, and everything worked. I'm afraid that this, or some similar scenario will do the same, after I go live. How can I protect myself? On Mon, May 31, 2010 at 10:10 PM, Jonathan Ellis jbel...@gmail.com wrote: No. On Mon, May 31, 2010 at 10:47 AM, David Boxenhorn da...@lookin2.com wrote: So this means that I can take my entire cluster off line if I make a mistake adding a new server??? Yikes! On Mon, May 31, 2010 at 6:41 PM, David Boxenhorn da...@lookin2.com wrote: OK. Got it working. I had some data in the 2nd server from previous failed attempts at hooking up to the cluster. When I deleted that data and tried again, it said bootstrapping and my 1st server started working again. On Mon, May 31, 2010 at 4:50 PM, David Boxenhorn da...@lookin2.com wrote: I am trying to get a cluster up and working for the first time. I got one server up and running, with lots of data on it, which I can see with the CLI. I added my 2nd server, they seem to recognize each other. Now I can't see my data with the CLI. I do a get and it returns null. The data files seem to be intact. What happened??? How can I fix it? -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
Re: writing speed test
Also, what are you meaning specifically by 'slow'? Which measurements are you looking at. What are your baseline constraints for your test system? 2010/6/1 史英杰 shiyingjie1...@gmail.com: Hi, It would be better if we know which Consistency Level did you choose, and what is the schema of test data? 在 2010年6月1日 下午4:48,Shuai Yuan yuansh...@supertool.net.cn写道: Hi all, I'm testing writing speed of cassandra with 4 servers. I'm confused by the behavior of cassandra. ---env--- load-data app written in c++, using libcassandra (w/ modified batch insert) 20 writing threads in 2 processes running on 2 servers ---optimization--- 1.turn log level to INFO 2.JVM has 8G heap 3.32 concurrent read 128 write in storage-conf.xml, other cache enlarged as well. ---result--- 1-monitoring by `date;nodetool -h host ring` I add all load together and measure the writing speed by (load_difference / time_difference), and I get about 15MB/s for the whole cluster. 2-monitoring by `iostat -m 10` I can watch the disk_io from the system level and have about 10MB/s - 65MB/s for a single machine. Very big variance over time. 3-monitoring by `iptraf -g` In this way I watch the communication between servers and get about 10MB/s for a single machine. ---opinion--- So, have you checked the writing speed of cassandra? I feel it's quite slow currently. Could anyone confirm this is the normal writing speed of cassandra, or please provide someway of improving it? -- Kevin Yuan www.yuan-shuai.info
Re: Which kind of applications are Cassandra fit for?
There is no easy answer to this. The requirements vary widely even within a particular type of application. If you have a list of specific requirements for a given application, it is easier to say whether it is a good fit. If you need a schema marshaling system, then you will have to build it into your application somewhere. Some client libraries support this type of interface. Otherwise, Cassandra doesn't make you pay for the kitchen sink if you don't need it enough to let it take up space and time in your application. The storage layout of Cassandra mimics lists, sets, and maps, as used by programmers everywhere. Cassandra is responsible for getting the data to and from those in-memory structures. Because there is little conceptual baggage between the in-storage representation and the in-memory representation, this is easier to optimize for the general case. There are a few necessary optimizations for dealing with the underlying storage medium, but the core concepts are generic. There are lots of bells and whistles, but they tend to fall in the happy zone between need-to-have, and want-to-have. Because Cassandra provides a generic service for data storage (in sets, lists, maps, and combinations of these), it serves as a good building block for close-to-the-metal designs, or as a layer to build more strongly-typed or schema-constrained systems on top of. I know this didn't answer your question, but maybe it got you in the ballpark. Jonathan On Tue, Jun 1, 2010 at 7:43 AM, 史英杰 shiyingjie1...@gmail.com wrote: Hi,ALL I found that most applications on Cassandra are for web applications, such as store friiend information or digg information, and they get good performance, many companies or groups want to move their applications to Cassandra, so which kind of applications are Cassandra fit for? Thanks a lot! Yingjie
Re: Order Preserving Partitioner
I don't think that queries on a key range are valid unless you are using OPP. As far as hashing the key for OPP goes, I take it to be the same a not using OPP. It's really a matter of where it gets done, but it has much the same effect. (I think) Jonathan On Wed, May 26, 2010 at 12:51 PM, Peter Hsu pe...@motivecast.com wrote: Correct me if I'm wrong here. Even though you can get your results with Random Partitioner, it's a lot less efficient if you're going across different machines to get your results. If you're doing a lot of range queries, it makes sense to have things ordered sequentially so that if you do need to go to disk, the reads will be faster, rather than lots of random reads across your system. It's also my understanding that if you go with the OPP, you could hash your key yourself using md5 or sha-1 to effectively get random partitioning. So it's a bit of a pain, but not impossible to do a split between OPP and RP for your different columnfamily/keyspaces. On May 26, 2010, at 2:32 AM, David Boxenhorn wrote: Just in case you don't know: You can do range searches on keys even with Random Partitioner, you just won't get the results in order. If this is good enough for you (e.g. if you can order the results on the client, or if you just need to get the right answer, but not the right order), then you should use Random Partitioner. (I bring this up because it confused me until recently.) On Wed, May 26, 2010 at 5:14 AM, Steve Lihn stevel...@gmail.com wrote: I have a question on using Order Preserving Partitioner. Many rowKeys in my system will be related to dates, so it seems natural to use Order Preserving Partitioner instead of the default Random Partitioner. However, I have been warned that special attention has to be applied for Order Preserving Partitioner to work properly (basically to ensure a good key distribution and avoid hot spot) and reverting it back to Random may not be easy. Also not every rowKey is related to dates, for these, using Random Partitioner is okay, but there is only one place to set Partitioner. (Note: The intension of this warning is actually to discredit Cassandra and persuade me not to use it.) It seems the choice of Partitioner is defined in the storage-conf.xml and is a global property. My question why does it have to be a global property? Is there a future plan to make it customizable per KeySpace (just like you would choose hash or range partition for different table/data in RDBMS) ? Thanks, Steve
Re: Doing joins between column familes
I wrote some Iterable* methods to do this for column families that share key structure with OPP. It is on the hector examples page. Caveat emptor. It does iterative chunking of the working set for each column family, so that you can set the nominal transfer size when you construct the Iterator/Iterable. I've been very happy with the performance of it, even over large ranges of keys. This is with OrderPreservingPartitioner because of other requirements, so it may not be a good example for comparison with a random partitioner, which is preferred. Doing joins as such on the server works against the basic design of Cassandra. The server does a few things very well only because it isn't overloaded with extra faucets and kitchen sinks. However, I'd like to be able to load auxiliary classes into the server runtime in a modular way, just for things like this. Maybe we'll get that someday. My impression is that there is much more common key structure in a workable Cassandra storage layout than in a conventional ER model. This is the nature of the beast when you are organizing your information more according to access patterns than fully normal relationships. That is one of the fundamental design trade-offs of using a hash structure over a schema. Having something that lets you deploy a fully normal schema on a hash store can be handy, but it can also obscure the way that your application indirectly exercises the storage layer. The end-result may be that the layout is less friendly to the underlying mechanisms of Cassandra. I'm not saying that it is bad to have a tool to do this, only that it can make it easy to avoid thinking about Cassandra storage in terms of what it really is. There may be ways to optimize the OCM queries, but that takes you down the road of query optimization, which can be quite nebulous. My gut instinct is to focus more on the layout, using aggregate keys and common key structure where you can, so that you can take advantage of the parallel queries more of the time. On Wed, May 26, 2010 at 3:13 PM, Charlie Mason charlie@gmail.com wrote: On Wed, May 26, 2010 at 7:45 PM, Dodong Juan dodongj...@gmail.com wrote: So I am not sure if you guys are familiar with OCM . Basically it is an ORM for Cassandra. Been testing it In case anyone is interested I have posted a reply on the OCM issue tracker where this was also raised. http://github.com/charliem/OCM/issues/closed#issue/5/comment/254717 Charlie M
Re: Cassandra's 2GB row limit and indexing
The example is a little confusing. .. but .. 1) sharding You can square the capacity by having a 2-level map. CF1-row-value-CF2-row-value This means finding some natural subgrouping or hash that provides a good distribution. 2) hashing You can also use some additional key hashing to spread the rows over a wider space: Find a delimiter that works for you and identify the row that owns it by domain + delimiter + hash(domain) modulo some divisor, for example. 3) overflow You can implement some overflow logic to create overflow rows which act like (2), but is less sparse while count(columns) for candidate row some threshold, try row + delimiter + subrow++ This is much easier when you are streaming data in, as opposed to poking the random value here and there Just some ideas. I'd go with 2, and find a way to adjust the modulo to minimize the row spread. 2) isn't guaranteed to provide uniformity, but 3) isn't guaranteed to provide very good performance. Perhaps a combination of them both? The count is readily accessible, so it may provide for some informed choices at run time. I'm assuming your column sizes are fairly predictable. Has anybody else tackled this before? On Wed, May 26, 2010 at 8:52 PM, Richard West r...@clearchaos.com wrote: Hi all, I'm currently looking at new database options for a URL shortener in order to scale well with increased traffic as we add new features. Cassandra seems to be a good fit for many of our requirements, but I'm struggling a bit to find ways of designing certain indexes in Cassandra due to its 2GB row limit. The easiest example of this is that I'd like to create an index by the domain that shortened URLs are linking to, mostly for spam control so it's easy to grab all the links to any given domain. As far as I can tell the typical way to do this in Cassandra is something like: - DOMAIN = { //columnfamily thing.com { //row key timestamp: shorturl567, //column name: value timestamp: shorturl144, timestamp: shorturl112, ... } somethingelse.com { timestamp: shorturl817, ... } } The values here are keys for another columnfamily containing various data on shortened URLs. The problem with this approach is that a popular domain (e.g. blogspot.com) could be used in many millions of shortened URLs, so would have that many columns and hit the row size limit mentioned at http://wiki.apache.org/cassandra/CassandraLimitations. Does anyone know an effective way to design this type of one-to-many index around this limitation (could be something obvious I'm missing)? If not, are the changes proposed for https://issues.apache.org/jira/browse/CASSANDRA-16 likely to make this type of design workable? Thanks in advance for any advice, Richard
Re: data model and queries.
Every system has its limits. When you say to imagine there are billions of users without providing any other real data, it limits the discussion strictly to the hypothetical (and hyperbolic, usually). The only reasonable answer we could provide would be about the types of limitations we know about and how they manifest. Here are the ones I know of off the top of my head, but you'll need to provide more specific constraints to get a better answer from anybody. * you must be able to fit a unit of work/transfer in memory, don't assume streaming support * you may not scale subcolumns within a supercolumn * compaction requires more than 2N storage * very large or growing datasets require active monitoring for storage headroom I'm sure there are others that I've forgotten. If you are going to be storing a virtually unlimited (billions of...) amount of information, how do you intend to scale your storage? What are your performance requirements? What is your synchronous consistency requirement? What is your asynchronous consistency requirement? What's the nature of the workload? Is it batching loads, or many fine units of work all the time? That said, these types of questions should not be unusual for any large system. I think the gist of your answer is probably, but there will be growing pains, as with any other system. One of the benefits of Cassandra is the ability to make design trade-offs which have a direct impact on scalability and consistency, which leaves you with more options when you hit a speed bump. Another is that when there are speed bumps which are considered a significant problem for more than a few people, they get some attention. (Thanks, devs). On Sun, May 23, 2010 at 5:04 AM, Kartal Guner kgu...@hakia.com wrote: I am trying to find out if Cassandra will fill my needs. I have a data model similar to below. Users = { //ColumnFamily user1 = { //Key for Users ColumnFamily message1 = { //Supercolumn text: hello //Column type: html //Column rating: 88 //Column } ... messageN } ... CountryN } Imagine there can be billions of users and hundreds of thousands of messages per user. After a message entry it will not be updated. I want to do queries such as: * Get all messages for user1 with type = HTML * Get top 100 message for user1, order by rating. 1) Is this possible with cassandra? 2) Do I have the right datamodel? Can it be optimized?
Re: Why Cassandra performs better in 15 nodes than in 20 nodes?
It would be helpful to know the replication factor and consistency levels of your reads and writes. 2010/5/23 史英杰 shiyingjie1...@gmail.com: Thanks for your reply! //Were all of those 20 nodes running real hardware (i.e. NOT VMs)? Yes, there are 20 real servers running in the cluster, and one Casssandra instance runs on each server. //Did your driver application(s) run on real hardware and how many threads did you use? The clients run on one server of the 20 servers, I used 10 threads to run the write and read tasks. How many threads can make Cassandra get good throughput? Thanks! 2010/5/23 Mark Robson mar...@gmail.com On 23 May 2010 13:42, 史英杰 shiyingjie1...@gmail.com wrote: Hi, All I am now doing some tests on Cassandra, and I found that both writes and reads on 15 nodes are faster than that of 20 nodes, how many servers does one Cassandra system contains during the real applications? Thanks a lot ! Yingjie I'd ask Were all of those 20 nodes running real hardware (i.e. NOT VMs)? and Did your driver application(s) run on real hardware and how many threads did you use? Cassandra can only get good throughput with a lot of client threads, not just a few. Mark
Re: list of columns
I think you are correct, David. What Bill is asking for specifically is not in the API. Bill, if this is a performance concern (i.e., your column values are/could be vastly larger than your column names, and you need to query the namespace before loading the values), then you might consider keeping a separate column family which just contains the column names and timestamps with empty values. On Sun, May 16, 2010 at 4:37 AM, David Boxenhorn da...@lookin2.com wrote: Bill, I am a new user of Cassandra, so I've been following this discussion with interest. I think the answer is no, except for the brute force method of looping through all your data. It's like asking for a list of all the files on your C: drive. The term column is very misleading, since columns are really leaves of a tree structure, not columns of a tabular structure. Anybody want to tell me I'm wrong? BTW, Bill, I think we've corresponded before, here: http://www.dehora.net/journal/2004/04/whats_in_a_name.html On Fri, May 14, 2010 at 2:23 AM, Bill de hOra b...@dehora.net wrote: A SlicePredicate/SliceRange can't exclude column values afaik. Bill Jonathan Shook wrote: get_slice see: http://wiki.apache.org/cassandra/API under get_slice and SlicePredicate On Thu, May 13, 2010 at 9:45 AM, Bill de hOra b...@dehora.net wrote: get_count returns the number of columns, not the names of those columns? I should have been specific, by list the columns, I meant list the column names. Bill Gary Dusbabek wrote: We have get_count at the thrift level. You supply a predicate and it returns the number of columns that match. There is also multi_get_count, which is the same operation against multiple keys. Gary. On Thu, May 13, 2010 at 04:18, Bill de hOra b...@dehora.net wrote: Admin question - is there a way to list the columns for a particular key? Bill
Re: list of columns
get_slice see: http://wiki.apache.org/cassandra/API under get_slice and SlicePredicate On Thu, May 13, 2010 at 9:45 AM, Bill de hOra b...@dehora.net wrote: get_count returns the number of columns, not the names of those columns? I should have been specific, by list the columns, I meant list the column names. Bill Gary Dusbabek wrote: We have get_count at the thrift level. You supply a predicate and it returns the number of columns that match. There is also multi_get_count, which is the same operation against multiple keys. Gary. On Thu, May 13, 2010 at 04:18, Bill de hOra b...@dehora.net wrote: Admin question - is there a way to list the columns for a particular key? Bill
Re: key is sorted?
Although, if replication factor spans all nodes, then the disparity in row allocation should be a non-issue when using OrderPreservingPartitioner. On Wed, May 12, 2010 at 6:42 PM, Vijay vijay2...@gmail.com wrote: If you use Random partitioner, You will NOT get RowKey's sorted. (Columns are sorted always). Answer: If used Random partitioner True True Regards, /VJ On Wed, May 12, 2010 at 1:25 AM, David Boxenhorn da...@lookin2.com wrote: You do any kind of range slice, e.g. keys beginning with abc? But the results will not be ordered? Please answer one of the following: True True True False False False Explain? Thanks! On Sun, May 9, 2010 at 8:27 PM, Vijay vijay2...@gmail.com wrote: True, The Range slice support was enabled in Random Partitioner for the hadoop support. Random partitioner actually hash the Key and those keys are sorted so we cannot have the actual key in order (Hope this doesnt confuse you)... Regards, /VJ On Sun, May 9, 2010 at 12:00 AM, David Boxenhorn da...@lookin2.com wrote: This is something that I'm not sure that I understand. Can somebody confirm/deny that I understand it? Thanks. If you use random partitioning, you can loop through all keys with a range query, but they will not be sorted. True or False? On Sat, May 8, 2010 at 3:45 AM, AJ Chen ajc...@web2express.org wrote: thanks, that works. -aj On Fri, May 7, 2010 at 1:17 PM, Stu Hood stu.h...@rackspace.com wrote: Your IPartitioner implementation decides how the row keys are sorted: see http://wiki.apache.org/cassandra/StorageConfiguration#Partitioner . You need to be using one of the OrderPreservingPartitioners if you'd like a reasonable order for the keys. -Original Message- From: AJ Chen ajc...@web2express.org Sent: Friday, May 7, 2010 3:10pm To: user@cassandra.apache.org Subject: key is sorted? I have a super column family for topic, key being the name of the topic. ColumnFamily Name=Topic CompareWith=UTF8Type ColumnType=Super CompareSubcolumnsWith=BytesType / When I retrieve the rows, the rows are not sorted by the key. Is the row key sorted in cassandra by default? -aj -- AJ Chen, PhD Chair, Semantic Web SIG, sdforum.org http://web2express.org twitter @web2express Palo Alto, CA, USA -- AJ Chen, PhD Chair, Semantic Web SIG, sdforum.org http://web2express.org twitter @web2express Palo Alto, CA, USA
Re: how does cassandra compare with mongodb?
You can choose to have keys ordered by using an OrderPreservingPartioner with the trade-off that key ranges can get denser on certain nodes than others. On Wed, May 12, 2010 at 7:48 PM, philip andrew philip14...@gmail.com wrote: Hi, From my understanding, Cassandra entities are indexed on only one key, so this can be a problem if you are searching for example by two values such as if you are storing an entity with a x,y then wish to search for entities in a box ie x5 and x10 and y5 and y10. MongoDB can do this, Cassandra cannot due to only indexing on one key. Cassandra can scale automatically just by adding nodes, almost infinite storage easily, MongoDB requires database administration to add nodes, setting up replication or allowing sharding, but not too complex. MongoDB requires you to create sharded keys if you want to scale horizontally, Cassandra just works automatically for scale horizontally. Cassandra requires the schema to be defined before the database starts, MongoDB can have any schema at run-time just like a normal database. In the end I choose MongoDB as I require more indexes than Cassandra provides, although I really like Cassandras ability to store almost infinite amount of data just by adding nodes. Thanks, Phil On Thu, May 13, 2010 at 5:57 AM, S Ahmed sahmed1...@gmail.com wrote: I tried searching mail-archive, but the search feature is a bit wacky (or more probably I don't know how to use it). What are the key differences between Cassandra and Mongodb? Is there a particular use case where each solution shines?
Re: Is SuperColumn necessary?
This is one of the sticking points with the key concatenation argument. You can't simply access subpartitions of data along an aggregate name using a concatenated key unless you can efficiently address a range of the keys according to a property of a subset. I'm hoping this will bear out with more of this discussion. Another facet of this issue is performance with respect to storage layout. Presently columns within a row are inherently organized for efficient range operations. The key space is not generally optimal in this way. I'm hoping to see some discussion of this, as well. On Tue, May 11, 2010 at 6:17 AM, vd vineetdan...@gmail.com wrote: Hi Can we make range search on ID:ID format as this would be treated as single ID by API or can it bifurcate on ':' . If now then how do can we ignore usage of supercolumns where we need to associate 'n' number of rows to a single ID. Like CatID1- articleID1 CatID1- articleID2 CatID1- articleID3 CatID1- articleID4 How can we map such scenarios with simple column families. Rgds. On Tue, May 11, 2010 at 2:11 PM, Torsten Curdt tcu...@vafer.org wrote: Exactly. On Tue, May 11, 2010 at 10:20, David Boxenhorn da...@lookin2.com wrote: Don't think of it as getting rid of supercolum. Think of it as adding superdupercolums, supertriplecolums, etc. Or, in sparse array terminology: array[dim1][dim2][dim3].[dimN] = value Or, as said above: Column Name=ThingThatsNowKey Indexed=True ClusterPartitioned=True Type=UTF8 Column Name=ThingThatsNowColumnFamily DiskPartitioned=True Type=UTF8 Column Name=ThingThatsNowSuperColumnName Type=Long Column Name=ThingThatsNowColumnName Indexed=True Type=ASCII Column Name=ThingThatCantCurrentlyBeRepresented/ /Column /Column /Column /Column
Re: Is SuperColumn necessary?
Agreed On Mon, May 10, 2010 at 12:01 PM, Mike Malone m...@simplegeo.com wrote: On Mon, May 10, 2010 at 9:52 AM, Jonathan Shook jsh...@gmail.com wrote: I have to disagree about the naming of things. The name of something isn't just a literal identifier. It affects the way people think about it. For new users, the whole naming thing has been a persistent barrier. I'm saying we shouldn't be worried too much about coming up with names and analogies until we've decided what it is we're naming. As for your suggestions, I'm all for simplifying or generalizing the how it works part down to a more generalized set of operations. I'm not sure it's a good idea to require users to think in terms building up a fluffy query structure just to thread it through a needle of an API, even for the simplest of queries. At some point, the level of generic boilerplate takes away from the semantic hand rails that developers like. So I guess I'm suggesting that how it works and how we use it are not always exactly the same. At least they should both hinge on a common conceptual model, which is where the naming becomes an important anchoring point. If things are done properly, client libraries could expose simplified query interfaces without much effort. Most ORMs these days work by building a propositional directed acyclic graph that's serialized to SQL. This would work the same way, but it wouldn't be converted into a 4GL. Mike Jonathan On Mon, May 10, 2010 at 11:37 AM, Mike Malone m...@simplegeo.com wrote: Maybe... but honestly, it doesn't affect the architecture or interface at all. I'm more interested in thinking about how the system should work than what things are called. Naming things are important, but that can happen later. Does anyone have any thoughts or comments on the architecture I suggested earlier? Mike On Mon, May 10, 2010 at 8:36 AM, Schubert Zhang zson...@gmail.com wrote: Yes, the column here is not appropriate. Maybe we need not to create new terms, in Google's Bigtable, the term qualifier is a good one. On Thu, May 6, 2010 at 3:04 PM, David Boxenhorn da...@lookin2.com wrote: That would be a good time to get rid of the confusing column term, which incorrectly suggests a two-dimensional tabular structure. Suggestions: 1. A hypercube (or hypocube, if only two dimensions): replace key and column with 1st dimension, 2nd dimension, etc. 2. A file system: replace key and column with directory and subdirectory 3. A tuple tree: Column family replaced by top-level tuple, whose value is the set of keys, whose value is the set of supercolumns of the key, whose value is the set of columns for the supercolumn, etc. 4. Etc. On Thu, May 6, 2010 at 2:28 AM, Mike Malone m...@simplegeo.com wrote: Nice, Ed, we're doing something very similar but less generic. Now replace all of the various methods for querying with a simple query interface that takes a Predicate, allow the user to specify (in storage-conf) which levels of the nested Columns should be indexed, and completely remove Comparators and have people subclass Column / implement IColumn and we'd really be on to something ;). Mock storage-conf.xml: Column Name=ThingThatsNowKey Indexed=True ClusterPartitioned=True Type=UTF8 Column Name=ThingThatsNowColumnFamily DiskPartitioned=True Type=UTF8 Column Name=ThingThatsNowSuperColumnName Type=Long Column Name=ThingThatsNowColumnName Indexed=True Type=ASCII Column Name=ThingThatCantCurrentlyBeRepresented/ /Column /Column /Column /Column Thrift: struct NamePredicate { 1: required listbinary column_names, } struct SlicePredicate { 1: required binary start, 2: required binary end, } struct CountPredicate { 1: required struct predicate, 2: required i32 count=100, } struct AndPredicate { 1: required Predicate left, 2: required Predicate right, } struct SubColumnsPredicate { 1: required Predicate columns, 2: required Predicate subcolumns, } ... OrPredicate, OtherUsefulPredicates ... query(predicate, count, consistency_level) # Count here would be total count of leaf values returned, whereas CountPredicate specifies a column count for a particular sub-slice. Not fully baked... but I think this could really simplify stuff and make it more flexible. Downside is it may give people enough rope to hang themselves, but at least the predicate stuff is easily distributable. I'm thinking I'll play around with implementing some of this stuff myself if I have any free time in the near future. Mike On Wed, May 5, 2010 at 2:04 PM, Jonathan Ellis jbel...@gmail.com wrote: Very interesting, thanks! On Wed, May 5, 2010 at 1:31 PM, Ed Anuff e...@anuff.com wrote: Follow-up from last weeks discussion
Re: Is SuperColumn necessary?
I'm not sure this is much of an improvement. It does illustrate, however, the desire to couch the concepts in terms that each is already comfortable with. Nearly every set of terms which come from an existing system will have baggage which doesn't map appropriately. Not that the sparse multidimensional arrays is an unfamiliar construct. It's more that sparse may or may not apply depending on the part of your data you are describing. Multidimensional implies uniformity of structure, which is not to be taken for granted. Arrays are just one way to think of the structures. They also serve well as maps and sets (Which can be modeled using arrays as well). There are certain semantics of sets, lists, and maps which people have wired into their brains, and reducing it all to arrays is likely to create more confusion. I think if we want to borrow terms form another system, it shouldn't be a computing system, or at least should be so different or fundamental that the terms have to be re-understood free of baggage. On Sun, May 9, 2010 at 1:30 AM, David Boxenhorn da...@lookin2.com wrote: Guys, this is beginning to sound like MUMPS! http://en.wikipedia.org/wiki/MUMPS In MUMPS, all variables are sparse, multidimensional arrays, which can be stored to disk. It is an arcane, and archaic, language (does anyone but me remember it?), but it has been used successfully for years. Maybe we can learn something from it. I like the terminology of sparse multidimensional arrays very much - it really clarifies my thinking. A column family would just be a variable. On Fri, May 7, 2010 at 7:06 PM, Ed Anuff e...@anuff.com wrote: On Thu, May 6, 2010 at 11:10 PM, Mike Malone m...@simplegeo.com wrote: The upshot is, the Cassandra data model would go from being it's a nested dictionary, just kidding no it's not! to being it's a nested dictionary, for serious. Again, these are all just ideas... but I think this simplified data model would allow you to express pretty much any query in a graph of simple primitives like Predicates, Filters, Aggregations, Transformations, etc. The indexes would allow you to cheat when evaluating certain types of queries - if you get a SlicePredicate on an indexed thingy you don't have to enumerate the entire set of sub-thingies for example. This would be my dream implementation. I'm working an an application that needs that sort of capability. SuperColumns lead you to thinking that should be done in the cassandra tier but then fall short, so my thought was that I was just going to do everything that was in Cassandra as regular columnfamilies and columns using composite keys and composite column names ala the code I shared above, and then implement the n-level hierarchy in the app tier. It looks like your suggestion is to take it in the other direction and make it part of the fundamental data model, which would be very useful if it could be made to work without big tradeoffs.
Re: replacing columns via remove and insert
I found the issue. Timestamp ordering was broken because: I generated a timestamp for the group of operations. Then, I used hector's remove, which generates its own internal timestamp. I then re-used the timestamp, not wary of the missing timestamp field on the remove operation. The fix was to simply regenerate my timestamp after any hector operation which generates its own. In my case, hector generates it's own internal timestamp for removes, but not other operations. Until the timestamp resolution is better than milliseconds, it's very possible to end up with the same timestamp for tightly grouped operations, which may lead to unexpected behavior. I've submitted a request to simplify this. On Wed, May 5, 2010 at 5:03 PM, Jonathan Shook jsh...@gmail.com wrote: When I try to replace a set of columns, like this: 1) remove all columns under a CF/row 2) batch insert columns into the same CF/row .. the columns cease to exist. Is this expected? This is just across 2 nodes with Replication Factor 2 and Consistency Level QUOROM.
Re: Cassandra training on May 21 in Palo Alto
Dallas On Thu, May 6, 2010 at 4:28 PM, Jonathan Ellis jbel...@gmail.com wrote: We're planning that now. Where would you like to see one? On Thu, May 6, 2010 at 2:40 PM, S Ahmed sahmed1...@gmail.com wrote: Do you have rough ideas when you would be doing the next one? Maybe in 1 or 2 months or much later? On Tue, May 4, 2010 at 8:50 PM, Jonathan Ellis jbel...@gmail.com wrote: Yes, although when and where are TBD. On Tue, May 4, 2010 at 7:38 PM, Mark Greene green...@gmail.com wrote: Jonathan, Awesome! Any plans to offer this training again in the future for those of us who can't make it this time around? -Mark On Tue, May 4, 2010 at 5:07 PM, Jonathan Ellis jbel...@gmail.com wrote: I'll be running a day-long Cassandra training class on Friday, May 21. I'll cover - Installation and configuration - Application design - Basics of Cassandra internals - Operations - Tuning and troubleshooting Details at http://riptanobayarea20100521.eventbrite.com/ -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
replacing columns via remove and insert
When I try to replace a set of columns, like this: 1) remove all columns under a CF/row 2) batch insert columns into the same CF/row .. the columns cease to exist. Is this expected? This is just across 2 nodes with Replication Factor 2 and Consistency Level QUOROM.
Re: Cassandra and Request routing
I think you may found the eventually in eventually consistent. With a replication factor of 1, you are allowing the client thread to continue to the read on node#2 before it is replicated to node 2. Try setting your replication factor higher for different results. Jonathan On Tue, May 4, 2010 at 12:14 AM, Olivier Mallassi omalla...@octo.comwrote: Hi all, I can't figure out how to deal with request routing... In fact I have two nodes in the Test Cluster and I wrote the client as specified here http://wiki.apache.org/cassandra/ThriftExamples#Java. The Keyspace is the default one (KeySpace1, replicatorFactor 1..) The Seeds are well configured (using the IP) : ie. the cassandra log indicates that the servers are up. http://wiki.apache.org/cassandra/ThriftExamples#JavaEverything goes well if I write and read the data on node#1 for instance. Yet, if I write the data on node#1 and then read the same data (using the key) on node#2, no data is found. Did I miss something? As far as I understood, I should be able to reach any nodes from the cluster and the node should be able to redirect the request to the good node Thank you for your answers and your time. Best Regards. Olivier. -- Olivier Mallassi OCTO Technology 50, Avenue des Champs-Elysées 75008 Paris Mobile: (33) 6 28 70 26 61 Tél: (33) 1 58 56 10 00 Fax: (33) 1 58 56 10 01 http://www.octo.com Octo Talks! http://blog.octo.com
Re: Cassandra and Request routing
I may be wrong here. Someone please correct me if I am. There may be a race condition if you aren't increasing your replication factor. If you insert to node A with replication factor 1, and then get from node B with replication factor 1, it should be possible (and even more likely in uneven loading scenarios) to have the results you described before. The ability to set the replication factor on inserts and gets allows you to decide when (if) and how much (little) to pay the price for consistency. On Tue, May 4, 2010 at 2:31 AM, Olivier Mallassi omalla...@octo.com wrote: :) I think this is simpler and I am just stupid I retried with clean data and commit log directories and everything works well. I should have missed something (maybe when I upgraded from 0.5.1 to 0.6) but anyway, I am just in test. On Tue, May 4, 2010 at 8:47 AM, Jonathan Shook jsh...@gmail.com wrote: I think you may found the eventually in eventually consistent. With a replication factor of 1, you are allowing the client thread to continue to the read on node#2 before it is replicated to node 2. Try setting your replication factor higher for different results. Jonathan On Tue, May 4, 2010 at 12:14 AM, Olivier Mallassi omalla...@octo.comwrote: Hi all, I can't figure out how to deal with request routing... In fact I have two nodes in the Test Cluster and I wrote the client as specified here http://wiki.apache.org/cassandra/ThriftExamples#Java. The Keyspace is the default one (KeySpace1, replicatorFactor 1..) The Seeds are well configured (using the IP) : ie. the cassandra log indicates that the servers are up. http://wiki.apache.org/cassandra/ThriftExamples#JavaEverything goes well if I write and read the data on node#1 for instance. Yet, if I write the data on node#1 and then read the same data (using the key) on node#2, no data is found. Did I miss something? As far as I understood, I should be able to reach any nodes from the cluster and the node should be able to redirect the request to the good node Thank you for your answers and your time. Best Regards. Olivier. -- Olivier Mallassi OCTO Technology 50, Avenue des Champs-Elysées 75008 Paris Mobile: (33) 6 28 70 26 61 Tél: (33) 1 58 56 10 00 Fax: (33) 1 58 56 10 01 http://www.octo.com Octo Talks! http://blog.octo.com -- Olivier Mallassi OCTO Technology 50, Avenue des Champs-Elysées 75008 Paris Mobile: (33) 6 28 70 26 61 Tél: (33) 1 58 56 10 00 Fax: (33) 1 58 56 10 01 http://www.octo.com Octo Talks! http://blog.octo.com
Re: Cassandra and Request routing
Ah! Thank you. Explained better here: http://www.slideshare.net/benjaminblack/introduction-to-cassandra-replication-and-consistency On Tue, May 4, 2010 at 8:38 PM, Robert Coli rc...@digg.com wrote: On 5/4/10 7:16 AM, Jonathan Shook wrote: I may be wrong here. Someone please correct me if I am. ... The ability to set the replication factor on inserts and gets allows you to decide when (if) and how much (little) to pay the price for consistency. You mean Consistency Level, not Replication Factor. =Rob
Re: Search Sample and Relation question because UDDI as Key
I am only speaking to your second question. It may be helpful to think of modeling your storage layout in terms of * lists * sets * hash maps ... and certain combinations of these. Since there are no schema-defined relations, your relations may appear implicit between different views or copies of your data. The relationship can be assumed to be explicit to the extent that it is used in that way or even (in some cases) enforced by a boundary layer in your software. For accessing data by value, you can try to do your bookkeeping (indexing) as you go, by maintaining auxiliary maps directly via your application. Scanning by value is really not a strong point for Cassandra, and in fact is one of the trade-offs made when moving to a DHT ( http://en.wikipedia.org/wiki/Distributed_hash_table) data store. There has been discussion around putting some form of value indexing in at some point in the future, but the plans appear indefinite. Even with this, it would move workload into the hub which may otherwise be better handled in a client node. On Sun, May 2, 2010 at 4:33 PM, CleverCross | Falk Wolsky falk.wol...@clevercross.eu wrote: Hello, 1) Can you provide a solution or a sample for searching (Column and SuperColumn) (Fulltext). What is the Way to realize this? Hadoop/MapReduce? See you a posibility to build/use a index for columns? Why this: In a given Data-Model we must use UUIDs as Key and have actually no chance to seach values from Columns? (or not?) 2) How can we realize a relation For Sample: (http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model ) Arin describes good a simple Data-Model to build a Blog. But how can we read (filter) all Posts from BlogEntries from a single Autor? (filter the Supercolumns by a culum inside of a SuperColumn) The relation for Sample is Autor - BlogEntries... To filter the Datas there is a needing to specify in a get(...)-Function a Column/Value combination... I know well that cassandra is not a relational Database! But without these releations the usage is very limited (specialized) Thanks in Advance! - and thx for Cassandra! With Hector i build a (Apache)Cocoon-Transformer... With Kind Regards, Falk Wolsky
Re: Storage Layout Questions
Ah, now I understand. Supercolumns it is. On Wed, Apr 28, 2010 at 9:40 AM, Jonathan Ellis jbel...@gmail.com wrote: I don't think you are missing anything. You'll have to pick your poison. FWIW, if each BAR has relatively few fields then supercolumns aren't bad. It's when a BAR has dynamically growing numbers of fields (subcolumns) that you get in trouble with that model. On Tue, Apr 27, 2010 at 4:24 PM, Jonathan Shook jsh...@gmail.com wrote: I'm trying to model a one-to-many set of data in which both sides of the relation may grow arbitrarily large. There are arbitrarily many FOOs. For each FOO, there are arbitrarily many BARs. Both types are modeled as an object, containing multiple fields (columns) in the application. Given a key-addressable FOO element, I'd like to be able to do range access operations on the associated BARs according to their temporal names. I wish to avoid: 1) using a super column to nest the temporal ids (or column names) within a row of the primary key, due to the memory-based limitations of super column deserialization. (and implicit compute costs that go with it) 2) keeping a separate map between the FOO type and the BAR type. 3) serializing all BAR types into the value field of each FOO-keyed, BAR-named column. Were the super column addressing more scalable, I'd see it as a natural fit. Does anybody have an elegant solution to this which I am overlooking? In the absence of ideas, I'd like some feedback on the trade-offs of the above avoids. Jonathan -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
Re: error during snapshot
The allocation of memory may have failed depending on the available virtual memory, whether or not the memory would have been subsequently accessed by the process. Some systems do the work of allocating physical pages only when they are accessed for the first time. I'm not sure if yours is one of them. On Tue, Apr 27, 2010 at 10:45 AM, Lee Parker l...@socialagency.com wrote: Adding a swapfile fixed the error, but it doesn't look as though the process is even using the swap file at all. Lee Parker On Tue, Apr 27, 2010 at 9:49 AM, Eric Hauser ewhau...@gmail.com wrote: Have you read this? http://forums.sun.com/thread.jspa?messageID=9734530 http://forums.sun.com/thread.jspa?messageID=9734530I don't think EC2 instances have any swap. On Tue, Apr 27, 2010 at 10:16 AM, Lee Parker l...@socialagency.comwrote: Can anyone help with this? It is preventing me from getting backups of our cluster. Lee Parker On Mon, Apr 26, 2010 at 10:02 PM, Lee Parker l...@socialagency.comwrote: I was attempting to get a snapshot on our cassandra nodes. I get the following error every time I run nodetool ... snapshot. Exception in thread main java.io.IOException: Cannot run program ln: java.io.IOException: error=12, Cannot allocate memory at java.lang.ProcessBuilder.start(ProcessBuilder.java:459) at org.apache.cassandra.io.util.FileUtils.createHardLink(FileUtils.java:221) at org.apache.cassandra.db.ColumnFamilyStore.snapshot(ColumnFamilyStore.java:1060) at org.apache.cassandra.db.Table.snapshot(Table.java:256) at org.apache.cassandra.service.StorageService.takeAllSnapshot(StorageService.java:1005) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:93) at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:27) at com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:208) at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:120) at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:262) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:836) at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:761) at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1426) at javax.management.remote.rmi.RMIConnectionImpl.access$200(RMIConnectionImpl.java:72) at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1264) at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1359) at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:788) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:305) at sun.rmi.transport.Transport$1.run(Transport.java:159) at java.security.AccessController.doPrivileged(Native Method) at sun.rmi.transport.Transport.serviceCall(Transport.java:155) at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:535) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:790) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:649) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) Caused by: java.io.IOException: java.io.IOException: error=12, Cannot allocate memory at java.lang.UNIXProcess.init(UNIXProcess.java:148) at java.lang.ProcessImpl.start(ProcessImpl.java:65) at java.lang.ProcessBuilder.start(ProcessBuilder.java:452) ... 34 more The nodes are both Amazon EC2 Large instances with 7.5G RAM (6 allocated for Java heap) with two cores and only 70G of data in casssandra. They have plenty of available RAM and HD space. Has anyone else run into this error? Lee Parker
Storage Layout Questions
I'm trying to model a one-to-many set of data in which both sides of the relation may grow arbitrarily large. There are arbitrarily many FOOs. For each FOO, there are arbitrarily many BARs. Both types are modeled as an object, containing multiple fields (columns) in the application. Given a key-addressable FOO element, I'd like to be able to do range access operations on the associated BARs according to their temporal names. I wish to avoid: 1) using a super column to nest the temporal ids (or column names) within a row of the primary key, due to the memory-based limitations of super column deserialization. (and implicit compute costs that go with it) 2) keeping a separate map between the FOO type and the BAR type. 3) serializing all BAR types into the value field of each FOO-keyed, BAR-named column. Were the super column addressing more scalable, I'd see it as a natural fit. Does anybody have an elegant solution to this which I am overlooking? In the absence of ideas, I'd like some feedback on the trade-offs of the above avoids. Jonathan