Re: Ball is rolling on High Performance Cassandra Cookbook second edition
Sounds good. One thing I'd like to see is more coverage on Cassandra Internals. Out of the box Cassandra's great but having a little inside knowledge can be very useful because it helps you design your applications to work with Cassandra; rather than having to later make endless optimizations that could probably have been avoided had you done your implementation slightly differently. Another thing that may be worth adding would be a recipe that showed an approach to evaluating Cassandra for your organization/use case. I realize that's going to vary on a case by case basis but one thing I've noticed is that some people dive in without really thinking through whether Cassandra is actually the right fit for what they're doing. It sort of becomes a hammer for anything that looks like a nail. On Tue, Jun 26, 2012 at 10:25 PM, Edward Capriolo edlinuxg...@gmail.comwrote: Hello all, It has not been very long since the first book was published but several things have been added to Cassandra and a few things have changed. I am putting together a list of changed content, for example features like the old per Column family memtable flush settings versus the new system with the global variable. My editors have given me the green light to grow the second edition from ~200 pages currently up to 300 pages! This gives us the ability to add more items/sections to the text. Some things were missing from the first edition such as Hector support. Nate has offered to help me in this area. Please feel contact me with any ideas and suggestions of recipes you would like to see in the book. Also get in touch if you want to write a recipe. Several people added content to the first edition and it would be great to see that type of participation again. Thank you, Edward -- Courtney Robinson court...@crlog.info http://crlog.info 07535691628 (No private #s)
CF design
I was hoping someone could share their opinions on the following CF designs or suggest a better way of doing it. My app is constantly receiving new data that contains URLs. I was thinking of hashing this URL to form a key. The data is a JSON object with several properties. For now many of its properties will be ignored and only 4 are of interests, URL, title, username, user_rating. Often times the same URL is received but shared by a different user. I’m wondering if anyone can suggest a better approach to what I propose below which will be able answer the following . Queries: I’ll be asking the following questions: 1. Give me the N most frequently shared items over : a) The last 30 minutes b) The last 24hrs c) Between D1 and D2 (where D1 and D2 represents the start and end date of interest) 2) Give me the N most shared item over the 3 time periods above WHERE the average user rating is above 5 3) Give me X for the item with the ID 123 (where X is a property for the item with the ID 123) Proposed approach Use timestamps as keys in the CF, that should take care of queries under 1 and partially handle 2 and use each column to store the JSON data, minus the common fields such as the title which will be the same no matter how many users share the same link (they’ll have their own columns in the row) other column names will be the user’s username and the value for those columns will be any JSON left over that’s not specific to the user. For the rest of 2, I can get the N items we’re interested in and calculate the average user rating for each item client side. Of course using timestamp as key means we need to maintain an index of the “real” keys/IDs to each item which would allow us to answer “Give me the item with the ID 123” Finally to address 3, I was thinking; Using the index get the timestamp of the item, and on the client side find the property of interest. CF1 Timestamp1 Title value ID ID1 Username3 {“rating”:5} Username2 {“rating”:0} Username2 {“rating”:4} Timestamp2 Title Value1 ID ID2 Username24 {“rating”:1} Username87 {“rating”:9} Username7 {“rating”:2} CF2 ID1 Timestamp1 ID2 Timestamp2 In the Username column, I'd ideally like to avoid storing the other properties as a JSON but I couldn't think of a way of doing it sensibly when that JSON grows into having 10 other properties.Does this sound like a sensible approach to designing my CFs?
Re: CQL DELETE statement
Cool... Okay, the plan is to eventually not use thrift underneath, for the CQL stuff right? Once this is done and the new transport is in place, or evening while designing the new transport, is this not something that's worth looking into again? I think it'd be a nice feature. -Original Message- From: Jonathan Ellis Sent: Monday, April 18, 2011 3:24 AM To: user@cassandra.apache.org Cc: Tyler Hobbs Subject: Re: CQL DELETE statement Very old. https://issues.apache.org/jira/browse/CASSANDRA-494 On Sun, Apr 17, 2011 at 7:49 PM, Tyler Hobbs ty...@datastax.com wrote: You are correct, but this is also a limitation with the Thrift API -- it's not CQL specific. It turns out that deleting a slice of columns is difficult. There's an old JIRA ticket somewhere that describes the issues. On Sun, Apr 17, 2011 at 7:45 PM, Courtney Robinson sa...@live.co.uk wrote: Looking at the CQL spec, it doesn’t seem to be possible to delete a range of columns for a given key without specifying the individual columns to be removed, for e.g. DELETE col1 .. col20 from CF WHERE KEY=key|(key1,key2) Am I correct in thinking so or have I missed that somewhere? -- Tyler Hobbs Software Engineer, DataStax Maintainer of the pycassa Cassandra Python client library -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
codeigniter+phpcassa
For anyone using Codeigniter and interested. I've written a little library to integrate Codeigniter with PHPcassa and consequently Cassandra. It provides you with access to code igniter's $this-db instance that only has the library's methods and phpcassa's. Follow up tutorial http://crlog.info/2011/04/07/apache-cassandra-phpcassa-code-igniter-large-scale-php-app-in-5-minutes/ Library download from https://github.com/zcourts/cassandraci
Re: Designing a decent data model for an online music shop...confused/stuck on decisions
Thanks for the response, I haven't checked on the status of phpcassa in a while but does it now work with 0.7? That was one of the main reasons I switched to pandra, it seemed more up to date From: Tyler Hobbs Sent: Monday, March 07, 2011 2:40 AM To: user@cassandra.apache.org Subject: Re: Designing a decent data model for an online music shop...confused/stuck on decisions Regarding PHP performance with Cassandra, THRIFT-638 was recently resolved and it shows some big performance improvements. I'll be upgrading the Thrift package that ships with phpcassa soon to include this fix, so you may want to compare performance numbers before and after. On Sun, Mar 6, 2011 at 8:03 PM, Courtney e-mailadr...@hotmail.com wrote: We're in a bit of a predicament, we have an e-music store currently built in PHP using codeigniter/mysql... The current system has 100+K users and a decent song collection. Over the last few months I've been playing with Cassandra... needless to say I'm impressed but I have a few questions. Firstly, I want to avoid re-writing the entire site if possible so my instincts have made me inclined to replace the database layer in code igniter... is this something anyone would recommend and are there any gotchas in doing that? I can't say I've been terribly happy with PHP accessing cassandra, when sample data of the same size was put into mysql and in cassandra (of the same size/type) The pages with php connecting to Cassandra took longer to load, (30K records in table). I've thought maybe it was my setup that needed tweaking and I've played with as many a options as I could but the best I've gotten is matching query time. Query speed test was simply getting time stamps right before and after query call returned... Is this something anyone else has seen, any comments suggestions? I've tried using thrift, phpcassa and pandra with pretty similar numbers. My other thought turned to maybe it was the way I designed my CFs, at first I used super columns to model user account CF based on a post I read by Arin (WTF is a super column) but I later changed to using normal CFs. I'm trying to make this work but I get the feeling my approach is somewhat...I don't mis-guided. Here's a break down of the current model. CF:Users{ uid fname lname username password street } Some additional columns in place for a user but keeping it simple... CF:Library{ uid songid ... other info about user library } CF:Songs{ songid title artistid } This all is still very relational like (considering I go on to have a CF for playlist and artists) and I'm not sure if this is a good design for the data but... when I looked into combining some of the info and removing some CFs I run into the issue of replicating data all over the place. If for example I stored the artist name in the library for each record then each then the artist would be replicated for every song they have for every user who has that song in their library Where do you sort of draw the line on deciding how much is okay to be replicated? As much as I am not liking the idea of building the application from scratch, I'm considering the possibility of building from scratch in Java/JSP just to get the benefit of using the hector client. (Efforts from the guys doing the PHP libs is much appreciated but PHP doesn't seem to go too well with Cas.) In the process of making decisions because the upgrade/rebuild needs to have a fairly steady working version for October and I don't want to go wrong before even starting. Recommendations. Suggestions, advice are all welcomed (Any experience with PHP and Cas. is also welcomed since all my fav. libs. are in PHP I'm reluctant to turn away) -- Tyler Hobbs Software Engineer, DataStax Maintainer of the pycassa Cassandra Python client library
Designing a decent data model for an online music shop...confused/stuck on decisions
We're in a bit of a predicament, we have an e-music store currently built in PHP using codeigniter/mysql... The current system has 100+K users and a decent song collection. Over the last few months I've been playing with Cassandra... needless to say I'm impressed but I have a few questions. Firstly, I want to avoid re-writing the entire site if possible so my instincts have made me inclined to replace the database layer in code igniter... is this something anyone would recommend and are there any gotchas in doing that? I can't say I've been terribly happy with PHP accessing cassandra, when sample data of the same size was put into mysql and in cassandra (of the same size/type) The pages with php connecting to Cassandra took longer to load, (30K records in table). I've thought maybe it was my setup that needed tweaking and I've played with as many a options as I could but the best I've gotten is matching query time. Query speed test was simply getting time stamps right before and after query call returned... Is this something anyone else has seen, any comments suggestions? I've tried using thrift, phpcassa and pandra with pretty similar numbers. My other thought turned to maybe it was the way I designed my CFs, at first I used super columns to model user account CF based on a post I read by Arin (WTF is a super column) but I later changed to using normal CFs. I'm trying to make this work but I get the feeling my approach is somewhat...I don't mis-guided. Here's a break down of the current model. CF:Users{ uid fname lname username password street } Some additional columns in place for a user but keeping it simple... CF:Library{ uid songid ... other info about user library } CF:Songs{ songid title artistid } This all is still very relational like (considering I go on to have a CF for playlist and artists) and I'm not sure if this is a good design for the data but... when I looked into combining some of the info and removing some CFs I run into the issue of replicating data all over the place. If for example I stored the artist name in the library for each record then each then the artist would be replicated for every song they have for every user who has that song in their library Where do you sort of draw the line on deciding how much is okay to be replicated? As much as I am not liking the idea of building the application from scratch, I'm considering the possibility of building from scratch in Java/JSP just to get the benefit of using the hector client. (Efforts from the guys doing the PHP libs is much appreciated but PHP doesn't seem to go too well with Cas.) In the process of making decisions because the upgrade/rebuild needs to have a fairly steady working version for October and I don't want to go wrong before even starting. Recommendations. Suggestions, advice are all welcomed (Any experience with PHP and Cas. is also welcomed since all my fav. libs. are in PHP I'm reluctant to turn away)
Re: Fwd: SPA2011 - June 12th-15th - BCS London, UK - Call for Sessions
Anyone else in London interested in this? -- From: Jonathan Ellis jbel...@gmail.com Sent: Monday, February 14, 2011 10:30 PM To: user user@cassandra.apache.org Subject: Fwd: SPA2011 - June 12th-15th - BCS London, UK - Call for Sessions In case any of the London crowd is interested: -- Forwarded message -- From: Mike Hill mikewh...@gmail.com Date: Mon, Feb 14, 2011 at 4:14 PM Subject: SPA2011 - June 12th-15th - BCS London, UK - Call for Sessions To: mikewh...@gmail.com mikewh...@gmail.com SPA2011 - June 12th-15th - BCS London, UK - Call for Sessions Submissions Deadline: 28th February 2011. To find out more, and submit a proposal, visit http://www.spaconference.org http://www.spaconference.org We would like to invite you to present a session at this leading software development conference. SPA2011 will continue the well established SPA tradition of learning through interaction, with sessions exploring the latest advancements in software development practice. We're looking for sessions which are interactive, thought provoking and have not been seen before in this form (may be a topic you've covered before, but it must be truly interactive). They can be about technology or teams, practice or process - in fact anything to do with advancing the state of the practice in software development. We welcome submissions from everyone, if you're not experienced with presenting sessions at SPA you'll be supported by our well-established shepherding process which has ensured the standard of sessions at SPA is exceptionally high. Presenters will receive free attendance to the conference. See the website for conditions. Don't be shy! This year you can submit a rough proposal and get your peers to give you feedback! To find out more, and submit a proposal, visit http://www.spaconference.org http://www.spaconference.org The submission deadline is 28th February 2011. Willem van den Ende Rob Bowley Programme Chairs SPA 2011 progra...@spaconference.org mailto:progra...@spaconference.org -- You received this message because you are subscribed to the Google Groups NOSQL group. To post to this group, send email to nosql-discuss...@googlegroups.com. To unsubscribe from this group, send email to nosql-discussion+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/nosql-discussion?hl=en. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
0.7 PHP thrift example
Does anyone have a working 0.7 thrift example in PHP...? I compiled 0.5 version of thrift and built the PHP bindings but when I try to run the php example on the wiki I get TException: Error: Attempt to send non-object type as a T_STRUCT
Re: TSocket timing out
It may also be an idea to check the node's memory usage. I encountered this on a few occasions and I simply killed any unneeded process that was eating away my node's memory. In each instance it worked fine after there was about 300MB of free memory From: Patricio Echagüe Sent: Sunday, January 30, 2011 12:46 AM To: user@cassandra.apache.org Subject: Re: TSocket timing out The recommendation is to wait few milliseconds and retry. For Example if you use Hector ( I don't think it is your case), Hector will retry to different nodes in your cluster and the retry mechanisms is tunable as well. On Sat, Jan 29, 2011 at 2:20 PM, buddhasystem potek...@bnl.gov wrote: When I do a lot of inserts into my cluster (10k at a time) I get timeouts from Thrift, the TScoket.py module. What do I do? Thanks, Maxim -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/TSocket-timing-out-tp5973548p5973548.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: Cassandra for graph data structure
Efficient, i'm not totally sure on yet. Would need to do some testing. A quick mock up should highlight any tradeoffs being made. I think its been decided we'll test it out by creating a simple client using thrift and if results look good we'll have an attempt using hector as a layer. -- From: Ran Tavory ran...@gmail.com Sent: Saturday, September 25, 2010 1:41 PM To: user@cassandra.apache.org Subject: Re: Cassandra for graph data structure Courtney this certainly sounds interesting and as Nate suggested we're always looking for valuable contributions. A few things to keep in mind: - I'm curious, as Lucas has asked - is it possible to create an efficient graph API over cassandra and what are the tradeoffs? - If the API is general enough and the functionality is reusable then we'd be happy to add it to hector. If not, you can create a library that uses hector as a layer. On Friday, September 24, 2010, Courtney Robinson sa...@live.co.uk wrote: ?Nate Lucas thanks for the responses. Nate, I think it would be asking a bit much to suggest the hector team implement convenience methods for a graph representations. But if we went ahead and forked hector, I'd be sure to contribute back what i can and just release it as another client or if the final product can be merged with hector... I'd like thoughts on any features outside my own usecase though so that we can build it to handle other use cases as well. Lucas, I understand what you're saying but i've had a quick play with neo4j and the expense we'd pay for reads offsets a lot of the setbacks i'd run into using neo4j, not to mention having to learn it... -- From: Nate McCall n...@riptano.com Sent: Friday, September 24, 2010 4:14 PM To: user@cassandra.apache.org Subject: Re: Cassandra for graph data structure My idea however was to fork hector, remove all the stuff i don't need and turn it into a graph API sitting on top of Cassandra. We are always looking for ideas and design feedback regarding Hector. Please feel free to make suggestions or fork and send pull requests. http://groups.google.com/group/hector-users
Cassandra for graph data structure
?Apoligies for the first e-mail with the misleading subject i was reading a thread and mistakenly replied I've been using Cassandra for a while now and no problems. I have a new project coming up now that we're penciling out the data structure for. The best we've come up with has turned into a graph structure i'm just wanting to know what people think because i know there are graph db's out there like neo4j etc. My idea however was to fork hector, remove all the stuff i don't need and turn it into a graph API sitting on top of Cassandra. Main reason for this approach is because i am already very familiar with Cassandra and it will be fast to write a client or modify an existing one than to learn a new API. Do you think there are any gotchas in this approach? Any tips, pointers?
Re: Cassandra for graph data structure
?Nate Lucas thanks for the responses. Nate, I think it would be asking a bit much to suggest the hector team implement convenience methods for a graph representations. But if we went ahead and forked hector, I'd be sure to contribute back what i can and just release it as another client or if the final product can be merged with hector... I'd like thoughts on any features outside my own usecase though so that we can build it to handle other use cases as well. Lucas, I understand what you're saying but i've had a quick play with neo4j and the expense we'd pay for reads offsets a lot of the setbacks i'd run into using neo4j, not to mention having to learn it... -- From: Nate McCall n...@riptano.com Sent: Friday, September 24, 2010 4:14 PM To: user@cassandra.apache.org Subject: Re: Cassandra for graph data structure My idea however was to fork hector, remove all the stuff i don't need and turn it into a graph API sitting on top of Cassandra. We are always looking for ideas and design feedback regarding Hector. Please feel free to make suggestions or fork and send pull requests. http://groups.google.com/group/hector-users
column limit on multiget_slice or get_slice
Is it possible to get the first x columns from a row without knowing the column names? So far i've been working with just grabbing all the columns in a row or just getting a specific column that i know the name of. If it is possible, can anyone point me in the right direction of how to do this? I'm using 0.6.4 with the thrift interface in java, i use hector but i'd much prefer knowing how its done via thrift first :) thanks
Re: column limit on multiget_slice or get_slice
Ahhh, excellent. thank you From: Chen Xinli Sent: Tuesday, September 14, 2010 10:51 AM To: user@cassandra.apache.org Subject: Re: column limit on multiget_slice or get_slice you can use get_slice: public ListColumnOrSuperColumn get_slice(String keyspace, String key, ColumnParent column_parent, SlicePredicate predicate, ConsistencyLevel consistency_level) throws InvalidRequestException, UnavailableException, TimedOutException, TException; In the SlicePredicate.SliceRange, set start and finish to empty, count to x 2010/9/14 Courtney Robinson sa...@live.co.uk Is it possible to get the first x columns from a row without knowing the column names? So far i've been working with just grabbing all the columns in a row or just getting a specific column that i know the name of. If it is possible, can anyone point me in the right direction of how to do this? I'm using 0.6.4 with the thrift interface in java, i use hector but i'd much prefer knowing how its done via thrift first :) thanks -- Best Regards, Chen Xinli
Row limits
Are there any limits (implied or otherwise) on how many columns there can be in a single row? My understanding has always been that there is no limit on how many columns you can have in a single row but i've just read Arin's, WTF is a super column post again and i got the impression he was saying that if its a normal row i.e not Super, there is a limit but if the said row is within a structure of type Super then there is a potentially unbounded amount of columns to be had. Is my original understanding correct, have i just misinterpreted his article? regards, Courtney
Re: Few questions regarding cassandra deployment on windows
I haven't looked at your previos e-mail( s) or the responses to them but have a look at http://prettyprint.me/2010/02/14/running-cassandra-as-an-embedded-service/ the post was written by one of the guys who maintains the hector cassandra client. In any case the simple and short answer is yes, he did it, so ... From: kannan chandrasekaran Sent: Wednesday, September 08, 2010 1:20 AM To: user@cassandra.apache.org Subject: Re: Few questions regarding cassandra deployment on windows Can you please elaborate on why you think Cassandra would not be suitable for this ? Main reasons why we think cassandra because, 1) We are on focusing on moving to a distributed architecture very soon and using cassandra as a backend naturally lends to this. 2) Our schema is relatively simple and we wanted quick read and write access. Cassandra response times were faster than Mysql and we expect it to satisfy our requirements ( without the need for a cache layer). 3) I believe with 0.7's live schema updates, the need for changing the xml files and restarting the service would go away. so I believe usecase2 is only difficult in the 0.6 versions... I am more interested in knowing if we can start/run/stop cassandra as a embedded service within a jvm Thanks Kannan From: Benjamin Black b...@b3k.us To: user@cassandra.apache.org Sent: Tue, September 7, 2010 4:38:41 PM Subject: Re: Few questions regarding cassandra deployment on windows This does not sound like a good application for Cassandra at all. Why are you using it? On Tue, Sep 7, 2010 at 3:42 PM, kannan chandrasekaran ckanna...@yahoo.com wrote: Hi All, We are currently considering Cassandra for our application. Platform: * a single-node cluster. * windows '08 * 64-bit jvm For the sake of brevity let, Cassandra service = a single node cassandra server running as an embedded service inside a JVM My use cases: 1) Start with a schema ( keyspace and set of column families under it) in a cassandra service 2) Need to be able to replicate the same schema structure (add new keyspace/columnfamilies with different names ofcourse). 3) Because of some existing limitations in my application, I need to be able to write to the keyspace/column-families from a cassandra service and read the written changes from a different cassandra service. Both the write and the read cassandra-services are sharing the same Data directory. I understand that the application has to take care of any naming collisions. Couple Questions related to the above mentioned usecases: 1) I want to spawn a new JVM and launch Cassandra as an embedded service programatically instead of using the startup.bat. I would like to know if that is possible and any pointers in that direction would be really helpful. ( use-case1) 2) I understand that there are provisions for live schema changes in 0.7 ( thank you guys !!!), but since I cant use a beta version in production, I am restricted to 0.6 for now. Is it possible to to support use-case 2 in 0.6.5 ? More specifically, I am planning to make runtime changes to the storage.conf xml file followed by a cassandra service restart 3) Can I switch the data directory at run-time ? (use-case 3). In order to not disrupt read while the writes are in progress, I am thinking something like, copy the existing data-dir into a new location; write to a new data directory; once the write is complete; switch pointers and restart the cassandra service to read from the new directory to pick up the updated changes Any help is greatly appreciated. Thanks Kannan
indexing methods
A few of us working on a book for casanadra and got to the point where we (well I did anyway) wanted to include an example of a non trivial inverted index. I've been playing around with different ideas on how I could store the data and I've had a look at the previous threads that touched on the subject but with the 2 or 3 ideas I've seen on the list someone always points out something in the approach that punches a hole in it. I've been playing around with the idea of using a Columnfamily for the index where I store the terms as the key then each column name is a 64 bit long and its value is the doc id. If the column name represents a ranking for the doc id it stores and the compare with option is LongType then once a term is retrieved the first x amount of columns would represent the most related docs for that term. I'd go on in more detail but I'm using my phone to write this and I think that gets the idea across. Ofcourse my first thought to this is, is it scalable? In a system where possibly millions of docs are related to one term, is that a good idea to have potentially that many columns in one row all associated to the one row key which is the term? I just want to know what others think, if you have any suggestions or have a similar thing implemented and you're able to share. On a side note to that, there has been a bit of talk about secondary indexes in 0.7 can anyone shed some light on that, or point me to any presentation or the like where its mentioned so I can get a better idea of what its for. Thanks, Courtney