Re: Cassandra Demo/Tutorial Applications
There are several large data sets on the net you could use to build. Demo with. Search logs, wikipedia, uk govt stuff Dbpedia may be interesting as they have some of the stuff extracted out --- Sent from my phone Ian Holsman - 703 879-3128 On 13/03/2010, at 4:46 PM, Jonathan Ellis wrote: On Fri, Mar 12, 2010 at 1:55 PM, Krishna Sankar wrote: I was looking at this from CASSANDRA-873 as well as hands-on homework (!) for my OSCON tutorial. Have couple of questions. Would appreciate insights: A) Cassandra-873 suggests Luenandra as one demo application B) Are there other ideas that will bring out the various aspects of Cassandra ? multi-user blog (single-user is too easy :) - extra credit: with full-text search using lucandra discussion forum - also w/ FTS C) What would be the goal of demo apps ? Tutorial to help folks learn the ins and outs of Cassandra ? Show case capabilities ? I think Cassandra-873 belongs to the latter; Twissandra most probably belongs to the former. I think you nailed it. D) Hadoop on Cassandra might be a good demo/tutorial Sure, I'll buy that. I can't think of any standalone projects for that, but "compute a twissandra tag cloud" would be pretty cool. (Might need to write a twissandra bot to load stuff in to make an interesting cloud. :) E) How would one structure the infrastructure for the demo/ tutorials ? What assumptions can we make in creating them ? As AMIs to be run in EC2 ? I'd probably go with "virtualbox images" as being simpler for people who don't have an AWS key already. (VB can read vmware player images, i think. But there is no free vmware for OS X, so you'd want to check that before going w/ vmware format.) Or just have people d/l cassandra and a configuration xml. Probably easier than teaching people to use virtualbox who haven't before. Also to be run on 2-3 local machines for folks who can spare some ? Or as multiple processes - all in one machine ? You're not going to have time to teach cluster management. Keep it to 1.
Re: finding Cassandra servers
+1 on erics comments We could create a branch or git fork where you guys could develop it, and if it reaches a usable state and others find it interesting it could get integrated in then On 3/3/10, Eric Evans wrote: > On Wed, 2010-03-03 at 10:05 -0600, Ted Zlatanov wrote: >> I can do a patch+ticket for this in the core, making it optional and >> off by default, or do the same for a contrib/ service as you >> suggested. So I'd appreciate a +1/-1 quick vote on whether this can >> go in the core to save me from rewriting the patch later. > > I don't think voting is going to help. Voting doesn't do anything to > develop consensus and it seems pretty clear that no consensus exists > here. > > It's entirely possible that you've identified a problem that others > can't see, or haven't yet encountered. I don't see it, but then maybe > I'm just thick. > > Either way, if you think this is important, the onus is on you to > demonstrate the merit of your idea and contrib/ or a github project is > one way to do that (the latter has the advantage of not needing to rely > on anyone else). > > > -- > Eric Evans > eev...@rackspace.com > > -- Sent from my mobile device
Re: Cassandra News Page
Hi Sal. we'll be moving off the incubator site shortly. we'll address that when we go to cassandra.apache.org regards Ian On Feb 18, 2010, at 4:06 PM, Sal Fuentes wrote: > This is just a thought, but I think some type of *latest news* page would be > nice to have on the main site (http://incubator.apache.org/cassandra/) even > if its a bit outdated. Not sure if this has been previously considered. > > -- > Salvador Fuentes Jr. -- Ian Holsman i...@holsman.net
Re: Scalable data model for a Metadata database
Hi Jared. you might want to look at graph databases (hypergraphDB or neo4j for example) for use cases like this. what it seems like you are asking for is a semantic knowledge base ala freebase.com tools like protégé (protege.stanford.edu/ ) and gremlin (gremlin.tinkerpop.com) are helpful for this kind of thing as well. the other issue you are going to encounter is when you want to link up 2 things. for example marriage. find all people whose sex == ‘male’ and age >= 20 and age <= 29 and is married to people called michelle who is older than 27. HTH Ian On Feb 10, 2010, at 3:51 AM, Jared winick wrote: > Thanks for the specific suggestions Jonathan, I really appreciate it. > > On Tue, Feb 9, 2010 at 9:37 AM, Jonathan Ellis wrote: >> On Tue, Feb 9, 2010 at 10:01 AM, Jared winick wrote: >>> Somehow I need to partition the data better. Would a recommendation >>> be to “split” the “sex” key into multiple keys? For example I could >>> append the year and month to the key (“sex_022010”) to partition the >>> data by the month it was insert. >> >> That's one possibility. Another would be to kill two birds with one >> stone and add the age to that key, so you'd have male_20 (probably >> better: male_1990), etc. >> >> Fundamentally TANSTAAFL and if you need to scale queries w/ lots of >> criteria like this you will have to choose (sometimes from more than >> one of) these options: >> >> - have a lot of machines so you can parallelize brute force queries, >> e.g. w/ Hadoop >> - precompute specific "indexes" like sex_birthdate above >> - note, with supercolumns you can also materialize the whole >> "person" in subcolumns, rather than doing an extra lookup for each >> index hit >> - use less-specific indexes (e.g. separate sex & birthdate indexes to >> continue the example) and do more work on the client >> >> -Jonathan >> -- Ian Holsman i...@holsman.net
Re: Cassandra versus HBase performance study
Hi Brian. was there any performance changes on the other tests with v0.5 ? the graphs on the other pages looks remarkably identical. On Feb 4, 2010, at 11:45 AM, Brian Frank Cooper wrote: > 0.5 does seem to be significantly faster - the latency is better and it > provides significantly more throughput. I'm updating my charts with new > values now. > > One thing that is puzzling is the scan performance. The scan experiment is to > scan between 1-100 records on each request. My 6 node Cassandra cluster is > only getting up to about 230 operations/sec, compared to >1400 ops/sec for > other systems. The latency is quite a bit higher. A chart with these results > is here: > > http://www.brianfrankcooper.net/pubs/scans.png > > Is this the expected performance? I'm using the OrderPreservingPartitioner > with InitialToken values that should evenly partition the data (and the > amount of data in /var/cassandra/data is about the same on all servers). I'm > using get_range_slice() from Java (code snippet below). > > At the max throughput (230 ops/sec), when latency is over 1.2 sec, CPU usage > varies from ~5% to ~72% on different boxes. Disk busy varies from 60% to 90% > (and the machine with the busiest disk is not the one with highest CPU > usage.) Network utilization (eth0 %util both in and out) varies from 15%-40% > on different boxes. So clearly there is some imbalance (and the workload > itself is skewed via a Zipfian distribution) but I'm surprised that the > latencies are so high even in this case. > > Code snippet - fields is a Set listing the columns I want; > recordcount is the number of records to return. > > SlicePredicate predicate; > if (fields==null) > { > predicate = new SlicePredicate(null,new SliceRange(new byte[0], new > byte[0],false,100)); > } > else > { > Vector fieldlist=new Vector(); > for (String s : fields) > { > fieldlist.add(s.getBytes("UTF-8")); > } > predicate = new SlicePredicate(fieldlist,null); > } > ColumnParent parent = new ColumnParent("data", null); > > List results = > client.get_range_slice(table,parent,predicate,startkey,"",recordcount,ConsistencyLevel.ONE); > > Thanks! > > Brian > > > From: Brian Frank Cooper > Sent: Saturday, January 30, 2010 7:56 AM > To: cassandra-user@incubator.apache.org > Subject: RE: Cassandra versus HBase performance study > > Good idea, we'll benchmark 0.5 next. > > brian > > -Original Message- > From: Jonathan Ellis [mailto:jbel...@gmail.com] > Sent: Friday, January 29, 2010 1:13 PM > To: cassandra-user@incubator.apache.org > Subject: Re: Cassandra versus HBase performance study > > Thanks for posting your results; it is an interesting read and we are > pleased to beat HBase in most workloads. :) > > Since you originally benchmarked 0.4.2, you might be interested in the > speed gains in 0.5. A couple graphs here: > http://spyced.blogspot.com/2010/01/cassandra-05.html > > 0.6 (beta in a few weeks?) is looking even better. :) > > -Jonathan -- Ian Holsman i...@holsman.net
Re: [VOTE] Graduation
+1. On Jan 26, 2010, at 8:11 AM, Eric Evans wrote: > > There was some additional discussion[1] concerning Cassandra's > graduation on the incubator list, and as a result we've altered the > initial resolution to expand the size of the PMC by three to include our > active mentors (new draft attached). > > I propose a vote for Cassandra's graduation to a top-level project. > > We'll leave this open for 72 hours, and assuming it passes, we can then > take it to a vote with the Incubator PMC. > > +1 from me! > > > [1] http://thread.gmane.org/gmane.comp.apache.incubator.general/24427 > > -- > Eric Evans > eev...@rackspace.com > -- Ian Holsman i...@holsman.net
Re: Data Model Index Text
Hi ML. this sounds more like a job for SOLR, but if you want to do this with cassandra, you should look at Jake's Lucandra http://github.com/tjake/Lucandra you should also look at http://nicklothian.com/blog/2009/10/27/solr-cassandra-solandra/ I wouldn't recommend you building your own IR engine, just use one of the ones out there. regards Ian On Jan 9, 2010, at 9:12 AM, ML_Seda wrote: > > Hey, > > I've been reading up on the Cassandra data model a bit, and would like to > get some input from this forum on different techniques for a particular > problem. > > Assume I need to index millions of text docs (e.g. research papers), and > allow the ability to query them by a given word inside or around any of the > indexed docs. meaning if i search for terms i would like to get a list of > docs in which these terms show up (e.g. Michael Jordan = Michael is the main > term, and Jordan is next term n1. The same can be applied by indicating > previous terms to Michael) > > How do I model this in Cassandra? > > Would my Keys be a concat of the middle term + docid? Will I be able to do > queries by wildcarding the docid? > > Thanks. > -- > View this message in context: > http://n2.nabble.com/Data-Model-Index-Text-tp4275199p4275199.html > Sent from the cassandra-user@incubator.apache.org mailing list archive at > Nabble.com. -- Ian Holsman i...@holsman.net
Re: Advise for choice
things positive for solr. - mature and stable - lots of documentation - a swiss army knife and can be used for a LOT of things, especially if you are manipulating a lot of text. - the query language is easier to use (imho.. but i've been using solr for years, so I am biased) - lots of people know it - fast caching - faceting cons for solr. - hard to update a single field (you need to fetch & re-insert the entire row) - commits/optimizes can slow things down to a crawl - can't store structured data easily. (for example a blog post has tags which have both a key and a value). - scalability isn't as easy as cassandra. sharding works, but it requires a lot of manual effort - it's easy to get started and get something running, but if you need to do something out of the ordinary, it gets hard fast. I think cassandra is more flexible to do ordinary things that don't involve text-matching. - replication isn't instant. (this is changing.. also look at zoie which may help). of course, if you tell us what your trying to do, I can be more specific. FWIW.. we use SOLR for some of our news-content (see love.com and newsrunner.com) and it works fast enough for us. We have a incoming doc rate of about 8-10 news articles/second. On Jan 8, 2010, at 5:43 AM, Nathan McCall wrote: > Agreed that there is not much to go on here in the original question. > I will say that we very recently found a good fit with Solr and > Cassandra in how we deal with a very heavy write volume of news > article data. Cassandra is excellent with write throughput and high > availability, but our search use cases are with time-dependent news > content, so we need lots of term proximity, faceting and ordering > functionality. > > We probably could store everything in Solr, but the above approach > will allow us to make articles immediately available in a > fault-tolerant manner while being able to efficiently send batches at > regular intervals to Solr and therefore scale out our ingestion of > news articles a little smoother. Full disclosure: I am still getting > my head around the innards of Solr replication and clustering, but so > far I feel like we made a good choice. > > Hopefully the above will be helpful to folks during their evaluations. > > Cheers, > -Nate > > > On Thu, Jan 7, 2010 at 10:02 AM, Joseph Bowman > wrote: >> I have to agree with Tatu. If you're struggling to find reasons to validate >> that Cassandra is the better choice for your task than Solr, then perhaps >> Solr is the correct choice. I kind of went through the same thing recently, >> struggled to make Cassandra fit what I was doing, then realized I was doing >> it wrong and moved to MongoDB. >> Cassandra is great at what it tries to accomplish, which is managing >> gigantic datasets in a distributed way. The question is, is that really what >> you need? >> >> On Thu, Jan 7, 2010 at 12:58 PM, Tatu Saloranta >> wrote: >>> >>> On Thu, Jan 7, 2010 at 3:16 AM, Richard Grossman >>> wrote: >>>> Hi, >>>> >>>> This message is little different than support. >>>> I'm confronted to problem where people want to change Cassandra with >>>> Solr >>>> server. I really think that our problem is a great case for cassandra >>>> but I >>>> need more arguments. >>>> >>>> So please if you've some time just put some idea why to use cassandra >>>> instead solr. >>> >>> Solution is generally applicable to a problem... so what is the (main) use >>> case? >>> >>> That would make it easier to find arguments for or against proposed >>> solution. >>> >>> -+ Tatu +- >> >> -- Ian Holsman i...@holsman.net
FWD: [protobuf] Captain Proto -- A Protobuf RPC system using capability-based security
There was a discussion about authorization for cassandra a while back. I thought this may be of interest. yes it is based on protobuf, but it should work equally as well with Thrift if someone was eager enough I would think. Regards Ian Original Message Subject:[protobuf] Captain Proto -- A Protobuf RPC system using capability-based security Date: Sun, 13 Dec 2009 03:18:54 -0800 From: Kenton Varda To: Protocol Buffers Hi all, As I've mentioned a couple times in other threads, last weekend I wrote up a simple RPC system based on Protocol Buffers which treats services as capabilities, in the sense of capability-based security. http://en.wikipedia.org/wiki/Capability-based_security Essentially what this means is that you can construct a service implementation and then embed a reference to it into an RPC message sent to or from some other service. So, for instance, if a client wants a server to be able to make calls back to the client, it can simply send the server a reference to a service implemented by the client. Or, for another example, a service which acts as a resource broker could grant a client access to a particular resource by sending it a reference to a service object representing that resource, to which the client can then make calls. Note that a particular service object cannot be accessed over a particular connection until that service object has actually been sent in an RPC over that connection. This property is useful for security, as described in the above link. In any case, the project is called Captain Proto and can be found here: http://code.google.com/p/capnproto/ Currently it only has Java support, though I hope it will eventually support other languages as well. The wire protocol is itself defined in terms of protocol buffers: http://code.google.com/p/capnproto/source/browse/proto/capnproto.proto There is basic documentation here: http://capnproto.googlecode.com/hg/doc/java/index.html You can also look at the test for an example: http://code.google.com/p/capnproto/source/browse/java/test.proto http://code.google.com/p/capnproto/source/browse/java/Test.java I expect the API to change quite a bit, so be warned that if you write code based on it, that code will have to change at some point. Future hopes/plans: - Improve API by taking advantage of code generator plugins. - Define a standard "ServiceDirectory" service which can serve as the default service on servers that export multiple services. The directory would have a method like Open() which takes the name of some particular service and returns a reference to the corresponding service object. - Provide a library of capability design pattern implementations, e.g. the revocable membrane. - Define a capnproto-over-HTTP protocol which can be used by AJAX clients. - Support C++ and Python. For the time being, this is not an official Google project. It's just something I wrote for fun -- or, more accurately, to support some other fun stuff that I want to work on. That said, due to the obviously wide applicability, I might try to make it more official at some point. -- You received this message because you are subscribed to the Google Groups "Protocol Buffers" group. To post to this group, send email to proto...@googlegroups.com . To unsubscribe from this group, send email to protobuf+unsubscr...@googlegroups.com . For more options, visit this group at http://groups.google.com/group/protobuf?hl=en . -- Ian Holsman i...@holsman.net
Re: Is Cassandra suitable for multi criteria search engine
Hi David. 3 million is a good size. I would say it is a 'medium' but it really depends on a lot of factors, and what exactly you are indexing. as a rule of thumb if you can fit the index in memory you'll be fine. It also depends on how much of a long tail you have, how often you update the index (each commit clears the caches) and how complex your queries are. I've found the number of commits plays a bigger part than the physical size. You should get a full size index up and benchmark it, in normal operation to be sure. you can also install Solr in 'distributed' mode, which lets you scale it out further. On Dec 19, 2009, at 12:30 AM, David MARTIN wrote: > Is a 3 million records set not a big deal for Solr? If I consider > about 30 properties per item, I have to give Solr 90 millions > properties to consider. Is that volume still correct for such a > solution? > > And regarding lucene on top of Cassandra, can people share their feed > back, if any, about such a solution. Pros & cons vs Solr for instance. > > Thank you. > > > 2009/12/17, Jake Luciani : >> True replication and scale. >> >> On Dec 17, 2009, at 4:56 PM, Josh wrote: >> >>> I've used solr a bunch (And I'd cosign gabriel: Solr's fantastic) and >>> I'm trying to work my head around Cassandra, but I'm really hazy on >>> what the Cassandra+Lucene combo gives you. What are you trying to >>> accomplish? (Meant earnestly: I'm really curious) >>> >>> josh >>> @schulz >>> http://schulzone.org >>> >>> >>> On Thu, Dec 17, 2009 at 2:52 PM, Jake Luciani >>> wrote: >>>> You can also put lucene on top of Cassandra by using. >>>> >>>> http://github.com/tjake/Lucandra >>>> >>>> On Dec 17, 2009, at 4:43 PM, gabriele renzi >>>> wrote: >>>> >>>>> On Thu, Dec 17, 2009 at 7:48 PM, David MARTIN >>>>> >>>>> wrote: >>>>>> >>>>>> Hi, >>>>>> That's what I was thinking. And I'm glad to read Apache solr in >>>>>> your >>>>>> answer as it is one of my main leads. >>>>> >>>>> as a happy solr user, I second the suggestion, lucene (the >>>>> technology >>>>> behind solr) handles a number of documents like that without a >>>>> sweat, >>>>> and solr gives your replication and a few other good things. >>>> >> -- Ian Holsman i...@holsman.net
Re: read latency creaping up
can you make it so that the client restarts the connection every 30m or so ? It could be an issue in thrift or something with long-lived connections. On Dec 15, 2009, at 10:16 AM, Brian Burruss wrote: > i agree. i don't know anything about thrift, and i don't know how it keeps > connections open or manages resources from a client or server perspective, > but this situation suggests that maybe killing the clients is forcing the > server to free something. > > how's that sound :) > > > From: Jonathan Ellis [jbel...@gmail.com] > Sent: Monday, December 14, 2009 3:12 PM > To: cassandra-user@incubator.apache.org > Subject: Re: read latency creaping up > > hmm, me neither > > but, I can't think how restarting the client would, either :) > > On Mon, Dec 14, 2009 at 4:59 PM, Brian Burruss wrote: >> Well not sure how that would affect he latency as reported by the Cassandra >> server using nodeprobe cfstats >> >> Jonathan Ellis wrote: >> >> >> possibly the clients are running into memory pressure? >> >> On Mon, Dec 14, 2009 at 4:27 PM, Brian Burruss wrote: >>> thx, i'm actually the "B. Todd Burruss" in that thread .. we changed our >>> email system and well now, i'm just Brian .. long story. >>> >>> anyway, in this case it isn't compaction pendings as i can kill the clients >>> and immediately restart and the latency is back to a reasonable number. >>> i'm still investigating. >>> >>> thx! >>> >>> From: Eric Evans [eev...@rackspace.com] >>> Sent: Monday, December 14, 2009 8:23 AM >>> To: cassandra-user@incubator.apache.org >>> Subject: RE: read latency creaping up >>> >>> On Sun, 2009-12-13 at 13:18 -0800, Brian Burruss wrote: >>>> if this isn't a known issue, lemme do some more investigating. my >>>> test client becomes "more random" with reads as time progresses, so >>>> possibly this is what causes the latency issue. however, all that >>>> being said, the performance really becomes bad after a while. >>> >>> Have a look at the following thread: >>> >>> http://thread.gmane.org/gmane.comp.db.cassandra.user/1402 >>> >>> >>> -- >>> Eric Evans >>> eev...@rackspace.com >>> >>> >> -- Ian Holsman i...@holsman.net
Re: Cassandra vs HBase
This is slightly off-topic There is a recent project called hadoop online (hop) on google-code that promises a online/continuous query ability on top of hadoop which should allow for near real time activities instead of the batch stuff that mapred does --- Sent from my phone Ian Holsman - 703 879-3128 On 06/12/2009, at 3:12 PM, Joseph Bowman wrote: When I wrote my Why Cassandra article, I didn't get into the why I didn't choose x platform because I didn't want to start a flame war by doing comparisons. For HBase, the primary reason I didn't choose it is that while there were benchmarks of what it could theoretically do, there wasn't any real real world deployments proving it. My experience as a systems administrator is that it's best to go with a product that's been proven over time in real world scenarios. I'll add to this though, that nothing nosql, even Cassandra, has reached the point where I feel it's no-brainer to choose it over anything, including sql based solutions like mysql and oracle. It really comes down to your requirements. On Sat, Dec 5, 2009 at 11:04 PM, Matt Revelle wrote: On Dec 5, 2009, at 21:45, Joe Stump wrote: On Dec 5, 2009, at 7:41 PM, Bill Hastings wrote: [Is] HBase used for real timish applications and if so any ideas what the largest deployment is. I don't know of anyone off the top of my head who's using anything built on top of Hadoop for a real-time environment. Hadoop just wasn't built for that. It was built, like MapReduce, for crunching absurd amounts of data across hundreds of nodes in a "reasonable" amount of time. Just my $0.02. --Joe While Hadoop MapReduce isn't meant for realtime use, HBase can handle it. Over last summer there were some benchmarks included in HBase/Hadoop presentations that showed, IIRC, performance comparable to Cassandra.
Re: Persistently increasing read latency
hmm. doesn't that leave the trunk in a bad position in terms of new development? you may go through times when a major feature lands and trunk is broken/buggy. or are you planning on building new features on a branch and then merging into trunk when it's stable? On Dec 3, 2009, at 5:32 AM, Jonathan Ellis wrote: > We are using trunk. 0.5 beta / trunk is better than 0.4 at the 0.4 > functionality and IMO is production ready (although you should always > test first), but I would not yet rely on the new stuff (bootstrap, > loadbalance, and moving nodes around in general). > > -Jonathan > > On Wed, Dec 2, 2009 at 12:26 PM, Adam Fisk wrote: >> Helpful thread guys. In general, Jonathan, would you recommend >> building from trunk for new deployments at our current snapshot in >> time? Are you using trunk at Rackspace? >> >> Thanks. >> >> -Adam >> >> >> On Tue, Dec 1, 2009 at 6:18 PM, Jonathan Ellis wrote: >>> On Tue, Dec 1, 2009 at 7:31 PM, Freeman, Tim wrote: >>>> Looking at the Cassandra mbean's, the attributes of ROW-MUTATION-STAGE and >>>> ROW-READ-STAGE and RESPONSE-STAGE are all less than 10. >>>> MINOR-COMPACTION-POOL reports 1218 pending tasks. >>> >>> That's probably the culprit right there. Something is wrong if you >>> have 1200 pending compactions. >>> >>> This is something that upgrading to trunk will help with right away >>> since we parallelize compactions there. >>> >>> Another thing you can do is increase the memtable limits so you are >>> not flushing + compacting so often with your insert traffic. >>> >>> -Jonathan >>> >> >> >> >> -- >> Adam Fisk >> http://www.littleshoot.org | http://adamfisk.wordpress.com | >> http://twitter.com/adamfisk >> -- Ian Holsman i...@holsman.net
Re: Wish list [from "users survey" thread]
well. I'd like to see how many times a specific user hits the site, without having to add them up every time. On Nov 24, 2009, at 9:47 AM, Ted Zlatanov wrote: > On Mon, 23 Nov 2009 13:45:09 -0600 Jonathan Ellis wrote: > > JE> 1. Increment/decrement: "atomic" is a dirty word in a system > JE> emphasizing availability, but incr/decr can be provided in an > JE> "eventually consistent" manner with vector clocks. There are other > JE> possible approaches but this is probably the best fit for us. We'd > JE> want to allow ColumnFamilies with either traditional (for Cassandra) > JE> long timestamps, or vector clocks, but not mixed. The bad news is, > JE> this is a very substantial change and will probably not be in 0.9 > JE> unless someone steps up to do the work. (This would also cover > JE> "flexible conflict resolution," which came up as well.) > > Just for my benefit, can someone explain the reasons why atomic inc/dec > are needed inside Cassandra if 64-bit time stamps and UUIDs are > available? I have not needed them in my usage but am curious about > other schemas that do. > > Thanks > Ted > -- Ian Holsman i...@holsman.net
Re: Social network feed/wall question
One of the problems you may face is that the common operation is 'get last X'. You might want to look at redis as an alternative as it supports this operation natively. I'm sure the Cassandra experts can help with your schema to optimize it as well --- Sent from my phone Ian Holsman - 703 879-3128 On 23/11/2009, at 9:55 AM, Kristian Lunde wrote: I am currently building a social network application where one of the important features is a feed / wall (Something similar to the Facebook wall). We will have several feeds, one for each profile and one for each group and so on. I have looked into using Cassandra for storing this, but I am not sure I am on the right track regarding my "schema". My thoughs was that the schema would be similar to this Feed [SuperColumn] - Row [user id as identifier] [Columns] - type - timestamp - message - url Each user would have his own feed super column and store all feed items related to him in this super column. I am not sure this is the best idea, since it creates an insane amount of writes whenever someone writes to their wall (this will have to write the feed of all his friends). Also I read in this thread http://www.mail-archive.com/cassandra-user@incubator.apache.org/msg00360.html that super columns are not suited for > 60k rows in a super column. What would be the optimal way of storing a set of feeds in cassandra? Thanks Kristian
Re: Cassandra users survey
--- Sent from my phone Ian Holsman - 703 879-3128 On 21/11/2009, at 12:38 PM, Dan Di Spaltro wrote: At Cloudkick we are using Cassandra to store monitoring statistics and running analytics over the data. I would love to share some ideas about how we set up our data-model, if anyone is interested. This isn't the right thread to do it in, but I think it would be useful to show how we store billions of points of data in Cassandra (and maybe get some feedback). Wishlist -remove_slice_range -auto loadbalancing -inc/dev On Fri, Nov 20, 2009 at 1:17 PM, Jonathan Ellis wrote: Hi all, I'd love to get a better feel for who is using Cassandra and what kind of applications it is seeing. If you are using Cassandra, could you share what you're using it for and what stage you are at with it (evaluation / testing / production)? Also, what alternatives you evaluated/are evaluating would be useful. Finally, feel free to throw in "I'd love to use Cassandra if only it did X" wishes. :) I can start: Rackspace is using Cassandra for stats collection (testing, almost production) and as a backend for the Mail & Apps division (early testing). We evaluated HBase, Hypertable, dynomite, and Voldemort as well. Thanks, -Jonathan (If you're in stealth mode or don't want to say anything in public, feel free to reply to me privately and I will keep it off the record.) -- Dan Di Spaltro
Re: Cassandra users survey
We're looking at it to be part of a near real time Web analytics engine, which sounds similar to Ooyala. at the moment I'm pushing to get the thing open sourced if possible. we're looking at combining Cassandra + Esper, but we are still in the very early stages. On Nov 21, 2009, at 8:17 AM, Jonathan Ellis wrote: > Hi all, > > I'd love to get a better feel for who is using Cassandra and what kind > of applications it is seeing. If you are using Cassandra, could you > share what you're using it for and what stage you are at with it > (evaluation / testing / production)? Also, what alternatives you > evaluated/are evaluating would be useful. Finally, feel free to throw > in "I'd love to use Cassandra if only it did X" wishes. :) > > I can start: Rackspace is using Cassandra for stats collection > (testing, almost production) and as a backend for the Mail & Apps > division (early testing). We evaluated HBase, Hypertable, dynomite, > and Voldemort as well. > > Thanks, > > -Jonathan > > (If you're in stealth mode or don't want to say anything in public, > feel free to reply to me privately and I will keep it off the record.) -- Ian Holsman i...@holsman.net
Re: Meetup?
I'm in Melbourne, and frequently in DC and NY as well. On Nov 13, 2009, at 11:18 AM, Nick Lothian wrote: Where in Australia are you from? (Adelaide here) I might be interested in one down here. From: Chris Were [mailto:chris.w...@gmail.com] Sent: Friday, 13 November 2009 9:09 AM To: cassandra-user@incubator.apache.org Subject: OT: Meetup? Hi, I'm from Australia, but currently in SF for the next 2 weeks working on a startup. If any cassandra users want to meet up to discuss cassandra or any other tech, shoot me an email. Cheers, Chris IMPORTANT: This e-mail, including any attachments, may contain private or confidential information. If you think you may not be the intended recipient, or if you have received this e-mail in error, please contact the sender immediately and delete all copies of this e-mail. If you are not the intended recipient, you must not reproduce any part of this e-mail or disclose its contents to any other party. This email represents the views of the individual sender, which do not necessarily reflect those of Education.au except where the sender expressly states otherwise. It is your responsibility to scan this email and any files transmitted with it for viruses or any other defects. education.au limited will not be liable for any loss, damage or consequence caused directly or indirectly by this email. -- Ian Holsman i...@holsman.net
Re: bandwidth limiting Cassandra's replication and access control
service layer, the java security manager isn’t going to suffice. What this snippet could do, though, and may be the rationale for the request, is to ensure that unauthorized users cannot instantiate a new Cassandra server. However, if a user has physical access to the machine on which Cassandra is installed, they could easily bypass that layer of security. What if Cassandra IS the application you're exposing? Imagine a large company that creates one large internal Cassandra deployment, and has multiple departments it wants to create separate keyspaces for. You can do that now, but there's nothing except a gentlemen's agreement to prevent one department from trashing another department's keyspace, and accidents do happen. You can front the service with some kind of application layer, but then you have another API to maintain, and you'll lose some performance this way. -Brandon -- Ian Holsman i...@holsman.net
using cassandra as a real time DW
hey guys. I was wondering if anyone is thinking of/is using cassandra to power a real time data warehouse. if so would you consider collaborating/open sourcing the effort so others could join in. TIA Ian. -- Ian Holsman i...@holsman.net
Re: Got Logo?
Let's go with what we have. We can get it fixed later Usually groups run competitions and get professional looking logos then No need to pay money On 9/19/09, Matt Kydd wrote: > I think the one up on the site by Makram Saleh is really quite good - > just needs a polish. > > http://issues.apache.org/jira/browse/CASSANDRA-231 > > MK > > 2009/9/19 Bill de hOra : >> David Pollak wrote: >> >>> I'll be happy to kick in $50 towards a 99Designs bounty for a Cassandra >>> logo >> >> Likewise. >> >> Bill >> > -- Sent from my mobile device
Re: New Features - Future releases
There was mention of lucene integration in the initial FB release. On Sep 18, 2009, at 9:59 PM, Jeffrey Damick wrote: Speaking of lucene, has anyone done any integration with lucene for cassandra or are there plans to provide full-text searches within cassandra? Thanks -jeff On 9/18/09 9:49 PM, "Joe Stump" wrote: On Sep 18, 2009, at 9:46 PM, wrote: Your idea is not bad: having a service layer in front of Cassandra. How about a separate opensource project or a standard/spec for ACL in the service layer? Sure. SOLR is kind of like this for Lucene. --Joe -- Ian Holsman i...@holsman.net
Re: Newbe´s question
isn't there a way to use svn:external or svn:link to pull them in from their own repos? (not sure how legal it would be). On Aug 27, 2009, at 10:03 AM, Jonathan Ellis wrote: I thought about that, but I really don't want Cassandra committers to have to be in the business of updating them all when we make changes, and having them in the repo creates that expectation even in contrib. On Wed, Aug 26, 2009 at 6:57 PM, Ian Holsman wrote: would it be worthwhile to start including these clients in the core codebase? in some kind of 'client' or 'contrib' directory? maybe even mentioning the 'popular' clients that people use in the readme (with links to them) would be good. On Aug 27, 2009, at 9:18 AM, Sal Fuentes wrote: Just would like to say great job so far. On Wed, Aug 26, 2009 at 4:01 PM, Ian Eure wrote: On Aug 25, 2009, at 2:46 PM, Drew Schleck wrote: For anyone using my branch of Lazyboy, Ian Eure pulled my work, improved it, and more. You ought to switch back to his version. I'm doing some heavy refactoring all this week, to bring it up to Cassandra trunk and simplify/genericize it wherever possible. I should have something to show in a day or two. Feel free to contact me if you have questions or requests. - Ian -- Salvador Fuentes Jr. 323-540-4SAL -- Ian Holsman i...@holsman.net -- Ian Holsman i...@holsman.net
Re: Newbe´s question
would it be worthwhile to start including these clients in the core codebase? in some kind of 'client' or 'contrib' directory? maybe even mentioning the 'popular' clients that people use in the readme (with links to them) would be good. On Aug 27, 2009, at 9:18 AM, Sal Fuentes wrote: Just would like to say great job so far. On Wed, Aug 26, 2009 at 4:01 PM, Ian Eure wrote: On Aug 25, 2009, at 2:46 PM, Drew Schleck wrote: For anyone using my branch of Lazyboy, Ian Eure pulled my work, improved it, and more. You ought to switch back to his version. I'm doing some heavy refactoring all this week, to bring it up to Cassandra trunk and simplify/genericize it wherever possible. I should have something to show in a day or two. Feel free to contact me if you have questions or requests. - Ian -- Salvador Fuentes Jr. 323-540-4SAL -- Ian Holsman i...@holsman.net
Re: Announcing 0.3.0
you need to give a tiny bit of time (say 24 hours) for the mirrors to catch up. On 21/07/2009, at 10:09 AM, Daniel Hengeveld wrote: I clicked on the link in my browser (Safari) - even copied the url and pasted it into the location bar of a new window, and was absolutely *not* greeted with a page of mirrors. Upon your reply, I tried an alternate browser (Firefox) and did get the links. Still doesn't work in Safari. Thanks for the help! ~d On Mon, Jul 20, 2009 at 17:05, Jeff Hodges wrote: Click through. It takes you to a page of mirror links. In the future, much debugging can be done by pointing your browser to the page. -- Jeff On Mon, Jul 20, 2009 at 4:56 PM, Daniel Hengeveld wrote: When I download this file, I get a 5KB file rather than the actual release. Is anyone else having this problem? On Mon, Jul 20, 2009 at 12:57, Eric Evans wrote: It is with great pleasure that I announce the very first release of Apache Cassandra, 0.3.0[1] A projects first release is a significant milestone and one that our burgeoning community should be proud of. Many thanks to everyone that submitted patches and bug reports, helped with testing, documented, organized, or just asked the important questions. Without further ado: The official download: http://www.apache.org/dyn/closer.cgi/incubator/cassandra/0.3.0/apache-cassandra-incubating-0.3.0-bin.tar.gz SVN Tag: https://svn.apache.org/repos/asf/incubator/cassandra/tags/cassandra-0.3.0-final/ [1] DISCLAIMER: Apache Cassandra is an effort undergoing incubation at The ASF, sponsored by the Apache Incubator Project Management Committee (PMC). Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF. -- Eric Evans eev...@rackspace.com -- ..[daniel hengeveld].. neoglam.com -- ..[daniel hengeveld].. neoglam.com -- Ian Holsman i...@holsman.net
Re: AttributeError: 'str' object has no attribute 'write'
hi Gasol. shouldn't regeneration of the interface be part of the build process? On 20/07/2009, at 3:29 AM, Gasol Wu wrote: hi, the cassandra.thrift has changed. u need to generate new python client and compile class again. On Mon, Jul 20, 2009 at 1:18 AM, wrote: Hi guys the new trunk cassandra doesnt work for a simple insert, how do we get this working client.insert('Table1', 'tofu', 'Super1:Related:tofu stew',pickle.dumps(dict(count=1)), time.time(), 0) --- AttributeErrorTraceback (most recent call last) /home/mark/work/cexperiments/ in () /home/mark/work/common/cassandra/Cassandra.py in insert(self, table, key, column_path, value, timestamp, block_for) 358 - block_for 359 """ --> 360 self.send_insert(table, key, column_path, value, timestamp, block_for) 361 self.recv_insert() 362 /home/mark/work/common/cassandra/Cassandra.py in send_insert(self, table, key, column_path, value, timestamp, block_for) 370 args.timestamp = timestamp 371 args.block_for = block_for --> 372 args.write(self._oprot) 373 self._oprot.writeMessageEnd() 374 self._oprot.trans.flush() /home/mark/work/common/cassandra/Cassandra.py in write(self, oprot) 1923 if self.column_path != None: 1924 oprot.writeFieldBegin('column_path', TType.STRUCT, 3) -> 1925 self.column_path.write(oprot) 1926 oprot.writeFieldEnd() 1927 if self.value != None: AttributeError: 'str' object has no attribute 'write' In [4]: client.insert('Table1', 'tofu', 'Super1:Related:tofu stew',pickle.dumps(dict(count=1)), time.time(), 0) -- Bidegg worlds best auction site http://bidegg.com -- Ian Holsman i...@holsman.net
Re: New cassandra in trunk - breaks python thrift interface (was AttributeError: 'str' object has no attribute 'write')
hi mobile. is it possible to put these as JIRA bugs ? instead of just mailing them on the list ? that way people can give them a bit more attention. and other people who have the same issue will be easily see what is going on. the URL is here :- https://issues.apache.org/jira/browse/CASSANDRA regards Ian On 20/07/2009, at 6:36 AM, mobiledream...@gmail.com wrote: ok so which is the version where cassandra python thrift works out of the box thanks On 7/19/09, Jonathan Ellis wrote: Don't run trunk if you're not going to read "svn log." The api changed with the commit of the 139 patches (and it will change again with the 185 ones). look at interface/cassandra.thrift to see what arguments are expected. On Sun, Jul 19, 2009 at 3:31 PM, wrote: > Hey Gasol wu > i regenerated the new thrift interface using > thrift -gen py cassandra.thrift > > > > client.insert('Table1', 'tofu', 'Super1:Related:tofu stew', > pickle.dumps(dict(count=1)), time.time(), 0) > --- > AttributeErrorTraceback (most recent call last) > > /home/mark/work/cexperiments/ in () > > /home/mark/work/common/cassandra/Cassandra.py in insert(self, table, key, > column_path, value, timestamp, block_for) > 358 - block_for > 359 """ > --> 360 self.send_insert(table, key, column_path, value, timestamp, > block_for) > 361 self.recv_insert() > 362 > > /home/mark/work/common/cassandra/Cassandra.py in send_insert(self, table, > key, column_path, value, timestamp, block_for) > 370 args.timestamp = timestamp > 371 args.block_for = block_for > --> 372 args.write(self._oprot) > 373 self._oprot.writeMessageEnd() > 374 self._oprot.trans.flush() > > /home/mark/work/common/cassandra/Cassandra.py in write(self, oprot) >1923 if self.column_path != None: >1924 oprot.writeFieldBegin('column_path', TType.STRUCT, 3) > -> 1925 self.column_path.write(oprot) >1926 oprot.writeFieldEnd() >1927 if self.value != None: > > AttributeError: 'str' object has no attribute 'write' > > > On Sun, Jul 19, 2009 at 10:29 AM, Gasol Wu wrote: >> >> hi, >> the cassandra.thrift has changed. >> u need to generate new python client and compile class again. >> >> >> On Mon, Jul 20, 2009 at 1:18 AM, wrote: >>> >>> Hi guys >>> the new trunk cassandra doesnt work for a simple insert, how do we get >>> this working >>> client.insert('Table1', 'tofu', 'Super1:Related:tofu >>> stew',pickle.dumps(dict(count=1)), time.time(), 0) >>> >>> --- >>> AttributeErrorTraceback (most recent call >>> last) >>> /home/mark/work/cexperiments/ in () >>> /home/mark/work/common/cassandra/Cassandra.py in insert(self, table, key, >>> column_path, value, timestamp, block_for) >>> 358 - block_for >>> 359 """ >>> --> 360 self.send_insert(table, key, column_path, value, timestamp, >>> block_for) >>> 361 self.recv_insert() >>> 362 >>> /home/mark/work/common/cassandra/Cassandra.py in send_insert(self, table, >>> key, column_path, value, timestamp, block_for) >>> 370 args.timestamp = timestamp >>> 371 args.block_for = block_for >>> --> 372 args.write(self._oprot) >>> 373 self._oprot.writeMessageEnd() >>> 374 self._oprot.trans.flush() >>> /home/mark/work/common/cassandra/Cassandra.py in write(self, oprot) >>>1923 if self.column_path != None: >>>1924 oprot.writeFieldBegin('column_path', TType.STRUCT, 3) >>> -> 1925 self.column_path.write(oprot) >>>1926 oprot.writeFieldEnd() >>>1927 if self.value != None: >>> AttributeError: 'str' object has no attribute 'write' >>> In [4]: client.insert('Table1', 'tofu', 'Super1:Related:tofu >>> stew',pickle.dumps(dict(count=1)), time.time(), 0) >>> >>> -- >>> Bidegg worlds best auction site >>> http://bidegg.com >> > > > > -- > Bidegg worlds best auction site > http://bidegg.com > -- Bidegg worlds best auction site http://bidegg.com -- Ian Holsman i...@holsman.net
Re: Best way to use a Cassandra Client in a multi-threaded environment?
ugh. if this is a byproduct of thrift, we should have another way of getting to the backend. serialization is *not* a desired feature for most people ;-0 On 16/07/2009, at 11:06 AM, Jonathan Ellis wrote: What I mean is, if you have client.rpc1() it doesn't really matter if you can do client.rpc2() from another thread or not, since it's dumb. :) On Wed, Jul 15, 2009 at 7:41 PM, Ian Holsman wrote: On 16/07/2009, at 10:35 AM, Jonathan Ellis wrote: IIRC thrift makes no effort to generate threadsafe code. which makes sense in an rpc-oriented protocol really. hmm.. not really. you can have a webserver calling a thrift backend quite easily, and then you would have 100+ threads all calling the same code. On Wed, Jul 15, 2009 at 7:25 PM, Joel Meyer wrote: Hello, Are there any recommendations on how to use Cassandra Clients in a multi-threaded front-end application (java)? Is the Client thread- safe or is it best to do a client per thread (or object pool of some sort)? Thanks, Joel -- Ian Holsman i...@holsman.net -- Ian Holsman i...@holsman.net
Re: Best way to use a Cassandra Client in a multi-threaded environment?
On 16/07/2009, at 10:35 AM, Jonathan Ellis wrote: IIRC thrift makes no effort to generate threadsafe code. which makes sense in an rpc-oriented protocol really. hmm.. not really. you can have a webserver calling a thrift backend quite easily, and then you would have 100+ threads all calling the same code. On Wed, Jul 15, 2009 at 7:25 PM, Joel Meyer wrote: Hello, Are there any recommendations on how to use Cassandra Clients in a multi-threaded front-end application (java)? Is the Client thread- safe or is it best to do a client per thread (or object pool of some sort)? Thanks, Joel -- Ian Holsman i...@holsman.net
Re: Non relational db meetup - San Francisco, June 11th
It looks like it is sold-out. On 13/05/2009, at 4:37 PM, Jonas Bonér wrote: 2009/5/12 Jonathan Ellis : That's true, but 100 people is about the largest space you're going to find for free, so past that you'd have to start charging people and worrying about taxes and such. Messy. No worries. That makes sense. Good initiative. Have fun. Maybe next year... :) Hehe. Sounds good. -Jonathan On Tue, May 12, 2009 at 2:02 PM, Jonas Bonér wrote: Great initiative. Just sad that it is not the week before (during JavaOne). Then I think a lot of people (including me) could go. 2009/5/12 Johan Oskarsson : Cassandra will be represented by Avinash Lakshman on a free full day meetup covering "open source, distributed, non relational databases" on June 11th in San Francisco. The idea is that the event will give people interested in this area a great introduction and an easy way to compare the different projects out there as well as the opportunity to discuss them with the developers. Registration The event is free but space is limited, please register if you wish to attend: http://nosql.eventbrite.com/ Preliminary schedule, 2009-06-11 09.45: Doors open 10.00: Intro session (Todd Lipcon, Cloudera) 10.40: Voldemort (Jay Kreps, Linkedin) 11.20: Short break 11.30: Cassandra (Avinash Lakshman, Facebook) 12.10: Free lunch (sponsored by CBSi) 13.10: Dynomite (Cliff Moon, Powerset) 13.50: HBase (Ryan Rawson, Stumbleupon) 14.30: Short break 14.40: Hypertable (Doug Judd, Zvents) 15.20: Panel discussion 16.00: End of meetup, relocate to a pub called Kate O’Brien’s nearby Location Magma room, CBS interactive 235 Second Street San Francisco, CA 94105 Sponsor A big thanks to CBSi for providing the venue and free lunch. /Johan Oskarsson, developer @ last.fm -- Jonas Bonér twitter: @jboner blog:http://jonasboner.com work: http://crisp.se work: http://scalablesolutions.se code: http://github.com/jboner -- Jonas Bonér twitter: @jboner blog:http://jonasboner.com work: http://crisp.se work: http://scalablesolutions.se code: http://github.com/jboner -- Ian Holsman i...@holsman.net