CassandraFS in 1.0?
Hey folks, I am going to start prototyping our media tier using cassandra as a file system (meaning upload video/audio/images to web server save in cassandra and then streaming them out) Has anyone done this before? I was thinking brisk's CassandraFS might be a fantastic implementation for this but then I feel that I need to run another/different Cassandra cluster outside of what our ops folks do with Apache Cassandra 0.8.X Am I best to just compress files uploaded to the web server and then start chunking and saving chunks in rows and columns so the mem issue does not smack me in the face? And use our existing cluster and build it out accordingly? I am sure our ops people would like the command line aspect of CassandraFS but looking for something that makes the most sense all around. It seems to me there is a REALLY great thing in CassandraFS and would love to see it as part of 1.0 =8^) or at a minimum some streamlined implementation to-do the same thing. If comparing to HDFS that is part of Hadoop project even though Cloudera has a distribution of Hadoop :) maybe that can work here too _fingers_crosed_ (or mongodb-gridfs) happy to help as I am moving down this road in general Thanks! /* Joe Stein http://www.linkedin.com/in/charmalloc Twitter: @allthingshadoop http://www.twitter.com/allthingshadoop */
Paging Columns from a Row
What is the best practices here to page and slice columns from a row. So lets say I have 1,000,000 columns in a row I read the row but want to have 1 thread read columns 0 - , second thread (actor in my case) 1 - 1 ... and so on so i can have 100 workers processing 10,000 columns for each of my rows. If there is no API for this then is it something I should a composite key on and have to populate the rows with a counter 000:myoriginalcolumnnameX 001:myoriginalcolumnnameY 002:myoriginalcolumnnameZ Going the composite key route and doing a start/end predicate would work but then it kind of makes the insertion/load of this have to go through a single synchronized point to generate the columns names... I am not opposed to this but would prefer both the load of my data and processing of my data to not be bound by any 1 single lock (even if distributed). Thanks /* Joe Stein http://www.linkedin.com/in/charmalloc Twitter: @allthingshadoop */
Re: Paging Columns from a Row
So I can have one PagedIndex CF that holdes a row for each data file I am processing. The columns for that row (in my example) would have X columns and I can make those columns values be 100 strings that represent keys in another PagedData CF This other PagedData CF for each row would have 10,000 columns and their values would have my data in them that I would loop through paralyze and scale on so I can do this 100 times simultaneously. This is really awesome because if I have 10 files each with a billion rows in it then I push it into this pattern I can scale quite nicely providing 10,000 is my magic number of columns to page. for 10,000,000,000 rows I would have in my first PagedIndex CF 10,000 columns (each representing 100s PagedData rows that have data) for each of the 100 rows for each column I can then pull that row pulling out 10,000 pieces of data to process 100 at a time on different servers. got it, thanks! awesome! On Sun, Jun 5, 2011 at 4:36 PM, Jonathan Ellis jbel...@gmail.com wrote: If you need to parallelize (and scale) you need to distribute across multiple rows. One Big Row means all your 100 workers are hammering the same 3 (for instance) replicas at the same time. On Sun, Jun 5, 2011 at 1:43 PM, Joseph Stein crypt...@gmail.com wrote: What is the best practices here to page and slice columns from a row. So lets say I have 1,000,000 columns in a row I read the row but want to have 1 thread read columns 0 - , second thread (actor in my case) 1 - 1 ... and so on so i can have 100 workers processing 10,000 columns for each of my rows. If there is no API for this then is it something I should a composite key on and have to populate the rows with a counter 000:myoriginalcolumnnameX 001:myoriginalcolumnnameY 002:myoriginalcolumnnameZ Going the composite key route and doing a start/end predicate would work but then it kind of makes the insertion/load of this have to go through a single synchronized point to generate the columns names... I am not opposed to this but would prefer both the load of my data and processing of my data to not be bound by any 1 single lock (even if distributed). Thanks /* Joe Stein http://www.linkedin.com/in/charmalloc Twitter: @allthingshadoop */ -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com -- /* Joe Stein http://www.linkedin.com/in/charmalloc Twitter: @allthingshadoop */
Re: [RELEASE] 0.8.0
Awesome! On Thu, Jun 2, 2011 at 7:36 PM, Eric Evans eev...@rackspace.com wrote: I am very pleased to announce the official release of Cassandra 0.8.0. If you haven't been paying attention to this release, this is your last chance, because by this time tomorrow all your friends are going to be raving, and you don't want to look silly. So why am I resorting to hyperbole? Well, for one because this is the release that debuts the Cassandra Query Language (CQL). In one fell swoop Cassandra has become more than NoSQL, it's MoSQL. Cassandra also has distributed counters now. With counters, you can count stuff, and counting stuff rocks. A kickass use-case for Cassandra is spanning data-centers for fault-tolerance and locality, but doing so has always meant sending data in the clear, or tunneling over a VPN. New for 0.8.0, encryption of intranode traffic. If you're not motivated to go upgrade your clusters right now, you're either not easily impressed, or you're very lazy. If it's the latter, would it help knowing that rolling upgrades between releases is now supported? Yeah. You can upgrade your 0.7 cluster to 0.8 without shutting it down. You see what I mean? Then go read the release notes[1] to learn about the full range of awesomeness, then grab a copy[2] and become a (fashionably )early adopter. Drivers for CQL are available in Python[3], Java[3], and Node.js[4]. As usual, a Debian package is available from the project's APT repository[5]. Enjoy! [1]: http://goo.gl/CrJqJ (NEWS.txt) [2]: http://cassandra.debian.org/download [3]: http://www.apache.org/dist/cassandra/drivers [4]: https://github.com/racker/node-cassandra-client [5]: http://wiki.apache.org/cassandra/DebianPackaging -- Eric Evans eev...@rackspace.com -- /* Joe Stein http://www.linkedin.com/in/charmalloc Twitter: @allthingshadoop */
Re: Cassandra Hackathon?
awesome! this week I am (finally) getting cassandra (0.8) going for existing projects we have in production. Looking at https://issues.apache.org/jira/browse/CASSANDRA-2495 is a place I was thinking maybe I could start to help out but I am not sure that is the best starting point though it is a sore spot for some of what we want to be doing in future projects. Any thoughts from folks on this? should I come up with an approach and comment in JIRA is best? start smaller? bigger? some other tickets good for the gander? more help for 0.80 in some specific places while I am working on it? I want the project I am working on to go well with Cassandra so the more I jump into it the better. if it just the two of us we will have more than enough pizza and beer (Medialet's treat) but hopefully we can get some others too. On Tue, May 17, 2011 at 12:04 AM, Edward Capriolo edlinuxg...@gmail.comwrote: I had it on our list of ideas for the Cassandra NYC meetup. I am down for action. On Mon, May 16, 2011 at 9:40 PM, Joseph Stein crypt...@gmail.com wrote: Any interest for a Cassandra Hackathon evening in NYC? Any committer(s) going to be in the NYC area together that can lead/guide this? http://www.meetup.com/NYC-Cassandra-User-Group/events/18635801/ I have a thumbs up to use our office www.medialets.com in the Milk Studios building. It is a big open space with 12 tables (2-3 people per table) all in one big room + a conference room we can gather around a big screen if/when need be too.. I would love to start contributing code myself and think this is a great way to get it going for others too to get over the hump (and simply make time) to knock out tickets together with good guidance growing the community. /* Joe Stein http://www.linkedin.com/in/charmalloc Twitter: @allthingshadoop */ -- /* Joe Stein http://www.linkedin.com/in/charmalloc Twitter: @allthingshadoop */
Cassandra Hackathon?
Any interest for a Cassandra Hackathon evening in NYC? Any committer(s) going to be in the NYC area together that can lead/guide this? http://www.meetup.com/NYC-Cassandra-User-Group/events/18635801/ I have a thumbs up to use our office www.medialets.com in the Milk Studios building. It is a big open space with 12 tables (2-3 people per table) all in one big room + a conference room we can gather around a big screen if/when need be too.. I would love to start contributing code myself and think this is a great way to get it going for others too to get over the hump (and simply make time) to knock out tickets together with good guidance growing the community. /* Joe Stein http://www.linkedin.com/in/charmalloc Twitter: @allthingshadoop */
GeoIndexing in Cassandra, Open Sourced?
I hear that a bunch of folks have GeoIndexing built on top of Cassandra and running in production. Any of them open sourced (Twitter? SimpleGeo? Bueller?) planning on it? /* Joe Stein http://www.linkedin.com/in/charmalloc Twitter: @allthingshadoop */
Re: GeoIndexing in Cassandra, Open Sourced?
On Fri, Jan 21, 2011 at 1:49 PM, Mike Malone m...@simplegeo.com wrote: A more recent preso I gave about the SimpleGeo architecture is up at http://strangeloop2010.com/system/talks/presentations/000/014/495/Malone-DimensionalDataDHT.pdf Mike On Fri, Jan 21, 2011 at 10:02 AM, Joseph Stein crypt...@gmail.com wrote: I hear that a bunch of folks have GeoIndexing built on top of Cassandra and running in production. Any of them open sourced (Twitter? SimpleGeo? Bueller?) planning on it? /* Joe Stein http://www.linkedin.com/in/charmalloc Twitter: @allthingshadoop */ -- /* Joe Stein http://www.linkedin.com/in/charmalloc Twitter: @allthingshadoop */
Re: [RELEASE] 0.7.0 (and 0.6.9)
Many thanks to those that put in all the hard work, time, dedication, etc for another awesome release !!! /* Joe Stein http://www.linkedin.com/in/charmalloc Twitter: @allthingshadoop */ On Tue, Jan 11, 2011 at 12:23 PM, Eric Evans eev...@rackspace.com wrote: As some of you may already be aware, 0.7.0 has been officially released. You are free to start your upgrades, though not all at once, you'll spoil your supper! I apologize to anyone that might have noticed artifacts published as early as Sunday and were confused by the lack of announcement, I was waiting for an Official ASF Press Release and my timing sucks. https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces9 There is way too much hotness in 0.7.0 to cover here, so instead I will refer you to the following articles. http://www.rackspacecloud.com/blog/2010/10/27/new-features-in-cassandra-0-7 http://www.riptano.com/blog/whats-new-cassandra-07-secondary-indexes http://www.riptano.com/blog/whats-new-cassandra-07-hadoop-output-cassandra http://www.riptano.com/blog/whats-new-cassandra-07-expiring-columns http://www.riptano.com/blog/whats-new-cassandra-07-live-schema-updates And of course, let's not forget the release notes. http://goo.gl/Bi8LD As usual you can download 0.7.0 from the website: http://cassandra.apache.org/download Users of Debian and derivatives can install from our repository: http://wiki.apache.org/cassandra/DebianPackaging New for 0.7.0, Cassandra is also available from Maven Central (thanks Stephen Connolly). But wait, there's more! If you're not in a hurry to upgrade, we have a new 0.6 release as well, 0.6.9. It's an easy upgrade for anyone running 0.6.8 and contains a number of useful changes (http://goo.gl/6NIPG). The Debian repository has been extended to support an extra version so if you're accustomed to installing 0.6 from our repository, then be sure to change your suite name to 06x in sources.list. For example: deb http://www.apache.org/dist/cassandra/debian 06x main deb-src http://www.apache.org/dist/cassandra/debian 06x main That's it, thanks everyone! -- Eric Evans eev...@rackspace.com
Re: Cassandra vs MongoDB
If you are looking to store web logs and then do ad hoc queries you might/should be using Hadoop (depending on how big your logs are) While MongoDB has MapReduce (built in) it is there to simulate SQL GROUP BY and not for large scale analytics by any means. MongoDB uses a global read/write lock per operation. general and index-assisted reads are ultra-fast in mongo, but a bigger map/reduce or group call will block other requests until complete, possibly causing traffic to back up. because of that global lock, *all writes block*, too. Cassandra is much more durable but from an architecture perspective keystore vs document store could be weighed (on smaller traffic systems that do not need higher level big data scale durability) If you have lots of data then MongoDB will eventually become a consistent problem. Here is a nice article on MongoDB in a larger scale of implementation http://www.mikealrogers.com/2010/07/mongodb-performance-durability/ with some conclusions which also talks about Cassandra, Redis CouchDB. MongoDB has made a lot of improvements over time but Cassandra is *VERY*active also and continues to deliver great features and not driven by a corporation but rather the community. MongoDB is backed and started by a company for them to make money using the open source modal instead of Cassandra which started to solve a difficult problem at facebook and then supported completely open source and THEN having a company later pop up (Riptano) to support it making their money using the open source modal... I say this to express the drives of the 2 servers open source projects/communities are different. You might see Google trends for MongoDB going up because folks jump into because of the marketing and then have issues and try to find solutions =8^) Now, I am not bashing MongoDB by an sorts it is a good database (so is MySQL) but it is all about use cases AND the implementation/use/load. Apply the right solution to the problem it fits in all respects! For logs (speaking with my architect hat on) I see no reason why you would want to hold that in a document structure but at the same time you might not have that many logs so you can get a lot of benefit from MongoDB M/R and suchBut honestly if it is less than 1TB you might be fine JUST using MySQL. It is all relative. Lastly, and back to Hadoop, Cassandra has a nice implementation so that when you load your data into Cassandra you can pull it out to MapReduce it http://allthingshadoop.com/2010/04/24/running-hadoop-mapreduce-with-cassandra-nosql/ /* Joe Stein http://www.linkedin.com/in/charmalloc Twitter: @allthingshadoop */ On Tue, Jul 27, 2010 at 4:05 PM, Mark static.void@gmail.com wrote: On 7/27/10 12:42 PM, Dave Gardner wrote: There are quite a few differences. Ultimately it depends on your use case! For example Mongo has a limit on the maximum document size of 4MB, whereas with Cassandra you are not really limited to the volume of data/columns per-row (I think there maybe a limit of 2GB perhaps; basically none) Another point re: search volumes is that mongo has been actively promoting over the last few months. I recently attended an excellent conference day in London which was very cheap; tickets probably didn't cover the costs. I guess this is part of their strategy. Eg: encourage adoption. Dave On Tuesday, July 27, 2010, Jonathan Shookjsh...@gmail.com wrote: Also, google trends is only a measure of what terms people are searching for. To equate this directly to growth would be misleading. Tue, Jul 27, 2010 at 12:27 PM, Drew Dahlkedrew.dah...@bronto.com wrote: There's a good post on stackoverflow comparing the two http://stackoverflow.com/questions/2892729/mongodb-vs-cassandra It seems to me that both projects have pretty vibrant communities behind them. On Tue, Jul 27, 2010 at 11:14 AM, Markstatic.void@gmail.com wrote: Can someone quickly explain the differences between the two? Other than the fact that MongoDB supports ad-hoc querying I don't know whats different. It also appears (using google trends) that MongoDB seems to be growing while Cassandra is dying off. Is this the case? Thanks for the help Well my initial use case would be to store our search logs and perform some ad-hoc querying which I know is a win for Mongo. However I don't think I fully understand how to build indexes in Cassandra so maybe its just an issue of ignorance. I know going forward though we would be expanding it to house our per item translations. --
geo distance calculations
I believe I have asked before but now that I am really getting into the weeds with this it seems I am about to go down the MongoDB path... before I do let me ask again (as I would prefer to stick with Cassandra for this app) Has anyone implemented geo (long lat) calculations (distance) using Cassandra (something like geokit for Rails or such). I would be using this in LIFT (not that it would matter but figure I would mention it). Regards, /* Joe Stein http://www.linkedin.com/in/charmalloc */
Re: timeout while running simple hadoop job
you can manage the number of map tasks by node mapred.tasktracker.map.tasks.maximum=1 On Fri, May 7, 2010 at 9:53 AM, gabriele renzi rff@gmail.com wrote: On Fri, May 7, 2010 at 2:44 PM, Jonathan Ellis jbel...@gmail.com wrote: Sounds like you need to configure Hadoop to not create a whole bunch of Map tasks at once interesting, from a quick check it seems there are a dozen threads running. Yet , setNumMapTasks seems to be deprecated (together with JobConf) and while I guess -Dmapred.map.tasks=N may still work, it seems that so it seems the only way to manage the number of map tasks is via a custom subclass of ColumnFamilyInputFormat. But of course you have a point that in a single box this does not add anything. -- /* Joe Stein http://www.linkedin.com/in/charmalloc */
Re: Cassandra use cases: as a datagrid ? as a distributed cache ?
great talk tonight in NYC I attended in regards to using Cassandra as a Lucene Index store (really great idea nicely implemented) http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend/ so Lucinda uses Cassandra as a distributed cache of indexes =8^) On Mon, Apr 26, 2010 at 9:47 PM, Jonathan Ellis jbel...@gmail.com wrote: On Mon, Apr 26, 2010 at 9:04 AM, Dominique De Vito dominique.dev...@thalesgroup.com wrote: (1) has anyone already used Cassandra as an in-memory data grid ? If no, does anyone know how far such a database is from, let's say, Oracle Coherence ? Does Cassandra provide, for example, a (synchronized) cache on the client side ? If you mean an in-process cache on the client side, no. (2) has anyone already used Cassandra as a distributed cache ? Are there some testimonials somewhere about this use case ? That's basically what reddit is using it for. http://blog.reddit.com/2010/03/she-who-entangles-men.html -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com -- /* Joe Stein http://www.linkedin.com/in/charmalloc */
Re: Cassandra use cases: as a datagrid ? as a distributed cache ?
(sp) Lucandra http://github.com/tjake/Lucandra On Mon, Apr 26, 2010 at 11:08 PM, Joseph Stein crypt...@gmail.com wrote: great talk tonight in NYC I attended in regards to using Cassandra as a Lucene Index store (really great idea nicely implemented) http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend/ so Lucinda uses Cassandra as a distributed cache of indexes =8^) On Mon, Apr 26, 2010 at 9:47 PM, Jonathan Ellis jbel...@gmail.com wrote: On Mon, Apr 26, 2010 at 9:04 AM, Dominique De Vito dominique.dev...@thalesgroup.com wrote: (1) has anyone already used Cassandra as an in-memory data grid ? If no, does anyone know how far such a database is from, let's say, Oracle Coherence ? Does Cassandra provide, for example, a (synchronized) cache on the client side ? If you mean an in-process cache on the client side, no. (2) has anyone already used Cassandra as a distributed cache ? Are there some testimonials somewhere about this use case ? That's basically what reddit is using it for. http://blog.reddit.com/2010/03/she-who-entangles-men.html -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com -- /* Joe Stein http://www.linkedin.com/in/charmalloc */ -- /* Joe Stein http://www.linkedin.com/in/charmalloc */
Re: The Difference Between Cassandra and HBase
it is kind of the classic distinction between OLTP OLAP. Cassandra is to OLTP as HBase is to OLAP (for those SAT nutz). Both are useful and valuable in their own right, agreed. On Sun, Apr 25, 2010 at 12:20 PM, Jeff Hodges jhod...@twitter.com wrote: HBase is awesome when you need high throughput and don't care so much about latency. Cassandra is generally the opposite. They are wonderfully complementary. -- Jeff On Sun, Apr 25, 2010 at 8:19 AM, Lenin Gali galile...@gmail.com wrote: I second Joe. Lenin Sent from my BlackBerry® wireless handheld -Original Message- From: Joe Stump j...@joestump.net Date: Sun, 25 Apr 2010 13:04:50 To: user@cassandra.apache.org Subject: Re: The Difference Between Cassandra and HBase On Apr 25, 2010, at 11:40 AM, Mark Robson wrote: For me an important difference is that Cassandra is operationally much more straightforward - there is only one type of node, and it is fully redundant (depending what consistency level you're using). This seems to be an advantage in Cassandra vs most other distributed storage systems, which almost all seem to require some master nodes which have different operational requirements (e.g. cannot fail, need to be failed over manually or have another HA solution installed for them) These two remain the #1 and #2 reasons I recommend Cassandra over HBase. At the end of the day, Cassandra is an *absolute* dream to manage across multiple data centers. I could go on and on about the voodoo that is expanding, contracting, and rebalancing a Cassandra cluster. It's pretty awesome. That being said, we're getting ready to spin up an HBase cluster. If you're wanting increment/decrement, more complex range scans, etc. then HBase is a great candidate. Especially if you don't need it to span multiple data centers. We're using Cassandra for our main things, and then HBase+Hive for analytics. There's room for both. Especially if you're using Hadoop with Cassandra. --Joe -- /* Joe Stein http://www.linkedin.com/in/charmalloc */
download links 404 on main site
so i just moved to a new dev machine and went to download 0.5.1 was excited to see when googling cassandra coming up #1 (under the top level site now) but upset when EVERY mirror I tried came up 404 error not found =8^( http://cassandra.apache.org/ try to download 0.5.1, no luck ... not sure known issue did not see anyone mailing about /* Joe Stein http://www.linkedin.com/in/charmalloc */