CassandraFS in 1.0?

2011-07-06 Thread Joseph Stein
Hey folks, I am going to start prototyping our media tier using cassandra as
a file system (meaning upload video/audio/images to web server save in
cassandra and then streaming them out)

Has anyone done this before?

I was thinking brisk's CassandraFS might be a fantastic implementation for
this but then I feel that I need to run another/different Cassandra cluster
outside of what our ops folks do with Apache Cassandra 0.8.X

Am I best to just compress files uploaded to the web server and then start
chunking and saving chunks in rows and columns so the mem issue does not
smack me in the face?  And use our existing cluster and build it out
accordingly?

I am sure our ops people would like the command line aspect of CassandraFS
but looking for something that makes the most sense all around.

It seems to me there is a REALLY great thing in CassandraFS and would love
to see it as part of 1.0 =8^)  or at a minimum some streamlined
implementation to-do the same thing.

If comparing to HDFS that is part of Hadoop project even though Cloudera has
a distribution of Hadoop :) maybe that can work here too _fingers_crosed_
(or mongodb-gridfs)

happy to help as I am moving down this road in general

Thanks!

/*
Joe Stein
http://www.linkedin.com/in/charmalloc
Twitter: @allthingshadoop http://www.twitter.com/allthingshadoop
*/


Paging Columns from a Row

2011-06-05 Thread Joseph Stein
What is the best practices here to page and slice columns from a row.

So lets say I have 1,000,000 columns in a row

I read the row but want to have 1 thread read columns 0 - , second
thread (actor in my case) 1 - 1 ... and so on so i can have 100
workers processing 10,000 columns for each of my rows.

If there is no API for this then is it something I should a composite key on
and have to populate the rows with a counter

000:myoriginalcolumnnameX
001:myoriginalcolumnnameY
002:myoriginalcolumnnameZ

Going the composite key route and doing a start/end predicate would work but
then it kind of makes the insertion/load of this have to go through a
single synchronized point to generate the columns names... I am not opposed
to this but would prefer both the load of my data and processing of my data
to not be bound by any 1 single lock (even if distributed).

Thanks

/*
Joe Stein
http://www.linkedin.com/in/charmalloc
Twitter: @allthingshadoop
*/


Re: Paging Columns from a Row

2011-06-05 Thread Joseph Stein
So I can have one PagedIndex CF that holdes a row for each data file I am
processing.

The columns for that row (in my example) would have X columns and I can make
those columns values be 100 strings that represent keys in another PagedData
CF

This other PagedData CF for each row would have 10,000 columns and their
values would have my data in them that I would loop through paralyze and
scale on so I can do this 100 times simultaneously.

This is really awesome because if I have 10 files each with a billion rows
in it then I push it into this pattern I can scale quite nicely providing
10,000 is my magic number of columns to page.   for 10,000,000,000 rows I
would have in my first PagedIndex CF 10,000 columns (each representing 100s
PagedData rows that have data) for each of the 100 rows for each column I
can then pull that row pulling out 10,000 pieces of data to process 100 at a
time on different servers.

got it, thanks! awesome!

On Sun, Jun 5, 2011 at 4:36 PM, Jonathan Ellis jbel...@gmail.com wrote:

 If you need to parallelize (and scale) you need to distribute across
 multiple rows. One Big Row means all your 100 workers are hammering
 the same 3 (for instance) replicas at the same time.

 On Sun, Jun 5, 2011 at 1:43 PM, Joseph Stein crypt...@gmail.com wrote:
  What is the best practices here to page and slice columns from a row.
  So lets say I have 1,000,000 columns in a row
  I read the row but want to have 1 thread read columns 0 - , second
  thread (actor in my case) 1 - 1 ... and so on so i can have 100
  workers processing 10,000 columns for each of my rows.
  If there is no API for this then is it something I should a composite key
 on
  and have to populate the rows with a counter
  000:myoriginalcolumnnameX
  001:myoriginalcolumnnameY
  002:myoriginalcolumnnameZ
  Going the composite key route and doing a start/end predicate would work
 but
  then it kind of makes the insertion/load of this have to go through a
  single synchronized point to generate the columns names... I am not
 opposed
  to this but would prefer both the load of my data and processing of my
 data
  to not be bound by any 1 single lock (even if distributed).
  Thanks
  /*
  Joe Stein
  http://www.linkedin.com/in/charmalloc
  Twitter: @allthingshadoop
  */
 



 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com




-- 

/*
Joe Stein
http://www.linkedin.com/in/charmalloc
Twitter: @allthingshadoop
*/


Re: [RELEASE] 0.8.0

2011-06-02 Thread Joseph Stein
Awesome!

On Thu, Jun 2, 2011 at 7:36 PM, Eric Evans eev...@rackspace.com wrote:


 I am very pleased to announce the official release of Cassandra 0.8.0.

 If you haven't been paying attention to this release, this is your last
 chance, because by this time tomorrow all your friends are going to be
 raving, and you don't want to look silly.

 So why am I resorting to hyperbole?  Well, for one because this is the
 release that debuts the Cassandra Query Language (CQL).  In one fell
 swoop Cassandra has become more than NoSQL, it's MoSQL.

 Cassandra also has distributed counters now.  With counters, you can
 count stuff, and counting stuff rocks.

 A kickass use-case for Cassandra is spanning data-centers for
 fault-tolerance and locality, but doing so has always meant sending data
 in the clear, or tunneling over a VPN.   New for 0.8.0, encryption of
 intranode traffic.

 If you're not motivated to go upgrade your clusters right now, you're
 either not easily impressed, or you're very lazy.  If it's the latter,
 would it help knowing that rolling upgrades between releases is now
 supported?  Yeah.  You can upgrade your 0.7 cluster to 0.8 without
 shutting it down.

 You see what I mean?  Then go read the release notes[1] to learn about
 the full range of awesomeness, then grab a copy[2] and become a
 (fashionably )early adopter.

 Drivers for CQL are available in Python[3], Java[3], and Node.js[4].

 As usual, a Debian package is available from the project's APT
 repository[5].

 Enjoy!


 [1]: http://goo.gl/CrJqJ (NEWS.txt)
 [2]: http://cassandra.debian.org/download
 [3]: http://www.apache.org/dist/cassandra/drivers
 [4]: https://github.com/racker/node-cassandra-client
 [5]: http://wiki.apache.org/cassandra/DebianPackaging

 --
 Eric Evans
 eev...@rackspace.com




-- 

/*
Joe Stein
http://www.linkedin.com/in/charmalloc
Twitter: @allthingshadoop
*/


Re: Cassandra Hackathon?

2011-05-17 Thread Joseph Stein
awesome!

this week I am (finally) getting cassandra (0.8) going for existing projects
we have in production.

Looking at https://issues.apache.org/jira/browse/CASSANDRA-2495 is a place I
was thinking maybe I could start to help out but I am not sure that is the
best starting point though it is a sore spot for some of what we want to be
doing in future projects.  Any thoughts from folks on this?  should I come
up with an approach and comment in JIRA is best?  start smaller? bigger?
some other tickets good for the gander?  more help for 0.80 in some specific
places while I am working on it?

I want the project I am working on to go well with Cassandra so the more I
jump into it the better.

if it just the two of us we will have more than enough pizza and beer
(Medialet's treat) but hopefully we can get some others too.

On Tue, May 17, 2011 at 12:04 AM, Edward Capriolo edlinuxg...@gmail.comwrote:

 I had it on our list of ideas for the Cassandra NYC meetup. I am down for
 action.




 On Mon, May 16, 2011 at 9:40 PM, Joseph Stein crypt...@gmail.com wrote:

 Any interest for a Cassandra Hackathon evening in NYC?  Any committer(s)
 going to be in the NYC area together that can lead/guide this?

 http://www.meetup.com/NYC-Cassandra-User-Group/events/18635801/

 I have a thumbs up to use our office www.medialets.com in the Milk
 Studios building. It is a big open space with 12 tables (2-3 people per
 table) all in one big room + a conference room we can gather around a big
 screen if/when need be too..

 I would love to start contributing code myself and think this is a great
 way to get it going for others too to get over the hump (and simply make
 time) to knock out tickets together with good guidance growing the
 community.


 /*
 Joe Stein
 http://www.linkedin.com/in/charmalloc
 Twitter: @allthingshadoop
 */





-- 

/*
Joe Stein
http://www.linkedin.com/in/charmalloc
Twitter: @allthingshadoop
*/


Cassandra Hackathon?

2011-05-16 Thread Joseph Stein
Any interest for a Cassandra Hackathon evening in NYC?  Any committer(s)
going to be in the NYC area together that can lead/guide this?

http://www.meetup.com/NYC-Cassandra-User-Group/events/18635801/

I have a thumbs up to use our office www.medialets.com in the Milk Studios
building. It is a big open space with 12 tables (2-3 people per table) all
in one big room + a conference room we can gather around a big screen
if/when need be too..

I would love to start contributing code myself and think this is a great way
to get it going for others too to get over the hump (and simply make time)
to knock out tickets together with good guidance growing the community.


/*
Joe Stein
http://www.linkedin.com/in/charmalloc
Twitter: @allthingshadoop
*/


GeoIndexing in Cassandra, Open Sourced?

2011-01-21 Thread Joseph Stein
I hear that a bunch of folks have GeoIndexing built on top of Cassandra and
running in production.

Any of them open sourced (Twitter? SimpleGeo? Bueller?) planning on it?

/*
Joe Stein
http://www.linkedin.com/in/charmalloc
Twitter: @allthingshadoop
*/


Re: GeoIndexing in Cassandra, Open Sourced?

2011-01-21 Thread Joseph Stein
On Fri, Jan 21, 2011 at 1:49 PM, Mike Malone m...@simplegeo.com wrote:

 A more recent preso I gave about the SimpleGeo architecture is up at
 http://strangeloop2010.com/system/talks/presentations/000/014/495/Malone-DimensionalDataDHT.pdf

 Mike

 On Fri, Jan 21, 2011 at 10:02 AM, Joseph Stein crypt...@gmail.com wrote:

 I hear that a bunch of folks have GeoIndexing built on top of Cassandra
 and running in production.

 Any of them open sourced (Twitter? SimpleGeo? Bueller?) planning on it?

 /*
 Joe Stein
 http://www.linkedin.com/in/charmalloc
 Twitter: @allthingshadoop
 */





-- 

/*
Joe Stein
http://www.linkedin.com/in/charmalloc
Twitter: @allthingshadoop
*/


Re: [RELEASE] 0.7.0 (and 0.6.9)

2011-01-11 Thread Joseph Stein
Many thanks to those that put in all the hard work, time, dedication, etc
for another awesome release !!!

/*
Joe Stein
http://www.linkedin.com/in/charmalloc
Twitter: @allthingshadoop
*/

On Tue, Jan 11, 2011 at 12:23 PM, Eric Evans eev...@rackspace.com wrote:


 As some of you may already be aware, 0.7.0 has been officially released.
 You are free to start your upgrades, though not all at once, you'll
 spoil your supper!

 I apologize to anyone that might have noticed artifacts published as
 early as Sunday and were confused by the lack of announcement, I was
 waiting for an Official ASF Press Release and my timing sucks.


 https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces9

 There is way too much hotness in 0.7.0 to cover here, so instead I will
 refer you to the following articles.

 http://www.rackspacecloud.com/blog/2010/10/27/new-features-in-cassandra-0-7
 http://www.riptano.com/blog/whats-new-cassandra-07-secondary-indexes
 http://www.riptano.com/blog/whats-new-cassandra-07-hadoop-output-cassandra
 http://www.riptano.com/blog/whats-new-cassandra-07-expiring-columns
 http://www.riptano.com/blog/whats-new-cassandra-07-live-schema-updates

 And of course, let's not forget the release notes.

 http://goo.gl/Bi8LD

 As usual you can download 0.7.0 from the website:
 http://cassandra.apache.org/download

 Users of Debian and derivatives can install from our repository:
 http://wiki.apache.org/cassandra/DebianPackaging

 New for 0.7.0, Cassandra is also available from Maven Central (thanks
 Stephen Connolly).


 But wait, there's more! If you're not in a hurry to upgrade, we have a
 new 0.6 release as well, 0.6.9.  It's an easy upgrade for anyone running
 0.6.8 and contains a number of useful changes (http://goo.gl/6NIPG).

 The Debian repository has been extended to support an extra version so
 if you're accustomed to installing 0.6 from our repository, then be sure
 to change your suite name to 06x in sources.list.  For example:

  deb http://www.apache.org/dist/cassandra/debian 06x main
  deb-src http://www.apache.org/dist/cassandra/debian 06x main


 That's it, thanks everyone!

 --
 Eric Evans
 eev...@rackspace.com




Re: Cassandra vs MongoDB

2010-07-28 Thread Joseph Stein
If you are looking to store web logs and then do ad hoc queries you
might/should be using Hadoop (depending on how big your logs are)

While MongoDB has MapReduce (built in) it is there to simulate SQL GROUP BY
and not for large scale analytics by any means.

MongoDB uses a global read/write lock per operation. general and
index-assisted reads are ultra-fast in mongo, but a bigger map/reduce or
group call will block other requests until complete, possibly causing
traffic to back up. because of that global lock, *all writes block*, too.

Cassandra is much more durable but from an architecture perspective keystore
vs document store could be weighed (on smaller traffic systems that do not
need higher level big data scale  durability)

If you have lots of data then MongoDB will eventually become a consistent
problem.

Here is a nice article on MongoDB in a larger scale of implementation
http://www.mikealrogers.com/2010/07/mongodb-performance-durability/ with
some conclusions which also talks about Cassandra, Redis  CouchDB.

MongoDB has made a lot of improvements over time but Cassandra is
*VERY*active also and continues to deliver great features and not
driven by a
corporation but rather the community.

MongoDB is backed and started by a company for them to make money using the
open source modal instead of Cassandra which started to solve a difficult
problem at facebook and then supported completely open source and THEN
having a company later pop up (Riptano) to support it making their money
using the open source modal... I say this to express the drives of the 2
servers  open source projects/communities are different.

You might see Google trends for MongoDB going up because folks jump into
because of the marketing and then have issues and try to find solutions =8^)

Now, I am not bashing MongoDB by an sorts it is a good database (so is
MySQL) but it is all about use cases AND the implementation/use/load.  Apply
the right solution to the problem it fits in all respects!

For logs (speaking with my architect hat on) I see no reason why you would
want to hold that in a document structure but at the same time you might not
have that many logs so you can get a lot of benefit from MongoDB M/R and
suchBut honestly if it is less than 1TB you might be fine JUST using
MySQL.

It is all relative.

Lastly, and back to Hadoop, Cassandra has a nice implementation so that when
you load your data into Cassandra you can pull it out to MapReduce it
http://allthingshadoop.com/2010/04/24/running-hadoop-mapreduce-with-cassandra-nosql/

/*
Joe Stein
http://www.linkedin.com/in/charmalloc
Twitter: @allthingshadoop
*/

On Tue, Jul 27, 2010 at 4:05 PM, Mark static.void@gmail.com wrote:

 On 7/27/10 12:42 PM, Dave Gardner wrote:

 There are quite a few differences. Ultimately it depends on your use
 case! For example Mongo has a limit on the maximum document size of
 4MB, whereas with Cassandra you are not really limited to the volume
 of data/columns per-row (I think there maybe a limit of 2GB perhaps;
 basically none)

 Another point re: search volumes is that mongo has been actively
 promoting over the last few months. I recently attended an excellent
 conference day in London which was very cheap; tickets probably didn't
 cover the costs. I guess this is part of their strategy. Eg: encourage
 adoption.

 Dave

 On Tuesday, July 27, 2010, Jonathan Shookjsh...@gmail.com  wrote:


 Also, google trends is only a measure of what terms people are
 searching for. To equate this directly to growth would be misleading.

  Tue, Jul 27, 2010 at 12:27 PM, Drew Dahlkedrew.dah...@bronto.com
  wrote:


 There's a good post on stackoverflow comparing the two
 http://stackoverflow.com/questions/2892729/mongodb-vs-cassandra

 It seems to me that both projects have pretty vibrant communities behind
 them.

 On Tue, Jul 27, 2010 at 11:14 AM, Markstatic.void@gmail.com
  wrote:


 Can someone quickly explain the differences between the two? Other than
 the
 fact that MongoDB supports ad-hoc querying I don't know whats
 different. It
 also appears (using google trends) that MongoDB seems to be growing
 while
 Cassandra is dying off. Is this the case?

 Thanks for the help







 Well my initial use case would be to store our search logs and perform
 some ad-hoc querying which I know is a win for Mongo. However I don't think
 I fully understand how to build indexes in Cassandra so maybe its just an
 issue of ignorance. I know going forward though we would be expanding it to
 house our per item translations.




--


geo distance calculations

2010-06-26 Thread Joseph Stein
I believe I have asked before but now that I am really getting into the
weeds with this it seems I am about to go down the MongoDB path... before I
do let me ask again (as I would prefer to stick with Cassandra for this app)

Has anyone implemented geo (long  lat) calculations (distance) using
Cassandra (something like geokit for Rails or such).

I would be using this in LIFT (not that it would matter but figure I would
mention it).

Regards,

/*
Joe Stein
http://www.linkedin.com/in/charmalloc
*/


Re: timeout while running simple hadoop job

2010-05-07 Thread Joseph Stein
you can manage the number of map tasks by node

mapred.tasktracker.map.tasks.maximum=1


On Fri, May 7, 2010 at 9:53 AM, gabriele renzi rff@gmail.com wrote:
 On Fri, May 7, 2010 at 2:44 PM, Jonathan Ellis jbel...@gmail.com wrote:
 Sounds like you need to configure Hadoop to not create a whole bunch
 of Map tasks at once

 interesting, from a  quick check it seems there are a dozen threads running.
 Yet , setNumMapTasks seems to be deprecated (together with JobConf)
 and while I guess
   -Dmapred.map.tasks=N
 may still work, it seems that  so it seems the only way to manage the
 number of map tasks is via a custom subclass of
 ColumnFamilyInputFormat.

 But of course you have a point that in a single box this does not add 
 anything.




-- 
/*
Joe Stein
http://www.linkedin.com/in/charmalloc
*/


Re: Cassandra use cases: as a datagrid ? as a distributed cache ?

2010-04-26 Thread Joseph Stein
great talk tonight in NYC I attended in regards to using Cassandra as
a Lucene Index store (really great idea nicely implemented)
http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend/

so Lucinda uses Cassandra as a distributed cache of indexes =8^)


On Mon, Apr 26, 2010 at 9:47 PM, Jonathan Ellis jbel...@gmail.com wrote:
 On Mon, Apr 26, 2010 at 9:04 AM, Dominique De Vito
 dominique.dev...@thalesgroup.com wrote:
 (1) has anyone already used Cassandra as an in-memory data grid ?
 If no, does anyone know how far such a database is from, let's say, Oracle
 Coherence ?
 Does Cassandra provide, for example, a (synchronized) cache on the client
 side ?

 If you mean an in-process cache on the client side, no.

 (2) has anyone already used Cassandra as a distributed cache ?
 Are there some testimonials somewhere about this use case ?

 That's basically what reddit is using it for.
 http://blog.reddit.com/2010/03/she-who-entangles-men.html

 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of Riptano, the source for professional Cassandra support
 http://riptano.com




-- 
/*
Joe Stein
http://www.linkedin.com/in/charmalloc
*/


Re: Cassandra use cases: as a datagrid ? as a distributed cache ?

2010-04-26 Thread Joseph Stein
(sp) Lucandra http://github.com/tjake/Lucandra

On Mon, Apr 26, 2010 at 11:08 PM, Joseph Stein crypt...@gmail.com wrote:
 great talk tonight in NYC I attended in regards to using Cassandra as
 a Lucene Index store (really great idea nicely implemented)
 http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend/

 so Lucinda uses Cassandra as a distributed cache of indexes =8^)


 On Mon, Apr 26, 2010 at 9:47 PM, Jonathan Ellis jbel...@gmail.com wrote:
 On Mon, Apr 26, 2010 at 9:04 AM, Dominique De Vito
 dominique.dev...@thalesgroup.com wrote:
 (1) has anyone already used Cassandra as an in-memory data grid ?
 If no, does anyone know how far such a database is from, let's say, Oracle
 Coherence ?
 Does Cassandra provide, for example, a (synchronized) cache on the client
 side ?

 If you mean an in-process cache on the client side, no.

 (2) has anyone already used Cassandra as a distributed cache ?
 Are there some testimonials somewhere about this use case ?

 That's basically what reddit is using it for.
 http://blog.reddit.com/2010/03/she-who-entangles-men.html

 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of Riptano, the source for professional Cassandra support
 http://riptano.com




 --
 /*
 Joe Stein
 http://www.linkedin.com/in/charmalloc
 */




-- 
/*
Joe Stein
http://www.linkedin.com/in/charmalloc
*/


Re: The Difference Between Cassandra and HBase

2010-04-25 Thread Joseph Stein
it is kind of the classic distinction between OLTP  OLAP.

Cassandra is to OLTP as HBase is to OLAP (for those SAT nutz).

Both are useful and valuable in their own right, agreed.

On Sun, Apr 25, 2010 at 12:20 PM, Jeff Hodges jhod...@twitter.com wrote:
 HBase is awesome when you need high throughput and don't care so much
 about latency. Cassandra is generally the opposite. They are
 wonderfully complementary.
 --
 Jeff

 On Sun, Apr 25, 2010 at 8:19 AM, Lenin Gali galile...@gmail.com wrote:
 I second Joe.

 Lenin
 Sent from my BlackBerry® wireless handheld

 -Original Message-
 From: Joe Stump j...@joestump.net
 Date: Sun, 25 Apr 2010 13:04:50
 To: user@cassandra.apache.org
 Subject: Re: The Difference Between Cassandra and HBase


 On Apr 25, 2010, at 11:40 AM, Mark Robson wrote:

 For me an important difference is that Cassandra is operationally much more 
 straightforward - there is only one type of node, and it is fully redundant 
 (depending what consistency level you're using).

 This seems to be an advantage in Cassandra vs most other distributed 
 storage systems, which almost all seem to require some master nodes which 
 have different operational requirements (e.g. cannot fail, need to be 
 failed over manually or have another HA solution installed for them)

 These two remain the #1 and #2 reasons I recommend Cassandra over HBase. At 
 the end of the day, Cassandra is an *absolute* dream to manage across 
 multiple data centers. I could go on and on about the voodoo that is 
 expanding, contracting, and rebalancing a Cassandra cluster. It's pretty 
 awesome.

 That being said, we're getting ready to spin up an HBase cluster. If you're 
 wanting increment/decrement, more complex range scans, etc. then HBase is a 
 great candidate. Especially if you don't need it to span multiple data 
 centers. We're using Cassandra for our main things, and then HBase+Hive for 
 analytics.

 There's room for both. Especially if you're using Hadoop with Cassandra.

 --Joe






-- 
/*
Joe Stein
http://www.linkedin.com/in/charmalloc
*/


download links 404 on main site

2010-03-15 Thread Joseph Stein
so i just moved to a new dev machine and went to download 0.5.1

was excited to see when googling cassandra coming up #1 (under the
top level site now)

but upset when EVERY mirror I tried came up 404 error not found =8^(

http://cassandra.apache.org/

try to download 0.5.1, no luck  ... not sure known issue did not see
anyone mailing about

/*
Joe Stein
http://www.linkedin.com/in/charmalloc
*/