Re: Ball is rolling on High Performance Cassandra Cookbook second edition

2012-06-27 Thread Courtney Robinson
Sounds good.
One thing I'd like to see is more coverage on Cassandra Internals. Out of
the box Cassandra's great but having a little inside knowledge can be very
useful because it helps you design your applications to work with
Cassandra; rather than having to later make endless optimizations that
could probably have been avoided had you done your implementation slightly
differently.

Another thing that may be worth adding would be a recipe that showed an
approach to evaluating Cassandra for your organization/use case. I realize
that's going to vary on a case by case basis but one thing I've noticed is
that some people dive in without really thinking through whether Cassandra
is actually the right fit for what they're doing. It sort of becomes a
hammer for anything that looks like a nail.

On Tue, Jun 26, 2012 at 10:25 PM, Edward Capriolo edlinuxg...@gmail.comwrote:

 Hello all,

 It has not been very long since the first book was published but
 several things have been added to Cassandra and a few things have
 changed. I am putting together a list of changed content, for example
 features like the old per Column family memtable flush settings versus
 the new system with the global variable.

 My editors have given me the green light to grow the second edition
 from ~200 pages currently up to 300 pages! This gives us the ability
 to add more items/sections to the text.

 Some things were missing from the first edition such as Hector
 support. Nate has offered to help me in this area. Please feel contact
 me with any ideas and suggestions of recipes you would like to see in
 the book. Also get in touch if you want to write a recipe. Several
 people added content to the first edition and it would be great to see
 that type of participation again.

 Thank you,
 Edward




-- 
Courtney Robinson
court...@crlog.info
http://crlog.info
07535691628 (No private #s)


CF design

2011-10-06 Thread Courtney Robinson

I was hoping someone could share their opinions on the following CF designs or 
suggest a better way of doing it.
My app is constantly  receiving new data that contains URLs. I was
thinking of hashing this URL to form a key. The data is a JSON object with
several properties. For now many of its properties will be ignored and only 4
are of interests, URL, title, username, user_rating. Often times the same URL
is received but shared by a different user. I’m wondering if anyone can suggest
a better approach to what I propose below which will be able answer the
following .


Queries:

I’ll be asking the following questions:

1.  
Give me the N most frequently shared items over :


a) 
The last 30 minutes

b) 
The last 24hrs

c)  
Between D1 and D2 (where D1 and D2 represents
the start and end date of interest)


2) 
Give me the N most shared item over the 3 time
periods above WHERE the average user rating is above 5


3) 
Give me X for the item with the ID 123 (where X
is a property for the item with the ID 123)

Proposed approach

Use timestamps as keys in the CF, that should take care of
queries under  1 and partially handle 2
and use each column to store the JSON data, minus the common fields such as the
title which will be the same no matter how many users share the same link 
(they’ll
have their own columns in the row) other column names will be the user’s
username and the value for those columns will be any JSON left over that’s not
specific to the user.

 For the rest of 2, I
can get the N items we’re interested in and calculate the average user rating
for each item client side. Of course using timestamp as key means we need to
maintain an index of the “real” keys/IDs to each item which would allow us to
answer “Give me the item with the ID 123”

Finally to address 3, I was thinking; Using the index get
the timestamp of the item, and on the client side find the property of
interest.

CF1


 
  
  Timestamp1
  
  
  
   

Title

   
   

value

   
  
  
  
  
  
   

ID

   
   

ID1

   
  
  
  
  
  
   

Username3

   
   

{“rating”:5}

   
  
  
  
  
  
   

Username2

   
   

{“rating”:0}

   
  
  
  
  
  
   

Username2

   
   

{“rating”:4}

   
  
  
  
 
 
  
  Timestamp2
  
  
  
   

Title

   
   

Value1

   
  
  
  
  
  
   

ID

   
   

ID2

   
  
  
  
  
  
   

Username24

   
   

{“rating”:1}

   
  
  
  
  
  
   

Username87

   
   

{“rating”:9}

   
  
  
  
  
  
   

Username7

   
   

{“rating”:2}

   
  
  
  
 


 

CF2


 
  
  ID1
  
  
  Timestamp1
  
 
 
  
  ID2
  
  
  Timestamp2
  
 


 In the Username column, I'd ideally like to avoid storing the other properties 
as a JSON but I couldn't think of a way of doing it sensibly when that JSON 
grows into having 10 other properties.Does this sound like a sensible approach 
to designing my CFs?   

Re: CQL DELETE statement

2011-04-18 Thread Courtney Robinson
Cool... Okay, the plan is to eventually not use thrift underneath, for the 
CQL stuff right?
Once this is done and the new transport is in place, or evening while 
designing the new transport,
is this not something that's worth looking into again? I think it'd be a 
nice feature.


-Original Message- 
From: Jonathan Ellis

Sent: Monday, April 18, 2011 3:24 AM
To: user@cassandra.apache.org
Cc: Tyler Hobbs
Subject: Re: CQL DELETE statement

Very old. https://issues.apache.org/jira/browse/CASSANDRA-494

On Sun, Apr 17, 2011 at 7:49 PM, Tyler Hobbs ty...@datastax.com wrote:

You are correct, but this is also a limitation with the Thrift API -- it's
not CQL specific.  It turns out that deleting a slice of columns is
difficult.  There's an old JIRA ticket somewhere that describes the 
issues.


On Sun, Apr 17, 2011 at 7:45 PM, Courtney Robinson sa...@live.co.uk 
wrote:


Looking at the CQL spec, it doesn’t seem to be possible to delete a range
of columns for a given key without specifying the individual columns to 
be

removed, for e.g.
DELETE col1 .. col20 from CF WHERE KEY=key|(key1,key2)
Am I correct in thinking so or have I missed that somewhere?




--
Tyler Hobbs
Software Engineer, DataStax
Maintainer of the pycassa Cassandra Python client library






--
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com 



codeigniter+phpcassa

2011-04-07 Thread Courtney Robinson
For anyone using Codeigniter and interested. I've written a little library 
to integrate

Codeigniter with PHPcassa and consequently Cassandra.
It provides you with access to code igniter's $this-db instance that only 
has

the library's methods and phpcassa's.
Follow up tutorial 
http://crlog.info/2011/04/07/apache-cassandra-phpcassa-code-igniter-large-scale-php-app-in-5-minutes/


Library download from
https://github.com/zcourts/cassandraci 



Re: Designing a decent data model for an online music shop...confused/stuck on decisions

2011-03-07 Thread Courtney
Thanks for the response, I haven't checked on the status of phpcassa in a while 
but does it now work with 0.7?
That was one of the main reasons I switched to pandra, it seemed more up to date


From: Tyler Hobbs 
Sent: Monday, March 07, 2011 2:40 AM
To: user@cassandra.apache.org 
Subject: Re: Designing a decent data model for an online music 
shop...confused/stuck on decisions


Regarding PHP performance with Cassandra, THRIFT-638 was recently resolved and 
it shows some big performance improvements.  I'll be upgrading the Thrift 
package that ships with phpcassa soon to include this fix, so you may want to 
compare performance numbers before and after.


On Sun, Mar 6, 2011 at 8:03 PM, Courtney e-mailadr...@hotmail.com wrote:

  We're in a bit of a predicament, we have an e-music store currently built in 
PHP using codeigniter/mysql...
  The current system has 100+K users and a decent song collection. Over the 
last few months I've been playing with
  Cassandra... needless to say I'm impressed but I have a few questions.
  Firstly, I want to avoid re-writing the entire site if possible so my 
instincts have made me inclined to replace the database layer
  in code igniter... is this something anyone would recommend and are there any 
gotchas in doing that?

  I can't say I've been terribly happy with PHP accessing cassandra, when 
sample data of the same size was put into mysql and in cassandra (of the same 
size/type)
  The pages with php connecting to Cassandra took longer to load, (30K records 
in table). 
  I've thought maybe it was my setup that needed tweaking and I've played with 
as many a options as I could but the best I've gotten is matching query time.
  Query speed test was simply getting time stamps right before and after query 
call returned...

  Is this something anyone else has seen, any comments suggestions? I've tried 
using thrift, phpcassa and pandra with pretty similar numbers.

  My other thought turned to maybe it was the way I designed my CFs, at first I 
used super columns to model user account CF based on a post I read
  by Arin (WTF is a super column) but I later changed to using normal CFs.

  I'm trying to make this work but I get the feeling my approach is 
somewhat...I don't mis-guided.

  Here's a break down of the current model.
  CF:Users{
  uid
  fname
  lname
  username
  password
  street
  
  }
  Some additional columns in place for a user but keeping it simple...
  CF:Library{
  uid
  songid
  ...
  other info about user library
  }

  CF:Songs{
  songid
  title
  artistid
  }

  This all is still very relational like (considering I go on to have a CF for 
playlist and artists) and I'm not sure if this is a good design for the data 
but... when I looked into
  combining some of the info and removing some CFs I run into the issue of 
replicating data all over the place. If for example I stored the artist name in 
the library for each record
  then each then the artist would be replicated for every song they have for 
every user who has that song in their library

  Where do you sort of draw the line on deciding how much is okay to be 
replicated?

  As much as I am not liking the idea of building the application from scratch, 
I'm considering the possibility of building from scratch in Java/JSP just to 
get the benefit of using
  the hector client. (Efforts from the guys doing the PHP libs is much 
appreciated but PHP doesn't seem to go too well with Cas.)

  In the process of making decisions because the upgrade/rebuild needs to have 
a fairly steady working version for October and I don't want to go wrong before 
even starting.

  Recommendations. Suggestions, advice are all welcomed (Any experience with 
PHP and Cas. is also welcomed since all my fav. libs. are in PHP I'm reluctant 
to turn away)



-- 
Tyler Hobbs
Software Engineer, DataStax
Maintainer of the pycassa Cassandra Python client library



Designing a decent data model for an online music shop...confused/stuck on decisions

2011-03-06 Thread Courtney
We're in a bit of a predicament, we have an e-music store currently built in 
PHP using codeigniter/mysql...
The current system has 100+K users and a decent song collection. Over the last 
few months I've been playing with
Cassandra... needless to say I'm impressed but I have a few questions.
Firstly, I want to avoid re-writing the entire site if possible so my instincts 
have made me inclined to replace the database layer
in code igniter... is this something anyone would recommend and are there any 
gotchas in doing that?

I can't say I've been terribly happy with PHP accessing cassandra, when sample 
data of the same size was put into mysql and in cassandra (of the same 
size/type)
The pages with php connecting to Cassandra took longer to load, (30K records in 
table). 
I've thought maybe it was my setup that needed tweaking and I've played with as 
many a options as I could but the best I've gotten is matching query time.
Query speed test was simply getting time stamps right before and after query 
call returned...

Is this something anyone else has seen, any comments suggestions? I've tried 
using thrift, phpcassa and pandra with pretty similar numbers.

My other thought turned to maybe it was the way I designed my CFs, at first I 
used super columns to model user account CF based on a post I read
by Arin (WTF is a super column) but I later changed to using normal CFs.

I'm trying to make this work but I get the feeling my approach is somewhat...I 
don't mis-guided.

Here's a break down of the current model.
CF:Users{
uid
fname
lname
username
password
street

}
Some additional columns in place for a user but keeping it simple...
CF:Library{
uid
songid
...
other info about user library
}

CF:Songs{
songid
title
artistid
}

This all is still very relational like (considering I go on to have a CF for 
playlist and artists) and I'm not sure if this is a good design for the data 
but... when I looked into
combining some of the info and removing some CFs I run into the issue of 
replicating data all over the place. If for example I stored the artist name in 
the library for each record
then each then the artist would be replicated for every song they have for 
every user who has that song in their library

Where do you sort of draw the line on deciding how much is okay to be 
replicated?

As much as I am not liking the idea of building the application from scratch, 
I'm considering the possibility of building from scratch in Java/JSP just to 
get the benefit of using
the hector client. (Efforts from the guys doing the PHP libs is much 
appreciated but PHP doesn't seem to go too well with Cas.)

In the process of making decisions because the upgrade/rebuild needs to have a 
fairly steady working version for October and I don't want to go wrong before 
even starting.

Recommendations. Suggestions, advice are all welcomed (Any experience with PHP 
and Cas. is also welcomed since all my fav. libs. are in PHP I'm reluctant to 
turn away)

Re: Fwd: SPA2011 - June 12th-15th - BCS London, UK - Call for Sessions

2011-02-14 Thread Courtney Robinson

Anyone else in London interested in this?


--
From: Jonathan Ellis jbel...@gmail.com
Sent: Monday, February 14, 2011 10:30 PM
To: user user@cassandra.apache.org
Subject: Fwd: SPA2011 - June 12th-15th - BCS London, UK - Call for Sessions


In case any of the London crowd is interested:


-- Forwarded message --
From: Mike Hill mikewh...@gmail.com
Date: Mon, Feb 14, 2011 at 4:14 PM
Subject: SPA2011 - June 12th-15th - BCS London, UK - Call for Sessions
To: mikewh...@gmail.com mikewh...@gmail.com


SPA2011 - June 12th-15th - BCS London, UK - Call for Sessions

Submissions Deadline:  28th February 2011.

To find out more, and submit a proposal, visit 
http://www.spaconference.org

http://www.spaconference.org

We would like to invite you to present a session at this leading
software development conference. SPA2011 will continue the well
established SPA tradition of learning through interaction, with sessions
exploring the latest advancements in software development practice.

We're looking for sessions which are interactive, thought provoking and
have not been seen before in this form (may be a topic you've covered
before, but it must be truly interactive).

They can be about technology or teams, practice or process - in fact
anything to do with advancing the state of the practice in software
development. We welcome submissions from everyone, if you're not
experienced with presenting sessions at SPA you'll be supported by our
well-established shepherding process which has ensured the standard
of sessions at SPA is exceptionally high.

Presenters will receive free attendance to the conference.
See the website for conditions.

Don't be shy! This year you can submit a rough proposal and get your
peers to give you feedback!

To find out more, and submit a proposal, visit 
http://www.spaconference.org

http://www.spaconference.org

The submission deadline is 28th February 2011.

Willem van den Ende  Rob Bowley
Programme Chairs SPA 2011
progra...@spaconference.org mailto:progra...@spaconference.org

--
You received this message because you are subscribed to the Google
Groups NOSQL group.
To post to this group, send email to nosql-discuss...@googlegroups.com.
To unsubscribe from this group, send email to
nosql-discussion+unsubscr...@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/nosql-discussion?hl=en.



--
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com



0.7 PHP thrift example

2011-02-12 Thread Courtney Robinson
Does anyone have a working 0.7 thrift example in PHP...?
I compiled 0.5 version of thrift and built the PHP bindings
but when I try to run the php example on the wiki I get

TException: Error: Attempt to send non-object type as a T_STRUCT


Re: TSocket timing out

2011-01-29 Thread Courtney Robinson
It may also be an idea to check the node's memory usage. I encountered this on 
a few occasions and I simply killed
any unneeded process that was eating away my node's memory. In each instance it 
worked fine after there was about 300MB of free memory


From: Patricio Echagüe 
Sent: Sunday, January 30, 2011 12:46 AM
To: user@cassandra.apache.org 
Subject: Re: TSocket timing out


The recommendation is to wait few milliseconds and retry. 


For Example if you use Hector ( I don't think it is your case), Hector will 
retry to different nodes in your cluster and the retry mechanisms is tunable as 
well.  


On Sat, Jan 29, 2011 at 2:20 PM, buddhasystem potek...@bnl.gov wrote:


  When I do a lot of inserts into my cluster (10k at a time) I get timeouts
  from Thrift, the TScoket.py module.

  What do I do?

  Thanks,

  Maxim

  --
  View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/TSocket-timing-out-tp5973548p5973548.html
  Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.




Re: Cassandra for graph data structure

2010-09-27 Thread Courtney Robinson

Efficient, i'm not totally sure on yet. Would need to do some testing.
A quick mock up should highlight any tradeoffs being made.
I think its been decided we'll test it out by creating a simple client using 
thrift

and if results look good we'll have an attempt using hector as a layer.

--
From: Ran Tavory ran...@gmail.com
Sent: Saturday, September 25, 2010 1:41 PM
To: user@cassandra.apache.org
Subject: Re: Cassandra for graph data structure


Courtney this certainly sounds interesting and as Nate suggested we're
always looking for valuable contributions.
A few things to keep in mind:
- I'm curious, as Lucas has asked - is it possible to create an
efficient graph API over cassandra and what are the tradeoffs?
- If the API is general enough and the functionality is reusable then
we'd be happy to add it to hector. If not, you can create a library
that uses hector as a layer.

On Friday, September 24, 2010, Courtney Robinson sa...@live.co.uk wrote:

?Nate  Lucas thanks for the responses.
Nate, I think it would be asking a bit much to suggest the hector team 
implement convenience methods for
a graph representations. But if we went ahead and forked hector, I'd be 
sure to contribute back what i can and just release it as another client

or if the final product can be merged with hector...
I'd like thoughts on any features outside my own usecase though so that 
we can build it to handle other use cases as well.


Lucas, I understand what you're saying but i've had a quick play with 
neo4j and the expense we'd pay for reads offsets a lot of the

setbacks i'd run into using neo4j, not to mention having to learn it...

--
From: Nate McCall n...@riptano.com
Sent: Friday, September 24, 2010 4:14 PM
To: user@cassandra.apache.org
Subject: Re: Cassandra for graph data structure


My idea however was to fork hector, remove all the stuff i don't need and
turn it into a graph API sitting on top of Cassandra.


We are always looking for ideas and design feedback regarding Hector.
Please feel free to make suggestions or fork and send pull requests.
http://groups.google.com/group/hector-users







Cassandra for graph data structure

2010-09-24 Thread Courtney Robinson
?Apoligies for the first e-mail with the misleading subject i was reading a 
thread and mistakenly replied 
I've been using Cassandra for a while now and no problems. I have a new 
project coming up now that we're penciling out the data structure for.

The best we've come up with has turned into a graph structure i'm just 
wanting to know what people think because i
know there are graph db's out there like neo4j etc.
My idea however was to fork hector, remove all the stuff i don't need and 
turn it into a graph API sitting on top of Cassandra.
Main reason for this approach is because i am already very familiar with 
Cassandra and it will be fast to write a client or modify an existing
one than to learn a new API.

Do you think there are any gotchas in this approach? Any tips, pointers? 


Re: Cassandra for graph data structure

2010-09-24 Thread Courtney Robinson

?Nate  Lucas thanks for the responses.
Nate, I think it would be asking a bit much to suggest the hector team 
implement convenience methods for
a graph representations. But if we went ahead and forked hector, I'd be sure 
to contribute back what i can and just release it as another client

or if the final product can be merged with hector...
I'd like thoughts on any features outside my own usecase though so that we 
can build it to handle other use cases as well.


Lucas, I understand what you're saying but i've had a quick play with neo4j 
and the expense we'd pay for reads offsets a lot of the

setbacks i'd run into using neo4j, not to mention having to learn it...

--
From: Nate McCall n...@riptano.com
Sent: Friday, September 24, 2010 4:14 PM
To: user@cassandra.apache.org
Subject: Re: Cassandra for graph data structure


My idea however was to fork hector, remove all the stuff i don't need and
turn it into a graph API sitting on top of Cassandra.


We are always looking for ideas and design feedback regarding Hector.
Please feel free to make suggestions or fork and send pull requests.
http://groups.google.com/group/hector-users



column limit on multiget_slice or get_slice

2010-09-14 Thread Courtney Robinson
Is it possible to get the first x columns from a row without knowing the 
column names?
So far i've been working with just grabbing all the columns in a row or just 
getting a specific column that i know the name of.
If it is possible, can anyone point me in the right direction of how to do 
this?
I'm using 0.6.4 with the thrift interface in java, i use hector but i'd much 
prefer knowing how its done via thrift first :)
thanks 



Re: column limit on multiget_slice or get_slice

2010-09-14 Thread Courtney Robinson
Ahhh, excellent.
thank you


From: Chen Xinli 
Sent: Tuesday, September 14, 2010 10:51 AM
To: user@cassandra.apache.org 
Subject: Re: column limit on multiget_slice or get_slice


you can use get_slice: 
public ListColumnOrSuperColumn get_slice(String keyspace, String key, 
ColumnParent column_parent, SlicePredicate predicate, ConsistencyLevel 
consistency_level) throws InvalidRequestException, UnavailableException, 
TimedOutException, TException;

In the SlicePredicate.SliceRange, set start and finish to empty, count to x


2010/9/14 Courtney Robinson sa...@live.co.uk

  Is it possible to get the first x columns from a row without knowing the 
column names?
  So far i've been working with just grabbing all the columns in a row or just 
getting a specific column that i know the name of.
  If it is possible, can anyone point me in the right direction of how to do 
this?
  I'm using 0.6.4 with the thrift interface in java, i use hector but i'd much 
prefer knowing how its done via thrift first :)
  thanks 




-- 
Best Regards,
Chen Xinli


Row limits

2010-09-08 Thread Courtney Robinson
Are there any  limits (implied or otherwise) on how many columns there can be 
in a single row?
My understanding has always been that there is no limit on how many columns you 
can have in a single row
but i've just read Arin's, WTF is a super column post again and i got the 
impression he was saying that if its a normal
row i.e not Super, there is a limit but if the said row is within a structure 
of type Super then there is a potentially unbounded
amount of columns to be had.

Is my original understanding correct, have i just misinterpreted his article?
regards,
Courtney

Re: Few questions regarding cassandra deployment on windows

2010-09-07 Thread Courtney
I haven't looked at your previos e-mail( s) or the responses to them but have a 
look at 
http://prettyprint.me/2010/02/14/running-cassandra-as-an-embedded-service/
the post was written by one of the guys who maintains the hector cassandra 
client.

In any case the simple and short answer is yes, he did it, so ...


From: kannan chandrasekaran 
Sent: Wednesday, September 08, 2010 1:20 AM
To: user@cassandra.apache.org 
Subject: Re: Few questions regarding cassandra deployment on windows


Can you please elaborate on why you think Cassandra would not be suitable for 
this ?

Main reasons why we think  cassandra because,
1) We are on focusing on moving to a distributed architecture very soon and 
using cassandra as a backend naturally lends to this.
2) Our schema is relatively simple and we wanted quick read and write access. 
Cassandra response times were faster than Mysql and we expect it to satisfy our 
requirements ( without the need for a cache layer).
3) I believe with 0.7's live schema updates, the need for changing the xml 
files and restarting the service would go away. so I believe usecase2 is only 
difficult in the 0.6 versions... 

I am more interested in knowing if we can start/run/stop  cassandra as a 
embedded service within a jvm

Thanks
Kannan








From: Benjamin Black b...@b3k.us
To: user@cassandra.apache.org
Sent: Tue, September 7, 2010 4:38:41 PM
Subject: Re: Few questions regarding cassandra deployment on windows

This does not sound like a good application for Cassandra at all.  Why
are you using it?

On Tue, Sep 7, 2010 at 3:42 PM, kannan chandrasekaran
ckanna...@yahoo.com wrote:
 Hi All,

 We are currently considering Cassandra for our application.

 Platform:
 * a single-node cluster.
 * windows '08
 * 64-bit jvm

 For the sake of brevity let,
 Cassandra service =  a single node cassandra server running as an embedded
 service inside a JVM


 My use cases:
 1) Start with a schema ( keyspace and set of column families under it) in a
 cassandra service
 2) Need to be able to replicate the same schema structure (add new
 keyspace/columnfamilies with different names ofcourse).
 3) Because of some existing limitations in my application, I need to be able
 to write to the keyspace/column-families from a cassandra service and read
 the written changes from a different cassandra service. Both the write and
 the read cassandra-services are sharing the same Data directory. I
 understand that the application has to take care of any naming collisions.


 Couple Questions related to the above mentioned usecases:
 1) I want to spawn a new JVM and launch Cassandra as an embedded service
 programatically instead of using the startup.bat. I would like to know if
 that is possible and any pointers in that direction would be really helpful.
 ( use-case1)
 2) I understand that there are provisions for live schema changes in 0.7 (
 thank you guys !!!), but since I cant use a beta version in production, I am
 restricted to 0.6 for now. Is it possible to to support use-case 2 in 0.6.5
 ? More specifically, I am planning to make runtime changes to the
 storage.conf xml file followed by a cassandra service restart
 3) Can I switch the data directory at run-time ?  (use-case 3). In order to
 not disrupt read while the writes are in progress, I am thinking something
 like, copy the existing data-dir into a new location; write to a new data
 directory; once the write is complete; switch pointers and restart the
 cassandra service to read from the new directory to pick up the updated
 changes

 Any help is greatly appreciated.

 Thanks
 Kannan






indexing methods

2010-09-03 Thread Courtney Robinson
A few of us working on a book for casanadra and got to the point where we (well 
I did anyway)  wanted to include an example of a non trivial inverted index. 

I've been playing around  with different ideas on how I could store the data 
and I've had a look at the previous threads that touched on the subject but 
with the 2 or 3 ideas I've seen on the list someone always points out something 
in the approach that punches a hole in it.

I've been playing around with the idea of using a Columnfamily for the index 
where I store the terms as the key then each column name is a 64 bit long and 
its value is the doc id. If the column name represents a ranking for the doc id 
it stores and the compare with option is LongType then once a term is retrieved 
the first x amount of columns would represent the most related docs for that 
term. 

I'd go on in more detail but I'm using my phone to write this and I think that 
gets the idea across.
Ofcourse my first thought to this is, is it scalable? In a system where 
possibly millions of docs are related to one term, is that a good idea to have 
potentially that many columns in one row all associated to the one row key 
which is the term?

I just want to know what others think, if you have any suggestions or have a 
similar thing implemented and you're able to share.

On a side note to that, there has been a bit of talk about secondary indexes in 
0.7 can anyone shed some light on that, or point me to any presentation or the 
like where its mentioned so I can get a better idea of what its for.

Thanks,
Courtney