Re: Help for choice

2010-02-24 Thread Nathan McCall
The workload you originally described does not sound like a difficult
job for a relational database. Do you have any more information on the
specifics of your access patterns and where you feel that an RDBMS
might fall short?

-Nate

On Tue, Feb 23, 2010 at 11:27 PM, Cemal cemalettin@gmail.com wrote:
 I was not really expecting such an answer. :)
 Any other idea?

 On Wed, Feb 24, 2010 at 2:51 AM, Tatu Saloranta tsalora...@gmail.com
 wrote:

 Very funny! I assume this is related to MySQL's somewhat spotty record
 of actually conforming to SQL standard, right? ;-D
 (the NoSQL solution part)



Re: Help for choice

2010-02-24 Thread Cemal
Hi,

Maybe I have to tell that we are very eager to evaluate NoSQL approaches and
for a simple case we want evaluate and compare each approaches.

In our case actually our data has not been denormalized yet and we are
suffering from a lot of joins. And because of very much updates in joined
tables we have a great performance problems in some situations. Another
difficulty we are dealing with is scaling problem. By now we have been using
master slaves model but in near future it seems that we will come across a
lot of problems.

By the way I tried to find an article about use cases, pros and cons of each
NoSQL solution but I could not find a detailed explanation about them.

Thanks



On Wed, Feb 24, 2010 at 10:15 AM, Nathan McCall n...@vervewireless.comwrote:

 The workload you originally described does not sound like a difficult
 job for a relational database. Do you have any more information on the
 specifics of your access patterns and where you feel that an RDBMS
 might fall short?

 -Nate




Anti-compaction Diskspace issue even when latest patch applied

2010-02-24 Thread shiv shivaji
For about 6TB of  total data size with a replication factor of 2 (6TB x 2) on a 
five node cluster, I see about 4.6 TB on one machine (due to potential past 
problems with other machines). The machine has a disk of 6TB. 

The data folder on this machine has 59,289 files totally 4.6 TB. The files are 
the data, filter and indexes. I see that anti-compaction is running. I applied 
a recent patch which does not do anti-compaction if disk space is limited. I 
still see it happening. I have also called nodetool loadbalance on this 
machine. Seems like it will run out of disk space anyway.

The machine diskspace consumed are: (Each machine has a 6TB hard-drive on RAID).

Machine Space Consumed
M14.47 TB   
M22.93 TB   
M31.83 GB
M456.19 GB
M5398.01 GB

How can I force M1 to immediately move its load to M3 and M4 for instance (or 
any other machine). The nodetool move command moves all data, is there a way 
instead to force move 50% of data to M3 and the remaining 50% to M4 and resume 
anti-compaction after the move?

Thanks, Shiv

Re: Help for choice

2010-02-24 Thread Nathan McCall
I found the following helpful:
http://www.rackspacecloud.com/blog/2009/11/09/nosql-ecosystem/
http://00f.net/2009/an-overview-of-modern-sql-free-databases/comments/507
http://cacm.acm.org/blogs/blog-cacm/50678-the-nosql-discussion-has-nothing-to-do-with-sql/fulltext

There is enough variation in the designs of NoSQL systems that the
only way to really compare them is to take some realistic sample of
your data and how it is accessed and see how each system performs.

I like Cassandra because of it's focus on partition tolerance and
availability in exchange for eventual consistency (see
http://camelcase.blogspot.com/2007/08/cap-theorem.html for more on
this concept).

Cheers,
-Nate

On Wed, Feb 24, 2010 at 12:53 AM, Cemal cemalettin@gmail.com wrote:
 Hi,
 Maybe I have to tell that we are very eager to evaluate NoSQL approaches and
 for a simple case we want evaluate and compare each approaches.
 In our case actually our data has not been denormalized yet and we are
 suffering from a lot of joins. And because of very much updates in joined
 tables we have a great performance problems in some situations. Another
 difficulty we are dealing with is scaling problem. By now we have been using
 master slaves model but in near future it seems that we will come across a
 lot of problems.
 By the way I tried to find an article about use cases, pros and cons of each
 NoSQL solution but I could not find a detailed explanation about them.
 Thanks


 On Wed, Feb 24, 2010 at 10:15 AM, Nathan McCall n...@vervewireless.com
 wrote:

 The workload you originally described does not sound like a difficult
 job for a relational database. Do you have any more information on the
 specifics of your access patterns and where you feel that an RDBMS
 might fall short?

 -Nate




Re: reads are slow

2010-02-24 Thread kevin
On Tue, Feb 23, 2010 at 10:06 AM, Jonathan Ellis jbel...@gmail.com wrote:

 the standard workaround is to change your data model to use non-super
 columns instead.

 supercolumns are really only for relatively small numbers of
 subcolumns until 598 is addressed.

is there any limit on the number of supercolumns i can have?


Re: Help for choice

2010-02-24 Thread alex kamil
Cemal,

I've found the following analysis very helpful, it compares various noSQL
options and gives pros/cons of RDBMS vs noSQL:
No Relation: The Mixed Blessings of Non-Relational Databases by Ian Varley
http://ianvarley.com/UT/MR/Varley_MastersReport_Full_2009-08-07.pdf

-Alex
http://ianvarley.com/UT/MR/Varley_MastersReport_Full_2009-08-07.pdf

On Wed, Feb 24, 2010 at 6:06 AM, Francois Orsini
francois.ors...@gmail.comwrote:

 Chris's answer of MySQL does make a lot of sense, indeed. Based on the data
 you provided

 - 5-6 millions rows is not considered as a very large database.

 - 1,000 row updates per minute (even with 4 indexes) should not be a
 problem for sure. You can actually achieve 1.5-2k updates per sec easily
 with MySQL and 2+ indexes.

 - MySQL Master-Slave replication works quite well - sure you can get slaves
 behind but with 5.1 this is even less of a problem (replication is no longer
 single-threaded). In 5.0, you can compensate by using SSD drives on the
 slaves and using prefetch techniques (e.g. google for
 'mk-slave-prefetch').

 - Be aware that you better have a good case for moving to Cassandra as you
 will be giving up on the declarative  expressive power of SQL.
   . data model paradigm shift (think in terms of queries (NoSQL) rather
 than relations in the case of SQL)
   . No free lunch in terms of multi-indexing, complex queries, etc.
   . Eventual consistency vs strict consistency and the difference in
 performance cost in Cassandra. I suspect you understand this issue if you
 are dealing with slaves falling behind with MySQL ;-)

 - On the other hand, Cassandra is great for:
   . Very intensive write(s) applications
   . No single point of failure / automatic fail-over
   . Load balancing
   . Great read throughput - keep in mind that with a great set-up, you can
 achieve 10k reads / sec with MySQL.
   . Horizontal scaling

 Disclaimer: I'm not MySQL biased and not in love with it either, we use
 both Cassandra and MySQL (as in NotOnlySQL) but there is a point where MySQL
 (and sharding) will be too darn challenging and difficult to maintain 
 evolve. The move comes with a price and some trade-offs but just be certain
 you really need to make that jump (or/and use both) based on requirements
 (in the short and long terms).


 On Wed, Feb 24, 2010 at 12:53 AM, Cemal cemalettin@gmail.com wrote:

 Hi,

 Maybe I have to tell that we are very eager to evaluate NoSQL approaches
 and for a simple case we want evaluate and compare each approaches.

 In our case actually our data has not been denormalized yet and we are
 suffering from a lot of joins. And because of very much updates in joined
 tables we have a great performance problems in some situations. Another
 difficulty we are dealing with is scaling problem. By now we have been using
 master slaves model but in near future it seems that we will come across a
 lot of problems.

 By the way I tried to find an article about use cases, pros and cons of
 each NoSQL solution but I could not find a detailed explanation about them.

 Thanks



 On Wed, Feb 24, 2010 at 10:15 AM, Nathan McCall 
 n...@vervewireless.comwrote:

 The workload you originally described does not sound like a difficult
 job for a relational database. Do you have any more information on the
 specifics of your access patterns and where you feel that an RDBMS
 might fall short?

 -Nate





Re: reads are slow

2010-02-24 Thread Jonathan Ellis
only the total row size limit (must fit in memory during compaction)

On Wed, Feb 24, 2010 at 7:47 AM, kevin kevincastigli...@gmail.com wrote:
 On Tue, Feb 23, 2010 at 10:06 AM, Jonathan Ellis jbel...@gmail.com wrote:

 the standard workaround is to change your data model to use non-super
 columns instead.

 supercolumns are really only for relatively small numbers of
 subcolumns until 598 is addressed.

 is there any limit on the number of supercolumns i can have?



Getting the keys in your system?

2010-02-24 Thread Erik Holstad
If you have a system setup using the RandomPartitioner and have a couple of
indexes
setup for your data but realize that you need to add another index. How do
you get the
keys for your data, so that you can know where to point your indexes?
I guess what I'm really asking is, is there a way to get your keys when
using the RP or
how do people out there deal with something like this?

-- 
Regards Erik


Re: Getting the keys in your system?

2010-02-24 Thread Jonathan Ellis
0.6 adds hadoop support for exactly this scenario (among others).

You can also use get_range_slice to iterate all keys against RP in
0.6, but it will be slow since it is difficult to parallelize
manually.

-Jonathan

On Wed, Feb 24, 2010 at 9:23 AM, Erik Holstad erikhols...@gmail.com wrote:
 If you have a system setup using the RandomPartitioner and have a couple of
 indexes
 setup for your data but realize that you need to add another index. How do
 you get the
 keys for your data, so that you can know where to point your indexes?
 I guess what I'm really asking is, is there a way to get your keys when
 using the RP or
 how do people out there deal with something like this?

 --
 Regards Erik



Re: Getting the keys in your system?

2010-02-24 Thread Erik Holstad
Thanks Jonathan!
We are thinking about moving over to the OPP to be able to be able to do
this
and to use an md5 for some of  the data just to get the data written to
different nodes
for some of the cases where order is not really needed. Is there anything we
need to
think about when making the switch or any big drawbacks in doing so?

-- 
Regards Erik


Re: Getting the keys in your system?

2010-02-24 Thread Jonathan Ellis
Other than you'll have to completely reload all your data when
changing partitioners, no, not much to think about. :)

On Wed, Feb 24, 2010 at 9:38 AM, Erik Holstad erikhols...@gmail.com wrote:
 Thanks Jonathan!
 We are thinking about moving over to the OPP to be able to be able to do
 this
 and to use an md5 for some of  the data just to get the data written to
 different nodes
 for some of the cases where order is not really needed. Is there anything we
 need to
 think about when making the switch or any big drawbacks in doing so?

 --
 Regards Erik



Re: Getting the keys in your system?

2010-02-24 Thread Erik Holstad
Haha!
Yeah, fortunately we are only in the testing phase so this is not that big
of a deal.
Thanks a lot!

-- 
Regards Erik


Re: Cassandra paging, gathering stats

2010-02-24 Thread Wojciech Kaczmarek
Btw,

does get_range_slice support reversed=true for keys (not column
predicates) ? In 0.5 seems not

On Tue, Feb 23, 2010 at 21:28, Jonathan Ellis jbel...@gmail.com wrote:
 you'd actually use first column as start, empty finish,
 count=pagesize, and reversed=True, unless I'm misunderstanding
 something.

 On Tue, Feb 23, 2010 at 1:57 PM, Brandon Williams dri...@gmail.com wrote:
 On Tue, Feb 23, 2010 at 11:54 AM, Sonny Heer sonnyh...@gmail.com wrote:

  Columns can easily be paginated via the 'start' and 'finish' parameters.
   You can't jump to a random page, but you can provide next/previous
  behavior.

 Do you have an example of this?  From a client, they can pass in the
 last key, which can then be used as the start with some predefined
 count.  But how can you do previous?

 To go backwards, you pass the first column seen as the finish parameter and
 use an empty start parameter with an appropriate count.
 -Brandon



Re: Cassandra paging, gathering stats

2010-02-24 Thread Jonathan Ellis
It does not.  Someone would need it badly enough to code it first. :)

On Wed, Feb 24, 2010 at 10:26 AM, Wojciech Kaczmarek
kaczmare...@gmail.com wrote:
 Btw,

 does get_range_slice support reversed=true for keys (not column
 predicates) ? In 0.5 seems not

 On Tue, Feb 23, 2010 at 21:28, Jonathan Ellis jbel...@gmail.com wrote:
 you'd actually use first column as start, empty finish,
 count=pagesize, and reversed=True, unless I'm misunderstanding
 something.

 On Tue, Feb 23, 2010 at 1:57 PM, Brandon Williams dri...@gmail.com wrote:
 On Tue, Feb 23, 2010 at 11:54 AM, Sonny Heer sonnyh...@gmail.com wrote:

  Columns can easily be paginated via the 'start' and 'finish' parameters.
   You can't jump to a random page, but you can provide next/previous
  behavior.

 Do you have an example of this?  From a client, they can pass in the
 last key, which can then be used as the start with some predefined
 count.  But how can you do previous?

 To go backwards, you pass the first column seen as the finish parameter and
 use an empty start parameter with an appropriate count.
 -Brandon




Bulk Ingestion Issues

2010-02-24 Thread Sonny Heer
I have a single box, and trying to ingest some data into a single
keyspace and 5 CFs.  Basically it reads from a directory text files,
and inserts into Cassandra.  I've set the BinaryMemtableSizeInMB to
64. For some reason I'm not getting all my data into cassandra.  I get
some ingested, but very little.   Is this because I'm only using a
single box, and it cant handle the load? There is an exception when
the ingest is about to finish. here is the output from a clean startup
to end:


:~/apache-cassandra-incubating-0.5.0$ bin/cassandra -f
Listening for transport dt_socket at address: 
INFO - Saved Token not found. Using eGsC7VsC6xz0uskJ
INFO - Starting up server gossip
INFO - Cassandra starting up...
INFO - Node /127.0.0.1 is now part of the cluster
INFO - InetAddress /127.0.0.1 is now UP
INFO - Enqueuing flush of org.apache.cassandra.db.binarymemta...@17b650a
INFO - Sorting org.apache.cassandra.db.binarymemta...@17b650a
INFO - Writing org.apache.cassandra.db.binarymemta...@17b650a
INFO - Enqueuing flush of org.apache.cassandra.db.binarymemta...@ec44cb
INFO - Sorting org.apache.cassandra.db.binarymemta...@ec44cb
INFO - Completed flushing
/var/lib/cassandra/data/Keyspace1/ColumnFamily4-1-Data.db
INFO - Writing org.apache.cassandra.db.binarymemta...@ec44cb
INFO - Completed flushing
/var/lib/cassandra/data/Keyspace1/ColumnFamily3-1-Data.db
INFO - Enqueuing flush of org.apache.cassandra.db.binarymemta...@b11287
INFO - Sorting org.apache.cassandra.db.binarymemta...@b11287
INFO - Writing org.apache.cassandra.db.binarymemta...@b11287
INFO - Completed flushing
/var/lib/cassandra/data/Keyspace1/ColumnFamily2-1-Data.db
INFO - Enqueuing flush of org.apache.cassandra.db.binarymemta...@1687dcd
INFO - Sorting org.apache.cassandra.db.binarymemta...@1687dcd
INFO - Writing org.apache.cassandra.db.binarymemta...@1687dcd
INFO - Completed flushing
/var/lib/cassandra/data/Keyspace1/ColumnFamily1-1-Data.db
INFO - Enqueuing flush of org.apache.cassandra.db.binarymemta...@137bc9
INFO - Sorting org.apache.cassandra.db.binarymemta...@137bc9
INFO - Writing org.apache.cassandra.db.binarymemta...@137bc9
INFO - Completed flushing
/var/lib/cassandra/data/Keyspace1/ColumnFamily4-2-Data.db
INFO - Enqueuing flush of org.apache.cassandra.db.binarymemta...@1d4f6b4
INFO - Sorting org.apache.cassandra.db.binarymemta...@1d4f6b4
INFO - Writing org.apache.cassandra.db.binarymemta...@1d4f6b4
INFO - Completed flushing
/var/lib/cassandra/data/Keyspace1/ColumnFamily3-2-Data.db
INFO - Enqueuing flush of org.apache.cassandra.db.binarymemta...@1ac9cff
INFO - Sorting org.apache.cassandra.db.binarymemta...@1ac9cff
INFO - Writing org.apache.cassandra.db.binarymemta...@1ac9cff
INFO - Enqueuing flush of org.apache.cassandra.db.binarymemta...@11c9fcc
INFO - Sorting org.apache.cassandra.db.binarymemta...@11c9fcc
INFO - Completed flushing
/var/lib/cassandra/data/Keyspace1/ColumnFamily5-1-Data.db
INFO - Writing org.apache.cassandra.db.binarymemta...@11c9fcc
INFO - Completed flushing
/var/lib/cassandra/data/Keyspace1/ColumnFamily2-2-Data.db
INFO - Enqueuing flush of org.apache.cassandra.db.binarymemta...@16a90c9
INFO - Sorting org.apache.cassandra.db.binarymemta...@16a90c9
INFO - Writing org.apache.cassandra.db.binarymemta...@16a90c9
INFO - Completed flushing
/var/lib/cassandra/data/Keyspace1/ColumnFamily1-2-Data.db
INFO - Enqueuing flush of org.apache.cassandra.db.binarymemta...@118bd3c
INFO - Sorting org.apache.cassandra.db.binarymemta...@118bd3c
INFO - Writing org.apache.cassandra.db.binarymemta...@118bd3c
INFO - Completed flushing
/var/lib/cassandra/data/Keyspace1/ColumnFamily4-3-Data.db
INFO - Enqueuing flush of org.apache.cassandra.db.binarymemta...@7a9bff
INFO - Sorting org.apache.cassandra.db.binarymemta...@7a9bff
INFO - Writing org.apache.cassandra.db.binarymemta...@7a9bff
INFO - Completed flushing
/var/lib/cassandra/data/Keyspace1/ColumnFamily3-3-Data.db
INFO - Enqueuing flush of org.apache.cassandra.db.binarymemta...@106bcba
INFO - Sorting org.apache.cassandra.db.binarymemta...@106bcba
INFO - Writing org.apache.cassandra.db.binarymemta...@106bcba
INFO - Completed flushing
/var/lib/cassandra/data/Keyspace1/ColumnFamily2-3-Data.db
INFO - Enqueuing flush of org.apache.cassandra.db.binarymemta...@1ad552c
INFO - Sorting org.apache.cassandra.db.binarymemta...@1ad552c
INFO - Writing org.apache.cassandra.db.binarymemta...@1ad552c
INFO - Completed flushing
/var/lib/cassandra/data/Keyspace1/ColumnFamily1-3-Data.db
INFO - Enqueuing flush of org.apache.cassandra.db.binarymemta...@272111
INFO - Sorting org.apache.cassandra.db.binarymemta...@272111
INFO - Writing org.apache.cassandra.db.binarymemta...@272111
INFO - Completed flushing
/var/lib/cassandra/data/Keyspace1/ColumnFamily4-4-Data.db
INFO - Compacting

Re: import data into cassandra

2010-02-24 Thread Jonathan Ellis
I suggest getting it working via plain thrift calls before trying
anything fancy.  Otherwise it's probably premature optimization.

On Wed, Feb 24, 2010 at 11:43 AM, Martin Probst ser...@preisroboter.de wrote:
 Hi,

 i'm playing around a little bit with cassandra and trying to load some data 
 into it. I've found the sstable2json and json2sstable scripts inside the /bin 
 dir and tried to work with this scripts. I've wrote a wrapper which transform 
 csv's into a json file and the json-validator throws no failures. But every 
 time i tried to import the json, a exception is thrown:

 host:/opt/cassandra# bin/json2sstable -K Keyspace1 -c col1 
 ../utf8_cassandra.json data/Keyspace1/col1-2-Data.db
 Exception in thread main java.lang.NumberFormatException: For input string: 
 PR
        at 
 java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
        at java.lang.Integer.parseInt(Integer.java:447)
        at 
 org.apache.cassandra.utils.FBUtilities.hexToBytes(FBUtilities.java:255)
        at 
 org.apache.cassandra.tools.SSTableImport.addToStandardCF(SSTableImport.java:89)
        at 
 org.apache.cassandra.tools.SSTableImport.importJson(SSTableImport.java:156)
        at 
 org.apache.cassandra.tools.SSTableImport.main(SSTableImport.java:207)

 The Keyspace is configured as follows:
   Keyspace Name=Keyspace1
      ColumnFamily CompareWith=UTF8Type Name=col1 Comment=some data/
    /Keyspace

 Is there another way to import some data, maybe a tool or something? I've 
 used the latest stable cassandra version (0.5.0).

 Thanks
 Martin


Re: Bulk Ingestion Issues

2010-02-24 Thread Sonny Heer
Sorry for being unclear.  Yes, I have flushed and compacted the data
in that keyspace.  I'm still not getting all the results expected.
Any idea where that exception is about?

On Wed, Feb 24, 2010 at 9:50 AM, Jonathan Ellis jbel...@gmail.com wrote:
 Okay, so you are using binarymemtable, that wasn't 100% clear.

 With BMT you need to manually flush when you are done loading, the
 data isn't live until it's been converted to sstable.

 On Wed, Feb 24, 2010 at 11:45 AM, Sonny Heer sonnyh...@gmail.com wrote:

 On what symptom are you basing that conclusion?


 I've ingested the same data using the java thrift API, ran queries
 against that set, and I'm getting different results when I ingest it
 using the StorageService (CassandraBulkLoader without Hadoop) method.
 The size of results is much less.  The reason I'm using the bulk load
 is because it is considerably faster.




Re: import data into cassandra

2010-02-24 Thread Eric Evans
On Wed, 2010-02-24 at 18:43 +0100, Martin Probst wrote:
 host:/opt/cassandra# bin/json2sstable -K Keyspace1 -c
 col1 ../utf8_cassandra.json data/Keyspace1/col1-2-Data.db 
 Exception in thread main java.lang.NumberFormatException: For input
 string: PR
 at
 java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
 at java.lang.Integer.parseInt(Integer.java:447)
 at
 org.apache.cassandra.utils.FBUtilities.hexToBytes(FBUtilities.java:255)
 at
 org.apache.cassandra.tools.SSTableImport.addToStandardCF(SSTableImport.java:89)
 at
 org.apache.cassandra.tools.SSTableImport.importJson(SSTableImport.java:156)
 at
 org.apache.cassandra.tools.SSTableImport.main(SSTableImport.java:207) 
 
 The Keyspace is configured as follows:
Keyspace Name=Keyspace1
   ColumnFamily CompareWith=UTF8Type Name=col1 Comment=some
 data/
 /Keyspace

This is because hex strings are used to represent byte arrays in the
JSON format, (i.e. the string 'PR' would be turned into something like
'5052').

 Is there another way to import some data, maybe a tool or something?
 I've used the latest stable cassandra version (0.5.0).

As Jonathan stated, you're best bet is to tackle this using the Thrift
interface first.

-- 
Eric Evans
eev...@rackspace.com



Re: Wiki permission denied

2010-02-24 Thread Jonathan Ellis
pinged #asfinfra.  looks like they fixed it.

On Wed, Feb 24, 2010 at 11:09 AM, Mark Robson mar...@gmail.com wrote:
 Hiya,

 I'm looking at

 http://wiki.apache.org/cassandra/RecentChanges

 And there's an error.

 Can someone look into it please?

 Ta

 Mark



Understanding Bootstrapping

2010-02-24 Thread Anthony Molinaro
Hi,

  I had to add a few more nodes to my cluster yesterday so far 2 of the 3
have finished bootstrapping (at least as far as I can tell, the show up
via a ring command in the UP state, the 3rd does not show up at all in the
ring command).  I'm curious when the 3rd will finish, so was wondering if
there is any way to gauge this.

From what I can tell on some nodes I have a stream directory which has
4 files in it, and running tpstats against that node shows the STREAM-STATE
pool with 1 active and 3 pending, so I'm assuminge this means those 4
files are being streamed from the machine somewhere.  However, I don't
see any corresponding files on the bootstrapping machine, so I can't
be sure they are going there.  I do see some commit log activity on
the bootstrapping machine (ie, the file is growing slowly).  So do all
bootstrapped entries flow through the commit log?  If not where is the
data streamed too?

Thanks,

-Anthony

-- 

Anthony Molinaro   antho...@alumni.caltech.edu


full text search

2010-02-24 Thread Mohammad Abed
Any suggestions on how to pursue full text search with Cassandra, what
options are out there?

Thanks.


Adjusting Token Spaces and Rebalancing Data

2010-02-24 Thread Jon Graham
Hello,

I have 6 node Cassandra 0.5.0 cluster
using org.apache.cassandra.dht.OrderPreservingPartitioner with replication
factor 3.

I mistakenly set my tokens to the wrong values, and have all the data being
stored on the first node (with replicas on the seconds and third nodes)

Does Cassandra have any tools to reset the token values and re-distribute
the data?

Thanks for your help,
Jon


Re: full text search

2010-02-24 Thread Mohammad Abed
Either of these solutions used in any production environment?



On Wed, Feb 24, 2010 at 3:54 PM, Brandon Williams dri...@gmail.com wrote:

 On Wed, Feb 24, 2010 at 5:41 PM, Mohammad Abed mohammad.a...@gmail.comwrote:

 Any suggestions on how to pursue full text search with Cassandra, what
 options are out there?


 Also: http://github.com/tjake/Lucandra

 -Brandon



Re: full text search

2010-02-24 Thread Nathan McCall
The following paper on the Articles and Presentations section of the
Cassandra wiki describes Facebook's inbox search implementation:
http://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf

-Nate

On Wed, Feb 24, 2010 at 4:45 PM, Mohammad Abed mohammad.a...@gmail.com wrote:
 Either of these solutions used in any production environment?



 On Wed, Feb 24, 2010 at 3:54 PM, Brandon Williams dri...@gmail.com wrote:

 On Wed, Feb 24, 2010 at 5:41 PM, Mohammad Abed mohammad.a...@gmail.com
 wrote:

 Any suggestions on how to pursue full text search with Cassandra, what
 options are out there?

 Also: http://github.com/tjake/Lucandra
 -Brandon



Re: full text search

2010-02-24 Thread Mohammad Abed
You might want to keep an on the thread

http://www.mail-archive.com/cassandra-user@incubator.apache.org/msg02674.html

Also somebody wrote

Lucandra powers http://sparse.ly





On Wed, Feb 24, 2010 at 5:00 PM, Brandon Williams dri...@gmail.com wrote:

 On Wed, Feb 24, 2010 at 6:45 PM, Mohammad Abed mohammad.a...@gmail.comwrote:

 Either of these solutions used in any production environment?


 Lucandra powers http://sparse.ly

 -Brandon



Re: cassandra freezes

2010-02-24 Thread Jonathan Ellis
On Wed, Feb 24, 2010 at 8:46 PM, Santal Li santal...@gmail.com wrote:
 BTW: Somebody in my team told me, that if the cassandra managed data was too
 huge( 15x than heap space) , will cause performance issues, is this true?

It really has more to do with what your hot data set is, than absolute size.

Once any system becomes i/o bound because the hot set can't be cached
in os buffers, you're going to be in trouble, there's nothing magic
about that. :)

-Jonathan


Re: Adjusting Token Spaces and Rebalancing Data

2010-02-24 Thread Jonathan Ellis
nodeprobe loadbalance and/or nodeprobe move

http://wiki.apache.org/cassandra/Operations

On Wed, Feb 24, 2010 at 6:17 PM, Jon Graham sjclou...@gmail.com wrote:
 Hello,

 I have 6 node Cassandra 0.5.0 cluster
 using org.apache.cassandra.dht.OrderPreservingPartitioner with replication
 factor 3.

 I mistakenly set my tokens to the wrong values, and have all the data being
 stored on the first node (with replicas on the seconds and third nodes)

 Does Cassandra have any tools to reset the token values and re-distribute
 the data?

 Thanks for your help,
 Jon


Re: Understanding Bootstrapping

2010-02-24 Thread Jonathan Ellis
Bootstrap files are streamed directly to data locations as .tmp files
and renamed when complete.

One of the problems w/ 0.5's bootstrap is indeed that it doesn't give
you any visibility into what is going on.  This is addressed in 0.6 w/
additional JMX reporting.

On Wed, Feb 24, 2010 at 5:06 PM, Anthony Molinaro
antho...@alumni.caltech.edu wrote:
 Hi,

  I had to add a few more nodes to my cluster yesterday so far 2 of the 3
 have finished bootstrapping (at least as far as I can tell, the show up
 via a ring command in the UP state, the 3rd does not show up at all in the
 ring command).  I'm curious when the 3rd will finish, so was wondering if
 there is any way to gauge this.

 From what I can tell on some nodes I have a stream directory which has
 4 files in it, and running tpstats against that node shows the STREAM-STATE
 pool with 1 active and 3 pending, so I'm assuminge this means those 4
 files are being streamed from the machine somewhere.  However, I don't
 see any corresponding files on the bootstrapping machine, so I can't
 be sure they are going there.  I do see some commit log activity on
 the bootstrapping machine (ie, the file is growing slowly).  So do all
 bootstrapped entries flow through the commit log?  If not where is the
 data streamed too?

 Thanks,

 -Anthony

 --
 
 Anthony Molinaro                           antho...@alumni.caltech.edu



Re: 3 node installation

2010-02-24 Thread Jonathan Ellis
Is the configuration identical on all nodes?  Specifically, is
ReplicationFactor set to 2 on all nodes?

On Wed, Feb 24, 2010 at 10:07 PM, Masood Mortazavi
masoodmortaz...@gmail.com wrote:
 I wonder if anyone can provide an explanation for the following behavior
 observed in a three-node cluster:

 1. In a three-node (A, B and C) installation, I use the cli, connected to
 node A, to set 10 data items.

 2. On cli connected to node A, I do get, and can see all 10 data items.

 3. I take node C down, I do step 2, and only see some of the 10 data items.
 Some of the data items are unavailable as follows:
 cassandra get Keyspace1.Standard1['test6']
 Exception null
 UnavailableException()
     at
 org.apache.cassandra.service.Cassandra$get_slice_result.read(Cassandr
 a.java:3274)
     at
 org.apache.cassandra.service.Cassandra$Client.recv_get_slice(Cassandr
 a.java:296)
     at
 org.apache.cassandra.service.Cassandra$Client.get_slice(Cassandra.jav
 a:270)
     at org.apache.cassandra.cli.CliClient.doSlice(CliClient.java:241)
     at org.apache.cassandra.cli.CliClient.executeGet(CliClient.java:300)
     at
 org.apache.cassandra.cli.CliClient.executeCLIStmt(CliClient.java:57)
     at org.apache.cassandra.cli.CliMain.processCLIStmt(CliMain.java:131)
     at org.apache.cassandra.cli.CliMain.main(CliMain.java:172)

 4. Following step 3, with no other changes other than connecting the same
 cli instance to the other remaining node, meaning node B (which is a node
 with largest memory, by the way, although I don't think it matters here), I
 can see all 10 test data items.

 The replica number is 2.





RE: Strategy to delete/expire keys in cassandra

2010-02-24 Thread Weijun Li
Hi Sylvain, I just noticed that you are the one that implemented the
Expiring Column feature. Could you please help on my questions?

 

Should I just run command (in Cassandra 0.5 source folder?) like: 

 

patch -p1 -i  0001-Add-new-ExpiringColumn-class.patch

 

for all of the five patches in your ticket?

 

Also what's your opinion on extending ExpiringColumn to expire a key
completely? Otherwise it will be difficult to track what are expired or old
rows in Cassandra.

 

Thanks,

-Weijun

 

From: Weijun Li [mailto:weiju...@gmail.com] 
Sent: Tuesday, February 23, 2010 6:18 PM
To: cassandra-user@incubator.apache.org
Subject: Re: Strategy to delete/expire keys in cassandra

 

Thanks for the answer.  A dumb question: how did you apply the patch file to
0.5 source? The link you gave doesn't mention that the patch is for 0.5??

Also, this ExpiringColumn feature doesn't seem to expire key/row, meaning
the number of keys will keep grow (even if you drop columns for them) unless
you delete them. In your case, how do you manage deleting/expiring keys from
Cassandra? Do you keep a list of keys somewhere and go through them once a
while?

Thanks,

-Weijun

On Tue, Feb 23, 2010 at 2:26 AM, Sylvain Lebresne sylv...@yakaz.com wrote:

Hi,

Maybe the following ticket/patch may be what you are looking for:
https://issues.apache.org/jira/browse/CASSANDRA-699

It's flagged for 0.7 but as it breaks the API (and if I understand correctly
the release plan) it may not make it in cassandra before 0.8 (and the
patch will have to change to accommodate the change that will be
made to the internals in 0.7).

Anyway, what I can at least tell you is that I'm using the patch against
0.5 in a test cluster without problem so far.


 3)  Once keys are deleted, do you have to wait till next GC to clean
 them from disk or memory (suppose you don't run cleanup manually)? What's
 the strategy for Cassandra to handle deleted items (notify other replica
 nodes, cleanup memory/disk, defrag/rebuild disk files, rebuild bloom
filter
 etc). I'm asking this because if the keys refresh very fast (i.e., high
 volume write/read and expiration is kind of short) how will the data file
 grow and how does this impact the system performance.

Items are deleted only during compaction, and you may actually have to
wait for the GCGraceSeconds before deletion. This value is configurable in
storage-conf.xml, but is 10 days by default. You can decrease this value
but because of consistency (and the fact that you have to at least wait for
compaction to occurs) you will always have a delay before the actual delete
(all this is also true for the patch I mention above by the way). But when
it's
deleted, it's just skipping the items during compaction, so it's really
cheap.

--
Sylvain

 



Re: 3 node installation

2010-02-24 Thread Masood Mortazavi
Yes.
Identical with replication factor of 2.
m.

On Wed, Feb 24, 2010 at 8:33 PM, Jonathan Ellis jbel...@gmail.com wrote:

 Is the configuration identical on all nodes?  Specifically, is
 ReplicationFactor set to 2 on all nodes?

 On Wed, Feb 24, 2010 at 10:07 PM, Masood Mortazavi
 masoodmortaz...@gmail.com wrote:
  I wonder if anyone can provide an explanation for the following behavior
  observed in a three-node cluster:
 
  1. In a three-node (A, B and C) installation, I use the cli, connected to
  node A, to set 10 data items.
 
  2. On cli connected to node A, I do get, and can see all 10 data items.
 
  3. I take node C down, I do step 2, and only see some of the 10 data
 items.
  Some of the data items are unavailable as follows:
  cassandra get Keyspace1.Standard1['test6']
  Exception null
  UnavailableException()
  at
  org.apache.cassandra.service.Cassandra$get_slice_result.read(Cassandr
  a.java:3274)
  at
  org.apache.cassandra.service.Cassandra$Client.recv_get_slice(Cassandr
  a.java:296)
  at
  org.apache.cassandra.service.Cassandra$Client.get_slice(Cassandra.jav
  a:270)
  at org.apache.cassandra.cli.CliClient.doSlice(CliClient.java:241)
  at
 org.apache.cassandra.cli.CliClient.executeGet(CliClient.java:300)
  at
  org.apache.cassandra.cli.CliClient.executeCLIStmt(CliClient.java:57)
  at
 org.apache.cassandra.cli.CliMain.processCLIStmt(CliMain.java:131)
  at org.apache.cassandra.cli.CliMain.main(CliMain.java:172)
 
  4. Following step 3, with no other changes other than connecting the same
  cli instance to the other remaining node, meaning node B (which is a node
  with largest memory, by the way, although I don't think it matters here),
 I
  can see all 10 test data items.
 
  The replica number is 2.
 
 
 



Re: A configuration and step-by-step procedure for production deployment ...

2010-02-24 Thread Masood Mortazavi
On Wed, Feb 24, 2010 at 8:29 PM, Jonathan Ellis jbel...@gmail.com wrote:

 On Wed, Feb 24, 2010 at 9:29 PM, Masood Mortazavi
 masoodmortaz...@gmail.com wrote:
  Is there a configuration and step-by-step *procedure* for production
  deployments of Cassandra?

 Not really.  As w/ any cluster deployment, some basic sysadmin kung fu
 is required, and we don't go into that (although I suppose maybe we
 should).

 For the Cassandra side you should read

 http://wiki.apache.org/cassandra/CassandraHardware
 http://wiki.apache.org/cassandra/Operations

  By the way, I've noticed that not all potentially configurable setting
 may
  actually be included in the -- storage-config.xml -- that's distributed
 with
  the releases.

 I think we've exposed all the useful ones now. :)

  [For example, there seems to be some default setting for R
  (number of necessary reads, in the W+R ? N formula a la Dynamo paper),
 and
  it is not clear to me how to over-rie it in config.xml.]

 If there is, it's dead code.  R and W in the Dynamo paper become
 ConsistencyLevel in thrift requests.
 (http://wiki.apache.org/cassandra/API)


I realize that ConsistencyLevel has replaced R and W.
However, is there a way to set this in the storage-config.xml?
Shouldn't it be possible to set it there?

- m.


Re: 3 node installation

2010-02-24 Thread Masood Mortazavi
Besides what I just said below, I should have also added that in the
scenario discussed here:

While RackUnawareStrategy is used ...

Node B which seems to have a copy of all data at all times, has an IP
address whose 3rd octet is different from IP addresses of both node A and C,
which have the same third octet.

A, B and C are all set as Seed in the seeds section.

Bootstrap is set true for all of them.


In storage-conf.xml, the only thing that differs for the three nodes is
their own interfaces.
As just noted, the Replica factor is 2.
That's it.

On Wed, Feb 24, 2010 at 11:18 PM, Masood Mortazavi 
masoodmortaz...@gmail.com wrote:

 Yes.
 Identical with replication factor of 2.
 m.


 On Wed, Feb 24, 2010 at 8:33 PM, Jonathan Ellis jbel...@gmail.com wrote:

 Is the configuration identical on all nodes?  Specifically, is
 ReplicationFactor set to 2 on all nodes?

 On Wed, Feb 24, 2010 at 10:07 PM, Masood Mortazavi
 masoodmortaz...@gmail.com wrote:
  I wonder if anyone can provide an explanation for the following behavior
  observed in a three-node cluster:
 
  1. In a three-node (A, B and C) installation, I use the cli, connected
 to
  node A, to set 10 data items.
 
  2. On cli connected to node A, I do get, and can see all 10 data items.
 
  3. I take node C down, I do step 2, and only see some of the 10 data
 items.
  Some of the data items are unavailable as follows:
  cassandra get Keyspace1.Standard1['test6']
  Exception null
  UnavailableException()
  at
  org.apache.cassandra.service.Cassandra$get_slice_result.read(Cassandr
  a.java:3274)
  at
  org.apache.cassandra.service.Cassandra$Client.recv_get_slice(Cassandr
  a.java:296)
  at
  org.apache.cassandra.service.Cassandra$Client.get_slice(Cassandra.jav
  a:270)
  at
 org.apache.cassandra.cli.CliClient.doSlice(CliClient.java:241)
  at
 org.apache.cassandra.cli.CliClient.executeGet(CliClient.java:300)
  at
  org.apache.cassandra.cli.CliClient.executeCLIStmt(CliClient.java:57)
  at
 org.apache.cassandra.cli.CliMain.processCLIStmt(CliMain.java:131)
  at org.apache.cassandra.cli.CliMain.main(CliMain.java:172)
 
  4. Following step 3, with no other changes other than connecting the
 same
  cli instance to the other remaining node, meaning node B (which is a
 node
  with largest memory, by the way, although I don't think it matters
 here), I
  can see all 10 test data items.
 
  The replica number is 2.