Duplicate result of get_indexed_slices, depending on indexClause.count

2011-04-14 Thread sam_
Hi All,

I have been using Cassandra 0.7.2 and 0.7.4 with Thrift API (using Java).

I noticed that if I am querying a Column Family with indexed columns
sometimes I get a duplicate result in get_indexed_slices depending on the
number of rows in the CF and the count that I set in IndexClause.count.
It also depends on the order of rows in CF.

For example consider the following CF that I call Attributes:

create column family Attributes with comparator=UTF8Type
and column_metadata=[
{column_name: range_id, validation_class: LongType, index_type: 
KEYS},
{column_name: attr_key, validation_class: UTF8Type, index_type: 
KEYS},
{column_name: attr_val, validation_class: BytesType, 
index_type: KEYS}
];

And suppose I have the following rows in the CF:

key   range_id   attr_keyattr_val
"1/@1/0",   1,  "A",   "1"
"1/5/0",  1,  "B", "1000"
"3/@1/0",   2,  "A",   "1"
"3/5/0",  2,  "B", "1001"
"5/@1/0",   3,  "A",   "2"
"5/5/0",  3,  "B", "1002"
"7/@1/0",   4,  "A",   "2"
"7/5/0",  4, "B",  "1003"

Now if I have a query with IndexClause like this (in pseudo code):

attr_key == "A" AND attr_val == "1"

with indexClause.count = 4;

Then I ill get the rows with the following keys from get_indexed_slices :

"1/@1/0", "3/@1/0", "3/@1/0"

The last key is a duplicate!

This is very sensitive to the order of rows in the CF and the number of rows
and the number you set in indexClause.count. I noticed when the number of
rows in the CF is twice the indexClause.count this issue might happen
depending on the order of rows in CF!

This seems a bug. And it occurs in both 0.7.2 and 0.7.4. 

Is there a solution to this problem? 

Many Thanks,
Sam





--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Duplicate-result-of-get-indexed-slices-depending-on-indexClause-count-tp6275394p6275394.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: Cassandra Database Modeling

2011-04-14 Thread csharpplusproject
Aaron,

Thank you so much.

So, the way things appear, it is definitely possible that I could be
making queries that would return all 10M particle pairs (at least, I
should plan for it). What would be the best design in such a case?
I read somewhere that the recommended maximum size of a row (meaning,
including all columns) should be around 10[MB], and better not to exceed
that. Is that correct?

As per packing data "efficiently", what would be the best way? would
packing the data using say (in python terms) struct.pack( ... ) be at
all helpful?

Thanks,
Shalom.

-Original Message-
From: aaron morton 
Reply-to: user@cassandra.apache.org
To: user@cassandra.apache.org
Subject: Re: Cassandra Database Modeling
Date: Thu, 14 Apr 2011 20:54:43 +1200

WRT your query, it depends on how big a slice you want to get how time
critical it is. e.g. Could you be making queries that would return all
10M pairs ? Or would the queries generally want to get some small
fraction of the data set? Again, depends on how the sim runs.


If you sim has stop the world pauses were you have a full view of the
data space, then you could grab all the points at a certain distance and
efficiently pack them up. Where "efficiently" means not using JSON.


http://wiki.apache.org/cassandra/LargeDataSetConsiderations
http://wiki.apache.org/cassandra/CassandraLimitations
 
Aaron


On 13 Apr 2011, at 15:48, csharpplusproject wrote:

> Aaron,
> 
> Thank you so much for your help. It is greatly appreciated!
> 
> Looking at the design of the particle pairs:
> 
> > 
> > - key: expriement_id.time_interval 
> > - column name: pair_id 
> > - column value: distance, angle, other data packed together as JSON
> > or some other format
> 
> 
> You wrote that retrieving millions of columns (I will have about
> 10,000,000 particles pairs) would be slow. You are also right that the
> retrieval of millions of columns into Python, won't be fast.
> 
> If my desired query is to get "all particle pairs on time interval
> [ Tn..T(n+1) ] where the distance between the two particles is smaller
> than X and the angle between the two particles is greater than Y".
> 
> In such a query (as the above), given the fact that retrieving
> millions of columns could be slow, would it be best to say
> 'concatenate' all values for all particle pairs for a given
> 'expriement_id.time_interval' into one column?
> 
> If data is stored in this way, I will be getting from Cassandra a
> binary string / JSON Object that I will have to 'unpack' in my
> application. Is this a recommended approach? are there better
> approaches?
> 
> Is there a limit to the size that can be stored in one 'cell' (by
> 'cell' I mean the intersection between a key and a data column)? is
> there a limit to the size of data of one key?  one data column?
> 
> Thanks in advance for any help / guidance.
> 
> -Original Message-
> From: aaron morton 
> Reply-to: user@cassandra.apache.org
> To: user@cassandra.apache.org
> Subject: Re: Cassandra Database Modeling
> Date: Wed, 13 Apr 2011 10:14:21 +1200
> 
> Yes for  interactive == real time queries.  Hadoop based techniques
> are non time critical queries, but they do have greater analytical
> capabilities.  
> 
> particle_pairs: 1) Yes and no and sort of. Under the hood the
> get_slice api call will be used by your client library to pull back
> chunks of (ordered) columns. Most client libraries abstract away the
> chunking for you.  
> 
> 2) If you are using a packed structure like JSON then no, Cassandra
> will have no idea what you've put in the columns other than bytes . It
> really depends on how much data you have per pair, but generally it's
> easier to pull back more data than try to get exactly what you need.
> Downside is you have to update all the data.  
> 
> 3) No, you would need to update all the data for the pair. I was
> assuming most of the data was written once, and that your simulation
> had something like a stop-the-world phase between time slices where
> state was dumped and then read to start the next interval. You could
> either read it first, or we can come up with something else. 
> 
> distance_cf 1) the query would return an list of columns, which have a
> name and value (as well as a timestamp and ttl). 2) depends on the
> client library, if using python go
> for https://github.com/pycassa/pycassa It will return objects  3)
> returning millions of columns is going to be slow, would also be slow
> using a RDBMS. Creating millions objects in python is going to be
> slow. You would need to have a better idea of what queries you will
> actually want to run to see if it's *too* slow. If it is one approach
> is to store the particles at the same distance in the same column, so
> you need to read less columns. Again depends on how your sim works.
> Time complexity depends on the number of columns read. Finding a row
> will not be O(1) as it it may have to read from several files. Writes
> are more constant than reads. But remember, yo

Re: What's the best modeling approach for ordering events by date?

2011-04-14 Thread Guillermo Winkler
Hi Ethan,

I want to present the events ordered by time, always in pages of 20/40
events. If the events are tweets, you can have 1000 tweets from the same
second or you can have 30 tweets in a 10 minute range. But I always wanna be
able to page through the results in an orderly fashion.

I think that using seconds since epoch it's what I'm doing, that is divide
time into a fixed series of interval. Each second is an interval, and all of
the events for that particular second are columns of that row.

Again with tweets for easier visualizatoin

TweetsBySecond : {
 12121121212 :{ -> seconds since epoch
 id1,id2,id3 -> all the tweet ids ocurred in that particular second
},
12121212123 : {
id4,id5
},
12121212124 : {
id6
}
}

The problem is you can't do that using OPP in cassandra 0.7, or it's just me
missing something?

Thanks for your answer,
Guille

On Thu, Apr 14, 2011 at 4:49 PM, Ethan Rowe  wrote:

> How do you plan to read the data?  Entire histories, or in relatively
> confined slices of time?  Do the events have any attributes by which you
> might segregate them, apart from time?
>
> If you can divide time into a fixed series of intervals, you can insert
> members of a given interval as columns (or supercolumns) in a row.  But it
> depends how you want to use the data on the read side.
>
>
> On Thu, Apr 14, 2011 at 12:25 PM, Guillermo Winkler <
> gwink...@inconcertcc.com> wrote:
>
>> I have a huge number of events I need to consume later, ordered by the
>> date the event occured.
>>
>> My first approach to this problem was to use seconds since epoch as row
>> key, and event ids as column names (empty value), this way:
>>
>> EventsByDate : {
>> SecondsSinceEpoch: {
>> evid:"", evid:"", evid:""
>> }
>> }
>>
>> And use OPP as partitioner. Using GetRangeSlices to retrieve ordered
>> events secuentially.
>>
>> Now I have two problems to solve:
>>
>> 1) The system is realtime, so all the events in a given moment are hitting
>> the same box
>> 2) Migrating from cassandra 0.6 to cassandra 0.7 OPP doesn't seem to like
>> LongType for row keys, was this purposedly deprecated?
>>
>> I was thinking about secondary indexes, but it does not assure the order
>> the rows are coming out of cassandra.
>>
>> Anyone has a better approach to model events by date given that
>> restrictions?
>>
>> Thanks,
>> Guille
>>
>>
>>
>




Wildcard character for CF in access.properties?

2011-04-14 Thread Mike Heffner

Is there a wildcard for the COLUMNFAMILY field in `access.properties`?
I'd like to split read-write and read-only access between my backend and 
frontend users, respectively, however the full list of CFs is not known 
a priori.


I'm using 0.7.4.


Cheers,


Mike


--

  Mike Heffner 
  Librato, Inc.


Re: RE: batch_mutate failed: out of sequence response

2011-04-14 Thread Dan Washusen
I've looked over the Pelops code again and I really can't see how it could be 
at fault here...

-- 
Dan Washusen
On Wednesday, 13 April 2011 at 3:20 AM, Stephen McKamey wrote: 
> [I wrote this Apr 10, 2011 at 12:09 but my message seems to have gotten lost 
> along the way.]
> 
> I use Pelops (the 1.0-0.7.x build from the Github Maven repo) and have 
> occasionally seen this message (under load or during GC). I have a test app 
> running in two separate single-threaded processes doing a slow trickle insert 
> into a single Cassandra 0.7.4 node all on the same box (Mac OS X).
> 
> This has been running off and on for over a week with no exceptions and I 
> just this same error about two hours ago. Both client processes experienced 
> it at about the same time, and it seemed related to a GC/compaction on the 
> Cassandra instance. 
> 
> I'm guessing that it is either actually a read timeout on the clients, or 
> (less likely) somehow the Cassandra instance mixed up the two responses?
> 
>  On Fri, Apr 8 2011 at 07:28, Dan Washusen  wrote:
> > Dan Hendry mentioned that he sees these errors. Is he also using Pelops? 
> > From his comment about retrying I'd assume not...
> > 
> > -- 
> > Dan Washusen
> >  On Thursday, 7 April 2011 at 7:39 PM, Héctor Izquierdo Seliva wrote:
> > > El mié, 06-04-2011 a las 21:04 -0500, Jonathan Ellis escribió:
> > > > "out of sequence response" is thrift's way of saying "I got a response
> > > > for request Y when I expected request X."
> > > > 
> > > > my money is on using a single connection from multiple threads. don't 
> > > > do that.
> > > 
> > > I'm not using thrift directly, and my application is single thread, so I
> > > guess this is Pelops fault somehow. Since I managed to tame memory
> > >  comsuption the problem has not appeared again, but it always happened
> > > during a stop-the-world GC. Could it be that the message was sent
> > > instead of being dropped by the server when the client assumed it had
> > > timed out? 



Re: Pyramid Organization of Data

2011-04-14 Thread Patrick Julien
On Thu, Apr 14, 2011 at 4:47 PM, Adrian Cockcroft
 wrote:
> What you are asking for breaks the eventual consistency model, so you need to 
> create a separate cluster in NYC that collects the same updates but has a 
> much longer setting to timeout the data for deletion, or doesn't get the 
> deletes.
>
> One way is to have a trigger on writes on your pyramid nodes in NY that 
> copies data over to the long term analysis cluster. The two clusters won't be 
> eventually consistent in the presence of failures, but with RF=3 you will get 
> up to three triggers for each write, so you get three chances to get the copy 
> done.
>


Yes, that's one of the scenarios we're contemplating.  However, there
aren't any triggers at the cassandra level and even if they were, we
would get them multiple times.

So far, I believe my best bet is to run 2 clusters.  One global that
has NY and the satellite sites.  The other is NY specific and is the
archive site.

We would then make placement strategy in NY that would decorate the
configured placement strategy so that it would copy the row over to
the archive site before passing it into the non-archive NY cluster.


Re: Starting the Cassandra server from Java (without command line)

2011-04-14 Thread Jason Pell
You can make use of embedded Cassandra server. Both Hector and Pelops have 
classes that you can use in your own unit tests, the Pelops one is quite good.

Sent from my iPhone

On Apr 15, 2011, at 3:59, sam_  wrote:

> Hello there,
> 
> To start the Cassandra server we can use the following command in command
> prompt:
> cassandra -f
> 
> I am wondering if it is possible to directly start the server inside a Java
> program using thrift API or a lower level class inside Cassandra
> implementation.
> 
> The purpose of this is to be able to run JUnit tests that need to start
> Cassandra server in SetUp(), without the need to create a process and run
> "cassandra" from command line.
> 
> Thanks,
> Sam 
> 
> --
> View this message in context: 
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Starting-the-Cassandra-server-from-Java-without-command-line-tp6273826p6273826.html
> Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
> Nabble.com.


Re: Pyramid Organization of Data

2011-04-14 Thread Adrian Cockcroft
What you are asking for breaks the eventual consistency model, so you need to 
create a separate cluster in NYC that collects the same updates but has a much 
longer setting to timeout the data for deletion, or doesn't get the deletes. 

One way is to have a trigger on writes on your pyramid nodes in NY that copies 
data over to the long term analysis cluster. The two clusters won't be 
eventually consistent in the presence of failures, but with RF=3 you will get 
up to three triggers for each write, so you get three chances to get the copy 
done. 

Adrian

On Apr 14, 2011, at 10:18 AM, "Patrick Julien"  wrote:

> Thanks for your input Adrian, we've pretty much settled on this too.
> What I'm trying to figure out is how we do deletes.
> 
> We want to do deletes in the satellites because:
> 
> a) we'll run out of disk space very quickly with the amount of data we have
> b) we don't need more than 3 days worth of history in the satellites,
> we're currently planning for 7 days of capacity
> 
> However, the deletes will get replicated back to NY.  In NY, we don't
> want that, we want to run hadoop/pig over all that data dating back to
> several months/years.  Even if we set the replication factor of the
> satellites to 1 and NY to 3, we'll run out of space very quickly in
> the satellites.
> 
> 
> On Thu, Apr 14, 2011 at 11:23 AM, Adrian Cockcroft
>  wrote:
>> We have similar requirements for wide area backup/archive at Netflix.
>> I think what you want is a replica with RF of at least 3 in NY for all the
>> satellites, then each satellite could have a lower RF, but if you want safe
>> local quorum I would use 3 everywhere.
>> Then NY is the sum of all the satellites, so that makes most use of the disk
>> space.
>> For archival storage I suggest you use snapshots in NY and save compressed
>> tar files of each keyspace in NY. We've been working on this to allow full
>> and incremental backup and restore from our EC2 hosted Cassandra clusters
>> to/from S3. Full backup/restore works fine, incremental and per-keyspace
>> restore is being worked on.
>> Adrian
>> From: Patrick Julien 
>> Reply-To: "user@cassandra.apache.org" 
>> Date: Thu, 14 Apr 2011 05:38:54 -0700
>> To: "user@cassandra.apache.org" 
>> Subject: Re: Pyramid Organization of Data
>> 
>> Thanks,  I'm still working the problem so anything I find out I will post
>> here.
>> 
>> Yes, you're right, that is the question I am asking.
>> 
>> No, adding more storage is not a solution since new york would have several
>> hundred times more storage.
>> 
>> On Apr 14, 2011 6:38 AM, "aaron morton"  wrote:
>>> I think your question is "NY is the archive, after a certain amount of
>>> time we want to delete the row from the original DC but keep it in the
>>> archive in NY."
>>> 
>>> Once you delete a row, it's deleted as far as the client is concerned.
>>> GCGaceSeconds is only concerned with when the tombstone marker can be
>>> removed. If NY has a replica of a row from Tokyo and the row is deleted in
>>> either DC, it will be deleted in the other DC as well.
>>> 
>>> Some thoughts...
>>> 1) Add more storage in the satellite DC's, then tilt you chair to
>>> celebrate a job well done :)
>>> 2) Run two clusters as you say.
>>> 3) Just thinking out loud, and I know this does not work now. Would it be
>>> possible to support per CF strategy options, so an archive CF only
>>> replicates to NY ? Can think of possible problems with repair and
>>> LOCAL_QUORUM, out of interest what else would it break?
>>> 
>>> Hope that helps.
>>> Aaron
>>> 
>>> 
>>> 
>>> On 14 Apr 2011, at 10:17, Patrick Julien wrote:
>>> 
 We have been successful in implementing, at scale, the comments you
 posted here. I'm wondering what we can do about deleting data
 however.
 
 The way I see it, we have considerably more storage capacity in NY,
 but not in the other sites. Using this technique here, it occurs to
 me that we would replicate non-NY deleted rows back to NY. Is there a
 way to tell NY not to tombstone rows?
 
 The ideas I have so far:
 
 - Set GCGracePeriod to be much higher in NY than in the other sites.
 This way we can get to tombstone'd rows well beyond their disk life in
 other sites.
 - A variant on this solution is to set the TTL on rows in non NY sites
 and again, set the GCGracePeriod to be considerably higher in NY
 - break this up to multiple clusters and do one write from the client
 to the its 'local' cluster and one write to the NY cluster.
 
 
 
 On Fri, Apr 8, 2011 at 7:15 PM, Jonathan Ellis  wrote:
> No, I'm suggesting you have a Tokyo keyspace that gets replicated as
> {Tokyo: 2, NYC:1}, a London keyspace that gets replicated to {London:
> 2, NYC: 1}, for example.
> 
> On Fri, Apr 8, 2011 at 5:59 PM, Patrick Julien 
> wrote:
>> I'm familiar with this material. I hadn't thought of it from this
>> angle but I believe what you're suggesting is that the

Re: Indexes on heterogeneous rows

2011-04-14 Thread aaron morton
> (This is a case were 1/3 of the rows are of type 2, but, say only a few 
> hundred rows of type 2 have e=5.)

How many rows would have e=5 without worrying about their type value?
 
Aaron

On 14 Apr 2011, at 23:48, David Boxenhorn wrote:

> Thanks. I'm aware that I can roll my own. I wanted to avoid that, for ease of 
> use, but especially for atomicity concerns. 
> 
> I thought that the secondary index would bring into memory all keys where 
> type=2, and then iterate over them to find keys where=5. (This is a case were 
> 1/3 of the rows are of type 2, but, say only a few hundred rows of type 2 
> have e=5.) The reason why I put "type" first is that queries on type will 
> always be an exact match, whereas the other clauses might be inequalities. 
> 
> On Thu, Apr 14, 2011 at 2:07 PM, aaron morton  wrote:
> You could make your own inverted index by using keys like  "e=5-type=2" where 
> the columns are either the keys for the object or the objects themselves. 
> Then just grab the full row back. If you know you always want to run queries 
> like that. 
> 
> This recent discussion and blog post from Ed is good background 
> http://www.mail-archive.com/user@cassandra.apache.org/msg12136.html
> 
> I'm not sure how efficient the join from "e" to type would be. AFAIK it will 
> iterate all keys where e=5 and lookup corresponding rows to find out if type 
> = 2. 
> 
> If know how you want to read things back and need to deal with lots-o-data I 
> would start testing with custom indexes. Then compare to the built in ones, 
> it should be reasonably simple add them for a test.   
> 
> Hope that helps. 
> Aaron
>
> On 14 Apr 2011, at 22:33, David Boxenhorn wrote:
> 
>> Thank you for your answer, and sorry about the sloppy terminology.
>> 
>> I'm thinking of the scenario where there are a small number of results in 
>> the result set, but there are billions of rows in the first of your 
>> secondary indexes.
>> 
>> That is, I want to do something like (not sure of the CQL syntax):
>> 
>> select * where type=2 and e=5
>> 
>> where there are billions of rows of type 2, but some manageable number of 
>> those rows have e=5.
>> 
>> As I understand it, secondary indexes are like column families, where each 
>> value is a column. So the billions of rows where type=2 would go into a 
>> single row of the secondary index. This sounds like a problem to me, is it?  
>> 
>> I'm assuming that the billions of rows that don't have column "e" at all 
>> (those rows of other types) are not a problem at all...
>> 
>> On Thu, Apr 14, 2011 at 12:12 PM, aaron morton  
>> wrote:
>> Need to clear up some terminology here. 
>> 
>> Rows have a key and can be retrieved by key. This is *sort of* the primary 
>> index, but not primary in the normal RDBMS sense. 
>> Rows can have different columns and the column names are sorted and can be 
>> efficiently selected.
>> There are "secondary indexes" in cassandra 0.7 based on column values 
>> http://www.datastax.com/dev/blog/whats-new-cassandra-07-secondary-indexes
>> 
>> So you could create secondary indexes on the a,e, and h columns and get rows 
>> that have specific values. There are some limitations to secondary indexes, 
>> read the linked article. 
>> 
>> Or you can make your own secondary indexes using row keys as the index 
>> values.
>> 
>> If you have billions of rows, how many do you need to read back at once?
>> 
>> Hope that helps
>> Aaron
>> 
>> On 14 Apr 2011, at 04:23, David Boxenhorn wrote:
>> 
>>> Is it possible in 0.7.x to have indexes on heterogeneous rows, which have 
>>> different sets of columns?
>>> 
>>> For example, let's say you have three types of objects (1, 2, 3) which each 
>>> had three members. If your rows had the following pattern
>>> 
>>> type=1 a=? b=? c=?
>>> type=2 d=? e=? f=?
>>> type=3 g=? h=? i=?
>>> 
>>> could you index "type" as your primary index, and also index "a", "e", "h" 
>>> as secondary indexes, to get the objects of that type that you are looking 
>>> for?
>>> 
>>> Would it work if you had billions of rows of each type?
>> 
>> 
> 
> 



Re: which nodes contain the data? column family as indexes?

2011-04-14 Thread aaron morton
By default all data for a keyspace, and it's CFs, are spread out over all nodes 
in the cluster.

If you want to see which endpoints have data for a particular key there is an 
operation on the o.a.c.db.StorageService MBean you can use via JConsole. Some 
(out of date) docs here may help 
http://wiki.apache.org/cassandra/JmxInterface#org.apache.cassandra.service.StorageService.Operations.getNaturalEndpoints

Lots of articles here 
http://wiki.apache.org/cassandra/ArticlesAndPresentations

Aaron


On 15 Apr 2011, at 04:32, tinhuty he wrote:

> I just started 5 nodes in a cluster and set a replica factor of 3 on a 
> keyspace ks. My question is how do I know which 3 nodes will contain the data 
> of this keyspace(or particular column family in this keyspace)?
>  
> I often read that “use a separate column family for storing our indexes”, 
> here is link for 
> that(http://www.datastax.com/docs/0.7/data_model/cfs_as_indexes), but it 
> doesn’t have more details. Could some one point me to more detailed 
> documentation and samples for this?
>  
> Thanks
>  



Re: pycassa + celery

2011-04-14 Thread aaron morton
This is going to be a bug in your code, so it's a bit tricky to know but...

How / when is the email added to the DB?
What does the rawEmail function do ?
Set a break point, what are the two strings you are feeding into the hash 
functions ? 

Aaron
On 15 Apr 2011, at 03:50, pob wrote:

> Hello,
> 
> I'm experiencing really strange problem. I wrote data into cassandra cluster. 
> I'm trying to check if data inserted then fetched are equally to source data 
> (file).  Code below is the task for celery that does the comparison with 
> sha1(). The problem is that celery worker returning since time to time during 
> the comparison output like that:
> 
> 2011-04-14 17:24:33,225: INFO/PoolWorker-134] 
> tasks.insertData[f377efdb-33a2-48f4-ab00-52b1898e216c]: [Error/EmailTest] 
> Email corrupted.] 
> 
> If i execute the task code manually the output is correct ,[Email data test: 
> OK].
> 
> I thought that possible bug is in multi threading but i start celery workers 
> with only one thread to  remove that case. 
> 
> 
> Another problem that is occurring often is :
> 
> [2011-04-14 12:46:49,682: INFO/PoolWorker-1] Connection 1781 (IP:9160) in 
> ConnectionPool (id = 15612176) failed: timed out
> [2011-04-14 12:46:49,844: INFO/PoolWorker-1] Connection 1781 (IP:9160) in 
> ConnectionPool (id = 15612176) failed: UnavailableException()
> 
> 
> I'm using pycassa connection pooling  with parameters pool_size=15 (5* number 
> of nodes), max_retries=30, max_overflow=5, timeout=4
> 
> 
> Any ideas where should be problems? The client is pycassa 1.0.8, and I tried 
> it with 1.0.6 too.
> 
> 
> Thanks
> 
> 
> Best,
> Peter
> 
> ###
> 
> @task(ignore_result=True)
> def checkData(key):
> 
> 
> logger = insertData.get_logger()
> logger.info("Reading email %s" % key)
> logger.info("Task id %s" %  checkData.request.id)
> 
> f = open(key, 'r')
> sEmail = f.readlines()
> f.close()
> 
> m = hashlib.sha1()
> m.update(''.join(sEmail))
> sHash = m.hexdigest()
> 
> #fetch email from DB
> email = rawEmail(key)
> 
> 
> m = hashlib.sha1()
> m.update(email)
> dHash = m.hexdigest()
>  
> if sHash != dHash:
> logger.info("[Error/EmailTest] Email corrupted.] < %s >" % key)
> else:
> logger.info("[Email data test: OK]")
> 



Re: What's the best modeling approach for ordering events by date?

2011-04-14 Thread Ethan Rowe
How do you plan to read the data?  Entire histories, or in relatively
confined slices of time?  Do the events have any attributes by which you
might segregate them, apart from time?

If you can divide time into a fixed series of intervals, you can insert
members of a given interval as columns (or supercolumns) in a row.  But it
depends how you want to use the data on the read side.

On Thu, Apr 14, 2011 at 12:25 PM, Guillermo Winkler <
gwink...@inconcertcc.com> wrote:

> I have a huge number of events I need to consume later, ordered by the date
> the event occured.
>
> My first approach to this problem was to use seconds since epoch as row
> key, and event ids as column names (empty value), this way:
>
> EventsByDate : {
> SecondsSinceEpoch: {
> evid:"", evid:"", evid:""
> }
> }
>
> And use OPP as partitioner. Using GetRangeSlices to retrieve ordered events
> secuentially.
>
> Now I have two problems to solve:
>
> 1) The system is realtime, so all the events in a given moment are hitting
> the same box
> 2) Migrating from cassandra 0.6 to cassandra 0.7 OPP doesn't seem to like
> LongType for row keys, was this purposedly deprecated?
>
> I was thinking about secondary indexes, but it does not assure the order
> the rows are coming out of cassandra.
>
> Anyone has a better approach to model events by date given that
> restrictions?
>
> Thanks,
> Guille
>
>
>


All nodes down even though ring shows up

2011-04-14 Thread mcasandra
I ran stress test to read 50K rows and since then I am getting below error
even though ring show all nodes are up:

ERROR 12:40:29,999 Exception:
me.prettyprint.hector.api.exceptions.HectorException: All host pools marked
down. Retry burden pushed out to client.
at
me.prettyprint.cassandra.connection.HConnectionManager.getClientFromLBPolicy(HConnectionManager.java:308)
at
me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:213)
at
me.prettyprint.cassandra.service.KeyspaceServiceImpl.operateWithFailover(KeyspaceServiceImpl.java:129)
at
me.prettyprint.cassandra.service.KeyspaceServiceImpl.batchMutate(KeyspaceServiceImpl.java:100)
at
me.prettyprint.cassandra.service.KeyspaceServiceImpl.batchMutate(KeyspaceServiceImpl.java:106)
at
me.prettyprint.cassandra.model.MutatorImpl$2.doInKeyspace(MutatorImpl.java:203)
at
me.prettyprint.cassandra.model.MutatorImpl$2.doInKeyspace(MutatorImpl.java:200)
at
me.prettyprint.cassandra.model.KeyspaceOperationCallback.doInKeyspaceAndMeasure(KeyspaceOperationCallback.java:20)
at
me.prettyprint.cassandra.model.ExecutingKeyspace.doExecute(ExecutingKeyspace.java:85)
at
me.prettyprint.cassandra.model.MutatorImpl.execute(MutatorImpl.java:200)
at
com.riptano.cassandra.stress.InsertCommand.call(InsertCommand.java:117)
at
com.riptano.cassandra.stress.InsertCommand.call(InsertCommand.java:1)
at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)


---

No errors logged in the system.log and tpsats shows nothing.

nodetool -h dsdb1 tpstats
Pool NameActive   Pending  Completed
ReadStage 0 0  50176
RequestResponseStage  0 0 207223
MutationStage 0 0 199473
ReadRepairStage   0 0  14615
GossipStage   0 0  39835
AntiEntropyStage  0 0  0
MigrationStage0 0207
MemtablePostFlusher   0 0386
StreamStage   0 0  0
FlushWriter   0 0385
FILEUTILS-DELETE-POOL 0 0   1446
MiscStage 0 0  0
FlushSorter   0 0  0
InternalResponseStage 0 0   1230
HintedHandoff 0 0  7


compaction stats say pending: 0



--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/All-nodes-down-even-though-ring-shows-up-tp6274152p6274152.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Problem with running test example with Cassandra 0.7.4

2011-04-14 Thread Anda P
Hello, 

I have set up my single node Cassandra 0.7.4 and started the service with
bin/cassandra -f. Now I am trying to use the Hector API (v. 0.7.0) to manage the
DB. 
The Cassandra CLI works fine and I can create keyspaces and so on.

I tried to run the test example and create a single keyspace:

Cluster cluster = HFactory.getOrCreateCluster("TestCluster",
new 
CassandraHostConfigurator("localhost:9160"));

Keyspace keyspace = HFactory.createKeyspace("Keyspace1", 
cluster);


But all I get is this:

2011-04-14 22:20:27,469 [main  ] INFO 
me.prettyprint.cassandra.connection.CassandraHostRetryService  
- Downed Host
Retry service started with queue size -1 and retry delay 10s
2011-04-14 22:20:27,492 [main  ] DEBUG
me.prettyprint.cassandra.connection.HThriftClient  -
 Transport open status false
for client CassandraClient
this again about 20 times
me.prettyprint.cassandra.service.JmxMonitor  - Registering JMX
me.prettyprint.cassandra.service_TestCluster:ServiceType=hector,
MonitorType=hector
2011-04-14 22:20:27,636 [Thread-0  ] INFO 
me.prettyprint.cassandra.connection.CassandraHostRetryService  - 
Downed Host
retry shutdown hook called
2011-04-14 22:20:27,646 [Thread-0  ] INFO 
me.prettyprint.cassandra.connection.CassandraHostRetryService  - 
Downed Host
retry shutdown complete


Can you please tell me what I'm doing wrong? 

Thank you, 
Anda Popovici



Re: Starting the Cassandra server from Java (without command line)

2011-04-14 Thread Narendra Sharma
The write up is a year old but still will give you fair idea of how to do.

http://prettyprint.me/2010/02/14/running-cassandra-as-an-embedded-service/


Thanks,
Naren

On Thu, Apr 14, 2011 at 10:59 AM, sam_  wrote:

> Hello there,
>
> To start the Cassandra server we can use the following command in command
> prompt:
> cassandra -f
>
> I am wondering if it is possible to directly start the server inside a Java
> program using thrift API or a lower level class inside Cassandra
> implementation.
>
> The purpose of this is to be able to run JUnit tests that need to start
> Cassandra server in SetUp(), without the need to create a process and run
> "cassandra" from command line.
>
> Thanks,
> Sam
>
> --
> View this message in context:
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Starting-the-Cassandra-server-from-Java-without-command-line-tp6273826p6273826.html
> Sent from the cassandra-u...@incubator.apache.org mailing list archive at
> Nabble.com.
>



-- 
Narendra Sharma
Solution Architect
*http://www.persistentsys.com*
*http://narendrasharma.blogspot.com/*


Starting the Cassandra server from Java (without command line)

2011-04-14 Thread sam_
Hello there,

To start the Cassandra server we can use the following command in command
prompt:
cassandra -f

I am wondering if it is possible to directly start the server inside a Java
program using thrift API or a lower level class inside Cassandra
implementation.

The purpose of this is to be able to run JUnit tests that need to start
Cassandra server in SetUp(), without the need to create a process and run
"cassandra" from command line.

Thanks,
Sam 

--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Starting-the-Cassandra-server-from-Java-without-command-line-tp6273826p6273826.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: Pyramid Organization of Data

2011-04-14 Thread Patrick Julien
Thanks for your input Adrian, we've pretty much settled on this too.
What I'm trying to figure out is how we do deletes.

We want to do deletes in the satellites because:

a) we'll run out of disk space very quickly with the amount of data we have
b) we don't need more than 3 days worth of history in the satellites,
we're currently planning for 7 days of capacity

However, the deletes will get replicated back to NY.  In NY, we don't
want that, we want to run hadoop/pig over all that data dating back to
several months/years.  Even if we set the replication factor of the
satellites to 1 and NY to 3, we'll run out of space very quickly in
the satellites.


On Thu, Apr 14, 2011 at 11:23 AM, Adrian Cockcroft
 wrote:
> We have similar requirements for wide area backup/archive at Netflix.
> I think what you want is a replica with RF of at least 3 in NY for all the
> satellites, then each satellite could have a lower RF, but if you want safe
> local quorum I would use 3 everywhere.
> Then NY is the sum of all the satellites, so that makes most use of the disk
> space.
> For archival storage I suggest you use snapshots in NY and save compressed
> tar files of each keyspace in NY. We've been working on this to allow full
> and incremental backup and restore from our EC2 hosted Cassandra clusters
> to/from S3. Full backup/restore works fine, incremental and per-keyspace
> restore is being worked on.
> Adrian
> From: Patrick Julien 
> Reply-To: "user@cassandra.apache.org" 
> Date: Thu, 14 Apr 2011 05:38:54 -0700
> To: "user@cassandra.apache.org" 
> Subject: Re: Pyramid Organization of Data
>
> Thanks,  I'm still working the problem so anything I find out I will post
> here.
>
> Yes, you're right, that is the question I am asking.
>
> No, adding more storage is not a solution since new york would have several
> hundred times more storage.
>
> On Apr 14, 2011 6:38 AM, "aaron morton"  wrote:
>> I think your question is "NY is the archive, after a certain amount of
>> time we want to delete the row from the original DC but keep it in the
>> archive in NY."
>>
>> Once you delete a row, it's deleted as far as the client is concerned.
>> GCGaceSeconds is only concerned with when the tombstone marker can be
>> removed. If NY has a replica of a row from Tokyo and the row is deleted in
>> either DC, it will be deleted in the other DC as well.
>>
>> Some thoughts...
>> 1) Add more storage in the satellite DC's, then tilt you chair to
>> celebrate a job well done :)
>> 2) Run two clusters as you say.
>> 3) Just thinking out loud, and I know this does not work now. Would it be
>> possible to support per CF strategy options, so an archive CF only
>> replicates to NY ? Can think of possible problems with repair and
>> LOCAL_QUORUM, out of interest what else would it break?
>>
>> Hope that helps.
>> Aaron
>>
>>
>>
>> On 14 Apr 2011, at 10:17, Patrick Julien wrote:
>>
>>> We have been successful in implementing, at scale, the comments you
>>> posted here. I'm wondering what we can do about deleting data
>>> however.
>>>
>>> The way I see it, we have considerably more storage capacity in NY,
>>> but not in the other sites. Using this technique here, it occurs to
>>> me that we would replicate non-NY deleted rows back to NY. Is there a
>>> way to tell NY not to tombstone rows?
>>>
>>> The ideas I have so far:
>>>
>>> - Set GCGracePeriod to be much higher in NY than in the other sites.
>>> This way we can get to tombstone'd rows well beyond their disk life in
>>> other sites.
>>> - A variant on this solution is to set the TTL on rows in non NY sites
>>> and again, set the GCGracePeriod to be considerably higher in NY
>>> - break this up to multiple clusters and do one write from the client
>>> to the its 'local' cluster and one write to the NY cluster.
>>>
>>>
>>>
>>> On Fri, Apr 8, 2011 at 7:15 PM, Jonathan Ellis  wrote:
 No, I'm suggesting you have a Tokyo keyspace that gets replicated as
 {Tokyo: 2, NYC:1}, a London keyspace that gets replicated to {London:
 2, NYC: 1}, for example.

 On Fri, Apr 8, 2011 at 5:59 PM, Patrick Julien 
 wrote:
> I'm familiar with this material. I hadn't thought of it from this
> angle but I believe what you're suggesting is that the different data
> centers would hold a different properties file for node discovery
> instead of using auto-discovery.
>
> So Tokyo, and others, would have a configuration that make it
> oblivious to the non New York data centers.
> New York would have a configuration that would give it knowledge of no
> other data center.
>
> Would that work? Wouldn't the NY data center wonder where these other
> writes are coming from?
>
> On Fri, Apr 8, 2011 at 6:38 PM, Jonathan Ellis 
> wrote:
>> On Fri, Apr 8, 2011 at 12:17 PM, Patrick Julien 
>> wrote:
>>> The problem is this: we would like the historical data from Tokyo to
>>> stay in Tokyo and only be replicated to New York. The one in Lo

Stress testing disk configurations. Your thoughts?

2011-04-14 Thread Nathan Milford
Ahoy,

I'm building out a new 0.7.4 cluster to migrate our 0.6.6 cluster to.

While I'm waiting for the dev-side to get time to work on their side of the
project I have a 10 node cluster evenly split across two data centers (NY &
LA) and was looking to do some testing while I could.

My primary focus is on disk configurations.  Space isn't a huge issue, our
current data set is ~30G on each node and I imagine that'll go up since I
intend on tweaking the RF on the new cluster.

Each node has 6 x 146G 10K SAS drives.  I want to test:

1) 6 disks in R0 where everything is written to the same stripe
2) 1 disk for OS+Commitlog and 5 disks in R0 for data.
3) 1 disk for OS+Commitlog and 5 individual disks defined
as separate data_file_directories.

I suspect I'll see best performance with option 3, but the issue has become
political\religious and there are internal doubts that separating the commit
log and data will truly improve performance despite documentation and logic
indicating otherwise.  Thus the test :)

Right now I've been tinkering and not being very scientific while I work out
a testing methodology and get used to the tools.  I've just been running
zznate's cassandra-stress against a single node and measuring the time it
takes to read and write N rows.

Unscientifically I've found that they all perform about the same. It is hard
to judge because, when writing to a single node, reads take exponentially
longer.  Writing 10M rows may take ~500 seconds, but reading will take ~5000
seconds.  I'm sure this will even out when I test across more than one node.

Early next week I'll be able to test against all 10 nodes with a realistic
replication factor.

I'd really love to hear some people's thoughts on methodologies and what I
should be looking at/for other than iostat and the time for the test to
inset/read.

Thanks,
nathan


Re: Indexes on heterogeneous rows

2011-04-14 Thread Jonathan Ellis
On Thu, Apr 14, 2011 at 6:48 AM, David Boxenhorn  wrote:
> The reason why I put "type" first is that queries on type will
> always be an exact match, whereas the other clauses might be inequalities.

Expression order doesn't matter, but as you imply, non-equalities
can't be used in an index lookup and have to be checked in a nested
loop phase afterwards.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


which nodes contain the data? column family as indexes?

2011-04-14 Thread tinhuty he
I just started 5 nodes in a cluster and set a replica factor of 3 on a keyspace 
ks. My question is how do I know which 3 nodes will contain the data of this 
keyspace(or particular column family in this keyspace)? 

I often read that “use a separate column family for storing our indexes”, here 
is link for that(http://www.datastax.com/docs/0.7/data_model/cfs_as_indexes), 
but it doesn’t have more details. Could some one point me to more detailed 
documentation and samples for this?

Thanks


What's the best modeling approach for ordering events by date?

2011-04-14 Thread Guillermo Winkler
I have a huge number of events I need to consume later, ordered by the date
the event occured.

My first approach to this problem was to use seconds since epoch as row key,
and event ids as column names (empty value), this way:

EventsByDate : {
SecondsSinceEpoch: {
evid:"", evid:"", evid:""
}
}

And use OPP as partitioner. Using GetRangeSlices to retrieve ordered events
secuentially.

Now I have two problems to solve:

1) The system is realtime, so all the events in a given moment are hitting
the same box
2) Migrating from cassandra 0.6 to cassandra 0.7 OPP doesn't seem to like
LongType for row keys, was this purposedly deprecated?

I was thinking about secondary indexes, but it does not assure the order the
rows are coming out of cassandra.

Anyone has a better approach to model events by date given that
restrictions?

Thanks,
Guille




Re: Indexes on heterogeneous rows

2011-04-14 Thread Jonathan Ellis
This should work reasonably well w/ 0.7 indexes. Cassandra tracks
statistics on index selectivity, so it would plan that query as "index
lookup on e=5, then iterate over those results and return only rows
that also have type=2."

On Thu, Apr 14, 2011 at 5:33 AM, David Boxenhorn  wrote:
> Thank you for your answer, and sorry about the sloppy terminology.
>
> I'm thinking of the scenario where there are a small number of results in
> the result set, but there are billions of rows in the first of your
> secondary indexes.
>
> That is, I want to do something like (not sure of the CQL syntax):
>
> select * where type=2 and e=5
>
> where there are billions of rows of type 2, but some manageable number of
> those rows have e=5.
>
> As I understand it, secondary indexes are like column families, where each
> value is a column. So the billions of rows where type=2 would go into a
> single row of the secondary index. This sounds like a problem to me, is it?
>
> I'm assuming that the billions of rows that don't have column "e" at all
> (those rows of other types) are not a problem at all...
>
> On Thu, Apr 14, 2011 at 12:12 PM, aaron morton 
> wrote:
>>
>> Need to clear up some terminology here.
>> Rows have a key and can be retrieved by key. This is *sort of* the primary
>> index, but not primary in the normal RDBMS sense.
>> Rows can have different columns and the column names are sorted and can be
>> efficiently selected.
>> There are "secondary indexes" in cassandra 0.7 based on column
>> values http://www.datastax.com/dev/blog/whats-new-cassandra-07-secondary-indexes
>> So you could create secondary indexes on the a,e, and h columns and get
>> rows that have specific values. There are some limitations to secondary
>> indexes, read the linked article.
>> Or you can make your own secondary indexes using row keys as the index
>> values.
>> If you have billions of rows, how many do you need to read back at once?
>> Hope that helps
>> Aaron
>>
>> On 14 Apr 2011, at 04:23, David Boxenhorn wrote:
>>
>> Is it possible in 0.7.x to have indexes on heterogeneous rows, which have
>> different sets of columns?
>>
>> For example, let's say you have three types of objects (1, 2, 3) which
>> each had three members. If your rows had the following pattern
>>
>> type=1 a=? b=? c=?
>> type=2 d=? e=? f=?
>> type=3 g=? h=? i=?
>>
>> could you index "type" as your primary index, and also index "a", "e", "h"
>> as secondary indexes, to get the objects of that type that you are looking
>> for?
>>
>> Would it work if you had billions of rows of each type?
>>
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


pycassa + celery

2011-04-14 Thread pob
Hello,

I'm experiencing really strange problem. I wrote data into cassandra
cluster. I'm trying to check if data inserted then fetched are equally to
source data (file).  Code below is the task for celery that does
the comparison with sha1(). The problem is that celery worker returning
since time to time during the comparison output like that:

2011-04-14 17:24:33,225: INFO/PoolWorker-134]
tasks.insertData[f377efdb-33a2-48f4-ab00-52b1898e216c]: [Error/EmailTest]
Email corrupted.]

If i execute the task code manually the output is correct ,[Email data test:
OK].

I thought that possible bug is in multi threading but i start celery workers
with only one thread to  remove that case.


Another problem that is occurring often is :

[2011-04-14 12:46:49,682: INFO/PoolWorker-1] Connection 1781 (IP:9160)
in ConnectionPool (id = 15612176) failed: timed out
[2011-04-14 12:46:49,844: INFO/PoolWorker-1] Connection 1781 (IP:9160)
in ConnectionPool (id = 15612176) failed: UnavailableException()


I'm using pycassa connection pooling  with parameters pool_size=15 (5*
number of nodes), max_retries=30, max_overflow=5, timeout=4


Any ideas where should be problems? The client is pycassa 1.0.8, and I tried
it with 1.0.6 too.


Thanks


Best,
Peter

###

@task(ignore_result=True)
def checkData(key):


logger = insertData.get_logger()
logger.info("Reading email %s" % key)
logger.info("Task id %s" %  checkData.request.id)

f = open(key, 'r')
sEmail = f.readlines()
f.close()

m = hashlib.sha1()
m.update(''.join(sEmail))
sHash = m.hexdigest()

#fetch email from DB
email = rawEmail(key)


m = hashlib.sha1()
m.update(email)
dHash = m.hexdigest()

if sHash != dHash:
logger.info("[Error/EmailTest] Email corrupted.] < %s >" % key)
else:
logger.info("[Email data test: OK]")


Re: Pyramid Organization of Data

2011-04-14 Thread Adrian Cockcroft
We have similar requirements for wide area backup/archive at Netflix.

I think what you want is a replica with RF of at least 3 in NY for all the 
satellites, then each satellite could have a lower RF, but if you want safe 
local quorum I would use 3 everywhere.

Then NY is the sum of all the satellites, so that makes most use of the disk 
space.

For archival storage I suggest you use snapshots in NY and save compressed tar 
files of each keyspace in NY. We've been working on this to allow full and 
incremental backup and restore from our EC2 hosted Cassandra clusters to/from 
S3. Full backup/restore works fine, incremental and per-keyspace restore is 
being worked on.

Adrian

From: Patrick Julien mailto:pjul...@gmail.com>>
Reply-To: "user@cassandra.apache.org" 
mailto:user@cassandra.apache.org>>
Date: Thu, 14 Apr 2011 05:38:54 -0700
To: "user@cassandra.apache.org" 
mailto:user@cassandra.apache.org>>
Subject: Re: Pyramid Organization of Data


Thanks,  I'm still working the problem so anything I find out I will post here.

Yes, you're right, that is the question I am asking.

No, adding more storage is not a solution since new york would have several 
hundred times more storage.

On Apr 14, 2011 6:38 AM, "aaron morton" 
mailto:aa...@thelastpickle.com>> wrote:
> I think your question is "NY is the archive, after a certain amount of time 
> we want to delete the row from the original DC but keep it in the archive in 
> NY."
>
> Once you delete a row, it's deleted as far as the client is concerned. 
> GCGaceSeconds is only concerned with when the tombstone marker can be 
> removed. If NY has a replica of a row from Tokyo and the row is deleted in 
> either DC, it will be deleted in the other DC as well.
>
> Some thoughts...
> 1) Add more storage in the satellite DC's, then tilt you chair to celebrate a 
> job well done :)
> 2) Run two clusters as you say.
> 3) Just thinking out loud, and I know this does not work now. Would it be 
> possible to support per CF strategy options, so an archive CF only replicates 
> to NY ? Can think of possible problems with repair and LOCAL_QUORUM, out of 
> interest what else would it break?
>
> Hope that helps.
> Aaron
>
>
>
> On 14 Apr 2011, at 10:17, Patrick Julien wrote:
>
>> We have been successful in implementing, at scale, the comments you
>> posted here. I'm wondering what we can do about deleting data
>> however.
>>
>> The way I see it, we have considerably more storage capacity in NY,
>> but not in the other sites. Using this technique here, it occurs to
>> me that we would replicate non-NY deleted rows back to NY. Is there a
>> way to tell NY not to tombstone rows?
>>
>> The ideas I have so far:
>>
>> - Set GCGracePeriod to be much higher in NY than in the other sites.
>> This way we can get to tombstone'd rows well beyond their disk life in
>> other sites.
>> - A variant on this solution is to set the TTL on rows in non NY sites
>> and again, set the GCGracePeriod to be considerably higher in NY
>> - break this up to multiple clusters and do one write from the client
>> to the its 'local' cluster and one write to the NY cluster.
>>
>>
>>
>> On Fri, Apr 8, 2011 at 7:15 PM, Jonathan Ellis 
>> mailto:jbel...@gmail.com>> wrote:
>>> No, I'm suggesting you have a Tokyo keyspace that gets replicated as
>>> {Tokyo: 2, NYC:1}, a London keyspace that gets replicated to {London:
>>> 2, NYC: 1}, for example.
>>>
>>> On Fri, Apr 8, 2011 at 5:59 PM, Patrick Julien 
>>> mailto:pjul...@gmail.com>> wrote:
 I'm familiar with this material. I hadn't thought of it from this
 angle but I believe what you're suggesting is that the different data
 centers would hold a different properties file for node discovery
 instead of using auto-discovery.

 So Tokyo, and others, would have a configuration that make it
 oblivious to the non New York data centers.
 New York would have a configuration that would give it knowledge of no
 other data center.

 Would that work? Wouldn't the NY data center wonder where these other
 writes are coming from?

 On Fri, Apr 8, 2011 at 6:38 PM, Jonathan Ellis 
 mailto:jbel...@gmail.com>> wrote:
> On Fri, Apr 8, 2011 at 12:17 PM, Patrick Julien 
> mailto:pjul...@gmail.com>> wrote:
>> The problem is this: we would like the historical data from Tokyo to
>> stay in Tokyo and only be replicated to New York. The one in London
>> to be in London and only be replicated to New York and so on for all
>> data centers.
>>
>> Is this currently possible with Cassandra? I believe we would need to
>> run multiple clusters and migrate data manually from data centers to
>> North America to achieve this. Also, any suggestions would also be
>> welcomed.
>
> NetworkTopologyStrategy allows configuration replicas per-keyspace,
> per-datacenter:
> http://www.datastax.com/dev/blog/deploying-cassandr

Re: Update the Keyspace replication factor online

2011-04-14 Thread Yudong Gao
Thanks, Aaron! I will try the scenario in small scale first.

I appreciate if anyone else have tried this before and can share the
experience with us.

Thanks!

Yudong

On Thu, Apr 14, 2011 at 4:26 AM, aaron morton  wrote:
> It looks like you are dropping DC1, in that case perhaps you could just move 
> the nodes from DC1 into DC 3.
>
> I *think* in your case if you made the RF change, ran repair on them,  and 
> worked at Quorum or ALL your clients would be ok. *BUT* I've not done this 
> myself, please take care or ask for a grown up to help.
>
> The warning about down time during repair have to do with the potential 
> impact of repair slowing nodes way down.
>
> Hope that helps
> Aaron
>
> On 13 Apr 2011, at 16:00, Yudong Gao wrote:
>
>> Thanks for the reply, Aaron!
>>
>> On Tue, Apr 12, 2011 at 10:52 PM, aaron morton  
>> wrote:
>>> Are you changing the replication factor or moving nodes ?
>>
>> I am just changing the replication factor, without touching the node
>> configuration.
>>
>>>
>>> To change the RF you need to repair and then once all repairing is done run 
>>> cleanup to remove the hold data.
>>
>> Do I need to shutdown the cluster when running the repair? If I just
>> repair the nodes one by one, will some users get the error of no data
>> exists, if the node responsible for the new replica is not yet
>> repaired?
>>
>> Yudong
>>
>>>
>>> You can move whole nodes by moving all their data with them, assigning a 
>>> new ip, and updating the topology file if used.
>>>
>>> Aaron
>>>
>>> On 13 Apr 2011, at 07:56, Yudong Gao wrote:
>>>
 Hi,

 What operations will be executed (and what is the associated overhead)
 when the Keyspace replication factor is changed online, in a
 multi-datacenter setup with NetworkTopologyStrategy?

 I checked the wiki and the archive of the mailing list and find this,
 but it is not very complete.

 http://wiki.apache.org/cassandra/Operations
 "
 Replication factor is not really intended to be changed in a live
 cluster either, but increasing it may be done if you (a) use
 ConsistencyLevel.QUORUM or ALL (depending on your existing replication
 factor) to make sure that a replica that actually has the data is
 consulted, (b) are willing to accept downtime while anti-entropy
 repair runs (see below), or (c) are willing to live with some clients
 potentially being told no data exists if they read from the new
 replica location(s) until repair is done.
 "

 More specifically, in this scenario:

 {DC1:1, DC2:1} -> {DC2:1, DC3:1}

 1. Can this be done online without shutting down the cluster? I
 thought there is an "update keyspace" command in the cassandra-cli.

 2. If so, what operations will be executed? Will new replicas be
 created in new locations (in DC3) and existing replicas be deleted in
 old locations (in DC1)?

 3. Or they will be updated only with read with ConssitencyLevel.QUORUM
 or All, or "nodetool repair"?

 Thanks!

 Yudong
>>>
>>>
>
>


Re: Pyramid Organization of Data

2011-04-14 Thread Patrick Julien
Thanks,  I'm still working the problem so anything I find out I will post
here.

Yes, you're right, that is the question I am asking.

No, adding more storage is not a solution since new york would have several
hundred times more storage.
On Apr 14, 2011 6:38 AM, "aaron morton"  wrote:
> I think your question is "NY is the archive, after a certain amount of
time we want to delete the row from the original DC but keep it in the
archive in NY."
>
> Once you delete a row, it's deleted as far as the client is concerned.
GCGaceSeconds is only concerned with when the tombstone marker can be
removed. If NY has a replica of a row from Tokyo and the row is deleted in
either DC, it will be deleted in the other DC as well.
>
> Some thoughts...
> 1) Add more storage in the satellite DC's, then tilt you chair to
celebrate a job well done :)
> 2) Run two clusters as you say.
> 3) Just thinking out loud, and I know this does not work now. Would it be
possible to support per CF strategy options, so an archive CF only
replicates to NY ? Can think of possible problems with repair and
LOCAL_QUORUM, out of interest what else would it break?
>
> Hope that helps.
> Aaron
>
>
>
> On 14 Apr 2011, at 10:17, Patrick Julien wrote:
>
>> We have been successful in implementing, at scale, the comments you
>> posted here. I'm wondering what we can do about deleting data
>> however.
>>
>> The way I see it, we have considerably more storage capacity in NY,
>> but not in the other sites. Using this technique here, it occurs to
>> me that we would replicate non-NY deleted rows back to NY. Is there a
>> way to tell NY not to tombstone rows?
>>
>> The ideas I have so far:
>>
>> - Set GCGracePeriod to be much higher in NY than in the other sites.
>> This way we can get to tombstone'd rows well beyond their disk life in
>> other sites.
>> - A variant on this solution is to set the TTL on rows in non NY sites
>> and again, set the GCGracePeriod to be considerably higher in NY
>> - break this up to multiple clusters and do one write from the client
>> to the its 'local' cluster and one write to the NY cluster.
>>
>>
>>
>> On Fri, Apr 8, 2011 at 7:15 PM, Jonathan Ellis  wrote:
>>> No, I'm suggesting you have a Tokyo keyspace that gets replicated as
>>> {Tokyo: 2, NYC:1}, a London keyspace that gets replicated to {London:
>>> 2, NYC: 1}, for example.
>>>
>>> On Fri, Apr 8, 2011 at 5:59 PM, Patrick Julien 
wrote:
 I'm familiar with this material. I hadn't thought of it from this
 angle but I believe what you're suggesting is that the different data
 centers would hold a different properties file for node discovery
 instead of using auto-discovery.

 So Tokyo, and others, would have a configuration that make it
 oblivious to the non New York data centers.
 New York would have a configuration that would give it knowledge of no
 other data center.

 Would that work? Wouldn't the NY data center wonder where these other
 writes are coming from?

 On Fri, Apr 8, 2011 at 6:38 PM, Jonathan Ellis 
wrote:
> On Fri, Apr 8, 2011 at 12:17 PM, Patrick Julien 
wrote:
>> The problem is this: we would like the historical data from Tokyo to
>> stay in Tokyo and only be replicated to New York. The one in London
>> to be in London and only be replicated to New York and so on for all
>> data centers.
>>
>> Is this currently possible with Cassandra? I believe we would need to
>> run multiple clusters and migrate data manually from data centers to
>> North America to achieve this. Also, any suggestions would also be
>> welcomed.
>
> NetworkTopologyStrategy allows configuration replicas per-keyspace,
> per-datacenter:
>
http://www.datastax.com/dev/blog/deploying-cassandra-across-multiple-data-centers
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
>

>>>
>>>
>>>
>>> --
>>> Jonathan Ellis
>>> Project Chair, Apache Cassandra
>>> co-founder of DataStax, the source for professional Cassandra support
>>> http://www.datastax.com
>>>
>


Re: Indexes on heterogeneous rows

2011-04-14 Thread David Boxenhorn
Thanks. I'm aware that I can roll my own. I wanted to avoid that, for ease
of use, but especially for atomicity concerns.

I thought that the secondary index would bring into memory all keys where
type=2, and then iterate over them to find keys where=5. (This is a case
were 1/3 of the rows are of type 2, but, say only a few hundred rows of type
2 have e=5.) The reason why I put "type" first is that queries on type will
always be an exact match, whereas the other clauses might be inequalities.

On Thu, Apr 14, 2011 at 2:07 PM, aaron morton wrote:

> You could make your own inverted index by using keys like  "e=5-type=2"
> where the columns are either the keys for the object or the objects
> themselves. Then just grab the full row back. If you know you always want to
> run queries like that.
>
> This recent discussion and blog post from Ed is good background
> http://www.mail-archive.com/user@cassandra.apache.org/msg12136.html
>
> I'm not sure how efficient the join from "e" to type would be. AFAIK it
> will iterate all keys where e=5 and lookup corresponding rows to find out if
> type = 2.
>
> If know how you want to read things back and need to deal with lots-o-data
> I would start testing with custom indexes. Then compare to the built in
> ones, it should be reasonably simple add them for a test.
>
> Hope
> that helps.
> Aaron
>
> On 14 Apr 2011, at 22:33, David Boxenhorn wrote:
>
> Thank you for your answer, and sorry about the sloppy terminology.
>
> I'm thinking of the scenario where there are a small number of results in
> the result set, but there are billions of rows in the first of your
> secondary indexes.
>
> That is, I want to do something like (not sure of the CQL syntax):
>
> select * where type=2 and e=5
>
> where there are billions of rows of type 2, but some manageable number of
> those rows have e=5.
>
> As I understand it, secondary indexes are like column families, where each
> value is a column. So the billions of rows where type=2 would go into a
> single row of the secondary index. This sounds like a problem to me, is it?
>
>
> I'm assuming that the billions of rows that don't have column "e" at all
> (those rows of other types) are not a problem at all...
>
> On Thu, Apr 14, 2011 at 12:12 PM, aaron morton wrote:
>
>> Need to clear up some terminology here.
>>
>> Rows have a key and can be retrieved by key. This is *sort of* the primary
>> index, but not primary in the normal RDBMS sense.
>> Rows can have different columns and the column names are sorted and can be
>> efficiently selected.
>> There are "secondary indexes" in cassandra 0.7 based on column values
>> http://www.datastax.com/dev/blog/whats-new-cassandra-07-secondary-indexes
>>
>> So you could create secondary indexes on the a,e, and h columns and get
>> rows that have specific values. There are some limitations to secondary
>> indexes, read the linked article.
>>
>> Or you can make your own secondary indexes using row keys as the index
>> values.
>>
>> If you have billions of rows, how many do you need to read back at once?
>>
>> Hope that helps
>> Aaron
>>
>> On 14 Apr 2011, at 04:23, David Boxenhorn wrote:
>>
>> Is it possible in 0.7.x to have indexes on heterogeneous rows, which have
>> different sets of columns?
>>
>> For example, let's say you have three types of objects (1, 2, 3) which
>> each had three members. If your rows had the following pattern
>>
>> type=1 a=? b=? c=?
>> type=2 d=? e=? f=?
>> type=3 g=? h=? i=?
>>
>> could you index "type" as your primary index, and also index "a", "e", "h"
>> as secondary indexes, to get the objects of that type that you are looking
>> for?
>>
>> Would it work if you had billions of rows of each type?
>>
>>
>>
>
>


Re: Indexes on heterogeneous rows

2011-04-14 Thread aaron morton
You could make your own inverted index by using keys like  "e=5-type=2" where 
the columns are either the keys for the object or the objects themselves. Then 
just grab the full row back. If you know you always want to run queries like 
that. 

This recent discussion and blog post from Ed is good background 
http://www.mail-archive.com/user@cassandra.apache.org/msg12136.html

I'm not sure how efficient the join from "e" to type would be. AFAIK it will 
iterate all keys where e=5 and lookup corresponding rows to find out if type = 
2. 

If know how you want to read things back and need to deal with lots-o-data I 
would start testing with custom indexes. Then compare to the built in ones, it 
should be reasonably simple add them for a test.   

Hope that helps. 
Aaron
   
On 14 Apr 2011, at 22:33, David Boxenhorn wrote:

> Thank you for your answer, and sorry about the sloppy terminology.
> 
> I'm thinking of the scenario where there are a small number of results in the 
> result set, but there are billions of rows in the first of your secondary 
> indexes.
> 
> That is, I want to do something like (not sure of the CQL syntax):
> 
> select * where type=2 and e=5
> 
> where there are billions of rows of type 2, but some manageable number of 
> those rows have e=5.
> 
> As I understand it, secondary indexes are like column families, where each 
> value is a column. So the billions of rows where type=2 would go into a 
> single row of the secondary index. This sounds like a problem to me, is it?  
> 
> I'm assuming that the billions of rows that don't have column "e" at all 
> (those rows of other types) are not a problem at all...
> 
> On Thu, Apr 14, 2011 at 12:12 PM, aaron morton  
> wrote:
> Need to clear up some terminology here. 
> 
> Rows have a key and can be retrieved by key. This is *sort of* the primary 
> index, but not primary in the normal RDBMS sense. 
> Rows can have different columns and the column names are sorted and can be 
> efficiently selected.
> There are "secondary indexes" in cassandra 0.7 based on column values 
> http://www.datastax.com/dev/blog/whats-new-cassandra-07-secondary-indexes
> 
> So you could create secondary indexes on the a,e, and h columns and get rows 
> that have specific values. There are some limitations to secondary indexes, 
> read the linked article. 
> 
> Or you can make your own secondary indexes using row keys as the index values.
> 
> If you have billions of rows, how many do you need to read back at once?
> 
> Hope that helps
> Aaron
> 
> On 14 Apr 2011, at 04:23, David Boxenhorn wrote:
> 
>> Is it possible in 0.7.x to have indexes on heterogeneous rows, which have 
>> different sets of columns?
>> 
>> For example, let's say you have three types of objects (1, 2, 3) which each 
>> had three members. If your rows had the following pattern
>> 
>> type=1 a=? b=? c=?
>> type=2 d=? e=? f=?
>> type=3 g=? h=? i=?
>> 
>> could you index "type" as your primary index, and also index "a", "e", "h" 
>> as secondary indexes, to get the objects of that type that you are looking 
>> for?
>> 
>> Would it work if you had billions of rows of each type?
> 
> 



Re: choose which Hector's serializer for Cassandra performance?

2011-04-14 Thread aaron morton
1. Order for RP is, well, random'ish for some large value of ish see 
http://wiki.apache.org/cassandra/FAQ#range_rp

2. Not really. 

Aaron

On 14 Apr 2011, at 14:45, 박용욱 wrote:

> Thanks very much Aaron!
> 
> 4. Not sure how Hector handles it, try 
> https://github.com/zznate/cassandra-tutorial or 
> https://github.com/zznate/hector-examples
> 
> 
> Let me ask question 4 again. :)
> 
> 
> 1. RandomPartitioner and OrderPreservingPartitioner response same result for 
> get_range_slices API.
> 
>   Is the only difference is efficiency?
> 2.  Don't RF & CL influence get_range_slices API, do they?
> 
> 
> 
> 
> Best Reguards.
> Colin.



Re: Pyramid Organization of Data

2011-04-14 Thread aaron morton
I think your question is "NY is the archive, after a certain amount of time we 
want to delete the row from the original DC but keep it in the archive in NY."

Once you delete a row, it's deleted as far as the client is concerned. 
GCGaceSeconds is only concerned with when the tombstone marker can be removed. 
If NY has a replica of a row from Tokyo and the row is deleted in either DC, it 
will be deleted in the other DC as well. 

Some thoughts...
1) Add more storage in the satellite DC's, then tilt you chair to celebrate a 
job well done :)
2) Run two clusters as you say. 
3) Just thinking out loud, and I know this does not work now. Would it be 
possible to support per CF strategy options, so an archive CF only replicates 
to NY ? Can think of possible problems with repair and LOCAL_QUORUM, out of 
interest what else would it break?

Hope that helps.
Aaron


 
On 14 Apr 2011, at 10:17, Patrick Julien wrote:

> We have been successful in implementing, at scale, the comments you
> posted here.  I'm wondering what we can do about deleting data
> however.
> 
> The way I see it, we have considerably more storage capacity in NY,
> but not in the other sites.  Using this technique here, it occurs to
> me that we would replicate non-NY deleted rows back to NY.  Is there a
> way to tell NY not to tombstone rows?
> 
> The ideas I have so far:
> 
> - Set GCGracePeriod to be much higher in NY than in the other sites.
> This way we can get to tombstone'd rows well beyond their disk life in
> other sites.
> - A variant on this solution is to set the TTL on rows in non NY sites
> and again, set the GCGracePeriod to be considerably higher in NY
> - break this up to multiple clusters and do one write from the client
> to the its 'local' cluster and one write to the NY cluster.
> 
> 
> 
> On Fri, Apr 8, 2011 at 7:15 PM, Jonathan Ellis  wrote:
>> No, I'm suggesting you have a Tokyo keyspace that gets replicated as
>> {Tokyo: 2, NYC:1}, a London keyspace that gets replicated to {London:
>> 2, NYC: 1}, for example.
>> 
>> On Fri, Apr 8, 2011 at 5:59 PM, Patrick Julien  wrote:
>>> I'm familiar with this material.  I hadn't thought of it from this
>>> angle but I believe what you're suggesting is that the different data
>>> centers would hold a different properties file for node discovery
>>> instead of using auto-discovery.
>>> 
>>> So Tokyo, and others, would have a configuration that make it
>>> oblivious to the non New York data centers.
>>> New York would have a configuration that would give it knowledge of no
>>> other data center.
>>> 
>>> Would that work?  Wouldn't the NY data center wonder where these other
>>> writes are coming from?
>>> 
>>> On Fri, Apr 8, 2011 at 6:38 PM, Jonathan Ellis  wrote:
 On Fri, Apr 8, 2011 at 12:17 PM, Patrick Julien  wrote:
> The problem is this: we would like the historical data from Tokyo to
> stay in Tokyo and only be replicated to New York.  The one in London
> to be in London and only be replicated to New York and so on for all
> data centers.
> 
> Is this currently possible with Cassandra?  I believe we would need to
> run multiple clusters and migrate data manually from data centers to
> North America to achieve this.  Also, any suggestions would also be
> welcomed.
 
 NetworkTopologyStrategy allows configuration replicas per-keyspace,
 per-datacenter:
 http://www.datastax.com/dev/blog/deploying-cassandra-across-multiple-data-centers
 
 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com
 
>>> 
>> 
>> 
>> 
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of DataStax, the source for professional Cassandra support
>> http://www.datastax.com
>> 



Re: Indexes on heterogeneous rows

2011-04-14 Thread David Boxenhorn
Thank you for your answer, and sorry about the sloppy terminology.

I'm thinking of the scenario where there are a small number of results in
the result set, but there are billions of rows in the first of your
secondary indexes.

That is, I want to do something like (not sure of the CQL syntax):

select * where type=2 and e=5

where there are billions of rows of type 2, but some manageable number of
those rows have e=5.

As I understand it, secondary indexes are like column families, where each
value is a column. So the billions of rows where type=2 would go into a
single row of the secondary index. This sounds like a problem to me, is it?


I'm assuming that the billions of rows that don't have column "e" at all
(those rows of other types) are not a problem at all...

On Thu, Apr 14, 2011 at 12:12 PM, aaron morton wrote:

> Need to clear up some terminology here.
>
> Rows have a key and can be retrieved by key. This is *sort of* the primary
> index, but not primary in the normal RDBMS sense.
> Rows can have different columns and the column names are sorted and can be
> efficiently selected.
> There are "secondary indexes" in cassandra 0.7 based on column values
> http://www.datastax.com/dev/blog/whats-new-cassandra-07-secondary-indexes
>
> So you could create secondary indexes on the a,e, and h columns and get
> rows that have specific values. There are some limitations to secondary
> indexes, read the linked article.
>
> Or you can make your own secondary indexes using row keys as the index
> values.
>
> If you have billions of rows, how many do you need to read back at once?
>
> Hope that helps
> Aaron
>
> On 14 Apr 2011, at 04:23, David Boxenhorn wrote:
>
> Is it possible in 0.7.x to have indexes on heterogeneous rows, which have
> different sets of columns?
>
> For example, let's say you have three types of objects (1, 2, 3) which each
> had three members. If your rows had the following pattern
>
> type=1 a=? b=? c=?
> type=2 d=? e=? f=?
> type=3 g=? h=? i=?
>
> could you index "type" as your primary index, and also index "a", "e", "h"
> as secondary indexes, to get the objects of that type that you are looking
> for?
>
> Would it work if you had billions of rows of each type?
>
>
>


Re: forced index creation?

2011-04-14 Thread Sasha Dolgy
Aha.  that would be it.  cheers
On Apr 14, 2011 11:47 AM, "aaron morton"  wrote:
> Checked the code, build_indexes comes from the JMX services and is only
shown if the client can connect to JMX.
>
> It is cannot connect it should print "WARNING: Could not connect to the
JMX on %s:%d, information won't be shown.%n%n"
>
> If you are using a non default JMX port use --jmxport when starting the
CLI.
>
> Hope that helps.
> Aaron
>
> On 14 Apr 2011, at 07:12, Sasha Dolgy wrote:
>
>> odd ... checked again today. still not there. will dig around the
>> logs a bit. my indexes work ... just not seeing anything in the CLI
>> ... are you also on 0.7.4 ?
>>
>> ColumnFamily: applications
>> Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type
>> Row cache size / save period: 0.0/0
>> Key cache size / save period: 20.0/14400
>> Memtable thresholds: 0.248437498/53/1440
>> GC grace seconds: 864000
>> Compaction min/max thresholds: 4/32
>> Read repair chance: 1.0
>> Column Metadata:
>> Column Name: app_name (app_name)
>> Validation Class: org.apache.cassandra.db.marshal.UTF8Type
>> Column Name: app_id (app_id)
>> Validation Class: org.apache.cassandra.db.marshal.UTF8Type
>> Column Name: app_uri (app_uri)
>> Validation Class: org.apache.cassandra.db.marshal.UTF8Type
>> Index Type: KEYS
>>
>>
>> On Wed, Apr 13, 2011 at 4:35 AM, aaron morton 
wrote:
>>> Built indexes are there for me
>>>
>>> [default@unknown] describe keyspace Keyspace1;
>>> Keyspace: Keyspace1:
>>> Replication Strategy: org.apache.cassandra.locator.SimpleStrategy
>>> Replication Factor: 1
>>> Column Families:
>>> ColumnFamily: Indexed1
>>> default_validation_class: org.apache.cassandra.db.marshal.LongType
>>> Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type
>>> Row cache size / save period in seconds: 0.0/0
>>> Key cache size / save period in seconds: 20.0/14400
>>> Memtable thresholds: 0.145312498/31/1440 (millions of
ops/minutes/MB)
>>> GC grace seconds: 864000
>>> Compaction min/max thresholds: 4/32
>>> Read repair chance: 1.0
>>> Built indexes: [Indexed1.birthdate_idx]
>>> Column Metadata:
>>> Column Name: birthdate
>>> Validation Class: org.apache.cassandra.db.marshal.LongType
>>> Index Name: birthdate_idx
>>> Index Type: KEYS
>>>
>>> When the index is created existing data is indexed async, and any new
data is indexed as part of the write. Not sure how to force/check things
though.
>>>
>>> Can you turn logging up to DEBUG and compare the requests between the
two clusters ?
>>>
>>> Aaron
>>>
>>> On 13 Apr 2011, at 05:46, Sasha Dolgy wrote:
>>>
 hi, just deployed a new keyspace on 0.7.4 and added the following
column family:

 create column family applications with comparator=UTF8Type and
column_metadata=[
 {column_name: app_name, validation_class: UTF8Type},
 {column_name: app_uri, validation_class: UTF8Type,index_type: KEYS},
 {column_name: app_id, validation_class: UTF8Type}
 ];

 I then proceeded to add two new rows of data to it. When i try and
 query the secondary index on app_uri, my query with phpcassa fails.
 on the same CF in a different cluster, it works fine. when comparing
 the CF between clusters, see there's a difference: --- Built indexes:
 --- shows up when i run --> describe keyspace foobar;



 Column Metadata:
 Column Name: app_name (app_name)
 Validation Class: org.apache.cassandra.db.marshal.UTF8Type
 Column Name: app_id (app_id)
 Validation Class: org.apache.cassandra.db.marshal.UTF8Type
 Column Name: app_uri (app_uri)
 Validation Class: org.apache.cassandra.db.marshal.UTF8Type
 Index Type: KEYS

 Checking out a bit further:

 get applications where 'app_uri' = 'get-test';
 ---
 RowKey: 9d699733-9afe-4a41-83ca-c60d040dacc0


 get applications where 'app_id' =
'9d699733-9afe-4a41-83ca-c60d040dacc0';
 No indexed columns present in index clause with operator EQ

 So .. I can see that the secondary indexes are working.

 Question 1: Has "Built indexes" been removed from the "describe
 keyspace" output? Or have i done something 
 Question 2: Is there a way to force secondary index creation?





 --
 Sasha Dolgy
 sasha.do...@gmail.com
>>>
>>>
>>
>>
>>
>> --
>> Sasha Dolgy
>> sasha.do...@gmail.com
>


Re: forced index creation?

2011-04-14 Thread aaron morton
Checked the code, build_indexes comes from the JMX services and is only shown 
if the client can connect to JMX. 

It is cannot connect it should print "WARNING: Could not connect to the JMX on 
%s:%d, information won't be shown.%n%n"

If you are using a non default JMX port use --jmxport when starting the CLI.  

Hope that helps. 
Aaron

On 14 Apr 2011, at 07:12, Sasha Dolgy wrote:

> odd ... checked again today.   still not there.  will dig around the
> logs a bit.  my indexes work ... just not seeing anything in the CLI
> ... are you also on 0.7.4 ?
> 
>ColumnFamily: applications
>  Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type
>  Row cache size / save period: 0.0/0
>  Key cache size / save period: 20.0/14400
>  Memtable thresholds: 0.248437498/53/1440
>  GC grace seconds: 864000
>  Compaction min/max thresholds: 4/32
>  Read repair chance: 1.0
>  Column Metadata:
>Column Name: app_name (app_name)
>  Validation Class: org.apache.cassandra.db.marshal.UTF8Type
>Column Name: app_id (app_id)
>  Validation Class: org.apache.cassandra.db.marshal.UTF8Type
>Column Name: app_uri (app_uri)
>  Validation Class: org.apache.cassandra.db.marshal.UTF8Type
>  Index Type: KEYS
> 
> 
> On Wed, Apr 13, 2011 at 4:35 AM, aaron morton  wrote:
>> Built indexes are there for me
>> 
>> [default@unknown] describe keyspace Keyspace1;
>> Keyspace: Keyspace1:
>>  Replication Strategy: org.apache.cassandra.locator.SimpleStrategy
>>Replication Factor: 1
>>  Column Families:
>>ColumnFamily: Indexed1
>>  default_validation_class: org.apache.cassandra.db.marshal.LongType
>>  Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type
>>  Row cache size / save period in seconds: 0.0/0
>>  Key cache size / save period in seconds: 20.0/14400
>>  Memtable thresholds: 0.145312498/31/1440 (millions of 
>> ops/minutes/MB)
>>  GC grace seconds: 864000
>>  Compaction min/max thresholds: 4/32
>>  Read repair chance: 1.0
>>  Built indexes: [Indexed1.birthdate_idx]
>>  Column Metadata:
>>Column Name: birthdate
>>  Validation Class: org.apache.cassandra.db.marshal.LongType
>>  Index Name: birthdate_idx
>>  Index Type: KEYS
>> 
>> When the index is created existing data is indexed async, and any new data 
>> is indexed as part of the write. Not sure how to force/check things though.
>> 
>> Can you turn logging up to DEBUG and compare the requests between the two 
>> clusters ?
>> 
>> Aaron
>> 
>> On 13 Apr 2011, at 05:46, Sasha Dolgy wrote:
>> 
>>> hi, just deployed a new keyspace on 0.7.4 and added the following column 
>>> family:
>>> 
>>> create column family applications with comparator=UTF8Type and 
>>> column_metadata=[
>>>{column_name: app_name, validation_class: UTF8Type},
>>>{column_name: app_uri, validation_class: UTF8Type,index_type: KEYS},
>>>{column_name: app_id, validation_class: UTF8Type}
>>> ];
>>> 
>>> I then proceeded to add two new rows of data to it.  When i try and
>>> query the secondary index on app_uri, my query with phpcassa fails.
>>> on the same CF in a different cluster, it works fine.  when comparing
>>> the CF between clusters, see there's a difference: ---  Built indexes:
>>> --- shows up when i run --> describe keyspace foobar;
>>> 
>>> 
>>> 
>>>  Column Metadata:
>>>Column Name: app_name (app_name)
>>>  Validation Class: org.apache.cassandra.db.marshal.UTF8Type
>>>Column Name: app_id (app_id)
>>>  Validation Class: org.apache.cassandra.db.marshal.UTF8Type
>>>Column Name: app_uri (app_uri)
>>>  Validation Class: org.apache.cassandra.db.marshal.UTF8Type
>>>  Index Type: KEYS
>>> 
>>> Checking out a bit further:
>>> 
>>> get applications where 'app_uri' = 'get-test';
>>> ---
>>> RowKey: 9d699733-9afe-4a41-83ca-c60d040dacc0
>>> 
>>> 
>>> get applications where 'app_id' = '9d699733-9afe-4a41-83ca-c60d040dacc0';
>>> No indexed columns present in index clause with operator EQ
>>> 
>>> So .. I can see that the secondary indexes are working.
>>> 
>>> Question 1:  Has "Built indexes" been removed from the "describe
>>> keyspace" output?  Or have i done something 
>>> Question 2:  Is there a way to force secondary index creation?
>>> 
>>> 
>>> 
>>> 
>>> 
>>> --
>>> Sasha Dolgy
>>> sasha.do...@gmail.com
>> 
>> 
> 
> 
> 
> -- 
> Sasha Dolgy
> sasha.do...@gmail.com



Re: Indexes on heterogeneous rows

2011-04-14 Thread aaron morton
Need to clear up some terminology here. 

Rows have a key and can be retrieved by key. This is *sort of* the primary 
index, but not primary in the normal RDBMS sense. 
Rows can have different columns and the column names are sorted and can be 
efficiently selected.
There are "secondary indexes" in cassandra 0.7 based on column values 
http://www.datastax.com/dev/blog/whats-new-cassandra-07-secondary-indexes

So you could create secondary indexes on the a,e, and h columns and get rows 
that have specific values. There are some limitations to secondary indexes, 
read the linked article. 

Or you can make your own secondary indexes using row keys as the index values.

If you have billions of rows, how many do you need to read back at once?

Hope that helps
Aaron

On 14 Apr 2011, at 04:23, David Boxenhorn wrote:

> Is it possible in 0.7.x to have indexes on heterogeneous rows, which have 
> different sets of columns?
> 
> For example, let's say you have three types of objects (1, 2, 3) which each 
> had three members. If your rows had the following pattern
> 
> type=1 a=? b=? c=?
> type=2 d=? e=? f=?
> type=3 g=? h=? i=?
> 
> could you index "type" as your primary index, and also index "a", "e", "h" as 
> secondary indexes, to get the objects of that type that you are looking for?
> 
> Would it work if you had billions of rows of each type?



Re: database design

2011-04-14 Thread aaron morton
In some cases you may be able to add a secondary index. 

Aaron

On 14 Apr 2011, at 03:11, Edward Capriolo wrote:

> On Wed, Apr 13, 2011 at 10:39 AM, Jean-Yves LEBLEU  wrote:
>> Hi all,
>> 
>> Just some thoughts and question I have about cassandra data modeling.
>> 
>> If I understand well, cassandra is better on writing than on reading.
>> So you have to think about your queries to design cassandra schema. We
>> are doing incremental design, and already have our system in
>> production and we have to develop new queries.
>> How do you usualy do when you have new queries, do you write a
>> specific job to update data in the database to match the new query you
>> are writing ?
>> 
>> Thanks for your help.
>> 
>> Jean-Yves
>> 
> 
> Good point, Generally you will need to write some type of range
> scanning/map reduce application to process and back fill your data.



Re: Cassandra Database Modeling

2011-04-14 Thread aaron morton
WRT your query, it depends on how big a slice you want to get how time critical 
it is. e.g. Could you be making queries that would return all 10M pairs ? Or 
would the queries generally want to get some small fraction of the data set? 
Again, depends on how the sim runs.

If you sim has stop the world pauses were you have a full view of the data 
space, then you could grab all the points at a certain distance and efficiently 
pack them up. Where "efficiently" means not using JSON.

http://wiki.apache.org/cassandra/LargeDataSetConsiderations
http://wiki.apache.org/cassandra/CassandraLimitations
 
Aaron

On 13 Apr 2011, at 15:48, csharpplusproject wrote:

> Aaron,
> 
> Thank you so much for your help. It is greatly appreciated!
> 
> Looking at the design of the particle pairs:
>> 
>> - key: expriement_id.time_interval 
>> - column name: pair_id 
>> - column value: distance, angle, other data packed together as JSON or some 
>> other format
> 
> You wrote that retrieving millions of columns (I will have about 10,000,000 
> particles pairs) would be slow. You are also right that the retrieval of 
> millions of columns into Python, won't be fast.
> 
> If my desired query is to get "all particle pairs on time interval [ 
> Tn..T(n+1) ] where the distance between the two particles is smaller than X 
> and the angle between the two particles is greater than Y".
> 
> In such a query (as the above), given the fact that retrieving millions of 
> columns could be slow, would it be best to say 'concatenate' all values for 
> all particle pairs for a given 'expriement_id.time_interval' into one column?
> 
> If data is stored in this way, I will be getting from Cassandra a binary 
> string / JSON Object that I will have to 'unpack' in my application. Is this 
> a recommended approach? are there better approaches?
> 
> Is there a limit to the size that can be stored in one 'cell' (by 'cell' I 
> mean the intersection between a key and a data column)? is there a limit to 
> the size of data of one key?  one data column?
> 
> Thanks in advance for any help / guidance.
> 
> -Original Message-
> From: aaron morton 
> Reply-to: user@cassandra.apache.org
> To: user@cassandra.apache.org
> Subject: Re: Cassandra Database Modeling
> Date: Wed, 13 Apr 2011 10:14:21 +1200
> 
> Yes for  interactive == real time queries.  Hadoop based techniques are non 
> time critical queries, but they do have greater analytical capabilities.  
> 
> particle_pairs: 1) Yes and no and sort of. Under the hood the get_slice api 
> call will be used by your client library to pull back chunks of (ordered) 
> columns. Most client libraries abstract away the chunking for you.  
> 
> 2) If you are using a packed structure like JSON then no, Cassandra will have 
> no idea what you've put in the columns other than bytes . It really depends 
> on how much data you have per pair, but generally it's easier to pull back 
> more data than try to get exactly what you need. Downside is you have to 
> update all the data.  
> 
> 3) No, you would need to update all the data for the pair. I was assuming 
> most of the data was written once, and that your simulation had something 
> like a stop-the-world phase between time slices where state was dumped and 
> then read to start the next interval. You could either read it first, or we 
> can come up with something else. 
> 
> distance_cf 1) the query would return an list of columns, which have a name 
> and value (as well as a timestamp and ttl). 2) depends on the client library, 
> if using python go for https://github.com/pycassa/pycassa It will return 
> objects  3) returning millions of columns is going to be slow, would also be 
> slow using a RDBMS. Creating millions objects in python is going to be slow. 
> You would need to have a better idea of what queries you will actually want 
> to run to see if it's *too* slow. If it is one approach is to store the 
> particles at the same distance in the same column, so you need to read less 
> columns. Again depends on how your sim works. Time complexity depends on 
> the number of columns read. Finding a row will not be O(1) as it it may have 
> to read from several files. Writes are more constant than reads. But 
> remember, you can have a lot of io and cpu power in your cluster. 
> 
> Best advice is to jump in and see if the data model works for you at a small 
> single node scale, most performance issues can be solved.  
> 
> Aaron 
> On 12 Apr 2011, at 15:34, csharpplusproject wrote: 
>> Hi Aaron,
>> 
>> Yes, of course it helps, I am starting to get a flavor of Cassandra -- thank 
>> you very much!
>> 
>> First of all, by 'interactive' queries, are you referring to 'real-time' 
>> queries? (meaning, where experiments data is 'streaming', data needs to be 
>> stored and following that, the query needs to be run in real time)?
>> 
>> Looking at the design of the particle pairs:
>> 
>> - key: expriement_id.time_interval 
>> - column name: pair_id 
>> - colum

Re: Update the Keyspace replication factor online

2011-04-14 Thread aaron morton
It looks like you are dropping DC1, in that case perhaps you could just move 
the nodes from DC1 into DC 3. 

I *think* in your case if you made the RF change, ran repair on them,  and 
worked at Quorum or ALL your clients would be ok. *BUT* I've not done this 
myself, please take care or ask for a grown up to help. 

The warning about down time during repair have to do with the potential impact 
of repair slowing nodes way down.

Hope that helps 
Aaron
 
On 13 Apr 2011, at 16:00, Yudong Gao wrote:

> Thanks for the reply, Aaron!
> 
> On Tue, Apr 12, 2011 at 10:52 PM, aaron morton  
> wrote:
>> Are you changing the replication factor or moving nodes ?
> 
> I am just changing the replication factor, without touching the node
> configuration.
> 
>> 
>> To change the RF you need to repair and then once all repairing is done run 
>> cleanup to remove the hold data.
> 
> Do I need to shutdown the cluster when running the repair? If I just
> repair the nodes one by one, will some users get the error of no data
> exists, if the node responsible for the new replica is not yet
> repaired?
> 
> Yudong
> 
>> 
>> You can move whole nodes by moving all their data with them, assigning a new 
>> ip, and updating the topology file if used.
>> 
>> Aaron
>> 
>> On 13 Apr 2011, at 07:56, Yudong Gao wrote:
>> 
>>> Hi,
>>> 
>>> What operations will be executed (and what is the associated overhead)
>>> when the Keyspace replication factor is changed online, in a
>>> multi-datacenter setup with NetworkTopologyStrategy?
>>> 
>>> I checked the wiki and the archive of the mailing list and find this,
>>> but it is not very complete.
>>> 
>>> http://wiki.apache.org/cassandra/Operations
>>> "
>>> Replication factor is not really intended to be changed in a live
>>> cluster either, but increasing it may be done if you (a) use
>>> ConsistencyLevel.QUORUM or ALL (depending on your existing replication
>>> factor) to make sure that a replica that actually has the data is
>>> consulted, (b) are willing to accept downtime while anti-entropy
>>> repair runs (see below), or (c) are willing to live with some clients
>>> potentially being told no data exists if they read from the new
>>> replica location(s) until repair is done.
>>> "
>>> 
>>> More specifically, in this scenario:
>>> 
>>> {DC1:1, DC2:1} -> {DC2:1, DC3:1}
>>> 
>>> 1. Can this be done online without shutting down the cluster? I
>>> thought there is an "update keyspace" command in the cassandra-cli.
>>> 
>>> 2. If so, what operations will be executed? Will new replicas be
>>> created in new locations (in DC3) and existing replicas be deleted in
>>> old locations (in DC1)?
>>> 
>>> 3. Or they will be updated only with read with ConssitencyLevel.QUORUM
>>> or All, or "nodetool repair"?
>>> 
>>> Thanks!
>>> 
>>> Yudong
>> 
>> 



Re: flush_largest_memtables_at messages in 7.4

2011-04-14 Thread Peter Schuller
And btw I've been assuming your reads are running without those 250k
column inserts going at the same time. It would be difficult to see
what's going on if you have both of those traffic patterns at the same
time.


-- 
/ Peter Schuller


Re: flush_largest_memtables_at messages in 7.4

2011-04-14 Thread Peter Schuller
And btw you can also directly read out the average wait time column
and average latency column of iostat too, to confirm that individual
i/o requests are taking longer than on a non-saturated drive.

-- 
/ Peter Schuller


Re: flush_largest_memtables_at messages in 7.4

2011-04-14 Thread Peter Schuller
> Actually when I run 2 stress clients in parallel I see Read Latency stay the
> same. I wonder if cassandra is reporting accurate nos.

Or you're just bottlenecking on something else. Are you running the
extra stress client on different machines for example, so that the
client isn't just saturating?

> I understand your analogy but for some reason I don't see that happening
> with the results I am seeing with multiple stress clients running. So I am
> just confused where the real bottleneck is.

If your queue size to your device is consistently high (you were
mentioning numbers in the ~100 range), you're saturating on disk,
periods. Unless your disk is a 500 drive RAID volume and 100 requests
represents 1/5 of capacity... (If you have a raid volume with a few
disks or an ssd, you want to switch to the noop or deadline scheduler
btw.)

-- 
/ Peter Schuller


Re: raid 0 and ssd

2011-04-14 Thread Terje Marthinussen
Hm...

You should notice that unless you have TRIM, which I don't think any OS
support with any raid functionality yet, then once you have written once to
the whole SSD, it is always full!

That is, when you delete a file, you don't "clear" the blocks on the SSD so
as far as the SSD goes, the data is still there.

The latest SSDs are pretty good at dealing with this though, and some can be
made a lot better by allocating extra spare block area for GC.

Also be careful with raids and things like scrubbing or initialization of
the Raid. This may very well "fill it 100%" :)

Terje

On Thu, Apr 14, 2011 at 2:02 PM, Drew Kutcharian  wrote:

> RAID 0 is the fastest, but you'll lose the whole array if you lose a drive.
> One thing to keep in mind is that SSDs get slower as they get filled up and
> closer to their capacity due to garbage collection.
>
> If you want more info on how SSDs perform in general, Percona guys have
> done extensive tests. (In addition to comparing all the raid levels and etc.
>
> http://www.percona.com/docs/wiki/benchmark:ssd:start
>
> http://www.mysqlperformanceblog.com/2009/05/01/raid-vs-ssd-vs-fusionio/(see 
> the "RELATED SEARCHES" on the right side too)
>
> - Drew
>
>
> On Apr 13, 2011, at 9:42 PM, Anurag Gujral wrote:
>
> > Hi All,
> >We are using three ssd disks with cassandra 0.7.3 , should we
> set them as raid0 .What are the advantages and disadvantages of doing this.
> > Please advise.
> >
> > Thanks
> > Anurag
>
>