Re: Tombstone lifespan after multiple deletions

2011-01-17 Thread Ryan King
On Sun, Jan 16, 2011 at 6:53 AM, David Boxenhorn  wrote:
> If I delete a row, and later on delete it again, before GCGraceSeconds has
> elapsed, does the tombstone live longer?

Each delete is a new tombstone, which should answer your question.

-ryan

> In other words, if I have the following scenario:
>
> GCGraceSeconds = 10 days
> On day 1 I delete a row
> On day 5 I delete the row again
>
> Will the tombstone be removed on day 10 or day 15?
>


Re: quorum calculation seems to depend on previous selected nodes

2011-01-17 Thread Samuel Benz
On 01/17/2011 09:28 PM, Jonathan Ellis wrote:
> On Mon, Jan 17, 2011 at 2:10 PM, Samuel Benz  wrote:
 Case1:
 If 'TEST' was previous stored on Node1, Node2, Node3 -> The update will
 succeed.

 Case2:
 If 'TEST' was previous stored on Node2, Node3, Node4 -> The update will
 not work.
>>>
>>> If you have RF=2 then it will be stored on 2 nodes, not 3.  I think
>>> this is the source of the confusion.
>>>
>>
>> I checked the existence of the row on the different serverver with
>> sstablekeys after flushing. So I saw three copies of every key in the
>> cluster.
> 
> If you want to be guaranteed to be able to read with two nodes down
> and RF=3, you have to read at CL.ONE, since if the two nodes that are
> down are replicas of the data you are reading (as in the 2nd case
> here) Cassandra will be unable to achieve quorum (quorum of 3 is 2
> live nodes).
> 

Now it seems clear to me. Thanks!

I was confused by the fact that: "live nodes" != "replica live nodes"

Correct me if I'm wrong, but even in a cluster with 1000 nodes and RF=3,
if I shut down the wrong two nodes, i have the same problem as in my
mini cluster.


--
Sam


Re: What is be the best possible client option available to a PHP developer for implementing an application ready for production environments ?

2011-01-17 Thread Tyler Hobbs
>
> 1. )  Is it devloped to the level in order to support all the
> necessary features to take full advantage of Cassandra?
>

Yes.  There aren't some of the niceties of pycassa yet, but you can do
everything that Cassandra offers with it.


> 2. )  Is it used in production by anyone ?
>

Yes, I've talked to a few people at least who are using it in production.
It tends to play a limited role instead of a central one, though.


> 3. )  What are its limitations?
>

Being written in PHP.  Seriously.  The lack of universal 64bit integer
support can be problematic if you don't have a fully 64bit system.  PHP is
fairly slow.  PHP makes a few other things less easy to do.  If you're doing
some pretty lightweight interaction with Cassandra through PHP, these might
not be a problem for you.

- Tyler


Re: What is be the best possible client option available to a PHP developer for implementing an application ready for production environments ?

2011-01-17 Thread Rajkumar Gupta
Hey Brandon,

1. )  Is it devloped to the level in order to support all the
necessary features to take full advantage of Cassandra?
2. )  Is it used in production by anyone ?
3. )  What are its limitations?

Thanks.

On Tue, Jan 18, 2011 at 7:11 AM, Brandon Williams  wrote:
> On Mon, Jan 17, 2011 at 7:22 PM, Ertio Lew  wrote:
>>
>> What would be the best client option to go with in order to use
>
> Pycassa. https://github.com/thobbs/pycassa
>
>>
>> Cassandra through an application to be implemented in PHP.
>
> Oh.  Then https://github.com/thobbs/phpcassa
> -Brandon


Re: What is be the best possible client option available to a PHP developer for implementing an application ready for production environments ?

2011-01-17 Thread Brandon Williams
On Mon, Jan 17, 2011 at 7:22 PM, Ertio Lew  wrote:

> What would be the best client option to go with in order to use
>

Pycassa. https://github.com/thobbs/pycassa


> Cassandra through an application to be implemented in PHP.
>

Oh.  Then https://github.com/thobbs/phpcassa

-Brandon


What is be the best possible client option available to a PHP developer for implementing an application ready for production environments ?

2011-01-17 Thread Ertio Lew
What would be the best client option to go with in order to use
Cassandra through an application to be implemented in PHP.

It seems that PHP developers have a high barrier of entry to
Cassandra's world because of the unavailability of relatively mature,
developed and well proven client options (like Hector for Java
developers) that fits the requirements & provide features for
production environments.

In this case what would be the best option to go with in order to use
Cassandra for production ? Implementing in some different language
like java ? or using thrift library?

I know most of the Cassandra implementations are Java based, so is
that way be preferable ? It's not very easy to go with Java based
application for small companies with less manpower.

I am unable to make an easy decision. Please help me out to make a
more performance-centered decision for the application.

Thanks.
Ertio Lew


P.S. In any case, if you suggest a client option please list any major
implementations of that.


Re: please help with multiget

2011-01-17 Thread Aaron Morton
If you can provide some more information on a specific use case we may be able to help with the modelling. The general approach is to denormalise the data to the point where each request/activity/feature in your application results in a call to get data from one or more rows in one CF. It's not always possible, it's just the goal I use when modelling. I also lean towards making fewer calls that return more data, rather than more calls that return the exact amount of data. IMHO additional filtering and ordering on the client side will reduce server load at scale. You may be able to use a multiget for a super_column for multiple rows, which will return the super columns and (their potentially different) list of columns. Or if the rows have only a few standard columns pull back all columns for the rows. Hope that helps.AaronOn 18 Jan, 2011,at 01:53 PM, Shu Zhang  wrote:Here's the method declaration for quick reference:
map> multiget_slice(string keyspace, list keys, ColumnParent column_parent, SlicePredicate predicate, ConsistencyLevel consistency_level)

It looks like you must have the same SlicePredicate for every key in your batch retrieval, so what are you suppose to do when you need to retrieve different columns for different keys? I mean, it seems like to fully take advantage of cassandra's data structure, you often want to put dynamic data as column names, and different rows may have totally different column names. That's pretty standard practice right? Then it seems like you should be able to batch get-requests mapping different slicepredicates to different keys in an efficient way.

The only way I can think of to retrieve different columns for different keys (besides breaking them into individual requests) is to set the SlicePredicate so that you retrieve entire rows and then parse it on the client side... but that seems a little inefficient and a bit of a pain. Is that what people do? I can see this not being TOO much more inefficient since a single row is always kept together physically.

I haven't found a lot of other complaints about this so maybe I am missing something. But a get request takes a key and a column path, so it seems like a batch-get should allow you to specify any combination of key-columnPath or key-slicePredicate pairs. I mean, intuitive design-wise, for any batch operation, it makes sense to allow for batching together any number of corresponding non-batch operations. Ie. If I can make a non-batch get request for (key1, colNam1), and I can make a non-batch get request for (key2, colName2), then I should be able to make a batch request for (key1, colName1) and (key2, colName2).

Furthermore, a batch-get method signature like

map> multiget_slice(string keyspace, map>> mutation_map, ConsistencyLevel consistency_level)

look a lot more symmetrical to the batch_mutate method
void batch_mutate(string keyspace, map>> mutation_map, ConsistencyLevel consistency_level)

Thoughts? 

Thanks,
Shu



Re: please help with multiget

2011-01-17 Thread Brandon Williams
On Mon, Jan 17, 2011 at 6:53 PM, Shu Zhang  wrote:

> Here's the method declaration for quick reference:
> map> multiget_slice(string keyspace,
> list keys, ColumnParent column_parent, SlicePredicate predicate,
> ConsistencyLevel consistency_level)
>
> It looks like you must have the same SlicePredicate for every key in your
> batch retrieval, so what are you suppose to do when you need to retrieve
> different columns for different keys?


Issue multiple gets in parallel yourself.  Keep in mind that multiget is not
an optimization, in fact, it can work against you when one key exceeds the
rpc timeout, because you get nothing back.

-Brandon


please help with multiget

2011-01-17 Thread Shu Zhang
Here's the method declaration for quick reference:
map> multiget_slice(string keyspace, 
list keys, ColumnParent column_parent, SlicePredicate predicate, 
ConsistencyLevel consistency_level)

It looks like you must have the same SlicePredicate for every key in your batch 
retrieval, so what are you suppose to do when you need to retrieve different 
columns for different keys? I mean, it seems like to fully take advantage of 
cassandra's data structure, you often want to put dynamic data as column names, 
and different rows may have totally different column names. That's pretty 
standard practice right? Then it seems like you should be able to batch 
get-requests mapping different slicepredicates to different keys in an 
efficient way.

The only way I can think of to retrieve different columns for different keys 
(besides breaking them into individual requests) is to set the SlicePredicate 
so that you retrieve entire rows and then parse it on the client side... but 
that seems a little inefficient and a bit of a pain. Is that what people do? I 
can see this not being TOO much more inefficient since a single row is always 
kept together physically.

I haven't found a lot of other complaints about this so maybe I am missing 
something. But a get request takes a key and a column path, so it seems like a 
batch-get should allow you to specify any combination of key-columnPath or 
key-slicePredicate pairs. I mean, intuitive design-wise, for any batch 
operation, it makes sense to allow for batching together any number of 
corresponding non-batch operations. Ie. If I can make a non-batch get request 
for (key1, colNam1), and I can make a non-batch get request for (key2, 
colName2), then I should be able to make a batch request for (key1, colName1) 
and (key2, colName2).

Furthermore, a batch-get method signature like

map> multiget_slice(string keyspace, 
map>> mutation_map, ConsistencyLevel 
consistency_level)

look a lot more symmetrical to the batch_mutate method
void batch_mutate(string keyspace, map>> 
mutation_map, ConsistencyLevel consistency_level)

Thoughts? 

Thanks,
Shu



Re: Super CF or two CFs?

2011-01-17 Thread Brandon Williams
On Mon, Jan 17, 2011 at 5:12 PM, Steven Mac  wrote:

>  I guess I was maybe trying to simplify the question too much. In reality I
> do not have one volatile part, but multiple ones (say all trading data of
> day). Each would be a supercolumn identified by the time slot, with the
> individual fields as subcolumns.
>

If you're always going to write these attributes in one shot, then just
serialize them and use a simple CF, there's no need for a SCF.

-Brandon


RE: Super CF or two CFs?

2011-01-17 Thread Steven Mac

I guess I was maybe trying to simplify the question too much. In reality I do 
not have one volatile part, but multiple ones (say all trading data of day). 
Each would be a supercolumn identified by the time slot, with the individual 
fields as subcolumns.

Of course, I could prefix the time slot identifier to the field names and make 
do with a normal CF, but couldn't this be done for any super column? In other 
words, why have it at all?

Steven.

> Date: Mon, 17 Jan 2011 22:58:14 +
> Subject: Re: Super CF or two CFs?
> From: stephen.alan.conno...@gmail.com
> To: user@cassandra.apache.org
> 
> On 17 January 2011 22:36, Steven Mac  wrote:
> > Sure, consider stock data, where the stock symbol is the row key. The stock
> > data consists of a rather stable part and a very volatile part, both of
> > which would be a super column. The stable super column would contain
> > subcolumns such as company name, address, and some annual or quarterly data.
> > The volatile super column would contain periodic stock data, such as current
> > price, last trade times, volumes, buyers, sellers, etc.
> >
> > The volatile super columns would be updated every few minutes, many rows at
> > once using a single batch_mutate. The data would be read using a get on a
> > single row key, returning both supercolumns and all subcolumns.
> >
> > The data could also be split over two column families, one for the stable
> > part and one for the volatile part. The updates would be the same, while a
> > read would require two get operations.
> 
> I'm not seeing why you need to use supercolumns for this at all.
> 
> Standard columns would seem just fine in this case (as long as you
> have good naming for your columns)
> 
> And you probably only need one column family... but people more expert
> than me could advise better...
> 
> I guess the question I have is why you feel the solution should
> involve supercolumns
> 
> -Stephen
> 
> >
> > Regards, Steven.
> >
> > 
> > Date: Mon, 17 Jan 2011 12:20:46 -0800
> > Subject: Re: Super CF or two CFs?
> > From: davevi...@gmail.com
> > To: user@cassandra.apache.org
> >
> > can you give an example of the data and how you'd access it?
> > what would your expected columns (and/or supercolumns) be?
> >
> > Dave Viner
> > On Mon, Jan 17, 2011 at 11:05 AM, Steven Mac  wrote:
> >
> > How can I best map an object containing two maps, one of which is updated
> > very frequently and the other only occasionally?
> >
> > a) As one super CF, which each map in a separate supercolumn and the map
> > entries being the subcolumns?
> > b) As two CFs, one for each map.
> >
> > I'd like to discuss the why behind a choice, in order to learn about the
> > impact of a design choice on performance, SStable size/disk usage,
> > compactions, etc.
> >
> > Regards, Steven.
> >
> > PS: Objects will always be read as a whole.
> >
  

Re: Super CF or two CFs?

2011-01-17 Thread Stephen Connolly
On 17 January 2011 22:36, Steven Mac  wrote:
> Sure, consider stock data, where the stock symbol is the row key. The stock
> data consists of a rather stable part and a very volatile part, both of
> which would be a super column. The stable super column would contain
> subcolumns such as company name, address, and some annual or quarterly data.
> The volatile super column would contain periodic stock data, such as current
> price, last trade times, volumes, buyers, sellers, etc.
>
> The volatile super columns would be updated every few minutes, many rows at
> once using a single batch_mutate. The data would be read using a get on a
> single row key, returning both supercolumns and all subcolumns.
>
> The data could also be split over two column families, one for the stable
> part and one for the volatile part. The updates would be the same, while a
> read would require two get operations.

I'm not seeing why you need to use supercolumns for this at all.

Standard columns would seem just fine in this case (as long as you
have good naming for your columns)

And you probably only need one column family... but people more expert
than me could advise better...

I guess the question I have is why you feel the solution should
involve supercolumns

-Stephen

>
> Regards, Steven.
>
> 
> Date: Mon, 17 Jan 2011 12:20:46 -0800
> Subject: Re: Super CF or two CFs?
> From: davevi...@gmail.com
> To: user@cassandra.apache.org
>
> can you give an example of the data and how you'd access it?
> what would your expected columns (and/or supercolumns) be?
>
> Dave Viner
> On Mon, Jan 17, 2011 at 11:05 AM, Steven Mac  wrote:
>
> How can I best map an object containing two maps, one of which is updated
> very frequently and the other only occasionally?
>
> a) As one super CF, which each map in a separate supercolumn and the map
> entries being the subcolumns?
> b) As two CFs, one for each map.
>
> I'd like to discuss the why behind a choice, in order to learn about the
> impact of a design choice on performance, SStable size/disk usage,
> compactions, etc.
>
> Regards, Steven.
>
> PS: Objects will always be read as a whole.
>


RE: Super CF or two CFs?

2011-01-17 Thread Steven Mac

Sure, consider stock data, where the stock symbol is the row key. The stock 
data consists of a rather stable part and a very volatile part, both of which 
would be a super column. The stable super column would contain subcolumns such 
as company name, address, and some annual or quarterly data. The volatile super 
column would contain periodic stock data, such as current price, last trade 
times, volumes, buyers, sellers, etc.

The volatile super columns would be updated every few minutes, many rows at 
once using a single batch_mutate. The data would be read using a get on a 
single row key, returning both supercolumns and all subcolumns.

The data could also be split over two column families, one for the stable part 
and one for the volatile part. The updates would be the same, while a read 
would require two get operations.

Regards, Steven.

Date: Mon, 17 Jan 2011 12:20:46 -0800
Subject: Re: Super CF or two CFs?
From: davevi...@gmail.com
To: user@cassandra.apache.org

can you give an example of the data and how you'd access it?what would your 
expected columns (and/or supercolumns) be?

Dave Viner
On Mon, Jan 17, 2011 at 11:05 AM, Steven Mac  wrote:






How can I best map an object containing two maps, one of which is updated very 
frequently and the other only occasionally?

a) As one super CF, which each map in a separate supercolumn and the map 
entries being the subcolumns?

b) As two CFs, one for each map.

I'd like to discuss the why behind a choice, in order to learn about the impact 
of a design choice on performance, SStable size/disk usage, compactions, etc.

Regards, Steven.


PS: Objects will always be read as a whole. 
  

  

Re: Cassandra GC Settings

2011-01-17 Thread Peter Schuller
> Now, a full stop of the application was what I was seeing extensively before
> (100-200 times over the course of a major compaction as reported by
> gossipers on other nodes). I have also just noticed that the previous
> instability (ie application stops) correlated with the compaction of a few
> column families characterized by fairly fat rows (10 mb mean size, max sizes
> 150-200 mb, up to a million+ columns per row). My theory is that each row
> being compacted with the old settings was being promoted to the old
> generation, thereby running the heap out of space and causing a stop the
> world gc. With the new settings, rows being compacted typically remain in
> the young generation, allowing them to be cleaned up more quickly with less
> effort on the part of the garbage collector. Does this theory sound
> reasonable?

Sounds reasonable I think. In addition to sizing the young gen, decreasing:

   in_memory_compaction_limit_in_mb: 64

from the default of 64 might help here I suppose.

-- 
/ Peter Schuller


Re: Cassandra GC Settings

2011-01-17 Thread SriSatish Ambati
Thanks, Dan:

Yes, -Xmn512MB/1G sizes the Young Generation explicitly and removes the
adaptive resizing out of the picture. (If at all possible send your gc log
over & we can analyze the promotion failure a little bit more finely.)
The low load implies that that you are able to use the parallel threads
effectively.

cheers,
Sri

On Mon, Jan 17, 2011 at 9:05 PM, Dan Hendry wrote:

> Thanks for all the info, I think I have been able to sort out my issue. The
> new settings I am using are:
>
> -Xmn512M (Very important I think)
> -XX:SurvivorRatio=5 (Not very important I think)
> -XX:MaxTenuringThreshold=5
> -XX:ParallelGCThreads=8
> -XX:CMSInitiatingOccupancyFraction=75
>
> Since applying these settings, the one time I saw the same type of behavior
> as before, the following appeared in the GC log.
>
>   Total time for which application threads were stopped: 0.6830080 seconds
>   1368.201: [GC 1368.201: [ParNew (promotion failed)
>   Desired survivor size 38338560 bytes, new threshold 1 (max 5)
>   - age   1:   55799736 bytes,   55799736 total
>   : 449408K->449408K(449408K), 0.2618690 secs]1368.463: [CMS1372.459:
> [CMS-concurrent-mark: 7.930/9.109 secs] [Times: us
>   er=28.31 sys=0.66, real=9.11 secs]
>(concurrent mode failure): 9418431K->6267709K(11841536K), 26.4973750
> secs] 9777393K->6267709K(12290944K), [CMS Perm : 20477K->20443K(34188K)],
> 26.7595510 secs] [Times: user=31.75 sys=0.00, real=26.76 secs]
>   Total time for which application threads were stopped: 26.7617560 seconds
>
> Now, a full stop of the application was what I was seeing extensively
> before (100-200 times over the course of a major compaction as reported by
> gossipers on other nodes). I have also just noticed that the previous
> instability (ie application stops) correlated with the compaction of a few
> column families characterized by fairly fat rows (10 mb mean size, max sizes
> 150-200 mb, up to a million+ columns per row). My theory is that each row
> being compacted with the old settings was being promoted to the old
> generation, thereby running the heap out of space and causing a stop the
> world gc. With the new settings, rows being compacted typically remain in
> the young generation, allowing them to be cleaned up more quickly with less
> effort on the part of the garbage collector. Does this theory sound
> reasonable?
>
> Answering some of the other questions:
>
> > disk bound or CPU bound during compaction?
>
> ... Neither (?). Iowait is 10-20%, disk utilization rarely jumps above 60%,
> CPU %idle is about 60%. I would have said that I was memory bound but now, I
> think compaction is now bounded by being single threaded.
>
> > are you sure you're not swapping a bit?
>
> Only if JNA is not doing its job
>
> > Number of cores on your system. How busy is the system?
>
> 8, load factors typically < 4 so not terribly busy I would say.
>
> On Mon, Jan 17, 2011 at 12:58 PM, Peter Schuller <
> peter.schul...@infidyne.com> wrote:
>
>> > very quickly from the young generation to the old generation".
>> Furthermore,
>> > the CMSInitiatingOccupancyFraction of 75 (from a JVM default of 68)
>> means
>> > "start gc in the old generation later", presumably to allow Cassandra to
>> use
>> > more of the old generation heap without needlessly trying to free up
>> used
>> > space (?). Please correct me if I am misinterpreting these settings.
>>
>> Note the use of -XX:+UseCMSInitiatingOccupancyOnly which causes the
>> JVM to always trigger on that occupancy fraction rather than only do
>> it for the first trigger (or something along those lines) and then
>> switch to heuristics. Presumably (though I don't specifically know the
>> history of this particular option being added) it is more important to
>> avoid doing Full GC:s at all than super-optimally tweaking the trigger
>> for maximum throughput.
>>
>> The heuristics tend to cut it pretty close, and setting a conservative
>> fixed occupancy trigger probably greatly lessens the chance of falling
>> back to a full gc in production.
>>
>> > One of the issues I have been having is extreme node instability when
>> > running a major compaction. After 20-30 seconds of operation, the node
>> > spends 30+ seconds in (what I believe to be) GC. Now I have tried
>> halving
>> > all memtable thresholds to reduce overall heap memory usage but that has
>> not
>> > seemed to help with the instability. After one of these blips, I often
>> see
>> > log entries as follows:
>> >  INFO [ScheduledTasks:1] 2011-01-17 10:41:21,961 GCInspector.java (line
>> 133)
>> > GC for ParNew: 215 ms, 45084168 reclaimed leaving 11068700368 used; max
>> is
>> > 12783583232
>> >  INFO [ScheduledTasks:1] 2011-01-17 10:41:28,033 GCInspector.java (line
>> 133)
>> > GC for ParNew: 234 ms, 40401120 reclaimed leaving 12144504848 used; max
>> is
>> > 12783583232
>> >  INFO [ScheduledTasks:1] 2011-01-17 10:42:15,911 GCInspector.java (line
>> 133)
>> > GC for ConcurrentMarkSweep: 45828 ms, 3350764696 reclaimed leaving
>> > 9224

Re: Do you have a site in production environment with Cassandra? What client do you use?

2011-01-17 Thread Colin Vipurs
Java + Pelops

On Sat, Jan 15, 2011 at 10:58 PM, Dave Viner  wrote:
> Perl using the thrift interface directly.
> On Sat, Jan 15, 2011 at 6:10 AM, Daniel Lundin  wrote:
>>
>> python + pycassa
>> scala + Hector
>>
>> On Fri, Jan 14, 2011 at 6:24 PM, Ertio Lew  wrote:
>> > Hey,
>> >
>> > If you have a site in production environment or considering so, what
>> > is the client that you use to interact with Cassandra. I know that
>> > there are several clients available out there according to the
>> > language you use but I would love to know what clients are being used
>> > widely in production environments and are best to work with(support
>> > most required features for performance).
>> >
>> > Also preferably tell about the technology stack for your applications.
>> >
>> > Any suggestions, comments appreciated ?
>> >
>> > Thanks
>> > Ertio
>> >
>
>



-- 
Maybe she awoke to see the roommate's boyfriend swinging from the
chandelier wearing a boar's head.

Something which you, I, and everyone else would call "Tuesday", of course.


Re: Cassandra GC Settings

2011-01-17 Thread Dan Hendry
Thanks for all the info, I think I have been able to sort out my issue. The
new settings I am using are:

-Xmn512M (Very important I think)
-XX:SurvivorRatio=5 (Not very important I think)
-XX:MaxTenuringThreshold=5
-XX:ParallelGCThreads=8
-XX:CMSInitiatingOccupancyFraction=75

Since applying these settings, the one time I saw the same type of behavior
as before, the following appeared in the GC log.

  Total time for which application threads were stopped: 0.6830080 seconds
  1368.201: [GC 1368.201: [ParNew (promotion failed)
  Desired survivor size 38338560 bytes, new threshold 1 (max 5)
  - age   1:   55799736 bytes,   55799736 total
  : 449408K->449408K(449408K), 0.2618690 secs]1368.463: [CMS1372.459:
[CMS-concurrent-mark: 7.930/9.109 secs] [Times: us
  er=28.31 sys=0.66, real=9.11 secs]
   (concurrent mode failure): 9418431K->6267709K(11841536K), 26.4973750
secs] 9777393K->6267709K(12290944K), [CMS Perm : 20477K->20443K(34188K)],
26.7595510 secs] [Times: user=31.75 sys=0.00, real=26.76 secs]
  Total time for which application threads were stopped: 26.7617560 seconds

Now, a full stop of the application was what I was seeing extensively before
(100-200 times over the course of a major compaction as reported by
gossipers on other nodes). I have also just noticed that the previous
instability (ie application stops) correlated with the compaction of a few
column families characterized by fairly fat rows (10 mb mean size, max sizes
150-200 mb, up to a million+ columns per row). My theory is that each row
being compacted with the old settings was being promoted to the old
generation, thereby running the heap out of space and causing a stop the
world gc. With the new settings, rows being compacted typically remain in
the young generation, allowing them to be cleaned up more quickly with less
effort on the part of the garbage collector. Does this theory sound
reasonable?

Answering some of the other questions:

> disk bound or CPU bound during compaction?

... Neither (?). Iowait is 10-20%, disk utilization rarely jumps above 60%,
CPU %idle is about 60%. I would have said that I was memory bound but now, I
think compaction is now bounded by being single threaded.

> are you sure you're not swapping a bit?

Only if JNA is not doing its job

> Number of cores on your system. How busy is the system?

8, load factors typically < 4 so not terribly busy I would say.

On Mon, Jan 17, 2011 at 12:58 PM, Peter Schuller <
peter.schul...@infidyne.com> wrote:

> > very quickly from the young generation to the old generation".
> Furthermore,
> > the CMSInitiatingOccupancyFraction of 75 (from a JVM default of 68) means
> > "start gc in the old generation later", presumably to allow Cassandra to
> use
> > more of the old generation heap without needlessly trying to free up used
> > space (?). Please correct me if I am misinterpreting these settings.
>
> Note the use of -XX:+UseCMSInitiatingOccupancyOnly which causes the
> JVM to always trigger on that occupancy fraction rather than only do
> it for the first trigger (or something along those lines) and then
> switch to heuristics. Presumably (though I don't specifically know the
> history of this particular option being added) it is more important to
> avoid doing Full GC:s at all than super-optimally tweaking the trigger
> for maximum throughput.
>
> The heuristics tend to cut it pretty close, and setting a conservative
> fixed occupancy trigger probably greatly lessens the chance of falling
> back to a full gc in production.
>
> > One of the issues I have been having is extreme node instability when
> > running a major compaction. After 20-30 seconds of operation, the node
> > spends 30+ seconds in (what I believe to be) GC. Now I have tried halving
> > all memtable thresholds to reduce overall heap memory usage but that has
> not
> > seemed to help with the instability. After one of these blips, I often
> see
> > log entries as follows:
> >  INFO [ScheduledTasks:1] 2011-01-17 10:41:21,961 GCInspector.java (line
> 133)
> > GC for ParNew: 215 ms, 45084168 reclaimed leaving 11068700368 used; max
> is
> > 12783583232
> >  INFO [ScheduledTasks:1] 2011-01-17 10:41:28,033 GCInspector.java (line
> 133)
> > GC for ParNew: 234 ms, 40401120 reclaimed leaving 12144504848 used; max
> is
> > 12783583232
> >  INFO [ScheduledTasks:1] 2011-01-17 10:42:15,911 GCInspector.java (line
> 133)
> > GC for ConcurrentMarkSweep: 45828 ms, 3350764696 reclaimed leaving
> > 9224048472 used; max is 12783583232
>
> 45 seconds is pretty significant even for a 12 gig heap unless you're
> really CPU loaded so that there is heavy contention over the CPU.
> While I don't see anything obviously extreme; are you sure you're not
> swapping a bit?
>
> Also, what do you mean by node instability - does it *completely* stop
> responding during these periods or does it flap in and out of the
> cluster but is still responding?
>
> Are you nodes disk bound or CPU bound during compaction?
>
> --
> / Peter Schuller
>


Re: quorum calculation seems to depend on previous selected nodes

2011-01-17 Thread Jonathan Ellis
On Mon, Jan 17, 2011 at 2:10 PM, Samuel Benz  wrote:
>>> Case1:
>>> If 'TEST' was previous stored on Node1, Node2, Node3 -> The update will
>>> succeed.
>>>
>>> Case2:
>>> If 'TEST' was previous stored on Node2, Node3, Node4 -> The update will
>>> not work.
>>
>> If you have RF=2 then it will be stored on 2 nodes, not 3.  I think
>> this is the source of the confusion.
>>
>
> I checked the existence of the row on the different serverver with
> sstablekeys after flushing. So I saw three copies of every key in the
> cluster.

If you want to be guaranteed to be able to read with two nodes down
and RF=3, you have to read at CL.ONE, since if the two nodes that are
down are replicas of the data you are reading (as in the 2nd case
here) Cassandra will be unable to achieve quorum (quorum of 3 is 2
live nodes).

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


Re: Super CF or two CFs?

2011-01-17 Thread Dave Viner
can you give an example of the data and how you'd access it?
what would your expected columns (and/or supercolumns) be?

Dave Viner

On Mon, Jan 17, 2011 at 11:05 AM, Steven Mac  wrote:

>  How can I best map an object containing two maps, one of which is updated
> very frequently and the other only occasionally?
>
> a) As one super CF, which each map in a separate supercolumn and the map
> entries being the subcolumns?
> b) As two CFs, one for each map.
>
> I'd like to discuss the why behind a choice, in order to learn about the
> impact of a design choice on performance, SStable size/disk usage,
> compactions, etc.
>
> Regards, Steven.
>
> PS: Objects will always be read as a whole.
>


Re: about the consistency level

2011-01-17 Thread Aaron Morton
Have you added a Jira for this? Or does anyone else want or not want this 
feature ?

I'll try to add it as practice.
Aaron

On 17/01/2011, at 10:15 PM, "raoyixuan (Shandy)"  wrote:

> Thanks a lot.
> 
> From: aaron morton [mailto:aa...@thelastpickle.com] 
> Sent: Monday, January 17, 2011 5:01 PM
> To: user@cassandra.apache.org
> Subject: Re: about the consistency level
> 
>  
> 
> The cassandra-clie works as CL.ONE , currently it cannot be changed. I'm not 
> sure if there is a reason for this, but if it's a feature you would like add 
> a request to JIRA https://issues.apache.org/jira/browse/CASSANDRA
> 
>  
> 
> In Hector it's part of the m.p.h.api.Keyspace interface as 
> setConsistencyLevelPolicy(), implemented in m.p.c.model.Keyspace and looks 
> like the way to set it is via the one of the overloads for 
> m.p.h.api.factory.HFactory.createKeyspace() .  It defaults to the 
> m.p.c.model.QuorumAllConsistencyLevelPolicy which uses Quorum for all 
> operations.
> 
>  
> 
> Hope that helps. 
> 
> Aaron
> 
>   
> 
>  
> 
> On 17/01/2011, at 9:23 PM, raoyixuan (Shandy) wrote:
> 
> 
> 
> 
> Both hector and cassandra-cli .  Can you tell me respectively? Thanks a lot.
> 
>  
> 
> From: aaron morton [mailto:aa...@thelastpickle.com] 
> Sent: Monday, January 17, 2011 4:17 PM
> To: user@cassandra.apache.org
> Subject: Re: about the consistency level
> 
>  
> 
> The ConsistenyLevel is passed with each read and write command. 
> 
>  
> 
> How you set it will depend on the client you are using. Which one are you 
> using ? 
> 
>  
> 
> Aaron
> 
>  
> 
> On 17/01/2011, at 8:50 PM, raoyixuan (Shandy) wrote:
> 
> 
> 
> 
> 
> How to set the consistency level in Cassandra 0.7? I mean what command?
> 
>  
> 
>  
> 
> 华为技术有限公司 Huawei Technologies Co., Ltd.
> 
> 
> 
> 
>  
> 
>  
> 
>  
> 
> Phone: 28358610
> Mobile: 13425182943
> Email: raoyix...@huawei.com
> 地址:深圳市龙岗区坂田华为基地 邮编:518129
> Huawei Technologies Co., Ltd.
> Bantian, Longgang District,Shenzhen 518129, P.R.China
> http://www.huawei.com
> 
> 本邮件及其附件含有华为公司的保密信息,仅限于发送给上面地址中列出的个人或群组。禁
> 止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中
> 的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!
> This e-mail and its attachments contain confidential information from HUAWEI, 
> which 
> is intended only for the person or entity whose address is listed above. Any 
> use of the 
> information contained herein in any way (including, but not limited to, total 
> or partial 
> disclosure, reproduction, or dissemination) by persons other than the 
> intended 
> recipient(s) is prohibited. If you receive this e-mail in error, please 
> notify the sender by 
> phone or email immediately and delete it!
> 
>  
> 
>  
> 
>  


Re: quorum calculation seems to depend on previous selected nodes

2011-01-17 Thread Samuel Benz
On 01/17/2011 05:08 PM, Jonathan Ellis wrote:
> On Mon, Jan 17, 2011 at 9:55 AM, Samuel Benz  wrote:
>> We have a cluster with 4 nodes. ReplicationFactor is 2, ReplicaPlacment
>> is the RackAwareStrategy and the EndpointSnitch is the
>> PropertyFileEndpointSnitch (with two data center and two racks each).
>>
>> My understanding is, that with this parameter of the cluster, it should
>> be possible to update with consistency level quorum while one data
>> center (two nodes) are shutdown completely.
> 
> No.  Quorum of 2 is 2.
>

I made a big mistake in my Email! Sorry for the confusion. Of course the
ReplicationFactor of the cluster is 3! (but the rest of the email is
correct)

>> Case1:
>> If 'TEST' was previous stored on Node1, Node2, Node3 -> The update will
>> succeed.
>>
>> Case2:
>> If 'TEST' was previous stored on Node2, Node3, Node4 -> The update will
>> not work.
> 
> If you have RF=2 then it will be stored on 2 nodes, not 3.  I think
> this is the source of the confusion.
> 

I checked the existence of the row on the different serverver with
sstablekeys after flushing. So I saw three copies of every key in the
cluster.

--
Sam


Super CF or two CFs?

2011-01-17 Thread Steven Mac

How can I best map an object containing two maps, one of which is updated very 
frequently and the other only occasionally?

a) As one super CF, which each map in a separate supercolumn and the map 
entries being the subcolumns?
b) As two CFs, one for each map.

I'd like to discuss the why behind a choice, in order to learn about the impact 
of a design choice on performance, SStable size/disk usage, compactions, etc.

Regards, Steven.

PS: Objects will always be read as a whole. 
  

Re: quorum calculation seems to depend on previous selected nodes

2011-01-17 Thread Peter Schuller
> Adding CL.TWO would be easy enough. :)

True, but the obvious generalization is to be able to select an
arbitrary replica count and that seemed like a bigger change to the
API. But if CL.TWO would be considered clean enough... I may submit a
jira/patch.

-- 
/ Peter Schuller


Re: quorum calculation seems to depend on previous selected nodes

2011-01-17 Thread Jonathan Ellis
Adding CL.TWO would be easy enough. :)

On Mon, Jan 17, 2011 at 12:12 PM, Peter Schuller
 wrote:
>> I think you should just tell everybody that if you want to use QUORUM you
>> need RF >= 3 for it to be meaningful.
>>
>> No one would use QUORUM with RF < 3 except in error.
>
> Well, strictly speaking you could have an application designed to talk
> to Cassandra at QUORUM and an operator may choose to deploy it against
> an RF=2 cluster, accepting the fact that the application won't survive
> a node going down. For example, maybe write's are done at QUORUM (for
> durability rather than consistency) and reads at ONE.
>
> (I"ve been meaning to suggest it at some point btw, but consistency
> and durability are different concerns. For durability purposes it
> would be nice to say "require two nodes", rather than having to choose
> between ONE and QUORUM.)
>
> --
> / Peter Schuller
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


Re: Cassandra GC Settings

2011-01-17 Thread Jonathan Ellis
On Mon, Jan 17, 2011 at 11:58 AM, Peter Schuller
 wrote:
> 45 seconds is pretty significant even for a 12 gig heap

Note that you really need to uncomment the -XX:PrintGC* arguments to
get a detailed GC log from the jvm before taking guesses at this; the
numbers GCInspector can get are NOT pause times.  High numbers for CMS
are perfectly normal there, but in a healthy system they are
Concurrent (the C in CMS) meaning Cassandra can continue running for
almost all of it.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


Re: Internal error when using SimpleSnitch and dynamic_snitch: true

2011-01-17 Thread Jonathan Ellis
Already fixed for 0.7.1 in CASSANDRA-1530.

On Mon, Jan 17, 2011 at 11:29 AM, Jim Ancona  wrote:
> We accidently configured our cluster with SimpleSnitch (instead of
> PropertyFileSnitch) and dynamic_snitch: true. This is with version
> 0.7.0.
>
> We saw the errors below on get_slice and batch_mutate calls. The
> errors went away when we switched to PropertyFileSnitch.
>
> Should dynamic_snitch work with SimpleSnitch? Should I open a Jira issue?
>
> Jim
>
> ERROR [pool-1-thread-55] 2011-01-14 15:53:45,998 Cassandra.java (line
> 2707) Internal error processing get_slice
> java.lang.UnsupportedOperationException
>        at 
> org.apache.cassandra.locator.SimpleSnitch.getDatacenter(SimpleSnitch.java:40)
>        at 
> org.apache.cassandra.locator.DynamicEndpointSnitch.getDatacenter(DynamicEndpointSnitch.java:94)
>        at 
> org.apache.cassandra.locator.NetworkTopologyStrategy.calculateNaturalEndpoints(NetworkTopologyStrategy.java:87)
>        at 
> org.apache.cassandra.locator.AbstractReplicationStrategy.getNaturalEndpoints(AbstractReplicationStrategy.java:99)
>        at 
> org.apache.cassandra.service.StorageService.getNaturalEndpoints(StorageService.java:1354)
>        at 
> org.apache.cassandra.service.StorageService.getNaturalEndpoints(StorageService.java:1337)
>        at 
> org.apache.cassandra.service.StorageService.findSuitableEndpoint(StorageService.java:1388)
>        at 
> org.apache.cassandra.service.StorageProxy.weakRead(StorageProxy.java:248)
>        at 
> org.apache.cassandra.service.StorageProxy.readProtocol(StorageProxy.java:224)
>        at 
> org.apache.cassandra.thrift.CassandraServer.readColumnFamily(CassandraServer.java:98)
>        at 
> org.apache.cassandra.thrift.CassandraServer.getSlice(CassandraServer.java:195)
>        at 
> org.apache.cassandra.thrift.CassandraServer.multigetSliceInternal(CassandraServer.java:271)
>        at 
> org.apache.cassandra.thrift.CassandraServer.get_slice(CassandraServer.java:233)
>        at 
> org.apache.cassandra.thrift.Cassandra$Processor$get_slice.process(Cassandra.java:2699)
>        at 
> org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2555)
>        at 
> org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:167)
>        at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>        at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>        at java.lang.Thread.run(Thread.java:636)
>
> ERROR [pool-1-thread-58] 2011-01-12 12:18:58,721 Cassandra.java (line
> 3044) Internal error processing batch_mutate
> java.lang.UnsupportedOperationException
>        at 
> org.apache.cassandra.locator.SimpleSnitch.getDatacenter(SimpleSnitch.java:40)
>        at 
> org.apache.cassandra.locator.DynamicEndpointSnitch.getDatacenter(DynamicEndpointSnitch.java:94)
>        at 
> org.apache.cassandra.locator.NetworkTopologyStrategy.calculateNaturalEndpoints(NetworkTopologyStrategy.java:87)
>        at 
> org.apache.cassandra.locator.AbstractReplicationStrategy.getNaturalEndpoints(AbstractReplicationStrategy.java:99)
>        at 
> org.apache.cassandra.service.StorageService.getNaturalEndpoints(StorageService.java:1354)
>        at 
> org.apache.cassandra.service.StorageService.getNaturalEndpoints(StorageService.java:1337)
>        at 
> org.apache.cassandra.service.StorageProxy.mutate(StorageProxy.java:109)
>        at 
> org.apache.cassandra.thrift.CassandraServer.doInsert(CassandraServer.java:412)
>        at 
> org.apache.cassandra.thrift.CassandraServer.batch_mutate(CassandraServer.java:385)
>        at 
> org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.process(Cassandra.java:3036)
>        at 
> org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2555)
>        at 
> org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:167)
>        at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>        at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>        at java.lang.Thread.run(Thread.java:636)
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


Re: balancing load

2011-01-17 Thread Edward Capriolo
On Mon, Jan 17, 2011 at 1:20 PM, Karl Hiramoto  wrote:
> On 01/17/11 15:54, Edward Capriolo wrote:
>> Just to head the next possible problem. If you run 'nodetool cleanup'
>> on each node and some of your nodes still have more data then others,
>> then it probably means your are writing the majority of data to a few
>> keys. ( you probably do not want to do that )
>>
>> If that happens, you can use nodetool cfstats on each node and ensure
>> that the 'max row compacted size' is roughly the same on all nodes. If
>> you have one or two really big rows that could explain your imbalance.
>
>
> Well, I did a lengthy repair/cleanup  on each node.  but still have data
> mainly on two nodes (I have RF=2)
>  ./apache-cassandra-0.7.0/bin/nodetool --host host3 ring
> Address         Status State   Load            Owns
> Token
>
> 119098828422328462212181112601118874007
> 10.1.4.10     Up     Normal  347.11 MB       30.00%
> 0
> 10.1.4.12     Up     Normal  49.41 KB        20.00%
> 34028236692093846346337460743176821145
> 10.1.4.13     Up     Normal  54.35 KB        20.00%
> 68056473384187692692674921486353642290
> 10.1.4.15     Up     Normal  19.09 GB        16.21%
> 95643579558861158157614918209686336369
> 10.1.4.14     Up     Normal  15.62 GB        13.79%
> 119098828422328462212181112601118874007
>
>
> in "cfstats" i see:
> Compacted row minimum size: 1131752
> Compacted row maximum size: 8582860529
> Compacted row mean size: 1402350749
>
> on the lowest used node i see:
> Compacted row minimum size: 0
> Compacted row maximum size: 0
> Compacted row mean size: 0
>
> I basicly have  MyKeyspace.Offer[UID] = value    my "value"  is at most
> 500 bytes.
>
> UID i just use a 12 random alpha numeric values  [A-Z][0-9]
>
> Should i try and adjust my tokens to fix the imbalance or something else?
>
> I'm using Redhat EL  5.5
>
> java -version
> java version "1.6.0_17"
> OpenJDK Runtime Environment (IcedTea6 1.7.5) (rhel-1.16.b17.el5-x86_64)
> OpenJDK 64-Bit Server VM (build 14.0-b16, mixed mode)
>
> I have some errors in my logs:
>
> ERROR [ReadStage:1747] 2011-01-17 18:13:53,988
> DebuggableThreadPoolExecutor.java (line 103) Error in ThreadPoolExecutor
> java.lang.AssertionError
>        at
> org.apache.cassandra.db.columniterator.SSTableNamesIterator.readIndexedColumns(SSTableNamesIterator.java:178)
>        at
> org.apache.cassandra.db.columniterator.SSTableNamesIterator.read(SSTableNamesIterator.java:132)
>        at
> org.apache.cassandra.db.columniterator.SSTableNamesIterator.(SSTableNamesIterator.java:70)
>        at
> org.apache.cassandra.db.filter.NamesQueryFilter.getSSTableColumnIterator(NamesQueryFilter.java:59)
>        at
> org.apache.cassandra.db.filter.QueryFilter.getSSTableColumnIterator(QueryFilter.java:80)
>        at
> org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1215)
>        at
> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1107)
>        at
> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1077)
>        at org.apache.cassandra.db.Table.getRow(Table.java:384)
>        at
> org.apache.cassandra.db.SliceByNamesReadCommand.getRow(SliceByNamesReadCommand.java:60)
>        at
> org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:68)
>        at
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:63)
>        at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>        at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>        at java.lang.Thread.run(Thread.java:636)
> ERROR [ReadStage:1747] 2011-01-17 18:13:53,989
> AbstractCassandraDaemon.java (line 91) Fatal exception in thread
> Thread[ReadStage:1747,5,main]
> java.lang.AssertionError
>        at
> org.apache.cassandra.db.columniterator.SSTableNamesIterator.readIndexedColumns(SSTableNamesIterator.java:178)
>        at
> org.apache.cassandra.db.columniterator.SSTableNamesIterator.read(SSTableNamesIterator.java:132)
>        at
> org.apache.cassandra.db.columniterator.SSTableNamesIterator.(SSTableNamesIterator.java:70)
>        at
> org.apache.cassandra.db.filter.NamesQueryFilter.getSSTableColumnIterator(NamesQueryFilter.java:59)
>        at
> org.apache.cassandra.db.filter.QueryFilter.getSSTableColumnIterator(QueryFilter.java:80)
>        at
> org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1215)
>        at
> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1107)
>        at
> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1077)
>        at org.apache.cassandra.db.Table.getRow(Table.java:384)
>        at
> org.apache.cassandra.db.SliceByNamesReadCommand.getRow(SliceByNamesReadCommand.java:60)
>        at
> org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:68)
>        at
> org.apache.cassandra.net.MessageDeliveryTask.run(

Re: balancing load

2011-01-17 Thread Karl Hiramoto
On 01/17/11 15:54, Edward Capriolo wrote:
> Just to head the next possible problem. If you run 'nodetool cleanup'
> on each node and some of your nodes still have more data then others,
> then it probably means your are writing the majority of data to a few
> keys. ( you probably do not want to do that )
>
> If that happens, you can use nodetool cfstats on each node and ensure
> that the 'max row compacted size' is roughly the same on all nodes. If
> you have one or two really big rows that could explain your imbalance.


Well, I did a lengthy repair/cleanup  on each node.  but still have data
mainly on two nodes (I have RF=2)
 ./apache-cassandra-0.7.0/bin/nodetool --host host3 ring
Address Status State   LoadOwns   
Token  
  
119098828422328462212181112601118874007
10.1.4.10 Up Normal  347.11 MB   30.00% 
0  
10.1.4.12 Up Normal  49.41 KB20.00% 
34028236692093846346337460743176821145 
10.1.4.13 Up Normal  54.35 KB20.00% 
68056473384187692692674921486353642290 
10.1.4.15 Up Normal  19.09 GB16.21% 
95643579558861158157614918209686336369 
10.1.4.14 Up Normal  15.62 GB13.79% 
119098828422328462212181112601118874007


in "cfstats" i see:
Compacted row minimum size: 1131752
Compacted row maximum size: 8582860529
Compacted row mean size: 1402350749

on the lowest used node i see:
Compacted row minimum size: 0
Compacted row maximum size: 0
Compacted row mean size: 0

I basicly have  MyKeyspace.Offer[UID] = valuemy "value"  is at most
500 bytes.

UID i just use a 12 random alpha numeric values  [A-Z][0-9]

Should i try and adjust my tokens to fix the imbalance or something else?

I'm using Redhat EL  5.5

java -version
java version "1.6.0_17"
OpenJDK Runtime Environment (IcedTea6 1.7.5) (rhel-1.16.b17.el5-x86_64)
OpenJDK 64-Bit Server VM (build 14.0-b16, mixed mode)

I have some errors in my logs:

ERROR [ReadStage:1747] 2011-01-17 18:13:53,988
DebuggableThreadPoolExecutor.java (line 103) Error in ThreadPoolExecutor
java.lang.AssertionError
at
org.apache.cassandra.db.columniterator.SSTableNamesIterator.readIndexedColumns(SSTableNamesIterator.java:178)
at
org.apache.cassandra.db.columniterator.SSTableNamesIterator.read(SSTableNamesIterator.java:132)
at
org.apache.cassandra.db.columniterator.SSTableNamesIterator.(SSTableNamesIterator.java:70)
at
org.apache.cassandra.db.filter.NamesQueryFilter.getSSTableColumnIterator(NamesQueryFilter.java:59)
at
org.apache.cassandra.db.filter.QueryFilter.getSSTableColumnIterator(QueryFilter.java:80)
at
org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1215)
at
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1107)
at
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1077)
at org.apache.cassandra.db.Table.getRow(Table.java:384)
at
org.apache.cassandra.db.SliceByNamesReadCommand.getRow(SliceByNamesReadCommand.java:60)
at
org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:68)
at
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:63)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)
ERROR [ReadStage:1747] 2011-01-17 18:13:53,989
AbstractCassandraDaemon.java (line 91) Fatal exception in thread
Thread[ReadStage:1747,5,main]
java.lang.AssertionError
at
org.apache.cassandra.db.columniterator.SSTableNamesIterator.readIndexedColumns(SSTableNamesIterator.java:178)
at
org.apache.cassandra.db.columniterator.SSTableNamesIterator.read(SSTableNamesIterator.java:132)
at
org.apache.cassandra.db.columniterator.SSTableNamesIterator.(SSTableNamesIterator.java:70)
at
org.apache.cassandra.db.filter.NamesQueryFilter.getSSTableColumnIterator(NamesQueryFilter.java:59)
at
org.apache.cassandra.db.filter.QueryFilter.getSSTableColumnIterator(QueryFilter.java:80)
at
org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1215)
at
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1107)
at
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1077)
at org.apache.cassandra.db.Table.getRow(Table.java:384)
at
org.apache.cassandra.db.SliceByNamesReadCommand.getRow(SliceByNamesReadCommand.java:60)
at
org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:68)
at
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:63)
at
java.util.concurrent.ThreadPoolExe

Re: quorum calculation seems to depend on previous selected nodes

2011-01-17 Thread Peter Schuller
> I think you should just tell everybody that if you want to use QUORUM you
> need RF >= 3 for it to be meaningful.
>
> No one would use QUORUM with RF < 3 except in error.

Well, strictly speaking you could have an application designed to talk
to Cassandra at QUORUM and an operator may choose to deploy it against
an RF=2 cluster, accepting the fact that the application won't survive
a node going down. For example, maybe write's are done at QUORUM (for
durability rather than consistency) and reads at ONE.

(I"ve been meaning to suggest it at some point btw, but consistency
and durability are different concerns. For durability purposes it
would be nice to say "require two nodes", rather than having to choose
between ONE and QUORUM.)

-- 
/ Peter Schuller


Re: Cassandra in less than 1G of memory?

2011-01-17 Thread Peter Schuller
> Peter : What do you recommand ? using Aaron Morton solution and using JNA or
> just disable mmap ? (Or is it the same and I missed something ?)

I suggested disabling mmap() just to get you confidence in what the
actual memory usage is, without it being "seemingly" higher than it is
due to mmap(). I don't recommend turning it off for real use.

-- 
/ Peter Schuller


Re: Cassandra in less than 1G of memory?

2011-01-17 Thread Victor Kabdebon
Peter : What do you recommand ? using Aaron Morton solution and using JNA or
just disable mmap ? (Or is it the same and I missed something ?)

Thank you all for your advice, I am surprised to be the only one to have
this problem even if I'm using a pretty standard distribution.

Best regards,
Victor K.

2011/1/16 Peter Schuller 

> > bigger and bigger) that cassandra ram memory consumption is going through
> > the roof.
>
> mmap():ed memory will be counted as virtual address space.
>
> Disable mmap() and use standard I/O if you want to see how it behaves
> for real;' then if you want mmap() for performance you can re-enable
> it.
>
> --
> / Peter Schuller
>


Re: Cassandra GC Settings

2011-01-17 Thread Peter Schuller
> very quickly from the young generation to the old generation". Furthermore,
> the CMSInitiatingOccupancyFraction of 75 (from a JVM default of 68) means
> "start gc in the old generation later", presumably to allow Cassandra to use
> more of the old generation heap without needlessly trying to free up used
> space (?). Please correct me if I am misinterpreting these settings.

Note the use of -XX:+UseCMSInitiatingOccupancyOnly which causes the
JVM to always trigger on that occupancy fraction rather than only do
it for the first trigger (or something along those lines) and then
switch to heuristics. Presumably (though I don't specifically know the
history of this particular option being added) it is more important to
avoid doing Full GC:s at all than super-optimally tweaking the trigger
for maximum throughput.

The heuristics tend to cut it pretty close, and setting a conservative
fixed occupancy trigger probably greatly lessens the chance of falling
back to a full gc in production.

> One of the issues I have been having is extreme node instability when
> running a major compaction. After 20-30 seconds of operation, the node
> spends 30+ seconds in (what I believe to be) GC. Now I have tried halving
> all memtable thresholds to reduce overall heap memory usage but that has not
> seemed to help with the instability. After one of these blips, I often see
> log entries as follows:
>  INFO [ScheduledTasks:1] 2011-01-17 10:41:21,961 GCInspector.java (line 133)
> GC for ParNew: 215 ms, 45084168 reclaimed leaving 11068700368 used; max is
> 12783583232
>  INFO [ScheduledTasks:1] 2011-01-17 10:41:28,033 GCInspector.java (line 133)
> GC for ParNew: 234 ms, 40401120 reclaimed leaving 12144504848 used; max is
> 12783583232
>  INFO [ScheduledTasks:1] 2011-01-17 10:42:15,911 GCInspector.java (line 133)
> GC for ConcurrentMarkSweep: 45828 ms, 3350764696 reclaimed leaving
> 9224048472 used; max is 12783583232

45 seconds is pretty significant even for a 12 gig heap unless you're
really CPU loaded so that there is heavy contention over the CPU.
While I don't see anything obviously extreme; are you sure you're not
swapping a bit?

Also, what do you mean by node instability - does it *completely* stop
responding during these periods or does it flap in and out of the
cluster but is still responding?

Are you nodes disk bound or CPU bound during compaction?

-- 
/ Peter Schuller


Re: quorum calculation seems to depend on previous selected nodes

2011-01-17 Thread David Boxenhorn
I think you should just tell everybody that if you want to use QUORUM you
need RF >= 3 for it to be meaningful.

No one would use QUORUM with RF < 3 except in error.

On Mon, Jan 17, 2011 at 6:08 PM, Jonathan Ellis  wrote:

> On Mon, Jan 17, 2011 at 9:55 AM, Samuel Benz 
> wrote:
> > We have a cluster with 4 nodes. ReplicationFactor is 2, ReplicaPlacment
> > is the RackAwareStrategy and the EndpointSnitch is the
> > PropertyFileEndpointSnitch (with two data center and two racks each).
> >
> > My understanding is, that with this parameter of the cluster, it should
> > be possible to update with consistency level quorum while one data
> > center (two nodes) are shutdown completely.
>
> No.  Quorum of 2 is 2.
>
> > Case1:
> > If 'TEST' was previous stored on Node1, Node2, Node3 -> The update will
> > succeed.
> >
> > Case2:
> > If 'TEST' was previous stored on Node2, Node3, Node4 -> The update will
> > not work.
>
> If you have RF=2 then it will be stored on 2 nodes, not 3.  I think
> this is the source of the confusion.
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>


Re: Cassandra GC Settings

2011-01-17 Thread SriSatish Ambati
Dan,

Please kindly attach your:
1) java -version
2) full commandline settings, heap sizes.
3) gc log from one of the nodes via:

-XX:+PrintTenuringDistribution \
-XX:+PrintGCDetails \
-XX:+PrintGCTimeStamps \
-Xloggc:/var/log/cassandra/gc.log \

4) number of cores on your system. How busy is the system?
5) Any workload specifics for your particular usecase?

While some of this is workload specific:

If you are seeing too frequent & very long CMS collection times:

C1) Upping the MaxTenuringThreshold=5/10/15  will reduce frequent promotion
that is made essential by current setting.

C2) Increasing -Xmn512mb/1g will help induce more parnew activity.

C3) If you have enough cores to handle multi threaded ParNewGen - I'd also
add -XX:+ParallelGCThreads=4 (or 8) depending on your situation.

Reducing CMIOF (& other thresholds) will trigger CMS at a lower threshold of
occupancy (counteracting some measures above) but might help offset conc
mode failures or promotion failure in case you are seeing it in the logs.

Anyways that is a simplistic analysis: I'd try changes (C1), (C2), (C3) &
then revisit further tuning as necessary.
thanks,
Sri

On Mon, Jan 17, 2011 at 5:03 PM, Dan Hendry wrote:

> I am having some reliability problems in my Cassandra cluster which I am
> almost certain is due to GC. I was about to start delving into the guts of
> the problem by turning on GC logging but I have never done any serious java
> GC tuning before (time to learn I guess). As a first step however, I was
> hoping to gain some insight into the GC settings shipped with Cassandra 0.7.
> I realize its a pretty complicated problem but I was specifically interested
> in knowing about:
>
> -XX:SurvivorRatio=8
> -XX:MaxTenuringThreshold=1
> -XX:CMSInitiatingOccupancyFraction=75
>
> Why are these set the way they are? What specifically was used to determine
> these settings? Was it purely experimental or was there a specific,
> undesirable behavior adding these settings corrected for? From my various
> web wanderings, I read the survivor ratio and tenuring threshold settings as
> "Cassandra creates mostly long lived objects, with objects being promoted
> very quickly from the young generation to the old generation". Furthermore,
> the CMSInitiatingOccupancyFraction of 75 (from a JVM default of 68) means
> "start gc in the old generation later", presumably to allow Cassandra to use
> more of the old generation heap without needlessly trying to free up used
> space (?). Please correct me if I am misinterpreting these settings.
>
> One of the issues I have been having is extreme node instability when
> running a major compaction. After 20-30 seconds of operation, the node
> spends 30+ seconds in (what I believe to be) GC. Now I have tried halving
> all memtable thresholds to reduce overall heap memory usage but that has not
> seemed to help with the instability. After one of these blips, I often see
> log entries as follows:
>
>  INFO [ScheduledTasks:1] 2011-01-17 10:41:21,961 GCInspector.java (line
> 133) GC for ParNew: 215 ms, 45084168 reclaimed leaving 11068700368 used; max
> is 12783583232
>  INFO [ScheduledTasks:1] 2011-01-17 10:41:28,033 GCInspector.java (line
> 133) GC for ParNew: 234 ms, 40401120 reclaimed leaving 12144504848 used; max
> is 12783583232
>  INFO [ScheduledTasks:1] 2011-01-17 10:42:15,911 GCInspector.java (line
> 133) GC for ConcurrentMarkSweep: 45828 ms, 3350764696 reclaimed leaving
> 9224048472 used; max is 12783583232
>
> Given that the 3 GB of garbage collected via ConcurrentMarkSweep was
> generated in < 30 seconds, one of the first things I was going to try was
> increasing the survivor ratio (to 16) and increase the MaxTenuringThreshold
> (to 5) to try and keep more objects in the young generation and therefore
> cleaned up faster. As a more general approach to solving my problem, I was
> also going to reduce the CMSInitiatingOccupancyFraction to 65. Does this
> seem reasonable? Obviously, the best answer is to just try it but I hesitate
> to start playing with settings when I have only vaguest notions of what they
> do and little concept of why they are there in the first place.
>
> Thanks for any help
>


Re: balancing load

2011-01-17 Thread Peter Schuller
> @Peter Isn't clean up a special case of compaction? IE it works as a
> major compaction + removes data not belonging to the node?

Yes, sorry. Brain lapse. Ignore my.

-- 
/ Peter Schuller


Internal error when using SimpleSnitch and dynamic_snitch: true

2011-01-17 Thread Jim Ancona
We accidently configured our cluster with SimpleSnitch (instead of
PropertyFileSnitch) and dynamic_snitch: true. This is with version
0.7.0.

We saw the errors below on get_slice and batch_mutate calls. The
errors went away when we switched to PropertyFileSnitch.

Should dynamic_snitch work with SimpleSnitch? Should I open a Jira issue?

Jim

ERROR [pool-1-thread-55] 2011-01-14 15:53:45,998 Cassandra.java (line
2707) Internal error processing get_slice
java.lang.UnsupportedOperationException
at 
org.apache.cassandra.locator.SimpleSnitch.getDatacenter(SimpleSnitch.java:40)
at 
org.apache.cassandra.locator.DynamicEndpointSnitch.getDatacenter(DynamicEndpointSnitch.java:94)
at 
org.apache.cassandra.locator.NetworkTopologyStrategy.calculateNaturalEndpoints(NetworkTopologyStrategy.java:87)
at 
org.apache.cassandra.locator.AbstractReplicationStrategy.getNaturalEndpoints(AbstractReplicationStrategy.java:99)
at 
org.apache.cassandra.service.StorageService.getNaturalEndpoints(StorageService.java:1354)
at 
org.apache.cassandra.service.StorageService.getNaturalEndpoints(StorageService.java:1337)
at 
org.apache.cassandra.service.StorageService.findSuitableEndpoint(StorageService.java:1388)
at 
org.apache.cassandra.service.StorageProxy.weakRead(StorageProxy.java:248)
at 
org.apache.cassandra.service.StorageProxy.readProtocol(StorageProxy.java:224)
at 
org.apache.cassandra.thrift.CassandraServer.readColumnFamily(CassandraServer.java:98)
at 
org.apache.cassandra.thrift.CassandraServer.getSlice(CassandraServer.java:195)
at 
org.apache.cassandra.thrift.CassandraServer.multigetSliceInternal(CassandraServer.java:271)
at 
org.apache.cassandra.thrift.CassandraServer.get_slice(CassandraServer.java:233)
at 
org.apache.cassandra.thrift.Cassandra$Processor$get_slice.process(Cassandra.java:2699)
at 
org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2555)
at 
org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:167)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)

ERROR [pool-1-thread-58] 2011-01-12 12:18:58,721 Cassandra.java (line
3044) Internal error processing batch_mutate
java.lang.UnsupportedOperationException
at 
org.apache.cassandra.locator.SimpleSnitch.getDatacenter(SimpleSnitch.java:40)
at 
org.apache.cassandra.locator.DynamicEndpointSnitch.getDatacenter(DynamicEndpointSnitch.java:94)
at 
org.apache.cassandra.locator.NetworkTopologyStrategy.calculateNaturalEndpoints(NetworkTopologyStrategy.java:87)
at 
org.apache.cassandra.locator.AbstractReplicationStrategy.getNaturalEndpoints(AbstractReplicationStrategy.java:99)
at 
org.apache.cassandra.service.StorageService.getNaturalEndpoints(StorageService.java:1354)
at 
org.apache.cassandra.service.StorageService.getNaturalEndpoints(StorageService.java:1337)
at 
org.apache.cassandra.service.StorageProxy.mutate(StorageProxy.java:109)
at 
org.apache.cassandra.thrift.CassandraServer.doInsert(CassandraServer.java:412)
at 
org.apache.cassandra.thrift.CassandraServer.batch_mutate(CassandraServer.java:385)
at 
org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.process(Cassandra.java:3036)
at 
org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2555)
at 
org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:167)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)


Cassandra GC Settings

2011-01-17 Thread Dan Hendry
I am having some reliability problems in my Cassandra cluster which I am
almost certain is due to GC. I was about to start delving into the guts of
the problem by turning on GC logging but I have never done any serious java
GC tuning before (time to learn I guess). As a first step however, I was
hoping to gain some insight into the GC settings shipped with Cassandra 0.7.
I realize its a pretty complicated problem but I was specifically interested
in knowing about:

-XX:SurvivorRatio=8
-XX:MaxTenuringThreshold=1
-XX:CMSInitiatingOccupancyFraction=75

Why are these set the way they are? What specifically was used to determine
these settings? Was it purely experimental or was there a specific,
undesirable behavior adding these settings corrected for? From my various
web wanderings, I read the survivor ratio and tenuring threshold settings as
"Cassandra creates mostly long lived objects, with objects being promoted
very quickly from the young generation to the old generation". Furthermore,
the CMSInitiatingOccupancyFraction of 75 (from a JVM default of 68) means
"start gc in the old generation later", presumably to allow Cassandra to use
more of the old generation heap without needlessly trying to free up used
space (?). Please correct me if I am misinterpreting these settings.

One of the issues I have been having is extreme node instability when
running a major compaction. After 20-30 seconds of operation, the node
spends 30+ seconds in (what I believe to be) GC. Now I have tried halving
all memtable thresholds to reduce overall heap memory usage but that has not
seemed to help with the instability. After one of these blips, I often see
log entries as follows:

 INFO [ScheduledTasks:1] 2011-01-17 10:41:21,961 GCInspector.java (line 133)
GC for ParNew: 215 ms, 45084168 reclaimed leaving 11068700368 used; max is
12783583232
 INFO [ScheduledTasks:1] 2011-01-17 10:41:28,033 GCInspector.java (line 133)
GC for ParNew: 234 ms, 40401120 reclaimed leaving 12144504848 used; max is
12783583232
 INFO [ScheduledTasks:1] 2011-01-17 10:42:15,911 GCInspector.java (line 133)
GC for ConcurrentMarkSweep: 45828 ms, 3350764696 reclaimed leaving
9224048472 used; max is 12783583232

Given that the 3 GB of garbage collected via ConcurrentMarkSweep was
generated in < 30 seconds, one of the first things I was going to try was
increasing the survivor ratio (to 16) and increase the MaxTenuringThreshold
(to 5) to try and keep more objects in the young generation and therefore
cleaned up faster. As a more general approach to solving my problem, I was
also going to reduce the CMSInitiatingOccupancyFraction to 65. Does this
seem reasonable? Obviously, the best answer is to just try it but I hesitate
to start playing with settings when I have only vaguest notions of what they
do and little concept of why they are there in the first place.

Thanks for any help


Re: quorum calculation seems to depend on previous selected nodes

2011-01-17 Thread Jonathan Ellis
On Mon, Jan 17, 2011 at 9:55 AM, Samuel Benz  wrote:
> We have a cluster with 4 nodes. ReplicationFactor is 2, ReplicaPlacment
> is the RackAwareStrategy and the EndpointSnitch is the
> PropertyFileEndpointSnitch (with two data center and two racks each).
>
> My understanding is, that with this parameter of the cluster, it should
> be possible to update with consistency level quorum while one data
> center (two nodes) are shutdown completely.

No.  Quorum of 2 is 2.

> Case1:
> If 'TEST' was previous stored on Node1, Node2, Node3 -> The update will
> succeed.
>
> Case2:
> If 'TEST' was previous stored on Node2, Node3, Node4 -> The update will
> not work.

If you have RF=2 then it will be stored on 2 nodes, not 3.  I think
this is the source of the confusion.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


Re: balancing load

2011-01-17 Thread Edward Capriolo
On Mon, Jan 17, 2011 at 10:51 AM, Peter Schuller
 wrote:
>> Just to head the next possible problem. If you run 'nodetool cleanup'
>> on each node and some of your nodes still have more data then others,
>> then it probably means your are writing the majority of data to a few
>> keys. ( you probably do not want to do that )
>
> It may also be that a compact is needed if the discrepancies are
> within the variation expected during normal operation due to
> compaction (this assumes overwrites/deletions in write traffic).
>
> --
> / Peter Schuller
>

@Peter Isn't clean up a special case of compaction? IE it works as a
major compaction + removes data not belonging to the node?


quorum calculation seems to depend on previous selected nodes

2011-01-17 Thread Samuel Benz
Dear List

I found a strange behavior on our mini cluster during update with
consistency level quorum.

We have a cluster with 4 nodes. ReplicationFactor is 2, ReplicaPlacment
is the RackAwareStrategy and the EndpointSnitch is the
PropertyFileEndpointSnitch (with two data center and two racks each).

My understanding is, that with this parameter of the cluster, it should
be possible to update with consistency level quorum while one data
center (two nodes) are shutdown completely.

During testing I observed that this is not in every case true. It
depends on the node on which you are connected for the update. Let me
explain this more detailed:

During the tests only Node1 (D1R1) and Node2 (D1R2) are up and running.
Node3 (D2R1) and Node4 (D2R2) are shutdown. In both tests I'm connected
to Node1 and update the row: 'TEST'.

Case1:
If 'TEST' was previous stored on Node1, Node2, Node3 -> The update will
succeed.

Case2:
If 'TEST' was previous stored on Node2, Node3, Node4 -> The update will
not work.

In both case at least two nodes, which are required for quorum, are
alive. So is this a bug or is my interpretation of the cluster
parameters wrong?

Best
Sam


Re: Cassandra-Maven-Plugin

2011-01-17 Thread Stephen Connolly
https://issues.apache.org/jira/browse/CASSANDRA-1997

On 16 January 2011 19:59, Stephen Connolly
 wrote:
> it will be an attachment to an as yet un raised jira. look out for it
> tomorrow/tuesday
>
> - Stephen
>
> ---
> Sent from my Android phone, so random spelling mistakes, random nonsense
> words and other nonsense are a direct result of using swype to type on the
> screen
>
> On 16 Jan 2011 17:52, "Hellmut Adolphs"  wrote:
>


Re: balancing load

2011-01-17 Thread Peter Schuller
> Just to head the next possible problem. If you run 'nodetool cleanup'
> on each node and some of your nodes still have more data then others,
> then it probably means your are writing the majority of data to a few
> keys. ( you probably do not want to do that )

It may also be that a compact is needed if the discrepancies are
within the variation expected during normal operation due to
compaction (this assumes overwrites/deletions in write traffic).

-- 
/ Peter Schuller


Re: balancing load

2011-01-17 Thread Edward Capriolo
On Mon, Jan 17, 2011 at 2:44 AM, aaron morton  wrote:
> The nodes will not automatically delete stale data, to do that you need to 
> run nodetool cleanup.
>
> See step 3 in the Range Changes > Bootstrap 
> http://wiki.apache.org/cassandra/Operations#Range_changes
>
> If you are feeling paranoid before hand, you could run nodetool repair on 
> each node in turn to make sure they have the correct data. 
> http://wiki.apache.org/cassandra/Operations#Repairing_missing_or_inconsistent_data
>
> You may also have some tombstones in there, they will not be deleted until 
> after GCGraceSeconds
> http://wiki.apache.org/cassandra/DistributedDeletes
>
> Hope that helps.
> Aaron
>
> On 17 Jan 2011, at 20:34, Karl Hiramoto wrote:
>
>> Thanks for the help.  I used "nodetool move", so now each node owns 20%
>> of the space, but it seems that the data load is still mostly on 2 nodes.
>>
>>
>> nodetool  --host slave4 ring
>> Address         Status State   Load            Owns
>> Token
>>
>>      136112946768375385385349842972707284580
>> 10.1.4.10     Up     Normal  335.9 MB        20.00%
>> 0
>> 10.1.4.12     Up     Normal  54.42 KB        20.00%
>> 34028236692093846346337460743176821145
>> 10.1.4.13     Up     Normal  59.32 KB        20.00%
>> 68056473384187692692674921486353642290
>> 10.1.4.14     Up     Normal  6.33 GB         20.00%
>> 102084710076281539039012382229530463435
>> 10.1.4.15     Up     Normal  6.36 GB         20.00%
>> 136112946768375385385349842972707284580
>>
>>
>>
>>
>> --
>> Karl
>
>

Just to head the next possible problem. If you run 'nodetool cleanup'
on each node and some of your nodes still have more data then others,
then it probably means your are writing the majority of data to a few
keys. ( you probably do not want to do that )

If that happens, you can use nodetool cfstats on each node and ensure
that the 'max row compacted size' is roughly the same on all nodes. If
you have one or two really big rows that could explain your imbalance.


Re: Between Clause

2011-01-17 Thread kh jo
another example:  generating visit statistics given that start and end date are 
dynamic

--- On Mon, 1/17/11, kh jo  wrote:

From: kh jo 
Subject: Re: Between Clause
To: user@cassandra.apache.org
Date: Monday, January 17, 2011, 12:40 PM


example: finding country from IP address

Mysql: I have table with 140,000 rows each with ipNumStart, IpNumEnd, Country

so to find the country I use:
WHERE ipNum BETWEEN ipNumStart AND ipNumEnd

ipNumStart   ipNumEnd    Country
16777216
17301503  Australia
18939904
19005439  Japan
etc
etc
etc

--- On Mon, 1/17/11, aaron morton  wrote:

From: aaron morton 
Subject: Re: Between Clause
To: user@cassandra.apache.org
Date: Monday, January 17, 2011, 12:24 PM

Can you provide some more information ?
Aaron
On 17/01/2011, at 11:55 PM, kh jo wrote:
What is the best way to model a query with between clause.. given that you have 
a large number of entries... 

thanks
Jo










  



  


  

Re: Between Clause

2011-01-17 Thread kh jo

example: finding country from IP address

Mysql: I have table with 140,000 rows each with ipNumStart, IpNumEnd, Country

so to find the country I use:
WHERE ipNum BETWEEN ipNumStart AND ipNumEnd

ipNumStart   ipNumEnd    Country
16777216
17301503  Australia
18939904
19005439  Japan
etc
etc
etc

--- On Mon, 1/17/11, aaron morton  wrote:

From: aaron morton 
Subject: Re: Between Clause
To: user@cassandra.apache.org
Date: Monday, January 17, 2011, 12:24 PM

Can you provide some more information ?
Aaron
On 17/01/2011, at 11:55 PM, kh jo wrote:
What is the best way to model a query with between clause.. given that you have 
a large number of entries... 

thanks
Jo










  



  

Re: Between Clause

2011-01-17 Thread Donal Zang

On 17/01/2011 11:55, kh jo wrote:
What is the best way to model a query with between clause.. given that 
you have a large number of entries...


thanks
Jo




In my experience,for the row based 'between clause' with a random 
partition, you should design the column family carefully, So that you 
can get all the rows' key.
In this case you can use a multi_get() instead of get_range(), and you 
can do get_range() between columns in a row.




--




Re: Between Clause

2011-01-17 Thread aaron morton
Can you provide some more information ?

Aaron

On 17/01/2011, at 11:55 PM, kh jo wrote:

> What is the best way to model a query with between clause.. given that you 
> have a large number of entries... 
> 
> thanks
> Jo
> 
> 



Between Clause

2011-01-17 Thread kh jo
What is the best way to model a query with between clause.. given that you have 
a large number of entries... 

thanks
Jo




  

RE: about the consistency level

2011-01-17 Thread raoyixuan (Shandy)
Thanks a lot.
From: aaron morton [mailto:aa...@thelastpickle.com]
Sent: Monday, January 17, 2011 5:01 PM
To: user@cassandra.apache.org
Subject: Re: about the consistency level

The cassandra-clie works as CL.ONE , currently it cannot be changed. I'm not 
sure if there is a reason for this, but if it's a feature you would like add a 
request to JIRA https://issues.apache.org/jira/browse/CASSANDRA

In Hector it's part of the m.p.h.api.Keyspace interface as 
setConsistencyLevelPolicy(), implemented in m.p.c.model.Keyspace and looks like 
the way to set it is via the one of the overloads for 
m.p.h.api.factory.HFactory.createKeyspace() .  It defaults to the 
m.p.c.model.QuorumAllConsistencyLevelPolicy which uses Quorum for all 
operations.

Hope that helps.
Aaron


On 17/01/2011, at 9:23 PM, raoyixuan (Shandy) wrote:


Both hector and cassandra-cli .  Can you tell me respectively? Thanks a lot.

From: aaron morton [mailto:aa...@thelastpickle.com]
Sent: Monday, January 17, 2011 4:17 PM
To: user@cassandra.apache.org
Subject: Re: about the consistency level

The ConsistenyLevel is passed with each read and write command.

How you set it will depend on the client you are using. Which one are you using 
?

Aaron

On 17/01/2011, at 8:50 PM, raoyixuan (Shandy) wrote:



How to set the consistency level in Cassandra 0.7? I mean what command?


华为技术有限公司 Huawei Technologies Co., Ltd.






Phone: 28358610
Mobile: 13425182943
Email: raoyix...@huawei.com
地址:深圳市龙岗区坂田华为基地 邮编:518129
Huawei Technologies Co., Ltd.
Bantian, Longgang District,Shenzhen 518129, P.R.China
http://www.huawei.com

本邮件及其附件含有华为公司的保密信息,仅限于发送给上面地址中列出的个人或群组。禁
止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中
的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!
This e-mail and its attachments contain confidential information from HUAWEI, 
which
is intended only for the person or entity whose address is listed above. Any 
use of the
information contained herein in any way (including, but not limited to, total 
or partial
disclosure, reproduction, or dissemination) by persons other than the intended
recipient(s) is prohibited. If you receive this e-mail in error, please notify 
the sender by
phone or email immediately and delete it!





Re: about the consistency level

2011-01-17 Thread aaron morton
The cassandra-clie works as CL.ONE , currently it cannot be changed. I'm not 
sure if there is a reason for this, but if it's a feature you would like add a 
request to JIRA https://issues.apache.org/jira/browse/CASSANDRA

In Hector it's part of the m.p.h.api.Keyspace interface as 
setConsistencyLevelPolicy(), implemented in m.p.c.model.Keyspace and looks like 
the way to set it is via the one of the overloads for 
m.p.h.api.factory.HFactory.createKeyspace() .  It defaults to the 
m.p.c.model.QuorumAllConsistencyLevelPolicy which uses Quorum for all 
operations.

Hope that helps. 
Aaron
  

On 17/01/2011, at 9:23 PM, raoyixuan (Shandy) wrote:

> Both hector and cassandra-cli .  Can you tell me respectively? Thanks a lot.
>  
> From: aaron morton [mailto:aa...@thelastpickle.com] 
> Sent: Monday, January 17, 2011 4:17 PM
> To: user@cassandra.apache.org
> Subject: Re: about the consistency level
>  
> The ConsistenyLevel is passed with each read and write command. 
>  
> How you set it will depend on the client you are using. Which one are you 
> using ? 
>  
> Aaron
>  
> On 17/01/2011, at 8:50 PM, raoyixuan (Shandy) wrote:
> 
> 
> How to set the consistency level in Cassandra 0.7? I mean what command?
>  
>  
> 华为技术有限公司 Huawei Technologies Co., Ltd.
> 
> 
>  
>  
>  
> Phone: 28358610
> Mobile: 13425182943
> Email: raoyix...@huawei.com
> 地址:深圳市龙岗区坂田华为基地 邮编:518129
> Huawei Technologies Co., Ltd.
> Bantian, Longgang District,Shenzhen 518129, P.R.China
> http://www.huawei.com
> 本邮件及其附件含有华为公司的保密信息,仅限于发送给上面地址中列出的个人或群组。禁
> 止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中
> 的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!
> This e-mail and its attachments contain confidential information from HUAWEI, 
> which 
> is intended only for the person or entity whose address is listed above. Any 
> use of the 
> information contained herein in any way (including, but not limited to, total 
> or partial 
> disclosure, reproduction, or dissemination) by persons other than the 
> intended 
> recipient(s) is prohibited. If you receive this e-mail in error, please 
> notify the sender by 
> phone or email immediately and delete it!
>  
>  



RE: about the consistency level

2011-01-17 Thread raoyixuan (Shandy)
Both hector and cassandra-cli .  Can you tell me respectively? Thanks a lot.

From: aaron morton [mailto:aa...@thelastpickle.com]
Sent: Monday, January 17, 2011 4:17 PM
To: user@cassandra.apache.org
Subject: Re: about the consistency level

The ConsistenyLevel is passed with each read and write command.

How you set it will depend on the client you are using. Which one are you using 
?

Aaron

On 17/01/2011, at 8:50 PM, raoyixuan (Shandy) wrote:


How to set the consistency level in Cassandra 0.7? I mean what command?


华为技术有限公司 Huawei Technologies Co., Ltd.





Phone: 28358610
Mobile: 13425182943
Email: raoyix...@huawei.com
地址:深圳市龙岗区坂田华为基地 邮编:518129
Huawei Technologies Co., Ltd.
Bantian, Longgang District,Shenzhen 518129, P.R.China
http://www.huawei.com

本邮件及其附件含有华为公司的保密信息,仅限于发送给上面地址中列出的个人或群组。禁
止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中
的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!
This e-mail and its attachments contain confidential information from HUAWEI, 
which
is intended only for the person or entity whose address is listed above. Any 
use of the
information contained herein in any way (including, but not limited to, total 
or partial
disclosure, reproduction, or dissemination) by persons other than the intended
recipient(s) is prohibited. If you receive this e-mail in error, please notify 
the sender by
phone or email immediately and delete it!




Re: about the consistency level

2011-01-17 Thread aaron morton
The ConsistenyLevel is passed with each read and write command. 

How you set it will depend on the client you are using. Which one are you using 
? 

Aaron

On 17/01/2011, at 8:50 PM, raoyixuan (Shandy) wrote:

> How to set the consistency level in Cassandra 0.7? I mean what command?
>  
>  
> 华为技术有限公司 Huawei Technologies Co., Ltd.
> 
>  
>  
>  
> Phone: 28358610
> Mobile: 13425182943
> Email: raoyix...@huawei.com
> 地址:深圳市龙岗区坂田华为基地 邮编:518129
> Huawei Technologies Co., Ltd.
> Bantian, Longgang District,Shenzhen 518129, P.R.China
> http://www.huawei.com
> 本邮件及其附件含有华为公司的保密信息,仅限于发送给上面地址中列出的个人或群组。禁
> 止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中
> 的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!
> This e-mail and its attachments contain confidential information from HUAWEI, 
> which 
> is intended only for the person or entity whose address is listed above. Any 
> use of the 
> information contained herein in any way (including, but not limited to, total 
> or partial 
> disclosure, reproduction, or dissemination) by persons other than the 
> intended 
> recipient(s) is prohibited. If you receive this e-mail in error, please 
> notify the sender by 
> phone or email immediately and delete it!
>  



Re: Cassandra freezes under load when using libc6 2.11.1-0ubuntu7.5

2011-01-17 Thread Erik Onnen
Unfortunately, the previous AMI we used to provision the 7.5 version is no
longer available. More unfortunately, the two test nodes we spun up in each
AZ did not get Nehalem architectures so the only things I can say for
certain after running Mike's test 10x on each test node are:

1) I could not repro using Mike's test in either AZ. We had previously only
seen failures in one AZ.
2) I could not repro using Mike's test against libc7.7, the only version of
libc we could provision with a nominal amount of effort.
3) 1&2 above were only tested on older Harpertown cores, an architecture
where we've never seen the issue.

I think the only way I'll be able to further test is to wait until we
shuffle around our production configuration and I can abuse the existing
guests without taking a node out of my ring.

The only silver lining so far is that we've realized we need to better lock
down how we source AMIs for our EC2 nodes. That will give us a more reliable
system, but apparently we have no control over the actual architecture that
AMZN lets us have.

-erik

On Fri, Jan 14, 2011 at 10:09 AM, Mike Malone  wrote:

> That's interesting. For us, the 7.5 version of libc was causing problems.
> Either way, I'm looking forward to hearing about anything you find.
>
> Mike
>
>
> On Thu, Jan 13, 2011 at 11:47 PM, Erik Onnen  wrote:
>
>> Too similar to be a coincidence I'd say:
>>
>> Good node (old AZ):  2.11.1-0ubuntu7.5
>> Bad node (new AZ): 2.11.1-0ubuntu7.6
>>
>> You beat me to the punch with the test program. I was working on something
>> similar to test it out and got side tracked.
>>
>> I'll try the test app tomorrow and verify the versions of the AMIs used
>> for provisioning.
>>
>> On Thu, Jan 13, 2011 at 11:31 PM, Mike Malone  wrote:
>>
>>> Erik, the scenario you're describing is almost identical to what we've
>>> been experiencing. Sounds like you've been pulling your hair out too! You're
>>> also running the same distro and kernel as us. And we also run without swap.
>>> Which begs the question... what version of libc6 are you running!? Here's
>>> the output from one of our upgraded boxes:
>>>
>>> $ dpkg --list | grep libc6
>>> ii  libc62.11.1-0ubuntu7.7
>>>   Embedded GNU C Library: Shared libraries
>>> ii  libc6-dev2.11.1-0ubuntu7.7
>>>   Embedded GNU C Library: Development Librarie
>>>
>>> Before upgrading the version field showed 2.11.1-0ubuntu7.5. Wondering
>>> what yours is.
>>>
>>> We also found ourselves in a similar situation with different regions.
>>> We're using the canonical ubuntu ami as the base for our systems. But there
>>> appear to be small differences between the packages included in the amis
>>> from different regions. Seems libc6 is one of the things that changed. I
>>> discovered by diff'ing `dpkg --list` on a node that was good, and one that
>>> was bad.
>>>
>>> The architecture hypothesis is also very interesting. If we could
>>> reproduce the bug with the latest libc6 build I'd escalate it back up to
>>> Amazon. But I can't repro it, so nothing to escalate.
>>>
>>> For what it's worth, we were able to reproduce the lockup behavior that
>>> you're describing by running a tight loop that spawns threads. Here's a gist
>>> of the app I used: https://gist.github.com/a4123705e67e9446f1cc -- I'd
>>> be interested to know whether that locks things up on your system with a new
>>> libc6.
>>>
>>> Mike
>>>
>>> On Thu, Jan 13, 2011 at 10:39 PM, Erik Onnen  wrote:
>>>
 May or may not be related but I thought I'd recount a similar experience
 we had in EC2 in hopes it helps someone else.

 As background, we had been running several servers in a 0.6.8 ring with
 no Cassandra issues (some EC2 issues, but none related to Cassandra) on
 multiple EC2 XL instances in a single availability zone. We decided to add
 several other nodes to a second AZ for reasons beyond the scope of this
 email. As we reached steady operational state in the new AZ, we noticed 
 that
 the new nodes in the new AZ were repeatedly getting dropped from the ring.
 At first we attributed the drops to phi and expected cross-AZ latency. As 
 we
 tried to pinpoint the issue, we found something very similar to what you
 describe - the EC2 VMs in the new AZ would become completely unresponsive.
 Not just the Java process hosting Cassandra, but the entire host. Shell
 commands would not execute for existing sessions, we could not establish 
 new
 SSH sessions and tails we had on active files wouldn't show any progress. 
 It
 appeared as if the machines in the new AZ would seize for several minutes,
 then come back to life with little rhyme or reason as to why. Tickets 
 opened
 with AMZN resulted in responses of "the physical server looks normal".

 After digging deeper, here's what we found. To confirm all nodes in both
 AZs were identical at the following lev