Re: Storing bi-temporal data in Cassandra

2015-02-20 Thread Raj N
Thanks for the response Peter. I used the temperature table because its the
most common example on CQL timeseries and I thought I would reuse it. From
some of the responses, looks like I was wrong.

event_time is the time the event happened. So yes it is valid time. I was
trying to see if I can get away with not having valid_from and valid_to in
Cassandra.
transaction_time is the time the database record was written.

Let's take an example -

INSERT INTO
temperature(weatherstation_id,event_time,transaction_time,temperature)
VALUES (’1234ABCD’,’2015-02-18 07:01:00′,’2015-02-18 07:01:00′,’72F’);

INSERT INTO
temperature(weatherstation_id,event_time,transaction_time,temperature)
VALUES (’1234ABCD’,’2015-02-18 08:01:00′,’2015-02-18 08:01:00′,’72F’);

And I get an update for the first record tomorrow, I want to keep both
versions. So I would have -

INSERT INTO
temperature(weatherstation_id,event_time,transaction_time,temperature)
VALUES (’1234ABCD’,’2015-02-18 07:01:00′,’2015-02-*19* 07:01:00′,’72F’);

I fundamentally need to execute 2 types of queries -

1. select the latest values for the weatherstation for a given event time
period which should ideally just return the first and third record.
2. select the values as of particular transaction time(say ’2015-02-18
08:01:00′), in which case I would expect to return first and second record.

About your comment on having valid_time in the keys, do I have a choice in
Cassandra, unless you are suggesting to use secondary indexes.

I am new to bi-temporal data modeling. So please advise if you think
building this on top of Cassandra is a stupid idea.

-Rajesh


On Sun, Feb 15, 2015 at 10:03 AM, Peter Lin  wrote:

>
> I've built several different bi-temporal databases over the year for a
> variety of applications, so I have to ask "why are you modeling it this
> way?"
>
> Having a temperatures table doesn't make sense to me. Normally a
> bi-temporal database has transaction time and valid time. The transaction
> time is the timestamp of when the data is saved and valid time is usually
> expressed as effective & expiration time.
>
> In the example, is event time suppose to be valid time and what is the
> granularity (seconds, hours, or day)?
> Which kind of queries do you need to query most of the time?
> Why is event_time and transaction_time part of the key?
>
> If you take time to study temporal databases, having valid time in the
> keys will cause a lot of headaches. Basically, it's an anti-pattern for
> temporal databases. The key should only be the unique identifier and it
> shouldn't have transaction or valid time. The use case you describe looks
> more like regular time series data and not really bi-temporal.
>
> I would suggest take time to understand the kinds of queries you need to
> run and then change the table.
>
> peter
>
>
> On Sun, Feb 15, 2015 at 9:00 AM, Jack Krupansky 
> wrote:
>
>> I had forgotten, but there is a new tuple notation to iterate over more
>> than one clustering column in C* 2.0.6:
>>
>> https://issues.apache.org/jira/browse/CASSANDRA-4851
>>
>> For example,
>>
>> SELECT ... WHERE (c1, c2) > (1, 0)
>>
>> There's an example in the CQL spec:
>> https://cassandra.apache.org/doc/cql3/CQL.html
>>
>>
>> -- Jack Krupansky
>>
>> On Sat, Feb 14, 2015 at 6:29 PM, Dave Brosius 
>> wrote:
>>
>>>  As you point out, there's not really a node-based problem with your
>>> query from a performance point of view. This is a limitation of CQL in
>>> that, cql wants to slice one section of a partition's row (no matter how
>>> big the section is). In your case, you are asking to slice multiple
>>> sections of a partition's row, which currently isn't supported.
>>>
>>> It seems silly perhaps that this is the case, as certainly in your
>>> example it would seem useful, and not to difficult, but the problem is that
>>> you can wind up with n-depth slicing of that partitioned row given an
>>> arbitrary query syntax if range queries on clustering keys was allowed
>>> anywhere.
>>>
>>> At present, you can either duplicate the data using the other clustering
>>> key (transaction_time) as primary clusterer for this use case, or omit
>>> the 3rd criterion (transaction_time = '')in the query and get all
>>> the range query results and filter on the client.
>>>
>>> hth,
>>> dave
>>>
>>>
>>>
>>> On 02/14/2015 06:05 PM, Raj N wrote:
>>>
>>> I don't think thats solves my problem. The question really is why can't
>>> we use ranges for both time columns whe

Re: Storing bi-temporal data in Cassandra

2015-02-14 Thread Raj N
I don't think thats solves my problem. The question really is why can't we
use ranges for both time columns when they are part of the primary key.
They are on 1 row after all. Is this just a CQL limitation?

-Raj

On Sat, Feb 14, 2015 at 3:35 AM, DuyHai Doan  wrote:

> "I am trying to get the state as of a particular transaction_time"
>
>  --> In that case you should probably define your primary key in another
> order for clustering columns
>
> PRIMARY KEY (weatherstation_id,transaction_time,event_time)
>
> Then, select * from temperatures where weatherstation_id = 'foo' and
> event_time >= '2015-01-01 00:00:00' and event_time < '2015-01-02
> 00:00:00' and transaction_time = ''
>
>
>
> On Sat, Feb 14, 2015 at 3:06 AM, Raj N  wrote:
>
>> Has anyone designed a bi-temporal table in Cassandra? Doesn't look like I
>> can do this using CQL for now. Taking the time series example from well
>> known modeling tutorials in Cassandra -
>>
>> CREATE TABLE temperatures (
>> weatherstation_id text,
>> event_time timestamp,
>> temperature text,
>> PRIMARY KEY (weatherstation_id,event_time),
>> ) WITH CLUSTERING ORDER BY (event_time DESC);
>>
>> If I add another column transaction_time
>>
>> CREATE TABLE temperatures (
>> weatherstation_id text,
>> event_time timestamp,
>> transaction_time timestamp,
>> temperature text,
>> PRIMARY KEY (weatherstation_id,event_time,transaction_time),
>> ) WITH CLUSTERING ORDER BY (event_time DESC, transaction_time DESC);
>>
>> If I try to run a query using the following CQL, it throws an error -
>>
>> select * from temperatures where weatherstation_id = 'foo' and event_time
>> >= '2015-01-01 00:00:00' and event_time < '2015-01-02 00:00:00' and
>> transaction_time < '2015-01-02 00:00:00'
>>
>> It works if I use an equals clause for the event_time. I am trying to get
>> the state as of a particular transaction_time
>>
>> -Raj
>>
>
>


Storing bi-temporal data in Cassandra

2015-02-13 Thread Raj N
Has anyone designed a bi-temporal table in Cassandra? Doesn't look like I
can do this using CQL for now. Taking the time series example from well
known modeling tutorials in Cassandra -

CREATE TABLE temperatures (
weatherstation_id text,
event_time timestamp,
temperature text,
PRIMARY KEY (weatherstation_id,event_time),
) WITH CLUSTERING ORDER BY (event_time DESC);

If I add another column transaction_time

CREATE TABLE temperatures (
weatherstation_id text,
event_time timestamp,
transaction_time timestamp,
temperature text,
PRIMARY KEY (weatherstation_id,event_time,transaction_time),
) WITH CLUSTERING ORDER BY (event_time DESC, transaction_time DESC);

If I try to run a query using the following CQL, it throws an error -

select * from temperatures where weatherstation_id = 'foo' and event_time
>= '2015-01-01 00:00:00' and event_time < '2015-01-02 00:00:00' and
transaction_time < '2015-01-02 00:00:00'

It works if I use an equals clause for the event_time. I am trying to get
the state as of a particular transaction_time

-Raj


Re: Keyspace and table/cf limits

2014-12-03 Thread Raj N
The question is more from a multi-tenancy point of view. We wanted to see
if we can have a keyspace per client. Each keyspace may have 50 column
families, but if we have 200 clients, that would be 10,000 column families.
Do you think that's reasonable to support? I know that key cache capacity
is reserved in heap still. Any plans to move it off-heap?

-Raj

On Tue, Nov 25, 2014 at 3:10 PM, Robert Coli  wrote:

> On Tue, Nov 25, 2014 at 9:07 AM, Raj N  wrote:
>
>> What's the latest on the maximum number of keyspaces and/or tables that
>> one can have in Cassandra 2.1.x?
>>
>
> Most relevant changes lately would be :
>
> https://issues.apache.org/jira/browse/CASSANDRA-6689
> and
> https://issues.apache.org/jira/browse/CASSANDRA-6694
>
> Which should meaningfully reduce the amount of heap memtables consume.
> That heap can then be used to support more heap-persistent structures
> associated with many CFs. I have no idea how to estimate the scale of the
> improvement.
>
> As a general/meta statement, Cassandra is very multi-threaded, and
> consumes file handles like crazy. How many different query cases do you
> really want to put on one cluster/node? ;D
>
> =Rob
>
>


Keyspace and table/cf limits

2014-11-25 Thread Raj N
What's the latest on the maximum number of keyspaces and/or tables that one
can have in Cassandra 2.1.x?

-Raj


Re: Cassandra heap pre-1.1

2014-11-05 Thread Raj N
We are planning to upgrade soon. But in the meantime, I wanted to see if we
can tweak certain things.

-Rajesh

On Wed, Nov 5, 2014 at 3:10 PM, Robert Coli  wrote:

> On Tue, Nov 4, 2014 at 8:51 PM, Raj N  wrote:
>
>> Is there a good formula to calculate heap utilization in Cassandra
>> pre-1.1, specifically 1.0.10. We are seeing gc pressure on our nodes. And I
>> am trying to estimate what could be causing this? Using node tool info my
>> steady state heap is at about 10GB. XMX is 12G.
>>
>
> Basically, no. If you really want to know, take a heap dump and load it
> into Eclipse Memory Analyzer.
>
>
>> I have 4.5 GB of bloom filters which I can derive looking at cfstats
>>
>
> This is a *very* large percentage of your total heap, and is probably the
> lever you have most influence on pulling.
>
>
>> I have negligible row caching.
>>
>
> Row caching is generally not advised in that era, especially with heap
> pressure.
>
>
>> I have key caching enabled on my cfs. I couldn't find an easy way to
>> estimate how much this is using, but I tried to invalidate the key cache
>> and I got 1.3 GB back.
>>
>
> Key caching is generally advisable, but 1.3GB is a lot of key cache..
>
>
>> That still only adds up to 5.8 GB. I know there is index sampling going
>> on as well. I have around 800 million rows. Is there a way to estimate how
>> much space this would add up to?
>>
>
> Plenty. You should reduce your bloom filter size, or upgrade to a version
> of Cassandra that moves stuff off the heap.
>
> =Rob
> http://twitter.com/rcolidba
>
>
>


Cassandra heap pre-1.1

2014-11-04 Thread Raj N
Is there a good formula to calculate heap utilization in Cassandra pre-1.1,
specifically 1.0.10. We are seeing gc pressure on our nodes. And I am
trying to estimate what could be causing this? Using node tool info my
steady state heap is at about 10GB. XMX is 12G.

I have 4.5 GB of bloom filters which I can derive looking at cfstats
I have negligible row caching.
I have key caching enabled on my cfs. I couldn't find an easy way to
estimate how much this is using, but I tried to invalidate the key cache
and I got 1.3 GB back.

That still only adds up to 5.8 GB. I know there is index sampling going on
as well. I have around 800 million rows. Is there a way to estimate how
much space this would add up to?

What else?

-Raj


Cassandra 0.8.4 node keeps crashing with OOM errors

2012-09-18 Thread Raj N
One of our nodes keeps crashing continuously with out of memory errors. I
see the following error in the logs -

 INFO 21:03:54,007 Creating new commitlog segment
/local3/logs/cassandra/commitlog/CommitLog-1348016634007.log
Java HotSpot(TM) 64-Bit Server VM warning: Attempt to allocate stack guard
pages failed.
Java HotSpot(TM) 64-Bit Server VM warning: Attempt to allocate stack guard
pages failed.
 INFO 21:03:54,017 Replaying
/local3/logs/cassandra/commitlog/CommitLog-1347950471775.log,
/local3/logs/cassandra/commitlog/CommitLog-1347952280335.log,
/local3/logs/cassandra/commitlog/CommitLog-
1347954861373.log,
/local3/logs/cassandra/commitlog/CommitLog-1347957791957.log,
/local3/logs/cassandra/commitlog/CommitLog-1347959168686.log,
/local3/logs/cassandra/commitlog/CommitLog-1347961014
948.log, /local3/logs/cassandra/commitlog/CommitLog-1347968405068.log,
/local3/logs/cassandra/commitlog/CommitLog-1347972420459.log,
/local3/logs/cassandra/commitlog/CommitLog-1347975084979.log, /
local3/logs/cassandra/commitlog/CommitLog-1347975538081.log,
/local3/logs/cassandra/commitlog/CommitLog-1347976033450.log,
/local3/logs/cassandra/commitlog/CommitLog-1347976685447.log, /local3/log
s/cassandra/commitlog/CommitLog-1347977204225.log,
/local3/logs/cassandra/commitlog/CommitLog-1347977904344.log,
/local3/logs/cassandra/commitlog/CommitLog-1347978791835.log,
/local3/logs/cassandr
a/commitlog/CommitLog-1347979595214.log,
/local3/logs/cassandra/commitlog/CommitLog-1347980280043.log,
/local3/logs/cassandra/commitlog/CommitLog-1347980822272.log,
/local3/logs/cassandra/commitlo
g/CommitLog-1347981376135.log,
/local3/logs/cassandra/commitlog/CommitLog-1347982023403.log,
/local3/logs/cassandra/commitlog/CommitLog-1347982906942.log,
/local3/logs/cassandra/commitlog/CommitLo
g-1347983999163.log,
/local3/logs/cassandra/commitlog/CommitLog-1347985475186.log,
/local3/logs/cassandra/commitlog/CommitLog-1347987026500.log,
/local3/logs/cassandra/commitlog/CommitLog-13479885
20038.log, /local3/logs/cassandra/commitlog/CommitLog-1347990191955.log,
/local3/logs/cassandra/commitlog/CommitLog-1347991899744.log,
/local3/logs/cassandra/commitlog/CommitLog-1347993488753.log,
 /local3/logs/cassandra/commitlog/CommitLog-1347994801952.log,
/local3/logs/cassandra/commitlog/CommitLog-1347995807921.log,
/local3/logs/cassandra/commitlog/CommitLog-1347996959596.log, /local3/l
ogs/cassandra/commitlog/CommitLog-1347997961248.log,
/local3/logs/cassandra/commitlog/CommitLog-1348005812088.log,
/local3/logs/cassandra/commitlog/CommitLog-1348015418757.log
ERROR 21:03:54,030 Exception encountered during startup.
java.lang.OutOfMemoryError
at java.io.RandomAccessFile.readBytes(Native Method)
at java.io.RandomAccessFile.read(RandomAccessFile.java:322)
at
org.apache.cassandra.io.util.BufferedRandomAccessFile.reBuffer(BufferedRandomAccessFile.java:254)
at
org.apache.cassandra.io.util.BufferedRandomAccessFile.(BufferedRandomAccessFile.java:124)
at
org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:196)
at
org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:158)
at
org.apache.cassandra.service.AbstractCassandraDaemon.setup(AbstractCassandraDaemon.java:195)
at
org.apache.cassandra.service.AbstractCassandraDaemon.activate(AbstractCassandraDaemon.java:336)
at
org.apache.cassandra.thrift.CassandraDaemon.main(CassandraDaemon.java:91)
Exception encountered during startup.
java.lang.OutOfMemoryError
at java.io.RandomAccessFile.readBytes(Native Method)
at java.io.RandomAccessFile.read(RandomAccessFile.java:322)
at
org.apache.cassandra.io.util.BufferedRandomAccessFile.reBuffer(BufferedRandomAccessFile.java:254)
at
org.apache.cassandra.io.util.BufferedRandomAccessFile.(BufferedRandomAccessFile.java:124)
at
org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:196)
at
org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:158)
at
org.apache.cassandra.service.AbstractCassandraDaemon.setup(AbstractCassandraDaemon.java:195)
at
org.apache.cassandra.service.AbstractCassandraDaemon.activate(AbstractCassandraDaemon.java:336)
at
org.apache.cassandra.thrift.CassandraDaemon.main(CassandraDaemon.java:91)
Java HotSpot(TM) 64-Bit Server VM warning: Attempt to allocate stack guard
pages failed.


Has anybody seen this before?

-Rajesh


Question on Read Repair

2012-09-16 Thread Raj N
Hi,
   I have a 2 DC setup(DC1:3, DC2:3). All reads and writes are at
LOCAL_QUORUM. The question is if I do reads at LOCAL_QUORUM in DC1, will
read repair happen on the replicas in DC2?

Thanks
-Raj


Upgrade for Cassandra 0.8.4 to 1.+

2012-07-05 Thread Raj N
Hi experts,
 I am planning to upgrade from 0.8.4 to 1.+. Whats the latest stable
version?

Thanks
-Rajesh


Re: Ball is rolling on High Performance Cassandra Cookbook second edition

2012-06-27 Thread Raj N
Great stuff!!!

On Tue, Jun 26, 2012 at 5:25 PM, Edward Capriolo wrote:

> Hello all,
>
> It has not been very long since the first book was published but
> several things have been added to Cassandra and a few things have
> changed. I am putting together a list of changed content, for example
> features like the old per Column family memtable flush settings versus
> the new system with the global variable.
>
> My editors have given me the green light to grow the second edition
> from ~200 pages currently up to 300 pages! This gives us the ability
> to add more items/sections to the text.
>
> Some things were missing from the first edition such as Hector
> support. Nate has offered to help me in this area. Please feel contact
> me with any ideas and suggestions of recipes you would like to see in
> the book. Also get in touch if you want to write a recipe. Several
> people added content to the first edition and it would be great to see
> that type of participation again.
>
> Thank you,
> Edward
>


Re: Unbalanced ring in Cassandra 0.8.4

2012-06-20 Thread Raj N
Nick, thanks for the response. Does cleanup only cleanup keys that no
longer belong to that node. Just to add more color, when I bulk loaded all
my data into these 6 nodes, all of them had the same amount of data. After
the first nodetool repair, the first node started having more data than the
rest of the cluster. And since then it has never come back down. When I run
cfstats on the node, the amount of data for every column family is almost 2
times the the amount of data for other. This is true for the number of keys
estimate as well. For 1 CF I see more than double the number of keys and
that's the largest cf as well with 34 GB data.

Thanks
-Rajesh

On Wed, Jun 20, 2012 at 12:32 AM, Nick Bailey  wrote:

> No. Cleanup will scan each sstable to remove data that is no longer
> owned by that specific node. It won't compact the sstables together
> however.
>
> On Tue, Jun 19, 2012 at 11:11 PM, Raj N  wrote:
> > But wont that also run a major compaction which is not recommended
> anymore.
> >
> > -Raj
> >
> >
> > On Sun, Jun 17, 2012 at 11:58 PM, aaron morton 
> > wrote:
> >>
> >> Assuming you have been running repair, it' can't hurt.
> >>
> >> Cheers
> >>
> >> -
> >> Aaron Morton
> >> Freelance Developer
> >> @aaronmorton
> >> http://www.thelastpickle.com
> >>
> >> On 17/06/2012, at 4:06 AM, Raj N wrote:
> >>
> >> Nick, do you think I should still run cleanup on the first node.
> >>
> >> -Rajesh
> >>
> >> On Fri, Jun 15, 2012 at 3:47 PM, Raj N  wrote:
> >>>
> >>> I did run nodetool move. But that was when I was setting up the cluster
> >>> which means I didn't have any data at that time.
> >>>
> >>> -Raj
> >>>
> >>>
> >>> On Fri, Jun 15, 2012 at 1:29 PM, Nick Bailey 
> wrote:
> >>>>
> >>>> Did you start all your nodes at the correct tokens or did you balance
> >>>> by moving them? Moving nodes around won't delete unneeded data after
> >>>> the move is done.
> >>>>
> >>>> Try running 'nodetool cleanup' on all of your nodes.
> >>>>
> >>>> On Fri, Jun 15, 2012 at 12:24 PM, Raj N 
> wrote:
> >>>> > Actually I am not worried about the percentage. Its the data I am
> >>>> > concerned
> >>>> > about. Look at the first node. It has 102.07GB data. And the other
> >>>> > nodes
> >>>> > have around 60 GB(one has 69, but lets ignore that one). I am not
> >>>> > understanding why the first node has almost double the data.
> >>>> >
> >>>> > Thanks
> >>>> > -Raj
> >>>> >
> >>>> >
> >>>> > On Fri, Jun 15, 2012 at 11:06 AM, Nick Bailey 
> >>>> > wrote:
> >>>> >>
> >>>> >> This is just a known problem with the nodetool output and multiple
> >>>> >> DCs. Your configuration is correct. The problem with nodetool is
> >>>> >> fixed
> >>>> >> in 1.1.1
> >>>> >>
> >>>> >> https://issues.apache.org/jira/browse/CASSANDRA-3412
> >>>> >>
> >>>> >> On Fri, Jun 15, 2012 at 9:59 AM, Raj N 
> >>>> >> wrote:
> >>>> >> > Hi experts,
> >>>> >> > I have a 6 node cluster across 2 DCs(DC1:3, DC2:3). I have
> >>>> >> > assigned
> >>>> >> > tokens using the first strategy(adding 1) mentioned here -
> >>>> >> >
> >>>> >> > http://wiki.apache.org/cassandra/Operations?#Token_selection
> >>>> >> >
> >>>> >> > But when I run nodetool ring on my cluster, this is the result I
> >>>> >> > get -
> >>>> >> >
> >>>> >> > Address DC  Rack  Status State   LoadOwns
>  Token
> >>>> >> >
> >>>> >> >  113427455640312814857969558651062452225
> >>>> >> > 172.17.72.91DC1 RAC13 Up Normal  102.07 GB   33.33%  0
> >>>> >> > 45.10.80.144DC2 RAC5  Up Normal  59.1 GB 0.00%   1
> >>>> >> > 172.17.72.93DC1 RAC18 Up Normal  59.57 GB33.33%
> >>>> >> >  56713727820156407428984779325531226112
> >>>> >> > 45.10.80.146DC2 RAC7  Up Normal  59.64 GB0.00%
> >>>> >> > 56713727820156407428984779325531226113
> >>>> >> > 172.17.72.95DC1 RAC19 Up Normal  69.58 GB33.33%
> >>>> >> >  113427455640312814857969558651062452224
> >>>> >> > 45.10.80.148DC2 RAC9  Up Normal  59.31 GB0.00%
> >>>> >> > 113427455640312814857969558651062452225
> >>>> >> >
> >>>> >> >
> >>>> >> > As you can see the first node has considerably more load than the
> >>>> >> > others(almost double) which is surprising since all these are
> >>>> >> > replicas
> >>>> >> > of
> >>>> >> > each other. I am running Cassandra 0.8.4. Is there an explanation
> >>>> >> > for
> >>>> >> > this
> >>>> >> > behaviour?
> >>>> >> > Could https://issues.apache.org/jira/browse/CASSANDRA-2433 be
> >>>> >> > the
> >>>> >> > cause for this?
> >>>> >> >
> >>>> >> > Thanks
> >>>> >> > -Raj
> >>>> >
> >>>> >
> >>>
> >>>
> >>
> >>
> >
>


Re: performance problems on new cluster

2012-06-20 Thread Raj N
How did you solve your problem eventually? I am experiencing something
similar. Did you run cleanup on the node that has 80GB data?

-Raj

On Mon, Aug 15, 2011 at 10:12 PM, aaron morton wrote:

> Just checking do you have read_repair_chance set to something ? The second
> request is going to all replicas which should only happen with CL ONE if
> read repair is running for the request.
>
> The exceptions are happening during read repair which is running async to
> the main read request. It's occurring after we have detected a digest mis
> match, when the process is trying to reconcile the full data responses from
> the replicas. The Assertion error is happening because the replica sent a
> digest response. The NPE is probably happening because the response did not
> include a row, how / why the response is not marked as digest is a mystery.
>
> This may be related to the main problem. If not dont forget to some back
> to it.
>
> In you first log with the timeout something is not right…
> > DEBUG [pool-2-thread-14] 2011-08-15 05:26:15,187 StorageProxy.java (line
> 546) reading data from /dc1host3
> > DEBUG [pool-2-thread-14] 2011-08-15 05:26:35,191 StorageProxy.java (line
> 593) Read timeout: java.util.concurrent.TimeoutException: Operation timed
> out - received only 1 responses from /dc1host3,  .
> The reading… log messages are written before the inter node messages are
> sent. For this CL ONE read only node dc 1 host 3 is involved and it has
> been asked for the data response. Makes sense if Read Repair is not running
> for the request.
>
> *But* the timeout error says we got a response from dc 1 host 3. One way I
> can see that happening is dc 1 host 3 returning a digest instead of a data
> response (see o.a.c.service.ReadCallback.response(Message)). Which kind of
> matches what we saw above.
>
> We need some more extensive logging and probably a trip to
> https://issues.apache.org/jira/browse/CASSANDRA
>
> Would be good to see full DEBUG logs from both dc1 host 1 and dc1 host 3
> if you can that reproduce the fault like the first one. Turn off read
> repair to make things a little simpler. If thats too much we need
> StorageProxy, ReadCalback, ReadVerbHandler
>
> Can you update the email thread with the ticket.
>
> Thanks
> A
>
> -
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 15/08/2011, at 7:34 PM, Anton Winter wrote:
>
> >> OK, node latency is fine and you are using some pretty low
> >> consistency. You said NTS with RF 2, is that RF 2 for each DC ?
> >
> > Correct, I'm using RF 2 for each DC.
> >
> >
> >
> > I was able to reproduce the cli timeouts on the non replica nodes.
> >
> > The debug log output from dc1host1 (non replica node):
> >
> > DEBUG [pool-2-thread-14] 2011-08-15 05:26:15,183 StorageProxy.java (line
> 518) Command/ConsistencyLevel is SliceFromReadCommand(table='ks1',
> key='userid1', column_parent='QueryPath(columnFamilyName='cf1',
> superColumnName='java.nio.HeapByteBuffer[pos=64 lim=67 cap=109]',
> columnName='null')', start='', finish='', reversed=false, count=100)/ONE
> > DEBUG [pool-2-thread-14] 2011-08-15 05:26:15,187 StorageProxy.java (line
> 546) reading data from /dc1host3
> > DEBUG [pool-2-thread-14] 2011-08-15 05:26:35,191 StorageProxy.java (line
> 593) Read timeout: java.util.concurrent.TimeoutException: Operation timed
> out - received only 1 responses from /dc1host3,  .
> >
> >
> > If the query is run again on the same node (dc1host1) 0 rows are
> returned and the following DEBUG messages are logged:
> >
> >
> > DEBUG [pool-2-thread-14] 2011-08-15 05:32:05,513 StorageProxy.java (line
> 518) Command/ConsistencyLevel is SliceFromReadCommand(table='ks1',
> key='userid1', column_parent='QueryPath(columnFamilyName='cf1',
> superColumnName='java.nio.HeapByteBuffer[pos=64 lim=67 cap=109]',
> columnName='null')', start='', finish='', reversed=false, count=100)/ONE
> > DEBUG [pool-2-thread-14] 2011-08-15 05:32:05,513 StorageProxy.java (line
> 546) reading data from /dc1host3
> > DEBUG [pool-2-thread-14] 2011-08-15 05:32:05,513 StorageProxy.java (line
> 562) reading digest from /dc1host2
> > DEBUG [pool-2-thread-14] 2011-08-15 05:32:05,514 StorageProxy.java (line
> 562) reading digest from /dc2host3
> > DEBUG [pool-2-thread-14] 2011-08-15 05:32:05,514 StorageProxy.java (line
> 562) reading digest from /dc2host2
> > DEBUG [pool-2-thread-14] 2011-08-15 05:32:05,514 StorageProxy.java (line
> 562) reading digest from /dc3host2
> > DEBUG [pool-2-thread-14] 2011-08-15 05:32:05,514 StorageProxy.java (line
> 562) reading digest from /dc3host3
> > DEBUG [pool-2-thread-14] 2011-08-15 05:32:05,514 StorageProxy.java (line
> 562) reading digest from /dc4host3
> > DEBUG [pool-2-thread-14] 2011-08-15 05:32:05,514 StorageProxy.java (line
> 562) reading digest from /dc4host2
> > DEBUG [pool-2-thread-14] 2011-08-15 05:32:06,022 StorageProxy.java (line
> 588) Read: 508 ms.
> > ERROR [ReadRepairStage:2112] 2011-08-15 05:32

Re: Unbalanced ring in Cassandra 0.8.4

2012-06-19 Thread Raj N
But wont that also run a major compaction which is not recommended anymore.

-Raj

On Sun, Jun 17, 2012 at 11:58 PM, aaron morton wrote:

> Assuming you have been running repair, it' can't hurt.
>
> Cheers
>
>   -
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 17/06/2012, at 4:06 AM, Raj N wrote:
>
> Nick, do you think I should still run cleanup on the first node.
>
> -Rajesh
>
> On Fri, Jun 15, 2012 at 3:47 PM, Raj N  wrote:
>
>> I did run nodetool move. But that was when I was setting up the cluster
>> which means I didn't have any data at that time.
>>
>> -Raj
>>
>>
>> On Fri, Jun 15, 2012 at 1:29 PM, Nick Bailey  wrote:
>>
>>> Did you start all your nodes at the correct tokens or did you balance
>>> by moving them? Moving nodes around won't delete unneeded data after
>>> the move is done.
>>>
>>> Try running 'nodetool cleanup' on all of your nodes.
>>>
>>> On Fri, Jun 15, 2012 at 12:24 PM, Raj N  wrote:
>>> > Actually I am not worried about the percentage. Its the data I am
>>> concerned
>>> > about. Look at the first node. It has 102.07GB data. And the other
>>> nodes
>>> > have around 60 GB(one has 69, but lets ignore that one). I am not
>>> > understanding why the first node has almost double the data.
>>> >
>>> > Thanks
>>> > -Raj
>>> >
>>> >
>>> > On Fri, Jun 15, 2012 at 11:06 AM, Nick Bailey 
>>> wrote:
>>> >>
>>> >> This is just a known problem with the nodetool output and multiple
>>> >> DCs. Your configuration is correct. The problem with nodetool is fixed
>>> >> in 1.1.1
>>> >>
>>> >> https://issues.apache.org/jira/browse/CASSANDRA-3412
>>> >>
>>> >> On Fri, Jun 15, 2012 at 9:59 AM, Raj N 
>>> wrote:
>>> >> > Hi experts,
>>> >> > I have a 6 node cluster across 2 DCs(DC1:3, DC2:3). I have
>>> assigned
>>> >> > tokens using the first strategy(adding 1) mentioned here -
>>> >> >
>>> >> > http://wiki.apache.org/cassandra/Operations?#Token_selection
>>> >> >
>>> >> > But when I run nodetool ring on my cluster, this is the result I
>>> get -
>>> >> >
>>> >> > Address DC  Rack  Status State   LoadOwnsToken
>>> >> >
>>> >> >  113427455640312814857969558651062452225
>>> >> > 172.17.72.91DC1 RAC13 Up Normal  102.07 GB   33.33%  0
>>> >> > 45.10.80.144DC2 RAC5  Up Normal  59.1 GB 0.00%   1
>>> >> > 172.17.72.93DC1 RAC18 Up Normal  59.57 GB33.33%
>>> >> >  56713727820156407428984779325531226112
>>> >> > 45.10.80.146DC2 RAC7  Up Normal  59.64 GB0.00%
>>> >> > 56713727820156407428984779325531226113
>>> >> > 172.17.72.95DC1 RAC19 Up Normal  69.58 GB33.33%
>>> >> >  113427455640312814857969558651062452224
>>> >> > 45.10.80.148DC2 RAC9  Up Normal  59.31 GB0.00%
>>> >> > 113427455640312814857969558651062452225
>>> >> >
>>> >> >
>>> >> > As you can see the first node has considerably more load than the
>>> >> > others(almost double) which is surprising since all these are
>>> replicas
>>> >> > of
>>> >> > each other. I am running Cassandra 0.8.4. Is there an explanation
>>> for
>>> >> > this
>>> >> > behaviour? Could
>>> https://issues.apache.org/jira/browse/CASSANDRA-2433 be
>>> >> > the
>>> >> > cause for this?
>>> >> >
>>> >> > Thanks
>>> >> > -Raj
>>> >
>>> >
>>>
>>
>>
>
>


Re: Rules for Major Compaction

2012-06-19 Thread Raj N
Thanks Ed. I am on 0.8.4. So I don't have Leveled option, only SizeTiered.
I have a strange problem. I have a 6 node cluster(DC1=3, DC2=3). One of the
nodes has 105 GB data where as every other node has 60 GB in spite of each
one being a replica of the other.  And I am contemplating whether I should
be running compact/cleanup on the node with 105GB. Btw side question, does
it make sense to run it just for 1 node or is it advisable to run it for
all? This node also giving me some issues lately. Last night during some
heavy load, I got a lot of TimedOutExceptions from this node. The node was
also flapping. I could see in the logs that it could see the peers dying
ans coming back up, utlimately throwing UnavailableException(and sometimes
TimedOutException) on my requests. I use JNA mlockAll. So the JVM is
definitely not swapping. I see a full GC running(according to GCInspector)
for 15 secondsaround the same time. But even after the GC, requests were
timing out. Cassandra runs with Xmx8G, Xmn800M. Total RAM on the machine
62GB. I don't use any meaningful Key cache or row cache and rely on OS file
cache. Top shows VIRT as 116G(which makes sense since I have 105GB data).
 Have you seen any issues with data this size on a node?

-Raj

On Tue, Jun 19, 2012 at 3:30 PM, Edward Capriolo wrote:

> Hey my favorite question! It is a loaded question and it depends on
> your workload. The answer has evolved over time.
>
> In the old days <0.6.5 the only way to remove tombstones was major
> compaction. This is not true in any modern version.
>
> (Also in the old days you had to run cleanup to clear hints)
>
> Cassandra now has two compaction strategies SizeTiered and Leveled.
> Leveled DB can not be manually compacted.
>
>
> You final two sentences are good ground rules. In our case we have
> some column families that have high churn, for example a gc_grace
> period of 4 days but the data is re-written completely every day.
> Write activity over time will eventually cause tombstone removal but
> we can expedite the process by forcing a major at night. Because the
> tables are not really growing the **warning** below does not apply.
>
> **Warning** this creates one large sstable. Which is not always
> desirable, because it fiddles with the heuristics of SizeTiered
> (having one big table and other smaller ones).
>
> The updated answer is "You probably do not want to run major
> compactions, but some use cases could see some benefits"
>
> On Tue, Jun 19, 2012 at 10:51 AM, Raj N  wrote:
> > DataStax recommends not to run major compactions. Edward Capriolo's
> > Cassandra High Performance book suggests that major compaction is a good
> > thing. And should be run on a regular basis. Are there any ground rules
> > about running major compactions? For example, if you have write-once
> kind of
> > data that is never updated  then it probably makes sense to not run major
> > compaction. But if you have data which can be deleted or overwritten
> does it
> > make sense to run major compaction on a regular basis?
> >
> > Thanks
> > -Raj
>


Rules for Major Compaction

2012-06-19 Thread Raj N
DataStax recommends not to run major compactions. Edward Capriolo's
Cassandra High Performance book suggests that major compaction is a good
thing. And should be run on a regular basis. Are there any ground rules
about running major compactions? For example, if you have write-once kind
of data that is never updated  then it probably makes sense to not run
major compaction. But if you have data which can be deleted or overwritten
does it make sense to run major compaction on a regular basis?

Thanks
-Raj


Re: Unbalanced ring in Cassandra 0.8.4

2012-06-16 Thread Raj N
Nick, do you think I should still run cleanup on the first node.

-Rajesh

On Fri, Jun 15, 2012 at 3:47 PM, Raj N  wrote:

> I did run nodetool move. But that was when I was setting up the cluster
> which means I didn't have any data at that time.
>
> -Raj
>
>
> On Fri, Jun 15, 2012 at 1:29 PM, Nick Bailey  wrote:
>
>> Did you start all your nodes at the correct tokens or did you balance
>> by moving them? Moving nodes around won't delete unneeded data after
>> the move is done.
>>
>> Try running 'nodetool cleanup' on all of your nodes.
>>
>> On Fri, Jun 15, 2012 at 12:24 PM, Raj N  wrote:
>> > Actually I am not worried about the percentage. Its the data I am
>> concerned
>> > about. Look at the first node. It has 102.07GB data. And the other nodes
>> > have around 60 GB(one has 69, but lets ignore that one). I am not
>> > understanding why the first node has almost double the data.
>> >
>> > Thanks
>> > -Raj
>> >
>> >
>> > On Fri, Jun 15, 2012 at 11:06 AM, Nick Bailey 
>> wrote:
>> >>
>> >> This is just a known problem with the nodetool output and multiple
>> >> DCs. Your configuration is correct. The problem with nodetool is fixed
>> >> in 1.1.1
>> >>
>> >> https://issues.apache.org/jira/browse/CASSANDRA-3412
>> >>
>> >> On Fri, Jun 15, 2012 at 9:59 AM, Raj N 
>> wrote:
>> >> > Hi experts,
>> >> > I have a 6 node cluster across 2 DCs(DC1:3, DC2:3). I have
>> assigned
>> >> > tokens using the first strategy(adding 1) mentioned here -
>> >> >
>> >> > http://wiki.apache.org/cassandra/Operations?#Token_selection
>> >> >
>> >> > But when I run nodetool ring on my cluster, this is the result I get
>> -
>> >> >
>> >> > Address DC  Rack  Status State   LoadOwnsToken
>> >> >
>> >> >  113427455640312814857969558651062452225
>> >> > 172.17.72.91DC1 RAC13 Up Normal  102.07 GB   33.33%  0
>> >> > 45.10.80.144DC2 RAC5  Up Normal  59.1 GB 0.00%   1
>> >> > 172.17.72.93DC1 RAC18 Up Normal  59.57 GB33.33%
>> >> >  56713727820156407428984779325531226112
>> >> > 45.10.80.146DC2 RAC7  Up Normal  59.64 GB0.00%
>> >> > 56713727820156407428984779325531226113
>> >> > 172.17.72.95DC1 RAC19 Up Normal  69.58 GB33.33%
>> >> >  113427455640312814857969558651062452224
>> >> > 45.10.80.148DC2 RAC9  Up Normal  59.31 GB0.00%
>> >> > 113427455640312814857969558651062452225
>> >> >
>> >> >
>> >> > As you can see the first node has considerably more load than the
>> >> > others(almost double) which is surprising since all these are
>> replicas
>> >> > of
>> >> > each other. I am running Cassandra 0.8.4. Is there an explanation for
>> >> > this
>> >> > behaviour? Could
>> https://issues.apache.org/jira/browse/CASSANDRA-2433 be
>> >> > the
>> >> > cause for this?
>> >> >
>> >> > Thanks
>> >> > -Raj
>> >
>> >
>>
>
>


Re: Unbalanced ring in Cassandra 0.8.4

2012-06-15 Thread Raj N
I did run nodetool move. But that was when I was setting up the cluster
which means I didn't have any data at that time.

-Raj

On Fri, Jun 15, 2012 at 1:29 PM, Nick Bailey  wrote:

> Did you start all your nodes at the correct tokens or did you balance
> by moving them? Moving nodes around won't delete unneeded data after
> the move is done.
>
> Try running 'nodetool cleanup' on all of your nodes.
>
> On Fri, Jun 15, 2012 at 12:24 PM, Raj N  wrote:
> > Actually I am not worried about the percentage. Its the data I am
> concerned
> > about. Look at the first node. It has 102.07GB data. And the other nodes
> > have around 60 GB(one has 69, but lets ignore that one). I am not
> > understanding why the first node has almost double the data.
> >
> > Thanks
> > -Raj
> >
> >
> > On Fri, Jun 15, 2012 at 11:06 AM, Nick Bailey  wrote:
> >>
> >> This is just a known problem with the nodetool output and multiple
> >> DCs. Your configuration is correct. The problem with nodetool is fixed
> >> in 1.1.1
> >>
> >> https://issues.apache.org/jira/browse/CASSANDRA-3412
> >>
> >> On Fri, Jun 15, 2012 at 9:59 AM, Raj N  wrote:
> >> > Hi experts,
> >> > I have a 6 node cluster across 2 DCs(DC1:3, DC2:3). I have
> assigned
> >> > tokens using the first strategy(adding 1) mentioned here -
> >> >
> >> > http://wiki.apache.org/cassandra/Operations?#Token_selection
> >> >
> >> > But when I run nodetool ring on my cluster, this is the result I get -
> >> >
> >> > Address DC  Rack  Status State   LoadOwnsToken
> >> >
> >> >  113427455640312814857969558651062452225
> >> > 172.17.72.91DC1 RAC13 Up Normal  102.07 GB   33.33%  0
> >> > 45.10.80.144DC2 RAC5  Up Normal  59.1 GB 0.00%   1
> >> > 172.17.72.93DC1 RAC18 Up Normal  59.57 GB33.33%
> >> >  56713727820156407428984779325531226112
> >> > 45.10.80.146DC2 RAC7  Up Normal  59.64 GB0.00%
> >> > 56713727820156407428984779325531226113
> >> > 172.17.72.95DC1 RAC19 Up Normal  69.58 GB33.33%
> >> >  113427455640312814857969558651062452224
> >> > 45.10.80.148DC2 RAC9  Up Normal  59.31 GB0.00%
> >> > 113427455640312814857969558651062452225
> >> >
> >> >
> >> > As you can see the first node has considerably more load than the
> >> > others(almost double) which is surprising since all these are replicas
> >> > of
> >> > each other. I am running Cassandra 0.8.4. Is there an explanation for
> >> > this
> >> > behaviour? Could https://issues.apache.org/jira/browse/CASSANDRA-2433
>  be
> >> > the
> >> > cause for this?
> >> >
> >> > Thanks
> >> > -Raj
> >
> >
>


Re: Unbalanced ring in Cassandra 0.8.4

2012-06-15 Thread Raj N
Actually I am not worried about the percentage. Its the data I am concerned
about. Look at the first node. It has 102.07GB data. And the other nodes
have around 60 GB(one has 69, but lets ignore that one). I am not
understanding why the first node has almost double the data.

Thanks
-Raj

On Fri, Jun 15, 2012 at 11:06 AM, Nick Bailey  wrote:

> This is just a known problem with the nodetool output and multiple
> DCs. Your configuration is correct. The problem with nodetool is fixed
> in 1.1.1
>
> https://issues.apache.org/jira/browse/CASSANDRA-3412
>
> On Fri, Jun 15, 2012 at 9:59 AM, Raj N  wrote:
> > Hi experts,
> > I have a 6 node cluster across 2 DCs(DC1:3, DC2:3). I have assigned
> > tokens using the first strategy(adding 1) mentioned here -
> >
> > http://wiki.apache.org/cassandra/Operations?#Token_selection
> >
> > But when I run nodetool ring on my cluster, this is the result I get -
> >
> > Address DC  Rack  Status State   LoadOwnsToken
> >
> >  113427455640312814857969558651062452225
> > 172.17.72.91DC1 RAC13 Up Normal  102.07 GB   33.33%  0
> > 45.10.80.144DC2 RAC5  Up Normal  59.1 GB 0.00%   1
> > 172.17.72.93DC1 RAC18 Up Normal  59.57 GB33.33%
> >  56713727820156407428984779325531226112
> > 45.10.80.146DC2 RAC7  Up Normal  59.64 GB0.00%
> > 56713727820156407428984779325531226113
> > 172.17.72.95DC1 RAC19 Up Normal  69.58 GB33.33%
> >  113427455640312814857969558651062452224
> > 45.10.80.148DC2 RAC9  Up Normal  59.31 GB0.00%
> > 113427455640312814857969558651062452225
> >
> >
> > As you can see the first node has considerably more load than the
> > others(almost double) which is surprising since all these are replicas of
> > each other. I am running Cassandra 0.8.4. Is there an explanation for
> this
> > behaviour? Could https://issues.apache.org/jira/browse/CASSANDRA-2433 be
> the
> > cause for this?
> >
> > Thanks
> > -Raj
>


Unbalanced ring in Cassandra 0.8.4

2012-06-15 Thread Raj N
Hi experts,
I have a 6 node cluster across 2 DCs(DC1:3, DC2:3). I have assigned
tokens using the first strategy(adding 1) mentioned here -

http://wiki.apache.org/cassandra/Operations?#Token_selection

But when I run nodetool ring on my cluster, this is the result I get -

Address DC  Rack  Status State   LoadOwnsToken

 113427455640312814857969558651062452225
172.17.72.91DC1 RAC13 Up Normal  102.07 GB   33.33%  0
45.10.80.144DC2 RAC5  Up Normal  59.1 GB 0.00%   1
172.17.72.93DC1 RAC18 Up Normal  59.57 GB33.33%
 56713727820156407428984779325531226112
45.10.80.146DC2 RAC7  Up Normal  59.64 GB0.00%
56713727820156407428984779325531226113
172.17.72.95DC1 RAC19 Up Normal  69.58 GB33.33%
 113427455640312814857969558651062452224
45.10.80.148DC2 RAC9  Up Normal  59.31 GB0.00%
113427455640312814857969558651062452225


As you can see the first node has considerably more load than the
others(almost double) which is surprising since all these are replicas of
each other. I am running Cassandra 0.8.4. Is there an explanation for this
behaviour? Could https://issues.apache.org/jira/browse/CASSANDRA-2433 be
the cause for this?

Thanks
-Raj


Re: nodetool repair taking forever

2012-05-25 Thread Raj N
Thanks for the reply Aaron. By compaction being on, do you mean if run
nodetool compact, then the answer is no. I haven't set any explicit
compaction_thresholds which means it should be using the default, min 4 and
max 32. Having said that to solve the problem, I just did a full cluster
restart and ran nodetool repair again. The entire cluster of 6 nodes was
repaired in 10 hours. I am also contemplating since all the 6 nodes are
replicas of each other, do I even need to run repair on all the nodes.
Wouldn't running it on the first node suffice since it will repair all the
ranges its responsible for(which is everything). So unless I upgrade to
1.0.+, where I can use the -pr option is it advisable to just run repair on
the first node.

-Raj

On Tue, May 22, 2012 at 5:05 AM, aaron morton wrote:

> I also dont understand if all these nodes are replicas of each other why
> is that the first node has almost double the data.
>
> Have you performed any token moves ? Old data is not deleted unless you
> run nodetool cleanup.
> Another possibility is things like a lot of hints. Admittedly it would
> have to be a *lot* of hints.
> The third is that compaction has fallen behind.
>
> This week its even worse, the nodetool repair has been running for the
> last 15 hours just on the first node and when I run nodetool
> compactionstats I constantly see this -
>
> pending tasks: 3
>
> First check the logs for errors.
>
> Repair will first calculate the differences, you can see this as a
> validation compaction in nodetool compactionstats.
> Then it will stream the data, you can watch that with nodetool netstats.
>
> Try to work out which part is taking the most time.   15 hours for 50Gb
> sounds like a long time (btw do you have compaction on ?)
>
> Cheers
>
> -
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 20/05/2012, at 3:14 AM, Raj N wrote:
>
> Hi experts,
>
> I have a 6 node cluster spread across 2 DCs.
>
> DC  RackStatus State   LoadOwnsToken
>
>  113427455640312814857969558651062452225
> DC1 RAC13   Up Normal  95.98 GB33.33%  0
> DC2 RAC5Up Normal  50.79 GB0.00%   1
> DC1 RAC18   Up Normal  50.83 GB33.33%
>  56713727820156407428984779325531226112
> DC2 RAC7Up Normal  50.74 GB0.00%
> 56713727820156407428984779325531226113
> DC1 RAC19   Up Normal  61.72 GB33.33%
>  113427455640312814857969558651062452224
> DC2 RAC9Up Normal  50.83 GB0.00%
> 113427455640312814857969558651062452225
>
> They are all replicas of each other. All reads and writes are done at
> LOCAL_QUORUM. We are on Cassandra 0.8.4. I see that our weekend nodetool
> repair runs for more than 12 hours. Especially on the first one which has
> 96 GB data. Is this usual? We are using 500 GB SAS drives with ext4 file
> system. This gets worse every week. This week its even worse, the nodetool
> repair has been running for the last 15 hours just on the first node and
> when I run nodetool compactionstats I constantly see this -
>
> pending tasks: 3
>
> and nothing else. Looks like its just stuck. There's nothing substantial
> in the logs as well. I also dont understand if all these nodes are replicas
> of each other why is that the first node has almost double the data. Any
> help will be really appreciated.
>
> Thanks
> -Raj
>
>
>


Re: Repair Process Taking too long

2012-05-19 Thread Raj N
Can I infer from this that if I have 3 replicas, then running repair
without -pr won 1 node will repair the other 2 replicas as well.

-Raj

On Sat, Apr 14, 2012 at 2:54 AM, Zhu Han  wrote:

>
> On Sat, Apr 14, 2012 at 1:57 PM, Igor  wrote:
>
>> Hi!
>>
>> What is the difference between 'repair' and '-pr repair'? Simple repair
>> touch all token ranges (for all nodes) and -pr touch only range for which
>> given node responsible?
>>
>>
> -pr only touches the primary range of the node.  If you executes -pr
> against all nodes in replica groups,  then all ranges are repaired.
>
>>
>>
>> On 04/12/2012 05:59 PM, Sylvain Lebresne wrote:
>>
>>> On Thu, Apr 12, 2012 at 4:06 PM, Frank Ng  wrote:
>>>
 I also noticed that if I use the -pr option, the repair process went
 down
 from 30 hours to 9 hours.  Is the -pr option safe to use if I want to
 run
 repair processes in parallel on nodes that are not replication peers?

>>> There is pretty much two use case for repair:
>>> 1) to rebuild a node: if say a node has lost some data due to a hard
>>> drive corruption or the like and you want to to rebuild what's missing
>>> 2) the periodic repairs to avoid problem with deleted data coming back
>>> from the dead (basically:
>>> http://wiki.apache.org/**cassandra/Operations#**
>>> Frequency_of_nodetool_repair
>>> )
>>>
>>> In case 1) you want to run 'nodetool repair' (without -pr) against the
>>> node to rebuild.
>>> In case 2) (which I suspect is the case your talking now), you *want*
>>> to use 'nodetool repair -pr' on *every* node of the cluster. I.e.
>>> that's the most efficient way to do it. The only reason not to use -pr
>>> in this case would be that it's not available because you're using an
>>> old version of Cassandra. And yes, it's is safe to run with -pr in
>>> parallel on nodes that are not replication peers.
>>>
>>> --
>>> Sylvain
>>>
>>>
>>>  thanks


 On Thu, Apr 12, 2012 at 12:06 AM, Frank Ng  wrote:

> Thank you for confirming that the per node data size is most likely
> causing the long repair process.  I have tried a repair on smaller
> column
> families and it was significantly faster.
>
> On Wed, Apr 11, 2012 at 9:55 PM, aaron morton *>
> wrote:
>
>> If you have 1TB of data it will take a long time to repair. Every bit
>> of
>> data has to be read and a hash generated. This is one of the reasons
>> we
>> often suggest that around 300 to 400Gb per node is a good load in the
>> general case.
>>
>> Look at nodetool compactionstats .Is there a validation compaction
>> running ? If so it is still building the merkle  hash tree.
>>
>> Look at nodetool netstats . Is it streaming data ? If so all hash
>> trees
>> have been calculated.
>>
>> Cheers
>>
>>
>> -
>> Aaron Morton
>> Freelance Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>>
>> On 12/04/2012, at 2:16 AM, Frank Ng wrote:
>>
>> Can you expand further on your issue? Were you using Random
>> Patitioner?
>>
>> thanks
>>
>> On Tue, Apr 10, 2012 at 5:35 PM, David Leimbach
>> wrote:
>>
>>> I had this happen when I had really poorly generated tokens for the
>>> ring.  Cassandra seems to accept numbers that are too big.  You get
>>> hot
>>> spots when you think you should be balanced and repair never ends (I
>>> think
>>> there is a 48 hour timeout).
>>>
>>>
>>> On Tuesday, April 10, 2012, Frank Ng wrote:
>>>
 I am not using tier-sized compaction.


 On Tue, Apr 10, 2012 at 12:56 PM, Jonathan Rhone
 wrote:

> Data size, number of nodes, RF?
>
> Are you using size-tiered compaction on any of the column families
> that hold a lot of your data?
>
> Do your cassandra logs say you are streaming a lot of ranges?
> zgrep -E "(Performing streaming repair|out of sync)"
>
>
> On Tue, Apr 10, 2012 at 9:45 AM, Igor  wrote:
>
>> On 04/10/2012 07:16 PM, Frank Ng wrote:
>>
>> Short answer - yes.
>> But you are asking wrong question.
>>
>>
>> I think both processes are taking a while.  When it starts up,
>> netstats and compactionstats show nothing.  Anyone out there
>> successfully
>> using ext3 and their repair processes are faster than this?
>>
>> On Tue, Apr 10, 2012 at 10:42 AM, Igor
>>  wrote:
>>
>>> Hi
>>>
>>> You can check with nodetool  which part of repair process is
>>> slow -
>>> network streams or verify compactions. use nodetool netstats or
>>> compactionstats.
>>>
>>>
>>> On 04/10/2012 05:16 PM, Frank Ng

nodetool repair taking forever

2012-05-19 Thread Raj N
Hi experts,

I have a 6 node cluster spread across 2 DCs.

DC  RackStatus State   LoadOwnsToken

 113427455640312814857969558651062452225
DC1 RAC13   Up Normal  95.98 GB33.33%  0
DC2 RAC5Up Normal  50.79 GB0.00%   1
DC1 RAC18   Up Normal  50.83 GB33.33%
 56713727820156407428984779325531226112
DC2 RAC7Up Normal  50.74 GB0.00%
56713727820156407428984779325531226113
DC1 RAC19   Up Normal  61.72 GB33.33%
 113427455640312814857969558651062452224
DC2 RAC9Up Normal  50.83 GB0.00%
113427455640312814857969558651062452225

They are all replicas of each other. All reads and writes are done at
LOCAL_QUORUM. We are on Cassandra 0.8.4. I see that our weekend nodetool
repair runs for more than 12 hours. Especially on the first one which has
96 GB data. Is this usual? We are using 500 GB SAS drives with ext4 file
system. This gets worse every week. This week its even worse, the nodetool
repair has been running for the last 15 hours just on the first node and
when I run nodetool compactionstats I constantly see this -

pending tasks: 3

and nothing else. Looks like its just stuck. There's nothing substantial in
the logs as well. I also dont understand if all these nodes are replicas of
each other why is that the first node has almost double the data. Any help
will be really appreciated.

Thanks
-Raj


Re: nodetool repair cassandra 0.8.4 HELP!!!

2012-04-29 Thread Raj N
I tried it on 1 column family. I believe there is a bug in 0.8* where
repair ignores the cf. I tried this multiple times on different nodes.
Every time the disk util was going uo to 80% on a 500 GB disk. I would
eventually kill the repair. I only have 60GB worth data. I see this JIRA -

https://issues.apache.org/jira/browse/CASSANDRA-2324

But that says it was fixed in 0.8 beta. Is this still broken in 0.8.4?

I also don't understand why the data was inconsistent in the first place. I
read and write at LOCAL_QUORUM.

Thanks
-Raj

On Sun, Apr 29, 2012 at 2:06 AM, Watanabe Maki wrote:

> You should run repair. If the disk space is the problem, try to cleanup
> and major compact before repair.
> You can limit the streaming data by running repair for each column family
> separately.
>
> maki
>
> On 2012/04/28, at 23:47, Raj N  wrote:
>
> > I have a 6 node cassandra cluster DC1=3, DC2=3 with 60 GB data on each
> node. I was bulk loading data over the weekend. But we forgot to turn off
> the weekly nodetool repair job. As a result, repair was interfering when we
> were bulk loading data. I canceled repair by restarting the nodes. But
> unfortunately after the restart it looks like I dont have any data on those
> nodes when I use list on cassandra-cli. I ran repair on one of the effected
> nodes, but repair seems to be taking forever. Disk space has almost
> tripled. I stopped the repair again in fear of running out of disk space.
> After restart, the disk space is at 50% where as the good nodes are at 25%.
> How should I proceed from here.  When I run list on cassandra-cli I do see
> data on the effected node. But how can I be sure I have all the data.
> Should I run repair again. Should I cleanup the disk by clearing snapshots.
> Or should I just drop column families and bulk load the data again?
> >
> > Thanks
> > -Raj
>


nodetool repair cassandra 0.8.4 HELP!!!

2012-04-28 Thread Raj N
I have a 6 node cassandra cluster DC1=3, DC2=3 with 60 GB data on each
node. I was bulk loading data over the weekend. But we forgot to turn off
the weekly nodetool repair job. As a result, repair was interfering when we
were bulk loading data. I canceled repair by restarting the nodes. But
unfortunately after the restart it looks like I dont have any data on those
nodes when I use list on cassandra-cli. I ran repair on one of the effected
nodes, but repair seems to be taking forever. Disk space has almost
tripled. I stopped the repair again in fear of running out of disk space.
After restart, the disk space is at 50% where as the good nodes are at 25%.
How should I proceed from here.  When I run list on cassandra-cli I do see
data on the effected node. But how can I be sure I have all the
data. Should I run repair again. Should I cleanup the disk by clearing
snapshots. Or should I just drop column families and bulk load the data
again?

Thanks
-Raj


Repair in Cassandra 0.8.4 taking too long

2011-10-01 Thread Raj N
I had 3 nodes with strategy_options (DC1=3) in 1 DC. I added 1 more DC and 3
more nodes. I didnt set the initial token. But I ran nodetool move on the
new nodes(adding 1 to the tokens of the nodes in DC1) . I updated the
keyspace to strategy_options (DC1=3, DC2=3). Then I started running nodetool
repair on each of the nodes. Before I started repair each node had around 5
GB of data. I started on the new nodes. 2 of the nodes completed the repair
in 2 hours each. During the repair I saw the data to grow to almost 25 GB,
but eventually when the repair was done the data settled at around 9 GB. Is
this normal? The 3rd node has been running repair for a long time. It
eventually stopped throwing an exception -
Exception in thread "main" java.rmi.UnmarshalException: Error unmarshaling
return header; nested exception is:
java.io.EOFException
at
sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:209)
at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:142)
at com.sun.jmx.remote.internal.PRef.invoke(Unknown Source)
at javax.management.remote.rmi.RMIConnectionImpl_Stub.invoke(Unknown
Source)
at
javax.management.remote.rmi.RMIConnector$RemoteMBeanServerConnection.invoke(RMIConnector.java:993)
at
javax.management.MBeanServerInvocationHandler.invoke(MBeanServerInvocationHandler.java:288)
at $Proxy0.forceTableRepair(Unknown Source)
at
org.apache.cassandra.tools.NodeProbe.forceTableRepair(NodeProbe.java:192)
at
org.apache.cassandra.tools.NodeCmd.optionalKSandCFs(NodeCmd.java:773)
at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:669)
Caused by: java.io.EOFException
at java.io.DataInputStream.readByte(DataInputStream.java:250)
at
sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:195)

I started repair again since its safe to do so. Now the GCInspector
complains of not enough heap -
WARN [ScheduledTasks:1] 2011-10-01 13:08:16,227 GCInspector.java (line 149)
Heap is 0.7598414264960864 full.  You may need to reduce memtable and/or
cache sizes.  Cassandra will now flush up to the two largest memtables to
free up memory.  Adjust flush_largest_memtables_at threshold in
cassandra.yaml if you don't want Cassandra to do this automatically
 INFO [ScheduledTasks:1] 2011-10-01 13:08:16,227 StorageService.java (line
2398) Unable to reduce heap usage since there are no dirty column families

nodetool ring shows 48GB of data on the node.

My Xmx is 2G. I rely on OS caching more than Row caching or key caching.
Hence the column families are created with default settings.

Any help would be appreciated.

Thanks
-Raj


Re: Off-heap Cache

2011-07-13 Thread Raj N
How do I ensure it is indeed using the SerializingCacheProvider.

Thanks
-Rajesh

On Tue, Jul 12, 2011 at 1:46 PM, Jonathan Ellis  wrote:

> You need to set row_cache_provider=SerializingCacheProvider on the
> columnfamily definition (via the cli)
>
> On Tue, Jul 12, 2011 at 9:57 AM, Raj N  wrote:
> > Do we need to do anything special to turn off-heap cache on?
> > https://issues.apache.org/jira/browse/CASSANDRA-1969
> > -Raj
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
>


Off-heap Cache

2011-07-12 Thread Raj N
Do we need to do anything special to turn off-heap cache on?

https://issues.apache.org/jira/browse/CASSANDRA-1969

-Raj


Re: nodetool repair question

2011-07-05 Thread Raj N
I know it doesn't. But is this a valid enhancement request?

On Tue, Jul 5, 2011 at 1:32 PM, Edward Capriolo wrote:

>
>
> On Tue, Jul 5, 2011 at 1:27 PM, Raj N  wrote:
>
>> Hi experts,
>>  Are there any benchmarks that quantify how long nodetool repair
>> takes? Something which says on this kind of hardware, with this much of
>> data, nodetool repair takes this long. The other question that I have is
>> since Cassandra recommends running nodetool repair within
>> GCGracePeriodSeconds, is it possible to introduce a setting in
>> cassandra.yaml, that allows you to specify the frequency of nodetool repair
>> so that Cassandra can itself determine when to run nodetool repair instead
>> of setting up a cron job. Since Cassandra knows about all its peers, it can
>> be smart enough to also decide which nodes can run repair concurrently. For
>> example, if RF =3, and I have 6 nodes, then 2 replicas which are responsible
>> for different ranges in the ring can run repair concurrently.
>>
>> Thanks
>> -Raj
>>
>
> Currently Cassandra does not run repair automatically. You can use cron and
> 'nodetool repair' as a simple approach.
>


nodetool repair question

2011-07-05 Thread Raj N
Hi experts,
 Are there any benchmarks that quantify how long nodetool repair takes?
Something which says on this kind of hardware, with this much of data,
nodetool repair takes this long. The other question that I have is since
Cassandra recommends running nodetool repair within GCGracePeriodSeconds, is
it possible to introduce a setting in cassandra.yaml, that allows you to
specify the frequency of nodetool repair so that Cassandra can itself
determine when to run nodetool repair instead of setting up a cron job.
Since Cassandra knows about all its peers, it can be smart enough to also
decide which nodes can run repair concurrently. For example, if RF =3, and I
have 6 nodes, then 2 replicas which are responsible for different ranges in
the ring can run repair concurrently.

Thanks
-Raj


Cassandra 2 DC deployment

2011-04-12 Thread Raj N
Hi experts,
 We are planning to deploy Cassandra in 2 datacenters. Let assume there
are 3 nodes, RF=3, 2 nodes in 1 DC and 1 node in 2nd DC. Under normal
operations, we would read and write at QUORUM. What we want to do though is
if we lose a datacenter which has 2 nodes, DC1 in this case, we want to
downgrade our consistency to ONE. Basically I am saying that whenever there
is a partition, then prefer availability over consistency. In order to do
this we plan to catch UnavailableException and take corrective action. So
try QUORUM under normal circumstances, if unavailable try ONE. My questions
-
Do you guys see any flaws with this approach?
What happens when DC1 comes back up and we start reading/writing at QUORUM
again? Will we read stale data in this case?

Thanks
-Raj


RE: data deleted came back after 9 days.

2010-08-18 Thread Raj N
Guys,
Correct me if I am wrong. The whole problem is because a node missed an
update when it was down. Shouldn’t HintedHandoff take care of this case?

Thanks
-Raj

-Original Message-
From: Jonathan Ellis [mailto:jbel...@gmail.com]
Sent: Wednesday, August 18, 2010 9:22 AM
To: user@cassandra.apache.org
Subject: Re: data deleted came back after 9 days.

Actually, tombstones are read repaired too -- as long as they are not
expired.  But nodetool repair is much less error-prone than relying on RR
and your memory of what deletes you issued.

Either way, you'd need to increase GCGraceSeconds first to make the
tombstones un-expired first.

On Wed, Aug 18, 2010 at 12:43 AM, Benjamin Black  wrote:
> On Tue, Aug 17, 2010 at 7:49 PM, Zhong Li  wrote:
>> Those data were inserted one node, then deleted on a remote node in
>> less than 2 seconds. So it is very possible some node lost tombstone
>> when connection lost.
>> My question, is a ConstencyLevel.ALL read can retrieve lost tombstone
>> back instead of repair?
>>
>
> No.  Read repair does not replay operations.  You must run nodetool
repair.
>
>
> b
>



--
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com