Re: Rebooted cassandra node timing out all requests but recovers after a while

2015-01-08 Thread Anand Somani
I will keep an eye for that if it happens again. Times at this point are
synchronized

On Wed, Jan 7, 2015 at 10:37 PM, Duncan Sands 
wrote:

> Hi Anand,
>
>
> On 08/01/15 02:02, Anand Somani wrote:
>
>> Hi,
>>
>> We have a 3 node cluster (on VM). Eg. host1, host2, host3. One of the VM
>> rebooted (host1) and when host1 came up it would see the others as down
>> and the
>> others (host2 and host3) see it as down. So we restarted host2 and now
>> the ring
>> seems fine(everybody sees everybody as up).
>>
>> But now the clients timeout talking to host1. Have not figured out what is
>> causing it. There is nothing in the logs that indicates a problem.
>> Looking for
>> indicators/help on what debug/tracing to turn on to find out what could be
>> causing it.
>>
>> Now this happens only when a VM reboots (not otherwise), also it seems to
>> have
>> recovered itself after some hours!!( or restarts) not sure which one.
>>
>> This is 1.2.15, we are using ssl and cassandra authorizers.
>>
>
> perhaps time is not synchronized between the nodes to begin with, and
> eventually becomes synchronized.
>
> Ciao, Duncan.
>


Rebooted cassandra node timing out all requests but recovers after a while

2015-01-07 Thread Anand Somani
Hi,

We have a 3 node cluster (on VM). Eg. host1, host2, host3. One of the VM
rebooted (host1) and when host1 came up it would see the others as down and
the others (host2 and host3) see it as down. So we restarted host2 and now
the ring seems fine(everybody sees everybody as up).

But now the clients timeout talking to host1. Have not figured out what is
causing it. There is nothing in the logs that indicates a problem. Looking
for indicators/help on what debug/tracing to turn on to find out what could
be causing it.

Now this happens only when a VM reboots (not otherwise), also it seems to
have recovered itself after some hours!!( or restarts) not sure which one.

This is 1.2.15, we are using ssl and cassandra authorizers.

Thanks
Anand


Multi-dc cassandra keyspace

2014-05-16 Thread Anand Somani
Hi,

It seems like it should be possible to have a keyspace replicated only to a
subset of DC's on a given cluster spanning across multiple DCs? Is there
anything bad about this approach?

Scenario
Cluster spanning 4 DC's => CA, TX, NY, UT
Has multiple keyspaces such that
* "keyspace_CA_TX" - replication_strategy = {CA = 3, TX = 3}
* "keyspace_UT_NY" - replication_strategy = {UT = 3, NY = 3}
* "keyspace_CA_UT" - replication_strategy = {UT = 3, CA = 3}

I am going to try this out, but was curious if anybody out there has tried
it.

Thanks
Anand


Re: Cassandra Client authentication and system table replication question

2014-04-29 Thread Anand Somani
Correction credentials are stored in the system_auth table, so it is
ok/recommended to change the replication factor of that keyspace?


On Tue, Apr 29, 2014 at 10:41 PM, Anand Somani  wrote:

> Hi
>
> We have enabled cassandra client authentication and have set new user/pass
> per keyspace. As I understand user/pass is stored in the system table, do
> we need to change the replication factor of the system table so this data
> is replicated? The cluster is going to be multi-dc.
>
> Thanks
> Anand
>


Cassandra Client authentication and system table replication question

2014-04-29 Thread Anand Somani
Hi

We have enabled cassandra client authentication and have set new user/pass
per keyspace. As I understand user/pass is stored in the system table, do
we need to change the replication factor of the system table so this data
is replicated? The cluster is going to be multi-dc.

Thanks
Anand


Re: Drop in node replacements.

2014-04-05 Thread Anand Somani
Have you tried nodetool rebuild for that node? I have seen that work when
repair failed.


On Wed, Apr 2, 2014 at 11:44 AM, Redmumba  wrote:

> Cassandra 1.2.15, using commodity hardware.
>
>
> On Tue, Apr 1, 2014 at 6:37 PM, Robert Coli  wrote:
>
>> On Tue, Apr 1, 2014 at 3:24 PM, Redmumba  wrote:
>>
>>> Is it possible to have true "drop in" node replacements?  For example, I
>>> have a cluster of 51 Cassandra nodes, 17 in each data center.  I had one
>>> host go down on DC3, and when it came back up, it joined the ring, etc.,
>>> but was not receiving any data.  Even after multiple restarts and forcing a
>>> repair on the entire fleet, it still holds maybe ~30MB on a cluster that is
>>> absorbing ~1.2TB a day.
>>>
>>
>> What version of Cassandra? Real hardware/network or virtual?
>>
>> =Rob
>>
>>
>


Best way to track backups/delays for cross DC replication

2013-09-04 Thread Anand Somani
Hi,

Scenario is a cluster spanning across datacenters and we use Local_quorum
and want to know when things are not getting replicated across data
centers. What is the best way to track/alert on that?

I was planning on using the HintedHandOffManager (JMX)
=> org.apache.cassandra.db:type=HintedHandoffManager countPendingHints. Are
there other metrics (maybe exposed via nodetool) I should be looking at. At
this point we are on 1.1.6 cassandra.

Thanks
Anand


Re: Linear scalability problems

2013-04-04 Thread Anand Somani
RF=3.

On Thu, Apr 4, 2013 at 7:08 AM, Cem Cayiroglu  wrote:

> What was the RF before adding nodes?
>
> Sent from my iPhone
>
> On 04 Apr 2013, at 15:12, Anand Somani  wrote:
>
> We are using a single process with multiple threads, will look at client
> side delays.
>
> Thanks
>
> On Wed, Apr 3, 2013 at 9:30 AM, Tyler Hobbs  wrote:
>
>> If I had to guess, I would say that your client is the bottleneck, not
>> the cluster.  Are you inserting data with multiple threads or processes?
>>
>>
>> On Wed, Apr 3, 2013 at 8:49 AM, Anand Somani wrote:
>>
>>> Hi,
>>>
>>> I am running some tests trying to scale out our application from using a
>>> 3 node cluster to 6 node cluster. The thing I observed is that when using a
>>> 3 node cluster I was able to handle abt 41 req/second, so I added 3 more
>>> nodes thinking it should close to double, but instead it only goes upto bat
>>> 47 req/second!! I am doing something wrong and it is not obvious, so wanted
>>> some help in what stats could/should I monitor to tell me things like if a
>>> node has more requests or if the load distribution is not random enough?
>>>
>>> Note I am using direct thrift (old code base) and cassandra 1.1.6. The
>>> data model is for storing blobs (split across columns) and has around 6 CF,
>>> RF=3 and all operations are at quorum. Also at the end of the run nodetool
>>> ring reports the same data size.
>>>
>>> Thanks
>>> Anand
>>>
>>
>>
>>
>> --
>> Tyler Hobbs
>> DataStax <http://datastax.com/>
>>
>
>


Re: Linear scalability problems

2013-04-04 Thread Anand Somani
We are using a single process with multiple threads, will look at client
side delays.

Thanks

On Wed, Apr 3, 2013 at 9:30 AM, Tyler Hobbs  wrote:

> If I had to guess, I would say that your client is the bottleneck, not the
> cluster.  Are you inserting data with multiple threads or processes?
>
>
> On Wed, Apr 3, 2013 at 8:49 AM, Anand Somani  wrote:
>
>> Hi,
>>
>> I am running some tests trying to scale out our application from using a
>> 3 node cluster to 6 node cluster. The thing I observed is that when using a
>> 3 node cluster I was able to handle abt 41 req/second, so I added 3 more
>> nodes thinking it should close to double, but instead it only goes upto bat
>> 47 req/second!! I am doing something wrong and it is not obvious, so wanted
>> some help in what stats could/should I monitor to tell me things like if a
>> node has more requests or if the load distribution is not random enough?
>>
>> Note I am using direct thrift (old code base) and cassandra 1.1.6. The
>> data model is for storing blobs (split across columns) and has around 6 CF,
>> RF=3 and all operations are at quorum. Also at the end of the run nodetool
>> ring reports the same data size.
>>
>> Thanks
>> Anand
>>
>
>
>
> --
> Tyler Hobbs
> DataStax <http://datastax.com/>
>


Linear scalability problems

2013-04-03 Thread Anand Somani
Hi,

I am running some tests trying to scale out our application from using a 3
node cluster to 6 node cluster. The thing I observed is that when using a 3
node cluster I was able to handle abt 41 req/second, so I added 3 more
nodes thinking it should close to double, but instead it only goes upto bat
47 req/second!! I am doing something wrong and it is not obvious, so wanted
some help in what stats could/should I monitor to tell me things like if a
node has more requests or if the load distribution is not random enough?

Note I am using direct thrift (old code base) and cassandra 1.1.6. The data
model is for storing blobs (split across columns) and has around 6 CF, RF=3
and all operations are at quorum. Also at the end of the run nodetool ring
reports the same data size.

Thanks
Anand


Re: upgrade from 0.8.5 to 1.1.6, now it cannot find schema

2012-12-16 Thread Anand Somani
Thanks for the hint. But we gave up on trying to upgrade with no data loss,
since it was not required at this point.




On Wed, Dec 12, 2012 at 5:03 PM, aaron morton wrote:

> *in-vm cassandra*
>
> Embedded ?
>
> The location of the SSTables has changed in 1.1, they are know in
> /var/lib/cassandra/data/KS_NAME/CF_NAME/SSTable.data Is the data in the
> right place ?
>
> Cheers
>
> -
> Aaron Morton
> Freelance Cassandra Developer
> New Zealand
>
> @aaronmorton
> http://www.thelastpickle.com
>
> On 13/12/2012, at 6:54 AM, Anand Somani  wrote:
>
> Hi,
>
> We have a service which uses* in-vm cassandra* and creates the schema if
> does not exist programmatically, this has worked for us for sometime
> (including upgrades to the service) and we have been using 0.8.5.
>
> Now we are testing the upgrade to 1.1.6 and noticed that on upgrade the
> cassandra fails to find the old schema and wants to create a schema!!. Even
> using cli it does not show the old schema.
>
> Has anybody come across this? FYI we have another cluster where we run
> cassandra as a separate process and schema is also created outside using
> cli, the upgrade there went fine!!
>
> Has anybody seen this behavior? Any clues?
>
> I am going to look at creating the schema from outside, so see if that is
> culprit but wanted to see if anybody had any suggestions/thoughts.
>
> Thanks
> Anand
>
>
>


upgrade from 0.8.5 to 1.1.6, now it cannot find schema

2012-12-12 Thread Anand Somani
Hi,

We have a service which uses* in-vm cassandra* and creates the schema if
does not exist programmatically, this has worked for us for sometime
(including upgrades to the service) and we have been using 0.8.5.

Now we are testing the upgrade to 1.1.6 and noticed that on upgrade the
cassandra fails to find the old schema and wants to create a schema!!. Even
using cli it does not show the old schema.

Has anybody come across this? FYI we have another cluster where we run
cassandra as a separate process and schema is also created outside using
cli, the upgrade there went fine!!

Has anybody seen this behavior? Any clues?

I am going to look at creating the schema from outside, so see if that is
culprit but wanted to see if anybody had any suggestions/thoughts.

Thanks
Anand


Monitoring question for a multi DC (active/standby) configuration

2011-10-27 Thread Anand Somani
Hi,

Have a requirement to do a multi dc low latency application. This will be
in an active/standby setup. So I am planning on using LOCAL_QUORUM for
writes. Now if there is a hard requirement of maximum loss of data (on a dc
destruction) to some minutes,

   - In cassandra what is the recommended approach to monitor the cluster
   so that I can find out when I am going to miss my SLA (of data loss)? Is
   this even possible?  Can this be monitored in cassandra (message dropped,
   or something) or has to be done outside?
   - I am also open to other suggestions, if there is a another way of
   doing active/standby with durability guarantees. Or is this not possible
   and I should use EACH_QUORUM

I found a few emails on active/standby, but nothing about monitoring, so if
this is already a discussed/solution suggested problem please forward the
link.

Thanks
Anand


Re: cassandra crashed while repairing, leave node size X3

2011-09-18 Thread Anand Somani
In my tests I have seen repair sometimes take a lot of space (2-3 times),
cleanup did not clean it, the only way I could clean that was using major
compaction.

On Sun, Sep 18, 2011 at 6:51 PM, Yan Chunlu  wrote:

> while doing repair on node3, the "Load" keep increasing, suddenly cassandra
> has encountered OOM, and the "Load" stopped at 140GB,  after cassandra came
> back, I tried node cleanup but it seems not working
>
> does node repair generate many temp sstables?   how to get rid of them?
>  thanks!
>
> Address Status State   LoadOwnsToken
>
>
>  113427455640312821154458202477256070484
> node1  Up Normal  43 GB   33.33%  0
>
> node2 Up Normal  59.52 GB33.33%
>  56713727820156410577229101238628035242
> node3  Down   Normal  142.57 GB   33.33%
>  113427455640312821154458202477256070484
>


Re: [BETA RELEASE] Apache Cassandra 1.0.0-beta1 released

2011-09-15 Thread Anand Somani
So I should be able to do rolling upgrade from 0.7 to 1.0 (not there in the
release notes, but I assume that is work in progress).

Thanks

On Thu, Sep 15, 2011 at 1:36 PM, amulya rattan wrote:

> Isn't this levelDB implementation for Google's LevelDB?
> http://code.google.com/p/leveldb/
> From what I know, its quite fast..
>
>
> On Thu, Sep 15, 2011 at 4:04 PM, mcasandra  wrote:
>
>> This is a great new! Is it possible to do a write-up of main changes like
>> "Leveldb" and explain it a little bit. I get lost reading JIRA and
>> sometimes
>> is difficult to follow the thread. It looks like there are some major
>> changes in this release.
>>
>>
>> --
>> View this message in context:
>> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/BETA-RELEASE-Apache-Cassandra-1-0-0-beta1-released-tp6797930p6798330.html
>> Sent from the cassandra-u...@incubator.apache.org mailing list archive at
>> Nabble.com.
>>
>
>


Re: Configuring multi DC cluster

2011-09-15 Thread Anand Somani
You are right, good catch, thanks!

On Thu, Sep 15, 2011 at 8:28 AM, Konstantin Naryshkin
wrote:

> Wait, his nodes are going SC, SC, AT, AT. Shouldn't they go SC, AT, SC, AT?
> By which I mean that if he adds another node to the ring (or lowers the
> replication factor), he will have a node that is under-utilized. The rings
> in his data centers have the tokens:
> SC: 0, 1
> AT: 85070591730234615865843651857942052864,
> 85070591730234615865843651857942052865
>
> They should be:
> SC: 0, 85070591730234615865843651857942052864
> AT: 1, 85070591730234615865843651857942052865
>
> Or did I forget/misread something?
>
> - Original Message -
> From: "aaron morton" 
> To: user@cassandra.apache.org
> Sent: Tuesday, September 13, 2011 6:19:16 PM
> Subject: Re: Configuring multi DC cluster
>
> Looks good to me. Last time I checked the Partitioner did not take the DC
> into consideration https://issues.apache.org/jira/browse/CASSANDRA-3047
>
>
> Good luck.
>
>
>
>
>
>
>
> -
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
>
>
> On 14/09/2011, at 8:41 AM, Anand Somani wrote:
>
>
> Hi,
>
> Just trying to setup a cluster of 4 nodes for multiDC scenario - with 2
> nodes in each DC. This is all on the same box just for testing the
> configuration aspect. I have configured things as
>
>• PropertyFile
>
>
>• 127.0.0.4=SC:rack1 127.0.0.5=SC:rack2 127.0.0.6=AT:rack1
> 127.0.0.7=AT:rack2 # default for unknown nodes default=SC:rack1
>•
> Setup initial tokens as - advised • configured keyspace with SC:2, AT:2
>• ring looks like
>
>
>• Address Status State Load Owns Token
> 85070591730234615865843651857942052865 127.0.0.4 Up Normal 464.98 KB 50.00%
> 0 127.0.0.5 Up Normal 464.98 KB 0.00% 1 127.0.0.6 Up Normal 464.99 KB 50.00%
> 85070591730234615865843651857942052864 127.0.0.7 Up Normal 464.99 KB 0.00%
> 85070591730234615865843651857942052865
>
> Is that what I should expect the ring to look like? Is there anything else
> I should be testing/validating to make sure that things are configured
> correctly for NTS?
>
> Thanks
> Anand
>
>


Re: what's the difference between repair CF separately and repair the entire node?

2011-09-14 Thread Anand Somani
On Tue, Sep 13, 2011 at 3:57 PM, Peter Schuller  wrote:

> > I think it is a serious problem since I can not "repair".  I am
> > using cassandra on production servers. is there some way to fix it
> > without upgrade?  I heard of that 0.8.x is still not quite ready in
> > production environment.
>
> It is a serious issue if you really need to repair one CF at the time.
>
Why is it serious to do repair one CF at a time, if I cannot do that it at a
CF level, then does it mean that I cannot use more than 50% disk space? Is
this specific to this problem or is that a general statement? I ask because
I am planning on doing this so I can limit the max disk overhead to be a CF
(+ some factor) worth. I am going to be testing this in the next couple of
weeks or so.

> However, looking at your original post it seems this is not
> necessarily your issue. Do you need to, or was your concern rather the
> overall time repair took?
>
> There are other things that are improved in 0.8 that affect 0.7. In
> particular, (1) in 0.7 compaction, including validating compactions
> that are part of repair, is non-concurrent so if your repair starts
> while there is a long-running compaction going it will have to wait,
> and (2) semi-related is that the merkle tree calculation that is part
> of repair/anti-entropy may happen "out of synch" if one of the nodes
> participating happen to be busy with compaction. This in turns causes
> additional data to be sent as part of repair.
>
> That might be why your immediately following repair took a long time,
> but it's difficult to tell.
>
> If you're having issues with repair and large data sets, I would
> generally say that upgrading to 0.8 is recommended. However, if you're
> on 0.7.4, beware of
> https://issues.apache.org/jira/browse/CASSANDRA-3166
>
> --
> / Peter Schuller (@scode on twitter)
>


Configuring multi DC cluster

2011-09-13 Thread Anand Somani
Hi,

Just trying to setup a cluster of 4 nodes for multiDC scenario - with 2
nodes in each DC. This is all on the same box just for testing the
configuration aspect. I have configured things as

   - PropertyFile
  - 127.0.0.4=SC:rack1
  127.0.0.5=SC:rack2
  127.0.0.6=AT:rack1
  127.0.0.7=AT:rack2
  # default for unknown nodes
  default=SC:rack1
   - Setup initial tokens as - advised
   - configured keyspace with SC:2, AT:2
   - ring looks like
  - Address Status State   LoadOwns
  Token

  85070591730234615865843651857942052865
  127.0.0.4   Up Normal  464.98 KB   50.00%
  0
  127.0.0.5   Up Normal  464.98 KB   0.00%
  1
  127.0.0.6   Up Normal  464.99 KB   50.00%
  85070591730234615865843651857942052864
  127.0.0.7   Up Normal  464.99 KB   0.00%
  85070591730234615865843651857942052865

Is that what I should expect the ring to look like? Is there anything else I
should be testing/validating to make sure that things are configured
correctly for NTS?

Thanks
Anand


Re: StorageProxy Mbean not exposed in 0.7.8 anymore

2011-09-13 Thread Anand Somani
yes, I see it now.

Thx

On Tue, Sep 13, 2011 at 10:11 AM, Nick Bailey  wrote:

> The StorageProxyMBean should still be exposed. It won't exist until some
> reads or writes are performed on the cluster though. You may need to
> actually do some reads/writes to see it show up.
>
>
> On Tue, Sep 13, 2011 at 11:53 AM, Anand Somani wrote:
>
>> Hi,
>>
>> Upgraded from 7.4 to 7.8, noticed that StorageProxy (under cassandra.db)
>> is no longer exposed, is that intentional? So the question are these covered
>> somewhere else?
>>
>> Thanks
>> Anand
>>
>>
>>
>


StorageProxy Mbean not exposed in 0.7.8 anymore

2011-09-13 Thread Anand Somani
Hi,

Upgraded from 7.4 to 7.8, noticed that StorageProxy (under cassandra.db) is
no longer exposed, is that intentional? So the question are these covered
somewhere else?

Thanks
Anand


Re: Question on using consistency level with NetworkTopologyStrategy

2011-09-09 Thread Anand Somani
Oh yes, that is cool. I see from the code now (was reading it incorrectly).

So a Quorum with NTS would give me 3 copies across the cluster, not
necessarily 2 local and 1 remote, but for most parts that would be true
since WAN adds to latency.

Thanks


On Thu, Sep 8, 2011 at 3:40 PM, Jonathan Ellis  wrote:

> CL.QUORUM is supported with any replication strategy, not just simple.
>
> Also, Cassandra's optimizing of cross-DC writes only requires that it
> know (via a correctly configured Snitch) where each node is located.
> It is not affected by replication strategy choice.
>
> On Thu, Sep 8, 2011 at 3:14 PM, Anand Somani  wrote:
> > Hi,
> >
> > Have a requirement, where data is spread across multiple DC for disaster
> > recovery. So I would use the NTS, that is clear, but I have some
> questions
> > with this scenario
> >
> > I have 2 Data Centers
> > RF - 2 (active DC) , 2 (passive DC)
> > with NTS - Consistency level options are - LOCAL_QUORUM and EACH _QUORUM
> > I want LOCAL_QUORUM and 1 remote copy (not 2) for write to succeed, if I
> > used EACH_QUORUM - it would mean that I need both the remote nodes up (as
> I
> > understand from http://www.datastax.com/docs/0.8/operations/datacenter).
> >
> > So if that is my requirement what consistency Level should I be using for
> my
> > writes? Is that even possible with NTS or another strategy? I could use
> the
> > SimpleStrategy with Quorum, but that would mean sending 2 copies (instead
> of
> > 1 per DC that NTS uses to optimize on WAN traffic) to remote DC (since it
> > does not understand DC's)?
> >
> > Thanks
> > Anand
> >
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
>


Anybody out there using 0.8 in production

2011-09-08 Thread Anand Somani
Hi

Currently we are using 0.7.4 and was wondering if I should upgrade to
0.7.8/9 or move to 0.8? Is anybody using 0.8 in production and what is their
experience?

Thanks


Question on using consistency level with NetworkTopologyStrategy

2011-09-08 Thread Anand Somani
Hi,

Have a requirement, where data is spread across multiple DC for disaster
recovery. So I would use the NTS, that is clear, but I have some questions
with this scenario

   - I have 2 Data Centers
   - RF - 2 (active DC) , 2 (passive DC)
   - with NTS - Consistency level options are - LOCAL_QUORUM and EACH
   _QUORUM
   - I want LOCAL_QUORUM and 1 remote copy (not 2) for write to succeed, if
   I used EACH_QUORUM - it would mean that I need both the remote nodes up (as
   I understand from http://www.datastax.com/docs/0.8/operations/datacenter
   ).

So if that is my requirement what consistency Level should I be using for my
writes? Is that even possible with NTS or another strategy? I could use the
SimpleStrategy with Quorum, but that would mean sending 2 copies (instead of
1 per DC that NTS uses to optimize on WAN traffic) to remote DC (since it
does not understand DC's)?

Thanks
Anand


What are the things to watch out for with big nodes

2011-08-28 Thread Anand Somani
Hi,

If I have a cluster with 15-20T nodes, somethings that I know will be a
potential problem are

   - Compactions taking longer
   - Higher read latencies
   - Long time for adding/removing nodes

What are other things that can be problematic with big nodes?

Regards
Anand


Re: Commit log fills up in less than a minute

2011-08-26 Thread Anand Somani
Sure I can fill in the ticket. Here is what I have noticed so far, the count
of HH is not going up, which is good. I think what must have happened is
that after I restarted the cluster, no new hints were added just the old
one's are still around and not cleaned up, is that possible? Cannot say for
sure, since I only looked at this JMX bean abt 36 hours after restart.

Can I just clean this up using the JMX call? I do not want to turn off HH,
since that can handle the intermittent network hiccups well, right?

On Thu, Aug 25, 2011 at 2:47 PM, aaron morton wrote:

> Could you put together some information on this in a ticket and references
> this one https://issues.apache.org/jira/browse/CASSANDRA-3071
>
> The short term fix is to disable HH. You will still get consistent reads.
>
> Cheers
>
> -
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 25/08/2011, at 3:22 AM, Anand Somani wrote:
>
> So I have looked at the cluster from
>
>- Cassandra-client - describe cluster => shows correctly - 3 nodes
>- used the StorageService - JMX bean =>UnreachableNodes - shows 0
>
>
> If all these show the correct ring state, why are hints being maintained,
> looks like that is the only way to find out about "phantom" nodes.
>
> On Wed, Aug 24, 2011 at 8:01 AM, Anand Somani wrote:
>
>> So, I restarted the cluster (not rolling), but it is still maintaining
>> hints for the IP's that are no longer part of the ring. nodetool ring shows
>> things correctly (as only 3 nodes).
>> When I check thru the jmx hintedhandoff manager, it shows it is
>> maintaining the hints for those non existent IP's. So the question is
>>  - How can I remove these IP permanently, so hints do not get saved?
>>  - Not all nodes see the same list of IP's
>>
>>
>>
>>
>> On Sun, Aug 21, 2011 at 3:10 PM, aaron morton wrote:
>>
>>> Yup, you can check the what HH is doing via JMX.
>>>
>>> there is a bug in 0.7 that can result in log files not been deleted
>>> https://issues.apache.org/jira/browse/CASSANDRA-2829
>>>
>>> Cheers
>>>
>>>  -
>>> Aaron Morton
>>> Freelance Cassandra Developer
>>> @aaronmorton
>>> http://www.thelastpickle.com
>>>
>>> On 22/08/2011, at 4:56 AM, Anand Somani wrote:
>>>
>>> We have a lot of space on /data, and looks like it was flushing data fine
>>> from file timestamps.
>>>
>>> We did have a bit of goofup with IP's when bringing up a down node (and
>>> the commit files have been around since then). Wonder if that is what
>>> triggered it and we have a bunch of hinted handoff's being backed up.
>>>
>>> For hinted hand off - how do I check if the nodes are collecting hints (
>>> I do have it turned on). I noticed console bean HintedHandManager, is that
>>> the only way to find out.
>>>
>>> On Sun, Aug 21, 2011 at 9:20 AM, Peter Schuller <
>>> peter.schul...@infidyne.com> wrote:
>>>
>>>> > When does the actual commit-data file get deleted.
>>>> >
>>>> > The flush interval on all my memtables is 60 minutes
>>>>
>>>> They *should* be getting deleted when they no longer contain any data
>>>> that has not been flushed to disk. Are flushes definitely still
>>>> happening? Is it possible flushing has started failing (e.g. out of
>>>> disk)?
>>>>
>>>> The only way I can think of over nodes directly affecting the commit
>>>> log size on your node would be e.g. hinted handoff resulting in burst
>>>> of writes.
>>>>
>>>> --
>>>> / Peter Schuller (@scode on twitter)
>>>>
>>>
>>>
>>>
>>
>
>


Re: Commit log fills up in less than a minute

2011-08-24 Thread Anand Somani
So I have looked at the cluster from

   - Cassandra-client - describe cluster => shows correctly - 3 nodes
   - used the StorageService - JMX bean =>UnreachableNodes - shows 0


If all these show the correct ring state, why are hints being maintained,
looks like that is the only way to find out about "phantom" nodes.

On Wed, Aug 24, 2011 at 8:01 AM, Anand Somani  wrote:

> So, I restarted the cluster (not rolling), but it is still maintaining
> hints for the IP's that are no longer part of the ring. nodetool ring shows
> things correctly (as only 3 nodes).
> When I check thru the jmx hintedhandoff manager, it shows it is maintaining
> the hints for those non existent IP's. So the question is
>  - How can I remove these IP permanently, so hints do not get saved?
>  - Not all nodes see the same list of IP's
>
>
>
>
> On Sun, Aug 21, 2011 at 3:10 PM, aaron morton wrote:
>
>> Yup, you can check the what HH is doing via JMX.
>>
>> there is a bug in 0.7 that can result in log files not been deleted
>> https://issues.apache.org/jira/browse/CASSANDRA-2829
>>
>> Cheers
>>
>>  -
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>>
>> On 22/08/2011, at 4:56 AM, Anand Somani wrote:
>>
>> We have a lot of space on /data, and looks like it was flushing data fine
>> from file timestamps.
>>
>> We did have a bit of goofup with IP's when bringing up a down node (and
>> the commit files have been around since then). Wonder if that is what
>> triggered it and we have a bunch of hinted handoff's being backed up.
>>
>> For hinted hand off - how do I check if the nodes are collecting hints ( I
>> do have it turned on). I noticed console bean HintedHandManager, is that the
>> only way to find out.
>>
>> On Sun, Aug 21, 2011 at 9:20 AM, Peter Schuller <
>> peter.schul...@infidyne.com> wrote:
>>
>>> > When does the actual commit-data file get deleted.
>>> >
>>> > The flush interval on all my memtables is 60 minutes
>>>
>>> They *should* be getting deleted when they no longer contain any data
>>> that has not been flushed to disk. Are flushes definitely still
>>> happening? Is it possible flushing has started failing (e.g. out of
>>> disk)?
>>>
>>> The only way I can think of over nodes directly affecting the commit
>>> log size on your node would be e.g. hinted handoff resulting in burst
>>> of writes.
>>>
>>> --
>>> / Peter Schuller (@scode on twitter)
>>>
>>
>>
>>
>


Re: Commit log fills up in less than a minute

2011-08-24 Thread Anand Somani
So, I restarted the cluster (not rolling), but it is still maintaining hints
for the IP's that are no longer part of the ring. nodetool ring shows things
correctly (as only 3 nodes).
When I check thru the jmx hintedhandoff manager, it shows it is maintaining
the hints for those non existent IP's. So the question is
 - How can I remove these IP permanently, so hints do not get saved?
 - Not all nodes see the same list of IP's



On Sun, Aug 21, 2011 at 3:10 PM, aaron morton wrote:

> Yup, you can check the what HH is doing via JMX.
>
> there is a bug in 0.7 that can result in log files not been deleted
> https://issues.apache.org/jira/browse/CASSANDRA-2829
>
> Cheers
>
> -
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 22/08/2011, at 4:56 AM, Anand Somani wrote:
>
> We have a lot of space on /data, and looks like it was flushing data fine
> from file timestamps.
>
> We did have a bit of goofup with IP's when bringing up a down node (and the
> commit files have been around since then). Wonder if that is what triggered
> it and we have a bunch of hinted handoff's being backed up.
>
> For hinted hand off - how do I check if the nodes are collecting hints ( I
> do have it turned on). I noticed console bean HintedHandManager, is that the
> only way to find out.
>
> On Sun, Aug 21, 2011 at 9:20 AM, Peter Schuller <
> peter.schul...@infidyne.com> wrote:
>
>> > When does the actual commit-data file get deleted.
>> >
>> > The flush interval on all my memtables is 60 minutes
>>
>> They *should* be getting deleted when they no longer contain any data
>> that has not been flushed to disk. Are flushes definitely still
>> happening? Is it possible flushing has started failing (e.g. out of
>> disk)?
>>
>> The only way I can think of over nodes directly affecting the commit
>> log size on your node would be e.g. hinted handoff resulting in burst
>> of writes.
>>
>> --
>> / Peter Schuller (@scode on twitter)
>>
>
>
>


Re: Commit log fills up in less than a minute

2011-08-21 Thread Anand Somani
We have a lot of space on /data, and looks like it was flushing data fine
from file timestamps.

We did have a bit of goofup with IP's when bringing up a down node (and the
commit files have been around since then). Wonder if that is what triggered
it and we have a bunch of hinted handoff's being backed up.

For hinted hand off - how do I check if the nodes are collecting hints ( I
do have it turned on). I noticed console bean HintedHandManager, is that the
only way to find out.

On Sun, Aug 21, 2011 at 9:20 AM, Peter Schuller  wrote:

> > When does the actual commit-data file get deleted.
> >
> > The flush interval on all my memtables is 60 minutes
>
> They *should* be getting deleted when they no longer contain any data
> that has not been flushed to disk. Are flushes definitely still
> happening? Is it possible flushing has started failing (e.g. out of
> disk)?
>
> The only way I can think of over nodes directly affecting the commit
> log size on your node would be e.g. hinted handoff resulting in burst
> of writes.
>
> --
> / Peter Schuller (@scode on twitter)
>


Re: Commit log fills up in less than a minute

2011-08-21 Thread Anand Somani
So no it did not fill in a minute, but ton's of header files were written in
a minute (is that normal, I assume these are marker files which get written
when memtables are flushed. The actual data files have been around for the
last 24 hours?
Somehow this all seems connected to "reintroduce node" exercise I went thru
on Saturday. Somehow can this get worse with cleanup, hinted hand off,
etc

When does the actual commit-data file get deleted.

The flush interval on all my memtables is 60 minutes

Thanks

On Sun, Aug 21, 2011 at 8:43 AM, Anand Somani  wrote:

> Hi,
>
> 7.4, 3 node cluster, RF=3
>
> Load has not changed much, on 2 of the 3 nodes the commit log filled up in
> less than a minute (did not give a chance to recover). Now have been running
> this cluster for abt 2-3 months without any problem. At this point I do not
> see any unusual load (continue to investigate). The commit log has never
> taken more than 20% before this!!!
>
> The only changes was a re-intro of a node couple days ago with a bunch of
> cleanup on each node. Has anybody seen anything like this?
>
> I am starting the entire cluster again, anyway hints and things I should
> watch for (besides commit -log). I have backed up the commit logs and logs,
> so any tips on how to do a analysis on this would help.
>
> Thanks
> Anand
>


Commit log fills up in less than a minute

2011-08-21 Thread Anand Somani
Hi,

7.4, 3 node cluster, RF=3

Load has not changed much, on 2 of the 3 nodes the commit log filled up in
less than a minute (did not give a chance to recover). Now have been running
this cluster for abt 2-3 months without any problem. At this point I do not
see any unusual load (continue to investigate). The commit log has never
taken more than 20% before this!!!

The only changes was a re-intro of a node couple days ago with a bunch of
cleanup on each node. Has anybody seen anything like this?

I am starting the entire cluster again, anyway hints and things I should
watch for (besides commit -log). I have backed up the commit logs and logs,
so any tips on how to do a analysis on this would help.

Thanks
Anand


Re: Re: Urgent:!! Re: Need to maintenance on a cassandra node, are there problems with this process

2011-08-20 Thread Anand Somani
Thanks for the help, this seems to have worked. Except that while adding the
new node we added the same token to a different IP (operational script
goofup) and brought the node up, so now the other nodes just had the message
that a new IP had taken over the token.


   - So we brought it down and fixed it and it all came up fine.
   - ran removetoken did not finish
   - so ran removetoken force, that seemed to work
   - Cleaned up the nodes
   - Everything from the ring perspective appeared ok on all nodes
  - except for this error message (which based on some thread it seemed
  would go away) reported in this thread =>
  
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/0-7-4-Replication-assertion-error-after-removetoken-removetoken-force-and-a-restart-td6311082.html
  - So I restarted this one node that was complaining (this was not the
   node that was replaced)
   - But once this node was restarted, the ring command on it showed the old
   single token IP (the one we removed).
   - So I am running the removetoken again , been running for about 2-3
   hours now.

the ring shows


113427455640312821154458202477256070485
10.xxx.0.184   Up Normal  829.73 GB   33.33%
0
10.xxx.0.185   Up Normal  576.09 GB   33.33%
56713727820156410577229101238628035241
10.xxx.0.189   Down   Leaving 139.73 KB   0.00%
56713727820156410577229101238628035242
10.xxx.0.188   Up Normal  697.41 GB   33.33%
113427455640312821154458202477256070485

What are my choices here, how do I clean up the ring? The other 2 nodes show
the ring fine (not even aware of 189)

Thanks
Anand


On Fri, Aug 19, 2011 at 11:53 AM, Anand Somani  wrote:

> ok I will go with the IP change strategy and keep you posted. Not going to
> manually copy any data, just bring up the node and let it bootstrap.
>
> Thanks
>
>
> On Fri, Aug 19, 2011 at 11:46 AM, Peter Schuller <
> peter.schul...@infidyne.com> wrote:
>
>> > (Yes, this should definitely be easier. Maybe the most generally
>> > useful fix would be for Cassandra to support a node joining the wring
>> > in "write-only" mode. This would be useful in other cases, such as
>> > when you're trying to temporarily off-load a node by dissabling
>> > gossip).
>>
>> I knew I had read discussions before:
>>
>>   https://issues.apache.org/jira/browse/CASSANDRA-2568
>>
>> --
>> / Peter Schuller (@scode on twitter)
>>
>
>


Re: 0.7.4: Replication assertion error after removetoken, removetoken force and a restart

2011-08-20 Thread Anand Somani
0.7.4/ 3 node cluster/ RF -3 /Quorum read/write

After I re-introduced a corrupted node, followed the process as (thanks to
folks on the mailing list for helping me) listed on the operations wiki to
handle failures.
Still doing a cleanup on one node at this point. But I noticed that I am
seeing this same exception appear 10/12 times in a minute, on an existing
node (not the new one). I think it started around the removetoken.

How do I solve this, should I just restart this node? Any other
cleanups/resets I need to do?

Thanks


On Thu, Apr 28, 2011 at 2:26 AM, aaron morton wrote:

> I *think* that code is used when one node tells others via gossip it is
> removing a token that is not it's own. The ode that receives information in
> gossip does some work and then replies to the first node with a
> REPLICATION_FINISHED message, which is the node I assume the error is
> happening on.
>
> Have you been doing any moves / removes or additions or tokens/nodes?
>
> Thanks
> Aaron
>
> On 28 Apr 2011, at 08:39, Alexis Lê-Quôc wrote:
>
> > Hi,
> >
> > I've been getting the following lately, every few seconds.
> >
> > 2011-04-27T20:21:18.299885+00:00 10.202.61.193 [MiscStage: 97] Error
> > in ThreadPoolExecutor
> > 2011-04-27T20:21:18.299885+00:00 10.202.61.193 java.lang.AssertionError
> > 2011-04-27T20:21:18.300038+00:00 10.202.61.193 10.202.61.193   at
> >
> org.apache.cassandra.service.StorageService.confirmReplication(StorageService.java:1872)
> > 2011-04-27T20:21:18.300038+00:00 10.202.61.193 10.202.61.193   at
> >
> org.apache.cassandra.streaming.ReplicationFinishedVerbHandler.doVerb(ReplicationFinishedVerbHandler.java:38)
> > 2011-04-27T20:21:18.300047+00:00 10.202.61.193 10.202.61.193   at
> >
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:72)
> > 2011-04-27T20:21:18.300047+00:00 10.202.61.193 10.202.61.193   at
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> > 2011-04-27T20:21:18.300055+00:00 10.202.61.193 10.202.61.193   at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> > 2011-04-27T20:21:18.300055+00:00 10.202.61.193 10.202.61.193   at
> > java.lang.Thread.run(Thread.java:636)
> > 2011-04-27T20:21:18.300555+00:00 10.202.61.193 [MiscStage: 97] Fatal
> > exception in thread Thread[MiscStage:97,5,main]
> >
> > I see it coming from
> > 32 public class ReplicationFinishedVerbHandler implements IVerbHandler
> > 33 {
> > 34 private static Logger logger =
> > LoggerFactory.getLogger(ReplicationFinishedVerbHandler.class);
> > 35
> > 36 public void doVerb(Message msg, String id)
> > 37 {
> > 38 StorageService.instance.confirmReplication(msg.getFrom());
> > 39 Message response =
> > msg.getInternalReply(ArrayUtils.EMPTY_BYTE_ARRAY);
> > 40 if (logger.isDebugEnabled())
> > 41 logger.debug("Replying to " + id + "@" + msg.getFrom());
> > 42 MessagingService.instance().sendReply(response, id,
> msg.getFrom());
> > 43 }
> > 44 }
> >
> > Before I dig deeper in the code, has anybody dealt with this before?
> >
> > Thanks,
> >
> > --
> > Alexis Lê-Quôc
>
>


Re: Re: Urgent:!! Re: Need to maintenance on a cassandra node, are there problems with this process

2011-08-19 Thread Anand Somani
ok I will go with the IP change strategy and keep you posted. Not going to
manually copy any data, just bring up the node and let it bootstrap.

Thanks

On Fri, Aug 19, 2011 at 11:46 AM, Peter Schuller <
peter.schul...@infidyne.com> wrote:

> > (Yes, this should definitely be easier. Maybe the most generally
> > useful fix would be for Cassandra to support a node joining the wring
> > in "write-only" mode. This would be useful in other cases, such as
> > when you're trying to temporarily off-load a node by dissabling
> > gossip).
>
> I knew I had read discussions before:
>
>   https://issues.apache.org/jira/browse/CASSANDRA-2568
>
> --
> / Peter Schuller (@scode on twitter)
>


Re: Urgent:!! Re: Need to maintenance on a cassandra node, are there problems with this process

2011-08-19 Thread Anand Somani
Let me be specific on  lost data -> lost a replica , the other 2 nodes have
replicas

I am running read/write at quorum. At this point I have turned off my
clients from talking to this node. So if that is the case I can potentially
just nodetool repair (without changing IP). But would it be better if I
copied over the data/mykeyspace from another replica and then run repair?

On Fri, Aug 19, 2011 at 11:20 AM, Peter Schuller <
peter.schul...@infidyne.com> wrote:

> > ok, so we just lost the data on that node. are building the raid on it,
> but
> > once it is up what is the best way to bring it back in the cluster
>
> You're saying the raid failed and data is gone?
>
> > just let it come up and run nodetool repair
> > copy data from another node and then run nodetool repair,
> >
> >  do I still need to run repair immeidately if I copy the data? Want to
> > schedule repair for later during non peak hours?
>
> If data is gone, the safe way is to have it re-join the cluster:
>
>   http://wiki.apache.org/cassandra/Operations#Handling_failure
>
> But note that in your case, since you've lost data (if I understand
> you), it's effectively a completely new node. That means you either
> want to switch it's IP address and go for the "recommended" approach,
> or do the other option but that WILL mean the node is serving reads
> with incorrect data, violating the consistency. Depending on your
> application, this may or may not be the case.
>
> Unless it's a major problem for you, I suggest bringing it back in
> with a new IP address and make it be treated like a completely fresh
> replacement node. Probably decreases the risk of mistakes happening.
>
> As for the other stuff about repair in the e-mail you pasted; periodic
> repairs are part of regular cluster maintenance. See:
>
>   http://wiki.apache.org/cassandra/Operations#Frequency_of_nodetool_repair
>
> --
> / Peter Schuller (@scode on twitter)
>


Urgent:!! Re: Need to maintenance on a cassandra node, are there problems with this process

2011-08-19 Thread Anand Somani
ok, so we just lost the data on that node. are building the raid on it, but
once it is up what is the best way to bring it back in the cluster

   - just let it come up and run nodetool repair
   - copy data from another node and then run nodetool repair,
  -  do I still need to run repair immeidately if I copy the data? Want
  to schedule repair for later during non peak hours?

like I said have 500G and am on 0.7.4, 3 node cluster and RF=3


On Thu, Aug 18, 2011 at 9:42 PM, aaron morton wrote:

> You should get on 0.7.4 while you are doing this, this is a pretty good
> reason
> https://github.com/apache/cassandra/blob/cassandra-0.7.8/CHANGES.txt#L58
>
>  Never done a read repair on this cluster before, is that a problem?
>
> Potentially.
> Repair will ensure that your data is distributed, and that deletes done
> mysteriously come back to life
> http://wiki.apache.org/cassandra/Operations#Dealing_with_the_consequences_of_nodetool_repair_not_running_within_GCGraceSeconds
>
> Personally I would get a repair to complete before I started this process.
>
> You may want to make sure everything is compacted as best it can be before
> hand, see some of the other threads about repair using a lot of space.
>
> * use nodetool to change the compaction threshold down to 2 for the CF's
> * trigger a minor compaction using nodetool flush
> * wait and monitor using nodetool compactionstats
>
> The do a repair, reapir one CF at a time. Starting with the smallest CF.
> Monitor disk space and
> nodetool compactionstats
> then
> nodetool netstats
>
>
> If you have the network space I would just move the files and then put them
> back….
>
> * drain
> * copy the /var/lib/cassandra/data and saved_caches dirs
> * copy the yaml
> * blast away
> * put things back in  place
> * start up and run repair
>
> I know you have RF 3 and 3 nodes. I'm been cautious. If you don't have
> space the current approach is fine.
>
> You may want to disable Hinted Handoff while you are doing this as you are
> going to run repair anyway when the node comes back.
>
> Cheers
>
>
> -
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 19/08/2011, at 11:57 AM, Anand Somani wrote:
>
> Hi,
>
> version - 0.7.4
> cluster size = 3
> RF = 3.
> data size on a node ~500G
>
> I want to do some disk maintenance on a cassandra node, so the process that
> I came up with is
>
>- drain this node
>- back up the system data space
>- rebuild the disk partition
>- copy data from another node
>- copy data from the backed up system data
>- restart node
>- run nodetool repair
>
> Is this process sane. Never done a read repair on this cluster before, is
> that a problem? Should I run it per CF? Would it help if I did this before
> bringing the node down?
>
> Any pointers, things to worry about.
>
> Thanks
> Anand
>
>
>


Fatal exception in thread Thread[RequestResponseStage......

2011-08-18 Thread Anand Somani
Hi

I am using 0.7.4 and am seeing this exception my logs a few times a day,
should I be worried? Or is this just a intermittent network disconnect



ERROR [RequestResponseStage:257] 2011-08-19 03:05:30,706
AbstractCassandraDaemon.java (line 112) Fatal exception in thread
Thread[RequestResponseStage:257,5,main]
java.io.IOError: java.io.EOFException
at
org.apache.cassandra.io.util.ColumnIterator.deserializeNext(ColumnSortedMap.java:246)
at
org.apache.cassandra.io.util.ColumnIterator.next(ColumnSortedMap.java:262)
at
org.apache.cassandra.io.util.ColumnIterator.next(ColumnSortedMap.java:223)
at
java.util.concurrent.ConcurrentSkipListMap.buildFromSorted(ConcurrentSkipListMap.java:1493)
at
java.util.concurrent.ConcurrentSkipListMap.(ConcurrentSkipListMap.java:1443)
at
org.apache.cassandra.db.SuperColumnSerializer.deserialize(SuperColumn.java:363)
at
org.apache.cassandra.db.SuperColumnSerializer.deserialize(SuperColumn.java:311)
at
org.apache.cassandra.db.ColumnFamilySerializer.deserializeColumns(ColumnFamilySerializer.java:129)
at
org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:120)
at org.apache.cassandra.db.RowSerializer.deserialize(Row.java:73)
at
org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:112)
at
org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:82)
at
org.apache.cassandra.service.AbstractRowResolver.preprocess(AbstractRowResolver.java:64)
at
org.apache.cassandra.service.AsyncRepairCallback.response(AsyncRepairCallback.java:45)
at
org.apache.cassandra.net.ResponseVerbHandler.doVerb(ResponseVerbHandler.java:49)
at
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:72)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.EOFException
at java.io.DataInputStream.readByte(DataInputStream.java:250)
at
org.apache.cassandra.utils.ByteBufferUtil.readShortLength(ByteBufferUtil.java:278)
at
org.apache.cassandra.utils.ByteBufferUtil.readWithShortLength(ByteBufferUtil.java:289)
at
org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:66)
at
org.apache.cassandra.io.util.ColumnIterator.deserializeNext(ColumnSortedMap.java:242)
... 18 more


Need to maintenance on a cassandra node, are there problems with this process

2011-08-18 Thread Anand Somani
Hi,

version - 0.7.4
cluster size = 3
RF = 3.
data size on a node ~500G

I want to do some disk maintenance on a cassandra node, so the process that
I came up with is

   - drain this node
   - back up the system data space
   - rebuild the disk partition
   - copy data from another node
   - copy data from the backed up system data
   - restart node
   - run nodetool repair

Is this process sane. Never done a read repair on this cluster before, is
that a problem? Should I run it per CF? Would it help if I did this before
bringing the node down?

Any pointers, things to worry about.

Thanks
Anand


Problems Iterating over tokens in > 0.7.5

2011-07-05 Thread Anand Somani
Hi,

Using thrift and get_range_slices call with tokenrange. Using Random
Partionioner. Have only tried this on > 0.7.5
Used to work in 0.6.4 or earlier version for me , but I notice that it does
not work for me anymore. The need is to iterate over a token range to do
some bookkeeping.
The logic is use

   1. TokenRange from describe_ring
   2. and then for each range
   1. set the start and end token
  2. get a batch of rows using get_range_slices
  3. Then use the last token from the batch to set the start_token and
  repeat (get the next batch). iterate until no more to get (or
last from new
  batch is same as last from previous batch)

Now this works when in a test I insert n records and then for iterating use
a batch size m such that m > n. As soon as I use m < n, I get incorrect
count or an infinite loop where the range seems to repeat.

Anybody seen this issue or am I using it incorrectly for newer versions of
cassandra? I will also look up how this is done in Hector, but in the
meantime if somebody has seen this behavior, please do respond.

Thanks
Anand


Re: Storing Accounting Data

2011-06-21 Thread Anand Somani
Not sure if it is that simple, a quorum can fail with writes happening on
some nodes (there is no rollback). Also there is no concept of atomic
compare-and-swap.

On Tue, Jun 21, 2011 at 2:03 PM, AJ  wrote:

> **
> On 6/21/2011 2:50 PM, Stephen Connolly wrote:
>
> how important are things like transactional consistency for you?
>
> would you have issues if only one side of a transfer was recorded?
>
>
> Right.  Both of those questions are about consistency.  Isn't the simple
> solution is to use QUORUM read/writes?
>
>  cassandra, out of the box, on it's own, would not be ideal if the above
> two things are important for you.
>
> you can add components to a system to help address these things, eg
> zookeeper, etc. a reason why you moght do this is if you already use
> cassandra in your app and are trying to limit the number of databases
>
> - Stephen
>
> ---
> Sent from my Android phone, so random spelling mistakes, random nonsense
> words and other nonsense are a direct result of using swype to type on the
> screen
> On 21 Jun 2011 18:30, "AJ"  wrote:
>
>
>


Re: how to use indexed column for this case

2011-05-20 Thread Anand Somani
>From what I know you cannot create secondary indexes on SCF. You should have
gotten this => https://issues.apache.org/jira/browse/CASSANDRA-1813 on index
creation.

On Fri, May 20, 2011 at 6:56 AM, Monkey me wrote:

> Hi,
>  I have a SCF, Key is string, super column is TimeUUID, and several
> columns with one column named "type", I create secondary index on this
> column.  I want to have the foliowing query to fetch all super columns along
> with all columns.
>  1. given a specific key
>  2. given a range of super column (start time to end time)
>  3. given specific "type" value.
>
>  Is such query possible? I could not figure out how to use
> getIndexedSlice to achieve this. Any idea? Thanks.
>
>
> Hou
>


Re: Best way to detect/fix bitrot today?

2011-02-08 Thread Anand Somani
I should have clarified we have 3 copies, so in that case as long as 2 match
we should be ok?

Even if there were checksumming at the SStable level, I assume it has to
check and report these errors on compaction (or node repair)?

I have seen some JIRA open on these issues ( 47 and 1717), but if I need
something today, a read repair ( or a node repair) is the only viable
option?



On Mon, Feb 7, 2011 at 12:09 PM, Peter Schuller  wrote:

> > Our application space is such that there is data that might not be read
> for
> > a long time. The data is mostly immutable. How should I approach
> > detecting/solving the bitrot problem? One approach is read data and let
> read
> > repair do the detection, but given the size of data, that does not look
> very
> > efficient.
>
> Note that read-repair is not really intended to repair arbitrary
> corruptions. Unless I'm mistaken, arbitrary corruption, unless it
> triggers a serialization failure that causes row skipping, it's a
> toss-up which version of the data is retained (or both, if the
> corruption is in the key). Given the same key and column timestamp,
> the tie breaker is the volumn value. So depending on whether
> corruption results in a "lesser" or "greater" value, you might get the
> corrupt or non-corrupt data.
>
> > Has anybody solved/workaround this or has any other suggestions to detect
> > and fix bitrot?
>
> My feel/tentative opinion is that the clean fix is for Cassandra to
> support strong checksumming at the sstable level.
>
> Deploying on e.g. ZFS would help a lot with this, but that's a problem
> for deployment on Linux (which is the recommended platform for
> Cassandra).
>
> --
> / Peter Schuller
>


Best way to detect/fix bitrot today?

2011-02-07 Thread Anand Somani
Hi,

Our application space is such that there is data that might not be read for
a long time. The data is mostly immutable. How should I approach
detecting/solving the bitrot problem? One approach is read data and let read
repair do the detection, but given the size of data, that does not look very
efficient.

Has anybody solved/workaround this or has any other suggestions to detect
and fix bitrot?


Thanks
Anand


Re: Using Cassandra for storing large objects

2011-01-27 Thread Anand Somani
At this point we are not in production, in the lab only. The longest test so
far has been about 2-3 days, the datasize at this point is about 2-3 TB per
node, we have 2 nodes. We do see spikes to high response times (and
timeouts), which seemed to be around the time GC kicks in. We were pushing
the system as much as we can. Also given our application we can do major
compactions at night, have not tried it on this big data set yet. We do
still have minor compactions turned on.

On Thu, Jan 27, 2011 at 12:56 PM, Narendra Sharma  wrote:

> Thanks Anand. Few questions:
> - What is the size of nodes (in terms for data)?
> - How long have you been running?
> - Howz compaction treating you?
>
> Thanks,
> Naren
>
>
> On Thu, Jan 27, 2011 at 12:13 PM, Anand Somani wrote:
>
>> Using it for storing large immutable objects, like Aaron was suggesting we
>> are splitting the blob across multiple columns. Also we are reading it a few
>> columns at a time (for memory considerations). Currently we have only gone
>> upto about 300-400KB size objects.
>>
>> We do have machines with 32Gb memory and with 8G for java. Row cache is
>> disabled. There is some latency that needs to be sorted out, but overall I
>> am positive. This is with 6.6, am in the process of moving it to 0.7.
>>
>> On Wed, Jan 26, 2011 at 11:37 PM, Narendra Sharma <
>> narendra.sha...@gmail.com> wrote:
>>
>>> Anyone using Cassandra for storing large number (millions) of large
>>> (mostly immutable) objects (200KB-5MB size each)? I would like to understand
>>> the experience in general considering that Cassandra is not considered a
>>> good fit for large objects.
>>> https://issues.apache.org/jira/browse/CASSANDRA-265
>>>
>>>
>>> Thanks,
>>> Naren
>>>
>>
>>
>


Re: Using Cassandra for storing large objects

2011-01-27 Thread Anand Somani
Using it for storing large immutable objects, like Aaron was suggesting we
are splitting the blob across multiple columns. Also we are reading it a few
columns at a time (for memory considerations). Currently we have only gone
upto about 300-400KB size objects.

We do have machines with 32Gb memory and with 8G for java. Row cache is
disabled. There is some latency that needs to be sorted out, but overall I
am positive. This is with 6.6, am in the process of moving it to 0.7.

On Wed, Jan 26, 2011 at 11:37 PM, Narendra Sharma  wrote:

> Anyone using Cassandra for storing large number (millions) of large (mostly
> immutable) objects (200KB-5MB size each)? I would like to understand the
> experience in general considering that Cassandra is not considered a good
> fit for large objects. https://issues.apache.org/jira/browse/CASSANDRA-265
>
>
> Thanks,
> Naren
>


Re: Embedded Cassandra server startup question

2011-01-21 Thread Anand Somani
It is a little slow not to the point where it concerns me (only have few
tests for now), but keeps things very clean so no surprise effects.



On Thu, Jan 20, 2011 at 6:33 PM, Roshan Dawrani wrote:

> On Fri, Jan 21, 2011 at 5:14 AM, Anand Somani wrote:
>
>> Here is what worked for me, I use testNg, and initialize and createschema
>> in the @BeforeClass for each test
>>
>>- In the @AfterClass, I had to drop schema, otherwise I was getting
>>the same exception.
>>- After this I started getting port conflict with the second test, so
>>I added my own version of EmbeddedCass.. class, added a stop which calls a
>>stop on the cassandradaemon (which from code comments seems to closes the
>>thrift port)
>>
>> How was this clean-up experience, Anand? Shutting down the cassandra
> daemon and droping and creating schema between tests? Sounds like something
> that could be time consuming.
>
> I am currently firing all-deletes on all my CFs and am looking for more
> efficient ways to have data cleaned-up between tests.
>
> Thanks.
>


Re: Embedded Cassandra server startup question

2011-01-20 Thread Anand Somani
Here is what worked for me, I use testNg, and initialize and createschema in
the @BeforeClass for each test

   - In the @AfterClass, I had to drop schema, otherwise I was getting the
   same exception.
   - After this I started getting port conflict with the second test, so I
   added my own version of EmbeddedCass.. class, added a stop which calls a
   stop on the cassandradaemon (which from code comments seems to closes the
   thrift port)


On Thu, Jan 20, 2011 at 1:32 PM, Aaron Morton wrote:

> Do you have a full error stack?
>
> That error is raised when the schema is added to an internal static map.
> There is a lot of static state so it's probably going to make your life
> easier if you can avoid reusing the JVM.
>
> Im guessing your errors comes from AbstractCassandraDaemon.setup() calling
> DatabaseDescriptor.loadSchemas() . It may be possible to work around this
> issue, but I don't have time today. Let me know how you get on.
>
> Aaron
>
>
> On 21/01/2011, at 12:46 AM, Roshan Dawrani 
> wrote:
>
> Hi,
>
> I am using Cassandra for a Grails application and in that I start the
> embedded server when the Spring application context gets built.
>
> When I run my Grails app test suite - it first runs the integration and
> then functional test suite and it builds the application text individually
> for each phase.
>
> When it brings the up the embedded Cassandra server in 2nd phase (for
> functional tests), it fails saying "*Attempt to assign id to existing
> column family.*"
>
> Anyone familiar with this error? Is it because both the test phases are
> executed in the same JVM instance and there is some Cassandra meta-data from
> phase 1 server start that is affecting the server startup in 2nd phase?
>
> Any way I can cleanly start the server 2 times in my case? Any other
> suggestion? Thanks.
>
> --
> Roshan
> Blog: 
> http://roshandawrani.wordpress.com/
> Twitter: @roshandawrani 
> Skype: roshandawrani
>
>


Re: Secondary indexes for multi-value fields

2010-12-22 Thread Anand Somani
One approach is to ask yourself questions as to how you would use this
information, for example

   - how often to you go from user to tags
   - how often would you want to go from tag->users.
   - What kind of reporting would you want to do on tags and how often
   - Can multiple people add the same tag to the same user, are they
   maintained separately
   - Given your business, how many users do you expect
   - etc.

Depending on that one approach might work better than other. I have not used
indexes/non id based searches (do not have that use case) in Cassandra yet,
so this is just based on time I have spend reading about it.

One approach using indexes was given by Jool, the other approach is using
reverse indexes


   - 2 CF - one for user and one for tags (reverse index)
   - User - might need to have a SC - with tags and some information like
   who tagged it
   - Tag - tag to column of users
   - Advantage: -
   - 1 query to find user->tags on user CF
  - tag->users - on tag CF (I would think this would be more efficient
  than user->tags since that will potentially hit multiple
rows/nodes, unless
  I have misunderstood secondary indexes)
  - Disadvantage
  - Need to write to couple of CF, but writes are relatively cheaper
  than reads in Cassandra
  - Since you update 2 CF and there are no transaction, one might
  succeed and the other might fail

Even with the other suggestion of indexes you can still add the tag->users.



On Wed, Dec 22, 2010 at 4:54 AM, Prasad Sunkari  wrote:

>
> Hi all,
>
> I have a column family for users of my system and I need to have tags set
> to these users.  My current plan is to have a column that holds a string
> (comma separated tags).
>
> I am not clear if this the best way to do it.  Specially because this may
> lead to a complications when more than one administrator is trying to tag
> the same user (lost updates) as well as the secondary indexes (if I wanted
> to use the built in secondary indexes).  I also am not sure if it is
> possible to have a secondary index on a multi-valued column!
>
> Another alternative is to have it in a super column with each tag being a
> column by itself and let my application take care of the secondary indexes.
>
> I am currently of the opinion that the second solution is the only thing
> that I could do.
> Any suggestions?  Since this is my first app on Cassandra I am trying to
> see if my opinion is correct.
>
> Thanks,
> Prasad
>


Re: Running multiple instances on a single server --micrandra ??

2010-12-08 Thread Anand Somani
Interesting idea, .

If it is like dividing the entire load on the system by 6, so if the
effective load is still the same and used SSD's for commit volume we could
get away with 1 commitlog SSD. Even if these 6 instances can handle 80% of
the load (compared to 1 on this machine), that might be acceptable. Could
that help?

I mean the benefits of smaller cassandra nodes does sound very enticing.
Sure we would probably have to throw more memory/CPU at it to get comparable
to 1 instance on that box (or reduce the load), but it does look better than
6 boxes.

On Tue, Dec 7, 2010 at 10:00 PM, Jonathan Ellis  wrote:

> The major downside is you're going to want to let each instance have
> its own dedicated commitlog spindle too, unless you just don't have
> many updates.
>
> On Tue, Dec 7, 2010 at 8:25 PM, Edward Capriolo 
> wrote:
> > I am quite ready to be stoned for this thread but I have been thinking
> > about this for a while and I just wanted to bounce these ideas of some
> > guru's.
> >
> > Cassandra does allow multiple data directories, but as far as I can
> > tell no one runs in this configuration. This is something that is very
> > different between the hbase architecture and the Cassandra
> > architecture. HBase borrows the concept from hadoop of JBOD
> > configurations. HBase has many small ish (~256 MB) regions managed
> > with Zookeeper. Cassandra has a few (1 per node) large node sized
> > Token Ranges managed by Gossip consensus.
> >
> > Lets say a node has 6 300 GB disks. You have the options of RAID5,
> > RAID6, RAID10, or RAID0. The problem I have found with these
> > configurations are major compactions (of even large minor ones) can
> > take a long time. Even if your disk is not heavily utilized this is a
> > lot of data to move through. Thus node joins take a long time. Node
> > moves take a long time.
> >
> > The idea behind "micrandra" is for a 6 disk system run 6 instances of
> > Cassandra, one per disk. Use the RackAwareSnitch to make sure no
> > replicas live on the same node.
> >
> > The downsides
> > 1) we would have to manage 6x the instances of cassandra
> > 2) we would have some overhead for each JVM.
> >
> > The upsides ?
> > 1) Since disk/instance failure only degrades the overall performance
> > 1/6th (RAID0 you lost the entire node) (RAID5 still takes a hit when
> > down a disk)
> > 2) Moves and joins have less work to do
> > 3) Can scale up a single node by adding a single disk to an existing
> > system (assuming the ram and cpu is light)
> > 4) OPP would be "easier" to balance out hot spots (maybe not on this
> > one in not an OPP)
> >
> > What does everyone thing? Does it ever make sense to run this way?
> >
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>


Re: Getting Exception when doing a range query using token (worked in 6.5 not in 6.6)

2010-11-15 Thread Anand Somani
I only have 1 node (not a cluster), so not sure what another machine it is
trying to send to? This is a very basic test that I am doing, hence only 1
node.

I use describe_ring to get a list of TokenRanges (1 in this case) and use
the end_token and start_token from it to get_range_slices. So what should I
be doing instead?

On Mon, Nov 15, 2010 at 5:34 PM, Jonathan Ellis  wrote:

> TimedOutException means the host that your client is talking to sent
> the request to another machine, which threw the logged exception and
> thus did not reply.
>
> You're doing an illegal query; token-based queries have to be on
> non-wrapping ranges (left token < right token), or a wrapping range of
> (mintoken, mintoken).  This was changed as part of the range scan
> fixes post-0.6.5.
>
> On Mon, Nov 15, 2010 at 6:32 PM, Anand Somani 
> wrote:
> > Hi
> >
> > Problem:
> >  Call - client.get_range_slices(). Using tokens (not keys), fails
> with
> > TimedoutException which I think is misleading (Read on)
> >  Server : Works with 6.5 server, but not with 6.6 or 6.8
> >  Client: have tried both 6.5 and 6.6
> >
> > I am getting a TimedoutException when I do a get_range_slices() passing
> in
> > tokens (not keys). I only have 1 node at this time. This is working with
> > 6.5, but is broken in 6.6 and 6.8. Basically I see the exception below on
> > the server side, so not sure how it translates to TimedoutException. I
> tried
> > to play with setting the timeout, but keep getting the TimedOutException
> in
> > exactly 10 seconds, the set value seems to have no impact.
> >
> > The exception on the server side:
> >
> > ERROR [ROW-READ-STAGE:4] 2010-11-15 22:55:39,261 CassandraDaemon.java
> (line
> > 87) Uncaught exception in thread Thread[ROW-READ-STAGE:4,5,main]
> > java.lang.AssertionError:
> >
> (99318701746171979556028978387039718369,99318701746171979556028978387039718369]
> > at
> >
> org.apache.cassandra.db.ColumnFamilyStore.getRangeSlice(ColumnFamilyStore.java:1154)
> > at
> >
> org.apache.cassandra.service.RangeSliceVerbHandler.doVerb(RangeSliceVerbHandler.java:41)
> > at
> >
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:49)
> > at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> > at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> > at java.lang.Thread.run(Thread.java:619)
> >
> >
> > I am doing something wrong or what?
> >
> > Thanks
> > Anand
> >
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>


Getting Exception when doing a range query using token (worked in 6.5 not in 6.6)

2010-11-15 Thread Anand Somani
Hi

Problem:
 Call - client.get_range_slices(). Using tokens (not keys), fails with
TimedoutException which I think is misleading (Read on)
 Server : Works with 6.5 server, but not with 6.6 or 6.8
 Client: have tried both 6.5 and 6.6

I am getting a TimedoutException when I do a get_range_slices() passing in
tokens (not keys). I only have 1 node at this time. This is working with
6.5, but is broken in 6.6 and 6.8. Basically I see the exception below on
the server side, so not sure how it translates to TimedoutException. I tried
to play with setting the timeout, but keep getting the TimedOutException in
exactly 10 seconds, the set value seems to have no impact.

The exception on the server side:

ERROR [ROW-READ-STAGE:4] 2010-11-15 22:55:39,261 CassandraDaemon.java (line
87) Uncaught exception in thread Thread[ROW-READ-STAGE:4,5,main]
java.lang.AssertionError:
(99318701746171979556028978387039718369,99318701746171979556028978387039718369]
at
org.apache.cassandra.db.ColumnFamilyStore.getRangeSlice(ColumnFamilyStore.java:1154)
at
org.apache.cassandra.service.RangeSliceVerbHandler.doVerb(RangeSliceVerbHandler.java:41)
at
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:49)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)


I am doing something wrong or what?

Thanks
Anand


Range queries using token instead of key

2010-11-10 Thread Anand Somani
Hi,

I am trying to iterate over the entire dataset to calculate some
information. Now the way I am trying to do this is by going directly to the
node that has a data range, so here is the route I am following

   - get TokenRange using - describe_ring
   - then for each tokenRange pick a node and get all data from that node
   (so talk directly to that node for local data) - using get_range_slices ()
   and using KeyRange with start and end token. I want to get about N tokens at
   a time.
   - I want to use paging approach for this, but I cannot seem to find a way
   to get the token for my last keyslice? The only thing I can find is key, now
   is there way to get token given a key? As per some suggestions I can do the
   md5 on the last key and use that as the starting token for the next query,
   would that work?

Also is there a better way of doing this? The data per row is very small.
This looks like a hadoop kind of a job, but am trying to avoid hadoop since
have no other use for it and this operation will be infrequent.

I am using 0.6.6, RandomPartitioner.

Thanks
Anand