Re: Cassandra 2 DC deployment

2011-04-15 Thread Peter Schuller
 You are right about the automatic fallback to ONE. Its quite possible, if 2 
 nodes die for some reason I will have the same problem. So probably the right 
 thing to do would be to read/write at ONE only when we lose a DC by changing 
 some manual configuration. Since we shouldn't be losing DCs that often, this 
 should be an acceptable change. So my follow up questions would be -

Seems reasonable to have a human do it, since it seems that you really
want QUORUM - so presumably there is some kind of negative impact and
you don't want that sporadically happening every time there is a
hiccup. But of course I don't know the context.

 When would be the right time to start reading/writing at QUORUM again?

I'd say usually as soon as possible, but it will depend on details of
your situation. For example, if you have 2 DC:s with 5 nodes in one
and 1 node in another, and there is a partition - the DC with just one
node will start seeing older data (from the point of view of writes
done in the 1-node DC) if you start asking for quorum since a lot of
the time a quorum will be 4 nodes in the other DC. So if there is
interest in preferring the local dc's copy of the data after an
emergency fallback to CL.ONE, it may be detrimental to go QUORUM too
early.

But this will depend on what your application is actually doing and
what is important to you.

 Should we be marking the 2 nodes in the lost DC as down?
 Should we be doing some administrative work on Cassandra before we start 
 reading/writing at QUORUM again?

Are you talking about permanently losing a DC then, rather than just a
transient partition? For non-permanent situations it seems
counter-productive to mark other DC's nodes as down. Oh and btw, keep
in mind you can choose to use LOCAL_QUORUM to get intra-site
consistency (rather than ONE).

As for administrative work: I can't answer in general since we're
talking about very special circumstances, but at least it's valid to
say that whenever you have some kind of issue that has caused
inconsistency, running 'nodetool repair' (perhaps earlier than the
standard weekly/whatever repair) is the most efficient way to achieve
consistency again.

-- 
/ Peter Schuller


RE: Cassandra 2 DC deployment

2011-04-13 Thread Nair, Rajesh
Peter all great questions. Let me try to answer them.

You are right about the automatic fallback to ONE. Its quite possible, if 2 
nodes die for some reason I will have the same problem. So probably the right 
thing to do would be to read/write at ONE only when we lose a DC by changing 
some manual configuration. Since we shouldn't be losing DCs that often, this 
should be an acceptable change. So my follow up questions would be -
When would be the right time to start reading/writing at QUORUM again? 
Should we be marking the 2 nodes in the lost DC as down?
Should we be doing some administrative work on Cassandra before we start 
reading/writing at QUORUM again?

I am trying to define a process when we lose a dc. 

Thanks
-Raj 

-Original Message-
From: sc...@scode.org [mailto:sc...@scode.org] On Behalf Of Peter Schuller
Sent: Tuesday, April 12, 2011 4:54 PM
To: user@cassandra.apache.org
Subject: Re: Cassandra 2 DC deployment

 When the down data center comes back up, the Quorum reads will result in a 
 read-repair, so you will get valid data.   Besides that, hinted handoff will 
 take care of getting data replicated to a previously down node.

*Eventually* though, but yes. I.e., there would be no expectation to instantly 
go back to full consistency once it goes back up.

Also, I would argue that it's useful to consider this: If you're implementing 
automatic fallback to ONE whenever QUORUM fails; consider all cases where 
this might happen for reasons *other* than there being a legitimate partition 
of the DC:s. For example, some random networking issues causing fewer nodes to 
be up etc.

A valid question is: If you simply do automatic fallback whenever QUORUM fails 
anyway, are you significantly increasing consistency with respect to ONE 
anyway? In some cases yes, but just be sure you know what you're doing... Keep 
in mind that when all nodes are up and all is working well, CL.ONE doesn't mean 
that writes won't be replicated to all nodes. It just means that only one is 
*required* - and same for reads.

If you have some situation whereby you normally want the strict requirement 
that a read subsequent to a write sees the written data, that doesn't sound 
very compatible with automatically falling back to CL.ONE...

Anyways, those are my off-the-cuff thoughts - maybe it doesn't apply in the 
situation in question.
--
/ Peter Schuller

THIS MESSAGE AND ANY ATTACHMENTS ARE CONFIDENTIAL, PROPRIETARY, AND MAY BE 
PRIVILEGED.  If this message was misdirected, BlackRock, Inc. and its 
subsidiaries, (BlackRock) does not waive any confidentiality or privilege.  
If you are not the intended recipient, please notify us immediately and destroy 
the message without disclosing its contents to anyone.  Any distribution, use 
or copying of this e-mail or the information it contains by other than an 
intended recipient is unauthorized.  The views and opinions expressed in this 
e-mail message are the author's own and may not reflect the views and opinions 
of BlackRock, unless the author is authorized by BlackRock to express such 
views or opinions on its behalf.  All email sent to or from this address is 
subject to electronic storage and review by BlackRock.  Although BlackRock 
operates anti-virus programs, it does not accept responsibility for any damage 
whatsoever caused by viruses being passed.




Cassandra 2 DC deployment

2011-04-12 Thread Raj N
Hi experts,
 We are planning to deploy Cassandra in 2 datacenters. Let assume there
are 3 nodes, RF=3, 2 nodes in 1 DC and 1 node in 2nd DC. Under normal
operations, we would read and write at QUORUM. What we want to do though is
if we lose a datacenter which has 2 nodes, DC1 in this case, we want to
downgrade our consistency to ONE. Basically I am saying that whenever there
is a partition, then prefer availability over consistency. In order to do
this we plan to catch UnavailableException and take corrective action. So
try QUORUM under normal circumstances, if unavailable try ONE. My questions
-
Do you guys see any flaws with this approach?
What happens when DC1 comes back up and we start reading/writing at QUORUM
again? Will we read stale data in this case?

Thanks
-Raj


Re: Cassandra 2 DC deployment

2011-04-12 Thread Jonathan Colby
When the down data center comes back up, the Quorum reads will result in a 
read-repair, so you will get valid data.   Besides that, hinted handoff will 
take care of getting data replicated to a previously down node.

You're example is a little unrealistic because you could theoretically have a 
DC with only one node.  So CL.ONE would work every time.   But if you have more 
than 1 node, you have to decide if your application can tolerate getting NULL 
 for a read if the write hasn't propagated from the responsible node to the 
replica.

disclaimer:  I'm a cassandra novice.

On Apr 12, 2011, at 5:12 PM, Raj N wrote:

 Hi experts,
  We are planning to deploy Cassandra in 2 datacenters. Let assume there 
 are 3 nodes, RF=3, 2 nodes in 1 DC and 1 node in 2nd DC. Under normal 
 operations, we would read and write at QUORUM. What we want to do though is 
 if we lose a datacenter which has 2 nodes, DC1 in this case, we want to 
 downgrade our consistency to ONE. Basically I am saying that whenever there 
 is a partition, then prefer availability over consistency. In order to do 
 this we plan to catch UnavailableException and take corrective action. So try 
 QUORUM under normal circumstances, if unavailable try ONE. My questions -
 Do you guys see any flaws with this approach? 
 What happens when DC1 comes back up and we start reading/writing at QUORUM 
 again? Will we read stale data in this case?
 
 Thanks
 -Raj



Re: Cassandra 2 DC deployment

2011-04-12 Thread Narendra Sharma
I think this is reasonable assuming you have enough backhaul to perform
reads across DC if read requests hit DC2 (with one copy of data) or one
replica from DC1 is down.

Moreover, since you clearly stated that you would prefer availability over
consistency, you should be prepared for stale reads :)


On Tue, Apr 12, 2011 at 8:12 AM, Raj N raj.cassan...@gmail.com wrote:

 Hi experts,
  We are planning to deploy Cassandra in 2 datacenters. Let assume there
 are 3 nodes, RF=3, 2 nodes in 1 DC and 1 node in 2nd DC. Under normal
 operations, we would read and write at QUORUM. What we want to do though is
 if we lose a datacenter which has 2 nodes, DC1 in this case, we want to
 downgrade our consistency to ONE. Basically I am saying that whenever there
 is a partition, then prefer availability over consistency. In order to do
 this we plan to catch UnavailableException and take corrective action. So
 try QUORUM under normal circumstances, if unavailable try ONE. My questions
 -
 Do you guys see any flaws with this approach?
 What happens when DC1 comes back up and we start reading/writing at QUORUM
 again? Will we read stale data in this case?

 Thanks
 -Raj




-- 
Narendra Sharma
Solution Architect
*http://www.persistentsys.com*
*http://narendrasharma.blogspot.com/*


Re: Cassandra 2 DC deployment

2011-04-12 Thread Peter Schuller
 When the down data center comes back up, the Quorum reads will result in a 
 read-repair, so you will get valid data.   Besides that, hinted handoff will 
 take care of getting data replicated to a previously down node.

*Eventually* though, but yes. I.e., there would be no expectation to
instantly go back to full consistency once it goes back up.

Also, I would argue that it's useful to consider this: If you're
implementing automatic fallback to ONE whenever QUORUM fails;
consider all cases where this might happen for reasons *other* than
there being a legitimate partition of the DC:s. For example, some
random networking issues causing fewer nodes to be up etc.

A valid question is: If you simply do automatic fallback whenever
QUORUM fails anyway, are you significantly increasing consistency with
respect to ONE anyway? In some cases yes, but just be sure you know
what you're doing... Keep in mind that when all nodes are up and all
is working well, CL.ONE doesn't mean that writes won't be replicated
to all nodes. It just means that only one is *required* - and same for
reads.

If you have some situation whereby you normally want the strict
requirement that a read subsequent to a write sees the written data,
that doesn't sound very compatible with automatically falling back to
CL.ONE...

Anyways, those are my off-the-cuff thoughts - maybe it doesn't apply
in the situation in question.
-- 
/ Peter Schuller