Thanks Jonathan, A few more clarification questions below. -eran
On Tue, Mar 2, 2010 at 15:44, Jonathan Ellis <jbel...@gmail.com> wrote: > > On Tue, Mar 2, 2010 at 6:43 AM, Eran Kutner <e...@gigya.com> wrote: > > Is the procedure described in the description of ticket CASSANDRA-44 really > > the way to do schema changes in the latest release? I'm not sure what's your > > thoughts about this but our experience is that every release of our software > > requires schema changes because we add new column families for indexes. > > Yes, that is how it is for 0.5 and 0.6. 0.7 will add online schema > changes (i.e., fix -44), Gary is working on that now. So just to be clear, that would require a complete cluster restart as well as stopping the client app (to prevent new writes from coming in after doing the flush), right? Do you know how others are handling it on a live system? > > > Any idea on the timeframe for 0.7? > > We are trying for 3-4 months, i.e. roughly the same as as our last 4 releases. > > > Our application needs a lot of range scans. Is there anything being done to > > improve the poor range scan performance as reflected here: > > http://www.brianfrankcooper.net/pubs/ycsb-v4.pdf ? > > https://issues.apache.org/jira/browse/CASSANDRA-821 is open, also for > the 0.7 release. Johan is working on this. > > > What is the reason for the replication strategy with two DCs? As far as I > > understand it means that only one replica will exist in the second DC. It > > also means that quorum reads will fail when attempted on the second DC while > > the first DC is down. Am I missing something? > > Yes: > - That strategy is meant for doing reads w/ CL.ONE; it guarantees at > least one replica in each DC, for low latency with that CL > - Quorum is based on the whole cluster, not per-DC. > DatacenterShardStrategy will put multiple replicas in each DC, for use > with CL.DCQUORUM, that is, a majority of replicas in the same DC as > the coordinator node for the current request. DCQOURUM is not yet > finished, though; currently it behaves the same as CL.ALL. Is it planned for any specific release? > > > Are there any plans to have a inter-cluster replication option? I mean > > having two clusters running in two DCs, each will be stand alone but they > > will replicate data between themselves. > > No. This is worse in every respect, since it means you get to > reinvent the existing repair, hinted handoff, etc code for when > replication breaks, poorly. I'm not sure I understand why you would need to redo all of that. As a trivial design assume that every write in DC1 was logged to a system table which would be just a standard Cassandra table, writes are cheap anyway, so doing another write on every write is a reasonable tradeoff. Then, a background service would read that system with CL.ALL table and write the data to the DC2, again with CL.ALL. With a single server/thread doing the replication it's almost trivial, but even with more servers/threads I think it can still be managed with very small changes to the existing Cassandra system. > > > This can avoid the problem mentioned > > above, as well as avoid the high cost of inter-DC traffic when doing > > Read-Repairs for every read. > > Of course if you don't RR then you can read inconsistent data until > your next full repair. Not a good trade. Remember RR is done in the > background so the latency doesn't matter. I am more concerned about the actual cost of the bandwidth, on a typical application with 80-90% reads doing RR means you need a very wide link between the DCs. It's probably going to be even worse after CL.DCQUORUM is available because then more data will have to be read from the remote DC. > > > From everything I've read I didn't understand if load balancing is local or > > global. In other words, what happens exactly when a new node is added? Will > > it only balance its two neighbors on the ring or will the re-balance > > propagate through the ring and all the nodes will be rebalanced evenly? > > The former. Cascading data moves around the ring is a Bad Idea. > (Since you read the Yahoo hbase/cassandra paper -- if hbase does this, > maybe that is why adding a new node basically kills their cluster for > several minutes?) I don't know exactly what they are doing there but in general, since the data layer (HDFS) is separate from the DB layer (HBase) they should be able to reassign key ranges to other region servers quite easily. I can only assume the slow down happens because the region server has to flush all its memory tables to disk before being able to split the ranges. Re-balancing the HDFS is definitely not done automatically, they have a "balancer" service that has to be run manually to balance HDFS blocks after adding/removing nodes but it does it work slowly in the background. > > > I see that Hadoop support is coming in 0.6 but from following the ticket on > > Jira (CASSANDRA-342) I didn't understand if it will support the > > orderPreservingPartitioner or not. > > It supports all partitioners. > > > Do the clients have to be recompiled and deployed when a new version of > > Cassandra is deployed, or are new releases backward compatible? > > The short answer is, we maintained backwards compatibility for 0.4 -> > 0.5 -> 0.6, but we are going to break things in 0.7 moving from String > keys to byte[] and possibly other changes. hmmm... My assumption was that although keys are strings they are still compared as bytes when using the OPP right? That would be the difference between the OPP and the COPP, right? Just confirming because otherwise creating composite keys with different data types may prove problematic. > > -Jonathan