Sean, Thanks to you and Ben for clarifying how that works. Since that was so helpful, I'll ask a followup question, and also a question on a mostly un-related topic...
1) When I've removed a couple of nodes and the remaining nodes pick up the slack, is there any way for me to look under the hood and see that? I'm using wget to fetch the '.../stats' URL from one of the remaing live nodes, and under ring_ownership it still lists the original 4 nodes, each one owning 1/4 or the total partitions. That's part of reason why I didn't think the data ownership had been moved. 2) My test involves sending a large number of read/write requests to the cluster from multiple client connections and timing how long each request takes. I find that the vast majority of the requests are processed quickly (a few milliseconds to 10s of milliseconds). However, every once in while, the server seems to "hang" for a while. When that happens the response can take several hundred milliseconds or even several seconds. Is this something that is known and/or expected? There doesn't seem to be any pattern to how often it happens -- typically I'll see it a "few" times during a 10-minute test run. Sometimes it will go for several minutes without a problem. I haven't ruled out a problem with my test client, but it's fairly simple-minded C++ program using the protocol buffers interface, so I don't think there is too much that can go wrong on that end. Thanks again for your help! On Fri, May 13, 2011 at 12:06:06PM -0400, Sean Cribbs wrote: > Peter, > > You've hit on a major feature of Riak: to be available in the face of network > and hardware failure. > > When a node is down, other nodes (ones that do not "own" the replicas for a > given key) will pick up the slack and serve read and write requests on behalf > of the downed node. This means that while the node(s) is down, you could > write a key to the cluster and read it back while still satisfying quorum. > The standard quorum considers fallback nodes to be as valid as non-fallbacks > (we're also in the process of implementing a way for you to be more strict > about that, if you so desire). When the downed nodes return, writes that > were sent to fallbacks are returned to their proper owners via hinted handoff. > > This feature lets your application that uses Riak stay available (even if in > a degraded state), despite multiple failures. We consider this A Good Thing. > > Sean Cribbs <[email protected]> > Developer Advocate > Basho Technologies, Inc. > http://basho.com/ > > On May 13, 2011, at 11:13 AM, Peter Fales wrote: > > > > > I'm a Riak newbie, trying to get some familiarity with the system by > > runing some tests on Amazon EC2. I'm seeing some behavior that I don't > > understand... > > > > I've set up a test where I create a 4-node cluster using 4 EC2 machines. > > I've created a bucket with n_val=4, r=quorum, and w=quorum. For > > n_val=4, the quorum should be 3, so I thought I would have to have at > > least 3 nodes in service for my read and write operations to succeed. > > During my test, I start sending read/write requests to two of the nodes > > (and I see the CPU load go up on all four nodes, so I know they are > > talking to each other). Then I reboot the other two nodes. At that > > point, I was expecting the reads and writes to start failing, but in > > fact I usually don't see any problems. (sometimes the query that is > > in progress at the time may fail or timeout, but if I establish a new > > connection to the server, and start sending read/write requests again, > > those requests will go through, even with only two of the 4 nodes in > > service) > > > > I suspect I'm just missing something obvious, but I don't understand how > > I can run with just two nodes. What am I missing? > > > > -- > > Peter Fales > > Alcatel-Lucent > > Member of Technical Staff > > 1960 Lucent Lane > > Room: 9H-505 > > Naperville, IL 60566-7033 > > Email: [email protected] > > Phone: 630 979 8031 > > > > _______________________________________________ > > riak-users mailing list > > [email protected] > > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com -- Peter Fales Alcatel-Lucent Member of Technical Staff 1960 Lucent Lane Room: 9H-505 Naperville, IL 60566-7033 Email: [email protected] Phone: 630 979 8031 _______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
