Extremely relevant to this conversation: http://blog.basho.com/2011/09/09/Riak-Cluster-Membership-Overview/
Buy Joe a beer next time you see him :) Mark On Fri, Sep 9, 2011 at 2:37 PM, Jeff Pollard <[email protected]> wrote: >> Data is first transferred to new partition owners before handing over >> partition ownership. This change fixes numerous bugs, such >> as 404s/not_founds during ownership changes. The Ring/Pending columns >> in [riak admin member_status] visualize this at a high-level, and the full >> transfer status in [riak-admin ring_status] provide additional insight. > > At present (0.14.x series) my understanding is that when a new node is added > to the cluster, it claims a portion of the ring and services requests for > that portion before all the data is actually present on the node. Is that > correct? If so, as long as you're able to meet the R value of a read (i.e. > R=2, N=3) by servicing reads from nodes with replicas of the same data you > shouldn't see any 404s. Is that also correct? > > I should add that we're planning on adding our first node to our production > cluster soon and wanted to make sure we had our story straight :) That > said, I'm very excited to see all the clustering improvements in 1.0 and > hoping we can upgrade before adding a new node. > > On Fri, Sep 9, 2011 at 3:10 AM, Jens Rantil <[email protected]> wrote: >> >> Thanks for very well written answer. I appreciate it, mate. >> >> Jens >> >> -----Ursprungligt meddelande----- >> Från: Joseph Blomstedt [mailto:[email protected]] >> Skickat: den 8 september 2011 17:42 >> Till: Jens Rantil >> Kopia: [email protected] >> Ämne: Re: Riak Clustering Changes in 1.0 >> >> > Out of curiousity, what was the reason for the 'join' command >> > behaviour to change? >> >> 1. Existing bugs/limitations. For example, joining two entire clusters >> together was not an entirely safe operation. In some cases, the newly formed >> cluster would not correctly converge, leaving the ring/cluster in flux. >> Likewise, we realized that many users were often joining two clusters >> together by accident and would prefer additional safety. In particular, >> joining two clusters together with overlapping data but no common vector >> clock relationship could result in data loss as unintended siblings were >> reconciled. >> >> 2. It was necessary consequence of how the new cluster code works. In the >> new cluster, the cluster state / ring is only ever mutated by a single node >> at a time. This is done by having a cluster-wide claimant, as mentioned in >> my original email. Given the claimant approach, all cluster state / ring >> changes are totally ordered. When a new node joins an existing cluster, it >> throws away it's existing ring and replaces it with a copy of the ring from >> the target cluster, thus joining into the same cluster history. If you were >> to join two clusters together, we would need to deterministically merge two >> independent cluster histories and elect a single new claimant for the new >> cluster. This is easy in cases where there are no node failures or >> net-splits during joining, but less trivial when there are errors. The >> entire new cluster code was heavily modeled before implementation, and in >> the modeling work several corner cases related to failures were found that >> were hard to address in a cluster/cluster join but easy to fix in a >> node/cluster join. Thus, I went with the simple and correct approach. >> >> -Joe >> >> -- >> Joseph Blomstedt <[email protected]> >> Software Engineer >> Basho Technologies, Inc. >> http://www.basho.com/ >> >> On Thu, Sep 8, 2011 at 5:19 AM, Jens Rantil <[email protected]> >> wrote: >> > Out of curiousity, what was the reason for the 'join' command >> > behaviour to change? >> > >> > >> > >> > Regards, >> > >> > Jens >> > >> > >> > >> > ----------------------------------------------------------- >> > >> > Date: Wed, 7 Sep 2011 18:12:40 -0600 >> > >> > From: Joseph Blomstedt <[email protected]> >> > >> > To: riak-users Users <[email protected]> >> > >> > Subject: Riak Clustering Changes in 1.0 >> > >> > Message-ID: >> > >> > >> > <CANvk2KRPpath-ZJaDuhZo0b+pEByn0MhxyJ-9LEB+b_12d=v...@mail.gmail.com> >> > >> > Content-Type: text/plain; charset=ISO-8859-1 >> > >> > >> > >> > Given that 1.0 prerelease packages are now available, I wanted to >> > >> > mention some changes to Riak's clustering capabilities in 1.0. In >> > >> > particular, there are some subtle semantic differences in the >> > >> > riak-admin commands. More complete docs will be updated in the near >> > >> > future, but I hope a quick email suffices for now. >> > >> > >> > >> > [nodeB/riak-admin join nodeA] is now strictly one-way. It joins nodeB >> > >> > to the cluster that nodeA is a member of. This is semantically >> > >> > different than pre-1.0 Riak in which join essentially joined clusters >> > >> > together rather than joined a node to a cluster. As part of this >> > >> > change, the joining node (nodeB in this case) must be a singleton >> > >> > (1-node) cluster. >> > >> > >> > >> > In pre-1.0, leave and remove were essentially the same operation, with >> > >> > leave just being an alias for 'remove this-node'. This has changed. >> > >> > Leave and remove are now very different operations. >> > >> > >> > >> > [nodeB/riak-admin leave] is the only safe way to have a node leave the >> > >> > cluster, and it must be executed by the node that you want to remove. >> > >> > In this case, nodeB will start leaving the cluster, and will not leave >> > >> > the cluster until after it has handed off all its data. Even if nodeB >> > >> > is restarted (crashed/shutdown/whatever), it will remain in the leave >> > >> > state and continue handing off partitions until done. After handoff, >> > >> > it will leave the cluster, and eventually shutdown. >> > >> > >> > >> > [nodeA/riak-admin remove nodeB] immediately removes nodeB from the >> > >> > cluster, without handing off its data. All replicas held by nodeB are >> > >> > therefore lost, and will need to be re-generated through read-repair. >> > >> > Use this command carefully. It's intended for nodes that are >> > >> > permanently unrecoverable and therefore for which handoff doesn't make >> > >> > sense. By the final 1.0 release, this command may be renamed >> > >> > "force-remove" just to make the distinction clear. >> > >> > >> > >> > There are now two new commands that provide additional insight into >> > >> > the cluster. [riak-admin member_status] and [riak-admin ring_status]. >> > >> > >> > >> > Underneath, the clustering protocol has been mostly re-written. The >> > >> > new approach has the following advantages: >> > >> > 1. It is no longer necessary to wait on [riak-admin ringready] in >> > >> > between adding/removing nodes from the cluster, and adding/removing is >> > >> > also much more sound/graceful. Starting up 16 nodes and issuing >> > >> > [nodeX: riak-admin join node1] for X=1:16 should just work. >> > >> > >> > >> > 2. Data is first transferred to new partition owners before handing >> > >> > over partition ownership. This change fixes numerous bugs, such as >> > >> > 404s/not_founds during ownership changes. The Ring/Pending columns in >> > >> > [riak-admin member_status] visualize this at a high-level, and the >> > >> > full transfer status in [riak-admin ring_status] provide additional >> > >> > insight. >> > >> > >> > >> > 3. All partition ownership decisions are now made by a single node in >> > >> > the cluster (the claimant). Any node can be the claimant, and the duty >> > >> > is automatically taken over if the previous claimant is removed from >> > >> > the cluster. [riak-admin member_status] will list the current >> > >> > claimant. >> > >> > >> > >> > 4. Handoff related to ownership changes can now occur under load; >> > >> > hinted handoff still only occurs when a vnode is inactive. This change >> > >> > allows a cluster to scale up/down under load, although this needs to >> > >> > be further benchmarked and tuned before 1.0. >> > >> > >> > >> > To support all of the above, a new limitation has been introduced. >> > >> > Cluster changes (member addition/removal, ring rebalance, etc) can >> > >> > only occur when all nodes are up and reachable. [riak-admin >> > >> > ring_status] will complain when this is not the case. If a node is >> > >> > down, you must issue [riak-admin down <node>] to mark the node as >> > >> > down, and the remaining nodes will then proceed to converge as usual. >> > >> > Once the down node comes back online, it will automatically >> > >> > re-integrate into the cluster. However, there is nothing preventing >> > >> > client requests being served by a down node before it re-integrates. >> > >> > Before issuing [down <node>], make sure to update your load balancers >> > >> > / connection pools to not include this node. Future releases of Riak >> > >> > may make offlining a node an automatic operation, but it's a >> > >> > user-initiated action in 1.0. >> > >> > >> > >> > -Joe >> > >> > >> > >> > _______________________________________________ >> > riak-users mailing list >> > [email protected] >> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >> > >> > >> >> _______________________________________________ >> riak-users mailing list >> [email protected] >> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > > _______________________________________________ > riak-users mailing list > [email protected] > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > _______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
