Re: [Openais] howto distribute data accross all nodes? Reply-To:
On Fri, Apr 17, 2009 at 10:56:47PM -0700, Steven Dake wrote: On Sat, 2009-04-18 at 07:49 +0200, Dietmar Maurer wrote: like a 'merge' function? Seems the algorithm for checkpoint recovery always uses the state from the node with the lowest processor id? Yes that is right. So if I have the following cluster: Part1: node2 node3 node4 Part2: node1 Let assume Part1 is running for some time and has gathered some state in checkpoints. Part2 is just the newly started node1. So when node1 starts up the whole cluster uses the empty checkpoint from node1? (I guess I am confused somehow). The checkpoint service will merge checkpoints from both partitions into one view because both node 1 and node2 send out their checkpoint state on a merge operation. That doesn't make any sense, I can't believe that's how it works, the resulting content would be complete nonsense. Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] howto distribute data accross all nodes?
On Sat, Apr 18, 2009 at 07:49:12AM +0200, Dietmar Maurer wrote: like a 'merge' function? Seems the algorithm for checkpoint recovery always uses the state from the node with the lowest processor id? Yes that is right. So if I have the following cluster: Part1: node2 node3 node4 Part2: node1 Let assume Part1 is running for some time and has gathered some state in checkpoints. Part2 is just the newly started node1. So when node1 starts up the whole cluster uses the empty checkpoint from node1? (I guess I am confused somehow). It is *not* as simple as node with the low nodeid. It is node with the low nodeid where the state exists. When selecting the node to send state to others, you obviously need to select among nodes that have the state :-) In the dlm_controld example I mentioned earlier, the function called set_plock_ckpt_node() picks the node that will save state in the ckpt: list_for_each_entry(memb, cg-members, list) { if (!(memb-start_flags DLM_MFLG_HAVEPLOCK)) continue; if (!low || memb-nodeid low) low = memb-nodeid; } Only nodes that have state will have the DLM_MFLG_HAVEPLOCK flag set; new nodes just added by a confchg will not have that flag set. Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] howto distribute data accross all nodes?
On Sat, Apr 18, 2009 at 03:55:57AM -0700, Steven Dake wrote: On Sat, 2009-04-18 at 12:47 +0200, Dietmar Maurer wrote: At least the SA Forum does not mention such strange behavior. Isn't that a serious bug? Yes, I'd consider it a serious bug. Consider 2 Partitions with one checkpoint: Part1: CkptSections ABC Part2: CkptSections BCD After the merge, you have: CkptSections ABCD And even worse, section contains data from different partitions (old data mixed with new one)? And there is no notification that such things happens? That ckpt behavior is nonsensical for most real applications I'd wager. I'm going to have to go check whether my apps are protected from that. The SA Forum doesn't consider at all how to handle partitions in a network or at least not very suitably (up to designer of SA Forum services). They assume that applications will be using the AMF, and rely on the AMF functionality to reboot partitioned nodes (fencing) so this condition doesn't occur. They don't consider it presumably because *it doesn't make any sense*. The SA Forum services were not designed with partitioned networks in mind. It is unfortunate, but it is what it is. If an app needs true consistently without some form of fencing, the app designer has to take partitions into consideration when designing their applications. This is why I recommend using CPG for these types of environments because it provides better design control over exactly how data remerges. If the SAF services don't specify what should happen when clusters with divergent state are combined, then it probably means it should not happen, and you should probably not allow the unspecified behavior instead of making something up. Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] howto distribute data accross all nodes?
like a 'merge' function? Seems the algorithm for checkpoint recovery always uses the state from the node with the lowest processor id? Yes that is right. So if I have the following cluster: Part1: node2 node3 node4 Part2: node1 Let assume Part1 is running for some time and has gathered some state in checkpoints. Part2 is just the newly started node1. So when node1 starts up the whole cluster uses the empty checkpoint from node1? (I guess I am confused somehow). - Dietmar ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] howto distribute data accross all nodes?
like a 'merge' function? Seems the algorithm for checkpoint recovery always uses the state from the node with the lowest processor id? Yes that is right. So if I have the following cluster: Part1: node2 node3 node4 Part2: node1 Let assume Part1 is running for some time and has gathered some state in checkpoints. Part2 is just the newly started node1. So when node1 starts up the whole cluster uses the empty checkpoint from node1? (I guess I am confused somehow). - Dietmar The checkpoint service will merge checkpoints from both partitions into one view because both node 1 and node2 send out their checkpoint state on a merge operation. So does it use a 'merge' function, or always use the state from the node with the lowest processor id? If there is a merge function, what algorithm is used to merge 2 states? - Dietmar ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] howto distribute data accross all nodes?
On Sat, 2009-04-18 at 09:58 +0200, Dietmar Maurer wrote: like a 'merge' function? Seems the algorithm for checkpoint recovery always uses the state from the node with the lowest processor id? Yes that is right. So if I have the following cluster: Part1: node2 node3 node4 Part2: node1 Let assume Part1 is running for some time and has gathered some state in checkpoints. Part2 is just the newly started node1. So when node1 starts up the whole cluster uses the empty checkpoint from node1? (I guess I am confused somehow). - Dietmar The checkpoint service will merge checkpoints from both partitions into one view because both node 1 and node2 send out their checkpoint state on a merge operation. So does it use a 'merge' function, or always use the state from the node with the lowest processor id? If there is a merge function, what algorithm is used to merge 2 states? An older version of the algorithm is described here: http://www.openais.org/doku.php?id=dev:partition_recovery_checkpoint:checkpoint It has been updated to deal with some race conditions, but the document is pretty close. As you can see, designing the recovery state machine is complicated. - Dietmar ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] howto distribute data accross all nodes?
An older version of the algorithm is described here: http://www.openais.org/doku.php?id=dev:partition_recovery_checkpoint:ch eckpoint It has been updated to deal with some race conditions, but the document is pretty close. As you can see, designing the recovery state machine is complicated. I still don't get the point. That algorithm computes something, yes. But can it be described in words, or better in C code? char *merge_checkpoint_data(char *cp_data_1, char *cp_data2); - Dietmar ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] howto distribute data accross all nodes?
On Sat, 2009-04-18 at 10:24 +0200, Dietmar Maurer wrote: An older version of the algorithm is described here: http://www.openais.org/doku.php?id=dev:partition_recovery_checkpoint:ch eckpoint It has been updated to deal with some race conditions, but the document is pretty close. As you can see, designing the recovery state machine is complicated. Is there any guarantee that all sections inside a recovered checkpoint are from the same cluster partition (I can't see such restriction in the algorithm)? no The algorithm will merge sections created in both partitions into the single checkpoint. - Dietmar ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] howto distribute data accross all nodes?
On Sat, 2009-04-18 at 12:47 +0200, Dietmar Maurer wrote: Is there any guarantee that all sections inside a recovered checkpoint are from the same cluster partition (I can't see such restriction in the algorithm)? no The algorithm will merge sections created in both partitions into the single checkpoint. At least the SA Forum does not mention such strange behavior. Isn't that a serious bug? Consider 2 Partitions with one checkpoint: Part1: CkptSections ABC Part2: CkptSections BCD After the merge, you have: CkptSections ABCD And even worse, section contains data from different partitions (old data mixed with new one)? And there is no notification that such things happens? The SA Forum doesn't consider at all how to handle partitions in a network or at least not very suitably (up to designer of SA Forum services). They assume that applications will be using the AMF, and rely on the AMF functionality to reboot partitioned nodes (fencing) so this condition doesn't occur. The SA Forum services were not designed with partitioned networks in mind. It is unfortunate, but it is what it is. If an app needs true consistently without some form of fencing, the app designer has to take partitions into consideration when designing their applications. This is why I recommend using CPG for these types of environments because it provides better design control over exactly how data remerges. Regards -steve - Dietmar ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] howto distribute data accross all nodes?
The SA Forum doesn't consider at all how to handle partitions in a network or at least not very suitably (up to designer of SA Forum services). They assume that applications will be using the AMF, and rely on the AMF functionality to reboot partitioned nodes (fencing) so this condition doesn't occur. The SA Forum services were not designed with partitioned networks in mind. It is unfortunate, but it is what it is. If an app needs true consistently without some form of fencing, the app designer has to take partitions into consideration when designing their applications. This is why I recommend using CPG for these types of environments because it provides better design control over exactly how data remerges. On the other side, it should be easy to fix the checkpoint algorithm. Or document that behavior at least. - Dietmar ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] howto distribute data accross all nodes?
On Sat, 2009-04-18 at 07:49 +0200, Dietmar Maurer wrote: like a 'merge' function? Seems the algorithm for checkpoint recovery always uses the state from the node with the lowest processor id? Yes that is right. So if I have the following cluster: Part1: node2 node3 node4 Part2: node1 Let assume Part1 is running for some time and has gathered some state in checkpoints. Part2 is just the newly started node1. So when node1 starts up the whole cluster uses the empty checkpoint from node1? (I guess I am confused somehow). - Dietmar The checkpoint service will merge checkpoints from both partitions into one view because both node 1 and node2 send out their checkpoint state on a merge operation. ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] howto distribute data accross all nodes?
Check out: http://www.openais.org/doku.php?id=dev:paritition_recovery Instead of using the token callback method, you could write your own methodology for executing the state machine. Ah, OK - I think that is what I already do. What is miss is something like a 'merge' function? Seems the algorithm for checkpoint recovery always uses the state from the node with the lowest processor id? - Dietmar ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] howto distribute data accross all nodes?
On Thu, 2009-04-16 at 08:20 +0200, Dietmar Maurer wrote: You might try taking a look at exec/sync.c it is a synchronization engine. Basically it takes configuration changes into account to call sync_init, sync_process, sync_abort, or sync_activate. These 4 states then activate the new dato a model. This could easily be done as an addon api that uses CPG and would be a welcome addition. If you want more details on how the state machine works, let me know. That code seems to directly access the totem protocol (create a new token). So I cant use such thing with CPG? Or what can I learn from that code? - Dietmar Check out: http://www.openais.org/doku.php?id=dev:paritition_recovery Instead of using the token callback method, you could write your own methodology for executing the state machine. regards -steve ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] howto distribute data accross all nodes?
On Tue, Apr 14, 2009 at 02:05:10PM +0200, Dietmar Maurer wrote: So CPG provide a framework to implement distributed finite state machines (DFSM). But there is no standard way to get the initial state of the DFSM. Almost all applications need to get the initial state, so I wonder if it would make sense to provide a service which solves that problem (at least as a example). My current solution is: I introduce a CPG mode, which is either: DFSM_MODE_SYNC ... CPG is syncing state. only state synchronization messages allowed. Other messages are delayed/queued. DFSM_MODE_WORK ... STATE is synced accross all members - normal operation. Queued messages are delivered when we reach this state. When a new node joins, CPG immediately change mode to DFSM_MODE_SYNC. Then all members send their state. When a node received the states of all members, it computes the new state by merging all received states (dfsm_state_merge_fn), and finally switches mode to DFSM_MODE_WORK. Does that make sense? Yes. I'm not sure if a generic service for this would be used much or not... maybe. Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] howto distribute data accross all nodes?
When a new node joins, CPG immediately change mode to DFSM_MODE_SYNC. Then all members send their state. When a node received the states of all members, it computes the new state by merging all received states (dfsm_state_merge_fn), and finally switches mode to DFSM_MODE_WORK. Does that make sense? Yes. I'm not sure if a generic service for this would be used much or not... maybe. Btw, how large is an average checkpoint in dlm? Just wonder how much data needs to be transferred. - Dietmar ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] howto distribute data accross all nodes?
Yes. I'm not sure if a generic service for this would be used much or not... maybe. Btw, how large is an average checkpoint in dlm? Just wonder how much data needs to be transferred. I've never measured, but it's a trivially small amount of data. Around 32 bytes of data per posix lock used on the fs (often zero if apps aren't using posix locks.) OK, then performance is not an issue. I just took a quick look at the dlm code, and I think that code could be simplified with such generic service (which would be tested by a larger user base)? Anyways, I will try to write a prototype too see if it is useful. - Dietmar ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] howto distribute data accross all nodes?
On Tue, 2009-04-14 at 14:05 +0200, Dietmar Maurer wrote: So CPG provide a framework to implement distributed finite state machines (DFSM). But there is no standard way to get the initial state of the DFSM. Almost all applications need to get the initial state, so I wonder if it would make sense to provide a service which solves that problem (at least as a example). My current solution is: I introduce a CPG mode, which is either: DFSM_MODE_SYNC ... CPG is syncing state. only state synchronization messages allowed. Other messages are delayed/queued. DFSM_MODE_WORK ... STATE is synced accross all members - normal operation. Queued messages are delivered when we reach this state. When a new node joins, CPG immediately change mode to DFSM_MODE_SYNC. Then all members send their state. When a node received the states of all members, it computes the new state by merging all received states (dfsm_state_merge_fn), and finally switches mode to DFSM_MODE_WORK. Does that make sense? Cool idea if it wasn't totally integrated with CPG but instead some external service which people could use in addition to CPG. It would make a great addition as a service engine or c library using CPG. Regards -steve - Dietmar On Thu, Apr 09, 2009 at 09:00:08PM +0200, Dietmar Maurer wrote: If new, normal read/write messages to the replicated state continue while the new node is syncing the pre-existing state, the new node needs to save those operations to apply after it's synced. Ah, that probably works. But can lead to very high memory usage if traffic is high. If that's a problem you could block normal activity during the sync period. Is somebody really using that? If so, is there some code available (for safe/replay)? There is no general purpose code. dlm_controld is an example of a program doing something like this, http://git.fedorahosted.org/git/dlm.git It uses cpg to replicate state of posix locks, uses checkpoints to sync existing lock state to new nodes, and saves messages on a new node until it has completed syncing (i.e. reading pre-existing state from the checkpoint.) Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] howto distribute data accross all nodes?
You might try taking a look at exec/sync.c it is a synchronization engine. Basically it takes configuration changes into account to call sync_init, sync_process, sync_abort, or sync_activate. These 4 states then activate the new dato a model. This could easily be done as an addon api that uses CPG and would be a welcome addition. If you want more details on how the state machine works, let me know. Yes, please tell me more. - Dietmar ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] howto distribute data accross all nodes?
Cool idea if it wasn't totally integrated with CPG but instead some external service which people could use in addition to CPG. Thats the plan. My current problem is the state merge function. There should be a standard way to merge state (when the user does not want to provide his own merge function). I thought the following is possible: - always us state from lowest node id - compare states and do voting? - timestamp based selection ? - always use state from oldest member (life time is time diff since process started) - use totem ring_id/token_seq somehow ? - combinations of above merge functions ? Any other suggestions? - Dietmar ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] howto distribute data accross all nodes?
What I recommend here is to place your local node id in the message contents (retrieved via cpg_local_get) and then compare that nodeid to incoming messages Why do you include the local node id into the message? I can compare the local node id with the sending node id without that, for example: ... cpg_local_get(cpg_handle, local_nodeid); ... static void cpg_deliver_callback (cpg_handle_t handle, struct cpg_name *groupName, uint32_t nodeid, uint32_t pid, void *msg, int msg_len) { ... if (nodeid == local_nodeid) ... } Or do I miss something? - Dietmar ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] howto distribute data accross all nodes?
On Thu, 2009-04-09 at 10:19 +0200, Dietmar Maurer wrote: What I recommend here is to place your local node id in the message contents (retrieved via cpg_local_get) and then compare that nodeid to incoming messages Why do you include the local node id into the message? I can compare the local node id with the sending node id without that, for example: ... cpg_local_get(cpg_handle, local_nodeid); ... static void cpg_deliver_callback (cpg_handle_t handle, struct cpg_name *groupName, uint32_t nodeid, uint32_t pid, void *msg, int msg_len) { ... if (nodeid == local_nodeid) ... } Or do I miss something? Sorry I forgot that was in the callback parameter. You should be good to go with the method proposed with your code. Lots of code, hard to remember all the interfaces :) Regards -steve - Dietmar ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] howto distribute data accross all nodes?
need for locks. An example of why not is creation of a resource called datasetA. 3 nodes: node A sends create datasetA node B sends create datasetA node C sends create datasetA Only one of those nodes create dataset will arrive first. The remainder will arrive second and third. Also, vs requires that each node sends in the same order so it may be something like on all nodes: B received, C received, A received. In this case, B creates the dataset, C says dataset exists A says dataset exists. All nodes see this same ordering But how does a node gets its initial state?? When a node joins it does not know the state of the other nodes, but it receives state change messages from other nodes. A distributed state machine only works if a node knows the state before joining the group. Is there a 'standard' solution to that problem? - Dietmar ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] howto distribute data accross all nodes?
1. Have an old cpg member (e.g. the one with the lowest nodeid) send messages containing the state to the new node after it's joined. These sync messages are separate from the messages used to read/write the replicated state during normal operation. This is not bullet proof. State can change while joining node is not 100% synced, resulting in endless sync messages? 2. Have an old cpg member write all the state to a checkpoint (see saCkpt) when a node joins, it sends a message to the new node when it's done writing indicating that the checkpoint is ready, and the new node then reads the state from the checkpoint. same problem as above. There are probably other ways of doing this as well. If new, normal read/write messages to the replicated state continue while the new node is syncing the pre-existing state, the new node needs to save those operations to apply after it's synced. Ah, that probably works. But can lead to very high memory usage if traffic is high. Is somebody really using that? If so, is there some code available (for safe/replay)? - Dietmar ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] howto distribute data accross all nodes?
On Thu, Apr 09, 2009 at 09:00:08PM +0200, Dietmar Maurer wrote: If new, normal read/write messages to the replicated state continue while the new node is syncing the pre-existing state, the new node needs to save those operations to apply after it's synced. Ah, that probably works. But can lead to very high memory usage if traffic is high. If that's a problem you could block normal activity during the sync period. Is somebody really using that? If so, is there some code available (for safe/replay)? There is no general purpose code. dlm_controld is an example of a program doing something like this, http://git.fedorahosted.org/git/dlm.git It uses cpg to replicate state of posix locks, uses checkpoints to sync existing lock state to new nodes, and saves messages on a new node until it has completed syncing (i.e. reading pre-existing state from the checkpoint.) Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] howto distribute data accross all nodes?
Ah, that probably works. But can lead to very high memory usage if traffic is high. If that's a problem you could block normal activity during the sync period. wow. that 'virtual synchrony' sound nice first, but gets incredible complex soon ;-) Is somebody really using that? If so, is there some code available (for safe/replay)? There is no general purpose code. dlm_controld is an example of a program doing something like this, http://git.fedorahosted.org/git/dlm.git It uses cpg to replicate state of posix locks, uses checkpoints to sync existing lock state to new nodes, and saves messages on a new node until it has completed syncing (i.e. reading pre-existing state from the checkpoint.) Thanks for that link, Dietmar ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] howto distribute data accross all nodes?
On Wed, 2009-04-08 at 11:02 +0200, Dietmar Maurer wrote: Hi all, each node in my cluster has some data which should be available to all nodes. I also want to update that data in a consistent/atomic way: 1.) acquire global lock 2.) read/update/write data 3.) release lock Virtual synchrony is tailor made for this problem. Send messages via CPG. On delivery of the message, take some action to read update or write the data to your internal data structures. Do not do the updating/reading/writing on the transmit of the message but only on _delivery_. This removes the need to have a lock entirely since the entire update can be one atomic message and it will be ordered according to virtual synchrony semantics. Since every node sees every message in the same order there is never a need for locks. An example of why not is creation of a resource called datasetA. 3 nodes: node A sends create datasetA node B sends create datasetA node C sends create datasetA Only one of those nodes create dataset will arrive first. The remainder will arrive second and third. Also, vs requires that each node sends in the same order so it may be something like on all nodes: B received, C received, A received. In this case, B creates the dataset, C says dataset exists A says dataset exists. All nodes see this same ordering. The only reason to use a lock service in practice is because there are not ordering guarantees in the transmit medium of the system. Hope that helps Regards -steve What the best way to implement that. For locking I can use the lock service. But what's the best/simplest way to distribute the node data to all members? I tried using CPG, but there I no synchronous message delivery available - which makes above read/update/write impossible? just do all updates/actions/etc on _delivery_ instead of transmit. Another option is using CKPT, but CKPT does not have any change notification? I need to combine that with CPG? Maybe there is some example code around where I can see how to implement such system? Or some other documentation? Many thank for your help, - Dietmar ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais