See my comments inline, marked [AndersW3]. regards, Anders Widell
On 02/24/2016 07:32 AM, Nhat Pham wrote: > > Hi Mahesh and Anders, > > Please see my comments below. > > Best regards, > > Nhat Pham > > *From:*A V Mahesh [mailto:[email protected]] > *Sent:* Wednesday, February 24, 2016 11:06 AM > *To:* Nhat Pham <[email protected]>; 'Anders Widell' > <[email protected]> > *Cc:* [email protected]; 'Beatriz Brandao' > <[email protected]>; 'Minh Chau H' <[email protected]> > *Subject:* Re: [PATCH 0 of 1] Review Request for cpsv: Support > preserving and recovering checkpoint replicas during headless state V2 > [#1621] > > Hi Nhat Pham, > > If component ( CPND ) restart allows while Controllers absent , > before requesting CLM going to change return value > to**SA_AIS_ERR_TRY_AGAIN , > We need to get clarification from AMF guys on few things why > because if CPND is on SA_AIS_ERR_TRY_AGAIN and component restart timeout > then AMF will restart component again ( this become cyclic ) and > after saAmfSGCompRestartMax configured value Node gose for reboot > as next level escalation, > in that case we may required changes in AMF as well, to not to act > on component restart timeout in case of Controllers absent ( i am not > sure it is deviation of AMF specification ) . > > */[Nhat Pham] In headless state, I’m not sure about this either. /* > > */@Anders: Would you have comments about this?/* > [AndersW3] Ok, first of all I would like to point out that normally, the OpenSAF checkpoint node director should not crash. So we are talking about a situation where multiple faults have occurred: first both the active and the standby system controllers have died, and then shortly afterwards - before we have a new active system controller - the checkpoint node director also crashes. Sure, these may not be totally independent events, but still there are a lot of faults that have happened within a short period of time. We should test the node director and make sure it doesn't crash in this type of scenario. Now, let's consider the case where we have a fault in the node director that causes it to crash during the headless state. The general philosophy of the headless feature is that when things work fine - i.e. in the absence of fault - we should be able to continue running while the system controllers are absent. However, if a fault happens during the headless state, we may not be able to recover from the fault until there is an active system controller. AMF does provide support for restarting components, but as you have pointed out, the node director will be stuck in a TRY_AGAIN loop immediately after it has been restarted. So this means that if the node director crashes during the headless state, we have lost the checkpoint functionality on that node and we will not get it back until there is an active system controller. Other services like IMM will still work for a while, but AMF will as you say eventually escalate the checkpoint node director failure to a node restart and then the whole node is gone. The node will not come back until we have an active system controller. So to summarize: there is very limited support for recovering from faults that happen during the headless state. The full recovery will not happen until we have an active system controller. > > Please do incorporate current comments ( in design prospective ) and > republish the patch , I will re-test V3 patch and provide review > comments on function issue/bugs if I found any. > > One Important note , in the new patch let us not have any complexity > of allowing non-collocated checkpoint creation and then documenting > that in some scenario , > non-collocated checkpoint replicas are recoverable , why because > replica is USER private data ( not Opensaf States ) , loosing USER > private data not acceptable . > so let us keep the scope of CPSV service as non-collocated checkpoint > creation NOT_SUPPORTED , if cluster is running with > IMMSV_SC_ABSENCE_ALLOWED ( headless state configuration enabled at > the time of cluster startup currently it is not configurable , so > their no chance of run-time configuration change ). > > We can provide support for non-collocated in subsequent enhancements > by having solution like replica on lower node ID PL will also created > non-collocated ( max three riplicas in cluster regradless of where > non-collocated is opened ). > > So for now, regardless of the heads (SC`s) status exist not exist > CPSV should return SA_AIS_ERR_NOT_SUPPORTED in case of > IMMSV_SC_ABSENCE_ALLOWED enabled cluster , > and let us document it as well. > > */[Nhat Pham] The patch is to limit loosing replicas and checkpoints > in case of headless state./* > > */In case both replicas locate on SCs and they reboot, loosing > checkpoint is unpreventable with current design after headless state./* > > */Even if we implement the proposal “/*max three riplicas in cluster > regradless of where non-collocated is opened*/”, there is still the > case where the checkpoint is lost. Ex. The SCs and the PL which hosts > the replica reboot same time./* > > */In case /*IMMSV_SC_ABSENCE_ALLOWED disable, if both SCs reboot, this > leads whole cluster reboots. Then the checkpoint is lost.*//* > > */What I mean is there are cases where the checkpoint is lost. The > point is what we can do to limit loosing data./* > > */For the proposal of reject creating non-collocated checkpoint in > case of/* IMMSV_SC_ABSENCE_ALLOWED enabled, I think this will lead to > in compatible problem. > > */@Anders: How do you think about rejecting creating non-collocated > checkpoint in case of /*IMMSV_SC_ABSENCE_ALLOWED enabled? > [AndersW3] No, I think we ought to support non-colocated checkpoints also when *//*IMMSV_SC_ABSENCE_ALLOWED is set. The fact that we have "system controllers" is an implementation detail of OpenSAF. I don't think the CKPT SAF specification implies that non-colocated checkpoints must be fully replicated on all the nodes in the cluster, and thus we must have the possibility that all replicas are lost. It is not clear exactly what to expect from the APIs when this happens, but you could handle it in a similar way as the case when all sections have been automatically deleted by the checkpoint service because the sections have expired. > > *//* > > > -AVM > > On 2/24/2016 6:51 AM, Nhat Pham wrote: > > Hi Mahesh, > > Do you have any further comments? > > Best regards, > > Nhat Pham > > *From:* A V Mahesh [mailto:[email protected]] > *Sent:* Monday, February 22, 2016 10:37 AM > *To:* Nhat Pham <[email protected]> > <mailto:[email protected]>; 'Anders Widell' > <[email protected]> <mailto:[email protected]> > *Cc:* [email protected] > <mailto:[email protected]>; 'Beatriz Brandao' > <[email protected]> > <mailto:[email protected]>; 'Minh Chau H' > <[email protected]> <mailto:[email protected]> > *Subject:* Re: [PATCH 0 of 1] Review Request for cpsv: Support > preserving and recovering checkpoint replicas during headless > state V2 [#1621] > > Hi, > > >>BTW, have you finished the review and test? > > I will finish by today. > > -AVM > > On 2/22/2016 7:48 AM, Nhat Pham wrote: > > Hi Mahesh and Anders, > > Please see my comment below. > > BTW, have you finished the review and test? > > Best regards, > > Nhat Pham > > *From:* A V Mahesh [mailto:[email protected]] > *Sent:* Friday, February 19, 2016 2:28 PM > *To:* Nhat Pham <[email protected]> > <mailto:[email protected]>; 'Anders Widell' > <[email protected]> > <mailto:[email protected]>; 'Minh Chau H' > <[email protected]> <mailto:[email protected]> > *Cc:* [email protected] > <mailto:[email protected]>; 'Beatriz > Brandao' <[email protected]> > <mailto:[email protected]> > *Subject:* Re: [PATCH 0 of 1] Review Request for cpsv: Support > preserving and recovering checkpoint replicas during headless > state V2 [#1621] > > Hi Nhat Pham, > > On 2/19/2016 12:28 PM, Nhat Pham wrote: > > Could you please give more detailed information about > steps to reproduce the problem below? Thanks. > > > Don't see this as specific bug , we need to see the issue as > CLM integrated service point of view , > by considering Anders Widell explication about CLM > application behavior during headless state > we need to reintegrate CPND with CLM ( before this headless > state feature no case of CPND existence in the obscene of > CLMD , but now it is ). > > And this will be the consistent across the all services who > integrated with CLM ( you may need some changes in CLM also ) > > */[Nhat Pham] I think CLM should return /*SA_AIS_ERR_TRY_AGAIN > in this case. > > @Anders. How would you think? > > To start with let us consider case CPND on payload restarted > on PL during headless state > and an application is in running on PL. > > */[Nhat Pham] Regarding the CPND as CLM application, I’m not > sure what it can do in this case. In case it restarts, it is > monitored by AMF./* > > */If it blocks for too long, AMF will also trigger a node > reboot./* > > */In my test case, the CPND get blocked by CLM. It doesn’t get > out of the saClmInitialize. How do you get the “/ER cpnd clm > init failed with return value:31/”?/* > > */Following is the cpnd trace./* > > Feb 22 8:56:41.188122 osafckptnd [736:cpnd_init.c:0183] >> > cpnd_lib_init > > Feb 22 8:56:41.188332 osafckptnd [736:cpnd_init.c:0412] >> > cpnd_cb_db_init > > Feb 22 8:56:41.188600 osafckptnd [736:cpnd_init.c:0437] << > cpnd_cb_db_init > > Feb 22 8:56:41.188778 osafckptnd [736:clma_api.c:0503] >> > saClmInitialize > > Feb 22 8:56:41.188945 osafckptnd [736:clma_api.c:0593] >> > clmainitialize > > Feb 22 8:56:41.190052 osafckptnd [736:clma_util.c:0100] >> > clma_startup: clma_use_count: 0 > > Feb 22 8:56:41.190273 osafckptnd [736:clma_mds.c:1124] >> > clma_mds_init > > Feb 22 8:56:41.190825 osafckptnd [736:clma_mds.c:1170] << > clma_mds_init > > -AVM > > On 2/19/2016 12:28 PM, Nhat Pham wrote: > > Hi Mahesh, > > Could you please give more detailed information about > steps to reproduce the problem below? Thanks. > > Best regards, > > Nhat Pham > > *From:* A V Mahesh [mailto:[email protected]] > *Sent:* Friday, February 19, 2016 1:06 PM > *To:* Anders Widell <[email protected]> > <mailto:[email protected]>; Nhat Pham > <[email protected]> > <mailto:[email protected]>; 'Minh Chau H' > <[email protected]> <mailto:[email protected]> > *Cc:* [email protected] > <mailto:[email protected]>; 'Beatriz > Brandao' <[email protected]> > <mailto:[email protected]> > *Subject:* Re: [PATCH 0 of 1] Review Request for cpsv: > Support preserving and recovering checkpoint replicas > during headless state V2 [#1621] > > Hi Anders Widell, > Thanks for the detailed explanation about CLM during > headless state. > > HI Nhat Pham , > > Comment : 3 > Please see below the problem I was interpreted now I > seeing it during CLMD obscene ( during headless state ), > so now CPND/CLMA need to to address below case , > currently cpnd clm init failed with return value: > SA_AIS_ERR_UNAVAILABLE > but should be SA_AIS_ERR_TRY_AGAIN > > ================================================== > Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO NODE STATE-> > IMM_NODE_FULLY_AVAILABLE 17418 > Feb 19 11:18:28 PL-4 osafimmloadd: NO Sync ending normally > Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO Epoch set to 9 in > ImmModel > Feb 19 11:18:28 PL-4 cpsv_app: IN Received PROC_STALE_CLIENTS > Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO Implementer > connected: 42 (MsgQueueService132111) <108, 2040f> > Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO Implementer > connected: 43 (MsgQueueService131855) <0, 2030f> > Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO Implementer > connected: 44 (safLogService) <0, 2010f> > Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO SERVER STATE: > IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY > Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO Implementer > connected: 45 (safClmService) <0, 2010f> > *Feb 19 11:18:28 PL-4 osafckptnd[7718]: ER cpnd clm init > failed with return value:31 > Feb 19 11:18:28 PL-4 osafckptnd[7718]: ER cpnd init failed > Feb 19 11:18:28 PL-4 osafckptnd[7718]: ER cpnd_lib_req FAILED > Feb 19 11:18:28 PL-4 osafckptnd[7718]: __init_cpnd() failed* > Feb 19 11:18:28 PL-4 osafclmna[5432]: NO > safNode=PL-4,safCluster=myClmCluster Joined cluster, > nodeid=2040f > Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO AVD NEW_ACTIVE, > adest:1 > Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO Sending node up > due to NCSMDS_NEW_ACTIVE > Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO 1 SISU states sent > Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO 1 SU states sent > Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO 7 CSICOMP states > synced > Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO 7 SU states sent > Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO Implementer > connected: 46 (safAmfService) <0, 2010f> > Feb 19 11:18:30 PL-4 osafamfnd[5441]: NO > 'safSu=PL-4,safSg=NoRed,safApp=OpenSAF' Component or SU > restart probation timer expired > Feb 19 11:18:35 PL-4 osafamfnd[5441]: NO Instantiation of > 'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF' failed > Feb 19 11:18:35 PL-4 osafamfnd[5441]: NO Reason: component > registration timer expired > Feb 19 11:18:35 PL-4 osafamfnd[5441]: WA > 'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF' > Presence State RESTARTING => INSTANTIATION_FAILED > Feb 19 11:18:35 PL-4 osafamfnd[5441]: NO Component > Failover trigerred for > 'safSu=PL-4,safSg=NoRed,safApp=OpenSAF': Failed component: > 'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF' > Feb 19 11:18:35 PL-4 osafamfnd[5441]: ER > 'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF'got > Inst failed > Feb 19 11:18:35 PL-4 osafamfnd[5441]: Rebooting OpenSAF > NodeId = 132111 EE Name = , Reason: NCS component > Instantiation failed, OwnNodeId = 132111, SupervisionTime = 60 > Feb 19 11:18:36 PL-4 opensaf_reboot: Rebooting local node; > timeout=60 > Feb 19 11:18:39 PL-4 kernel: [ 4877.338518] md: stopping > all md devices. > ================================================== > > -AVM > > On 2/15/2016 5:11 PM, Anders Widell wrote: > > Hi! > > Please find my answer inline, marked [AndersW]. > > regards, > Anders Widell > > On 02/15/2016 10:38 AM, Nhat Pham wrote: > > Hi Mahesh, > > It's good. Thank you. :) > > [AVM] Up on rejoining of the SC`s The replica > should be re-created regardless > of another application opens it on PL4. > ( Note : this comment is based on > your explanation have not yet > reviewed/tested , > currently i am struggling with > SC`s not rejoining > after headless state , i can provide you more on > this once i complte my > review/testing) > > [Nhat] To make cloud resilience works, you need > the patches from other > services (log, amf, clm, ntf). > @Minh: I heard that you created tar file which > includes all patches. Could you > please send it to Mahesh? Thanks > > [AVM] I understand that , before I comment more on > this please allow me to > understand > I am not still not very clear of the > headless design in detail. > For example cluster membership of > PL`s during headless state , > In the absence of SC`s (CLMD) > dose the PLs is considered as > cluster nodes or not (cluster membership) ? > > [Nhat] I don't know much about this. > @ Anders: Could you please have comment about > this? Thanks > > [AndersW] First of all, keep in mind that the > "headless" state should ideally not last a very long > time. Once we have the spare SC feature in place > (ticket [#79]), a new SC should become active within a > matter of a few seconds after we have lost both the > active and the standby SC. > > I think you should view the state of the cluster in > the headless state in the same way as you view the > state of the cluster during a failover between the > active and the standby SC. Imagine that the active SC > dies. It takes the standby SC 1.5 seconds to detect > the failure of the active SC (this is due to the TIPC > timeout). If you have configured the > PROMOTE_ACTIVE_TIMER, there is an additional delay > before the standby takes over as active. What is the > state of the cluster during the time after the active > SC failed and before the standby takes over? > > The state of the cluster while it is headless is very > similar. The difference is that this state may last a > little bit longer (though not more than a few seconds, > until one of the spare SCs becomes active). Another > difference is that we may have lost some state. With a > "perfect" implementation of the headless feature we > should not lose any state at all, but with the current > set of patches we do lose state. > > So specifically if we talk about cluster membership > and ask the question: is a particular PL a member of > the cluster or not during the headless state? Well, if > you ask CLM about this during the headless state, then > you will not know - because CLM doesn't provide any > service during the headless state. If you keep > retrying you query to CLM, you will eventually get an > answer - but you will not get this answer until there > is an active SC again and we have exited the headless > state. When viewed in this way, the answer to the > question about a node's membership is undefined during > the headless state, since CLM will not provide you > with any answer until there is an active SC. > > However, if you asked CLM about the node's cluster > membership status before the cluster went headless, > you probably saved a cached copy of the cluster > membership state. Maybe you also installed a CLM track > callback and intend to update this cached copy every > time the cluster membership status changes. The > question then is: can you continue using this cached > copy of the cluster membership state during the > headless state? The answer is YES: since CLM doesn't > provide any service during the headless state, it also > means that the cluster membership view cannot change > during this time. Nodes can of course reboot or die, > but CLM will not notice and hence the cluster view > will not be updated. You can argue that this is bad > because the cluster view doesn't reflect reality, but > notice that this will always be the case. We can never > propagate information instantaneously, and detection > of node failures will take 1.5 seconds due to the TIPC > timeout. You can never be sure that a node is alive at > this very moment just because CLM tells you that it is > a member of the cluster. If we are unfortunate enough > to lose both system controller nodes simultaneously, > updates to the cluster membership view will be delayed > a few seconds longer than usual. > > > Best regards, > Nhat Pham > > -----Original Message----- > From: A V Mahesh [mailto:[email protected]] > Sent: Monday, February 15, 2016 11:19 AM > To: Nhat Pham <[email protected]> > <mailto:[email protected]>; > [email protected] > <mailto:[email protected]> > Cc: [email protected] > <mailto:[email protected]>; > 'Beatriz Brandao' > <[email protected]> > <mailto:[email protected]> > Subject: Re: [PATCH 0 of 1] Review Request for > cpsv: Support preserving and > recovering checkpoint replicas during headless > state V2 [#1621] > > Hi Nhat Pham, > > How is your holiday went > > Please find my comments below > > On 2/15/2016 8:43 AM, Nhat Pham wrote: > > Hi Mahesh, > > For the comment 1, the patch will be updated > accordingly. > > [AVM] Please hold , I will provide more comments > in this week , so we can > have consolidated V3 > > For the comment 2, I think the CKPT service > will not be backward > compatible if the scAbsenceAllowed is true. > The client can't create non-collocated > checkpoint on SCs. > > Furthermore, this solution only protects the > CKPT service from the > case "The non-collocated checkpoint is > created on a SC" > there are still the cases where the replicas > are completely lost. Ex: > > - The non-collocated checkpoint created on a > PL. The PL reboots. Both > replicas now locate on SCs. Then, headless > state happens. All replicas are > lost. > - The non-collocated checkpoint has active > replica locating on a PL > and this PL restarts during headless state > - The non-collocated checkpoint is created on > PL3. This checkpoint is > also opened on PL4. Then SCs and PL3 reboot. > > [AVM] Up on rejoining of the SC`s The replica > should be re-created regardless > of another application opens it on PL4. > ( Note : this comment is based on > your explanation have not yet > reviewed/tested , > currently i am struggling with > SC`s not rejoining > after headless state , i can provide you more on > this once i complte my > review/testing) > > In this case, all replicas are lost and the > client has to create it again. > > In case multiple nodes (which including SCs) > reboot, losing replicas > is unpreventable. The patch is to recover the > checkpoints in possible cases. > How do you think? > > [AVM] I understand that , before I comment more on > this please allow > me to understand > I am not still not very clear of the > headless design in detail. > > For example cluster membership of > PL`s during headless > state , > In the absence of SC`s (CLMD) > dose the PLs is considered as > cluster nodes or not (cluster membership) ? > > - if not consider as NON > cluster nodes Checkpoint Service > API should leverage the SA Forum Cluster > Membership Service and > API's can fail with > SA_AIS_ERR_UNAVAILABLE > > - if considers as cluster > nodes we need to follow all the > defined rules which are defined in > SAI-AIS-CKPT-B.02.02 specification > > so give me some more time to review > it completely , so that we > can have consolidated patch V3 > > -AVM > > Best regards, > Nhat Pham > > -----Original Message----- > From: A V Mahesh [mailto:[email protected]] > Sent: Friday, February 12, 2016 11:10 AM > To: Nhat Pham <[email protected]> > <mailto:[email protected]>; > [email protected] > <mailto:[email protected]> > Cc: [email protected] > <mailto:[email protected]>; > Beatriz Brandao > <[email protected]> > <mailto:[email protected]> > Subject: Re: [PATCH 0 of 1] Review Request for > cpsv: Support > preserving and recovering checkpoint replicas > during headless state V2 > [#1621] > > > Comment 2 : > > After incorporating the comment one all the > Limitations should be > prevented based on Hydra configuration is > enabled in IMM status. > > Foe example : if some application is trying > to create > > non-collocated checkpoint active replica > getting generated/locating on > SC then ,regardless of the heads (SC`s) status > exist not exist should > return SA_AIS_ERR_NOT_SUPPORTED > > In other words, rather that allowing to > created non-collocated > checkpoint when > heads(SC`s) are exit , and non-collocated > checkpoint getting > unrecoverable after heads(SC`s) rejoins. > > > ====================================================================== > > ======================= > > Limitation: The CKPT service doesn't > support recovering checkpoints in > following cases: > . The checkpoint which is unlinked > before headless. > . The non-collocated checkpoint has > active replica locating on SC. > . The non-collocated checkpoint has > active replica locating on a PL > and this PL > restarts during headless state. In > this cases, the checkpoint replica is > destroyed. The fault code > SA_AIS_ERR_BAD_HANDLE is returned when the > client > accesses the checkpoint in these > cases. The client must re-open the > checkpoint. > > > ====================================================================== > > ======================= > > -AVM > > > On 2/11/2016 12:52 PM, A V Mahesh wrote: > > Hi, > > I jut starred reviewing patch , I will be > giving comments as soon as > I crossover any , to save some time. > > Comment 1 : > This functionality should be under checks > if Hydra configuration is > enabled in IMM attrName = > const_cast<SaImmAttrNameT>("scAbsenceAllowed") > > > Please see example how LOG/AMF services > implemented it. > > -AVM > > > On 1/29/2016 1:02 PM, Nhat Pham wrote: > > Hi Mahesh, > > As described in the README, the CKPT > service returns > SA_AIS_ERR_TRY_AGAIN fault code in > this case. > I guess it's same for other services. > > @Anders: Could you please confirm this? > > Best regards, > Nhat Pham > > -----Original Message----- > From: A V Mahesh > [mailto:[email protected]] > Sent: Friday, January 29, 2016 2:11 PM > To: Nhat Pham > <[email protected]> > <mailto:[email protected]>; > [email protected] > <mailto:[email protected]> > Cc: > [email protected] > <mailto:[email protected]> > > Subject: Re: [PATCH 0 of 1] Review > Request for cpsv: Support > preserving and recovering checkpoint > replicas during headless state > V2 [#1621] > > Hi, > > On 1/29/2016 11:45 AM, Nhat Pham wrote: > > - The behavior of > application will be consistent > with other > saf services like imm/amf behavior > during headless state. > [Nhat] I'm not clear what you mean > about "consistent"? > > In the obscene of Director (SC's) , > what is expected return values > of SAF API should ( all services ) , > which are not in aposition to > provide service at that moment. > > I think all services should return > same SAF ERRS., I thinks > currently we don't have it , may be > Anders Widel will help us. > > -AVM > > > On 1/29/2016 11:45 AM, Nhat Pham wrote: > > Hi Mahesh, > > Please see the attachment for the > README. Let me know if there is > any more information required. > > Regarding your comments: > - during headless state > applications may behave like during > CPND restart case [Nhat] Headless > state and CPND restart are > different events. Thus, the > behavior is different. > Headless state is a case where > both SCs go down. > > - The behavior of > application will be consistent > with other > saf services like imm/amf behavior > during headless state. > [Nhat] I'm not clear what you mean > about "consistent"? > > Best regards, > Nhat Pham > > -----Original Message----- > From: A V Mahesh > [mailto:[email protected]] > Sent: Friday, January 29, 2016 > 11:12 AM > To: Nhat Pham > <[email protected]> > <mailto:[email protected]>; > [email protected] > <mailto:[email protected]> > Cc: > [email protected] > <mailto:[email protected]> > > Subject: Re: [PATCH 0 of 1] Review > Request for cpsv: Support > preserving and recovering > checkpoint replicas during > headless state > V2 [#1621] > > Hi Nhat Pham, > > I stared reviewing this patch , so > can please provide README file > with scope and limitations , that > will help to define > testing/reviewing scope . > > Following are minimum things we > can keep in mind while > reviewing/accepting patch , > > - Not effecting existing > functionality > - during headless state > applications may behave like during > CPND restart case > - The minimum functionally > of application works > - The behavior of > application will be consistent with > other saf services like > imm/amf behavior during headless > state. > > So please do provide any > additional detailed in README if > any of > the above is deviated , that allow > users to know about the > limitations/deviation. > > -AVM > > On 1/4/2016 3:15 PM, Nhat Pham wrote: > > Summary: cpsv: Support > preserving and recovering > checkpoint > replicas during headless state > [#1621] Review request for Trac > Ticket(s): > #1621 Peer Reviewer(s): > [email protected] > <mailto:[email protected]>; > [email protected] > <mailto:[email protected]> > Pull request to: > [email protected] > <mailto:[email protected]> > Affected branch(es): default > Development > branch: default > > -------------------------------- > Impacted area Impact y/n > -------------------------------- > Docs n > Build system n > RPM/packaging n > Configuration files n > Startup scripts n > SAF services y > OpenSAF services n > Core libraries n > Samples n > Tests n > Other n > > > Comments (indicate scope for > each "y" above): > > --------------------------------------------- > > > changeset > > faec4a4445a4c23e8f630857b19aabb43b5af18d > > Author: Nhat Pham > <[email protected]> > <mailto:[email protected]> > Date: Mon, 04 Jan 2016 > 16:34:33 +0700 > > cpsv: Support preserving > and recovering checkpoint > replicas > during headless state [#1621] > > Background: > ---------- This > enhancement supports to > preserve checkpoint > replicas > > in case > > both SCs down (headless > state) and recover replicas in > case > one of > > SCs up > > again. If both SCs goes > down, checkpoint replicas on > surviving nodes > > still > > remain. When a SC is > available again, surviving > replicas are > > automatically > > registered to the SC > checkpoint database. Content in > surviving > > replicas are > > intacted and > synchronized to new replicas. > > When no SC is available, > client API calls changing > checkpoint > > configuration > > which requires SC > communication, are rejected. > Client API > calls > > reading and > > writing existing > checkpoint replicas still work. > > Limitation: The CKPT > service does not support > recovering > checkpoints > > in > > following cases: > - The checkpoint which > is unlinked before headless. > - The non-collocated > checkpoint has active replica > locating > on SC. > - The non-collocated > checkpoint has active replica > locating > on a PL > > and this > > PL restarts during > headless state. In this cases, > the > checkpoint > > replica is > > destroyed. The fault > code SA_AIS_ERR_BAD_HANDLE is > returned > when the > > client > > accesses the checkpoint > in these cases. The client must > re-open the > checkpoint. > > While in headless state, > accessing checkpoint replicas > does > not work > > if the > > node which hosts the > active replica goes down. It > will back > working > > when a > > SC available again. > > Solution: > --------- The solution > for this enhancement includes > 2 parts: > > 1. To destroy > un-recoverable checkpoint > described above when > both > > SCs are > > down: When both SCs are > down, the CPND deletes > un-recoverable > > checkpoint > > nodes and replicas on > PLs. Then it requests CPA to > destroy > > corresponding > > checkpoint node by using > new message > CPA_EVT_ND2A_CKPT_DESTROY > > 2. To update CPD with > checkpoint information When an > active > SC is up > > after > > headless, CPND will > update CPD with checkpoint > information by > using > > new > > message > CPD_EVT_ND2D_CKPT_INFO_UPDATE > instead of using > > CPD_EVT_ND2D_CKPT_CREATE. This > is because the CPND will > create new > > ckpt_id > > for the checkpoint which > might be different with the > current > ckpt id > > if the > > CPD_EVT_ND2D_CKPT_CREATE is > used. The CPD collects checkpoint > > information > > within 6s. During this > updating time, following > requests is > rejected > > with > > fault code > SA_AIS_ERR_TRY_AGAIN: > - CPD_EVT_ND2D_CKPT_CREATE > - CPD_EVT_ND2D_CKPT_UNLINK > - CPD_EVT_ND2D_ACTIVE_SET > - CPD_EVT_ND2D_CKPT_RDSET > > > Complete diffstat: > ------------------ > osaf/libs/agents/saf/cpa/cpa_proc.c > | 52 > > +++++++++++++++++++++++++++++++++++ > > osaf/libs/common/cpsv/cpsv_edu.c > | 43 > > +++++++++++++++++++++++++++++ > > osaf/libs/common/cpsv/include/cpd_cb.h > | 3 ++ > > osaf/libs/common/cpsv/include/cpd_imm.h > | 1 + > > osaf/libs/common/cpsv/include/cpd_proc.h > | 7 ++++ > > osaf/libs/common/cpsv/include/cpd_tmr.h > | 3 +- > > osaf/libs/common/cpsv/include/cpnd_cb.h > | 1 + > > osaf/libs/common/cpsv/include/cpnd_init.h > | 2 + > > osaf/libs/common/cpsv/include/cpsv_evt.h > | 20 +++++++++++++ > osaf/services/saf/cpsv/cpd/Makefile.am > | 3 +- > osaf/services/saf/cpsv/cpd/cpd_evt.c > | 229 > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > ++++ > > osaf/services/saf/cpsv/cpd/cpd_imm.c > | 112 > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > osaf/services/saf/cpsv/cpd/cpd_init.c > | 20 ++++++++++++- > osaf/services/saf/cpsv/cpd/cpd_proc.c > | 309 > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > osaf/services/saf/cpsv/cpd/cpd_tmr.c > | 7 ++++ > osaf/services/saf/cpsv/cpnd/cpnd_db.c > | 16 ++++++++++ > osaf/services/saf/cpsv/cpnd/cpnd_evt.c > | 22 +++++++++++++++ > > osaf/services/saf/cpsv/cpnd/cpnd_init.c > | 23 ++++++++++++++- > osaf/services/saf/cpsv/cpnd/cpnd_mds.c > | 13 ++++++++ > > osaf/services/saf/cpsv/cpnd/cpnd_proc.c > | 314 > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--- > > > 20 files changed, 1189 > insertions(+), 11 deletions(-) > > > Testing Commands: > ----------------- > - > > Testing, Expected Results: > -------------------------- > - > > > Conditions of Submission: > ------------------------- > <<HOW MANY DAYS BEFORE > PUSHING, CONSENSUS ETC>> > > > Arch Built Started > Linux distro > > ------------------------------------------- > > mips n n > mips64 n n > x86 n n > x86_64 n n > powerpc n n > powerpc64 n n > > > Reviewer Checklist: > ------------------- > [Submitters: make sure that > your review doesn't trigger any > checkmarks!] > > > Your checkin has not passed > review because (see checked > entries): > > ___ Your RR template is > generally incomplete; it has > too many > blank > > entries > > that need proper data > filled in. > > ___ You have failed to > nominate the proper persons > for review and > push. > > ___ Your patches do not have > proper short+long header > > ___ You have grammar/spelling > in your header that is > unacceptable. > > ___ You have exceeded a > sensible line length in your > > headers/comments/text. > > ___ You have failed to put in > a proper Trac Ticket # into your > commits. > > ___ You have incorrectly > put/left internal data in your > comments/files > (i.e. internal bug > tracking tool IDs, product > names etc) > > ___ You have not given any > evidence of testing beyond > basic build > tests. > Demonstrate some > level of runtime or other > sanity testing. > > ___ You have ^M present in > some of your files. These have > to be > removed. > > ___ You have needlessly > changed whitespace or added > whitespace crimes > like trailing spaces, > or spaces before tabs. > > ___ You have mixed real > technical changes with > whitespace and other > cosmetic code cleanup > changes. These have to be > separate > commits. > > ___ You need to refactor your > submission into logical > chunks; there is > too much content into > a single commit. > > ___ You have extraneous > garbage in your review (merge > commits etc) > > ___ You have giant attachments > which should never have been > sent; > Instead you should > place your content in a public > tree to > be pulled. > > ___ You have too many commits > attached to an e-mail; resend as > threaded > commits, or place in > a public tree for a pull. > > ___ You have resent this > content multiple times without > a clear > indication > of what has changed > between each re-send. > > ___ You have failed to > adequately and individually > address all of the > comments and change > requests that were proposed in > the > initial > > review. > > ___ You have a misconfigured > ~/.hgrc file (i.e. username, > email > etc) > > ___ Your computer have a badly > configured date and time; > confusing the > the threaded patch > review. > > ___ Your changes affect IPC > mechanism, and you don't > present any > results > for in-service > upgradability test. > > ___ Your changes affect user > manual and documentation, your > patch > series > do not contain the > patch that updates the Doxygen > manual. > ------------------------------------------------------------------------------ Site24x7 APM Insight: Get Deep Visibility into Application Performance APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month Monitor end-to-end web transactions and take corrective actions now Troubleshoot faster and improve end-user experience. Signup Now! http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140 _______________________________________________ Opensaf-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-devel
