Re: [devel] [PATCH 0 of 1] Review Request for cpsv: Support preserving and recovering checkpoint replicas during headless state V2 [#1621]

A V Mahesh Sun, 21 Feb 2016 21:25:03 -0800

Hi,

 >>According to the log, the PL-4 joined cluster, it means the cluster 
is not in headless state, doesn't it?


Not exactly ,  best my knowledge CPSV  application was running on PL4 ( 
cluster is up and running ) , the restarted both controllers
it seem because of some other problem CPND restated , then i saw this issue

Currently i don't have those  logs  , i will try to re-produce the issue.

But in brad , we need to re-integrate CLM CPSV again , based on new 
behavior of CLMD.

-AVM

On 2/22/2016 9:38 AM, Nhat Pham wrote:
> RE: [devel] [PATCH 0 of 1] Review Request for cpsv: Support preserving 
> and recovering checkpoint replicas during headless state V2 [#1621]
>
> Hi Mahesh,
>
> Could you please clarify which case the error below happened?
>
> *Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO SERVER STATE: 
> IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY*
>
> Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO Implementer connected: 45 
> (safClmService) <0, 2010f>
>
> Feb 19 11:18:28 PL-4 osafckptnd[7718]: ER cpnd clm init failed with 
> return value:31
>
> Feb 19 11:18:28 PL-4 osafckptnd[7718]: ER cpnd init failed
>
> Feb 19 11:18:28 PL-4 osafckptnd[7718]: ER cpnd_lib_req FAILED
>
> Feb 19 11:18:28 PL-4 osafckptnd[7718]: __init_cpnd() failed
>
> *Feb 19 11:18:28 PL-4 osafclmna[5432]: NO 
> safNode=PL-4,safCluster=myClmCluster Joined cluster, nodeid=2040f*
>
> According to the log, the PL-4 joined cluster, it means the cluster is 
> not in headless state, doesn't it?
>
> Best regards,
>
> Nhat Pham
>
> -----Original Message-----
> From: Nhat Pham [mailto:[email protected]]
> Sent: Monday, February 22, 2016 9:19 AM
> To: 'A V Mahesh' <[email protected]>; 'Anders Widell' 
> <[email protected]>
> Cc: 'Beatriz Brandao' <[email protected]>; 'Minh Chau H' 
> <[email protected]>; [email protected]
> Subject: Re: [devel] [PATCH 0 of 1] Review Request for cpsv: Support 
> preserving and recovering checkpoint replicas during headless state V2 
> [#1621]
>
> Hi Mahesh and Anders,
>
> Please see my comment below.
>
> BTW, have you finished the review and test?
>
> Best regards,
>
> Nhat Pham
>
> From: A V Mahesh [mailto:[email protected]]
>
> Sent: Friday, February 19, 2016 2:28 PM
>
> To: Nhat Pham 
> <[email protected]<mailto:[email protected]>>; 'Anders 
> Widell'
>
> <[email protected]<mailto:[email protected]>>; 'Minh 
> Chau H' <[email protected]<mailto:[email protected]>>
>
> Cc:[email protected]<mailto:[email protected]>;
>  
> 'Beatriz Brandao'
>
> <[email protected]<mailto:[email protected]>>
>
> Subject: Re: [PATCH 0 of 1] Review Request for cpsv: Support 
> preserving and recovering checkpoint replicas during headless state V2 
> [#1621]
>
> Hi Nhat Pham,
>
> On 2/19/2016 12:28 PM, Nhat Pham wrote:
>
> Could you please give more detailed information about steps to 
> reproduce the problem below? Thanks.
>
>
> Don't see this as specific bug  , we need to see the issue as  CLM 
> integrated service point  of view , by considering Anders Widell  
> explication about CLM application behavior during headless state we 
> need to reintegrate CPND with CLM ( before this  headless state 
> feature no case of CPND existence in the obscene of CLMD  , but now it 
> is ).
>
> And this will be the consistent across the all services who integrated 
> with CLM  ( you may need some changes in CLM also )
>
> [Nhat Pham] I think CLM should return SA_AIS_ERR_TRY_AGAIN in this case.
>
> @Anders. How would you think?
>
> To start with let us consider case CPND  on payload restarted on PL  
> during headless state and an application is in running on PL.
>
> [Nhat Pham] Regarding the CPND as CLM application, I'm not sure what 
> it can do in this case. In case it restarts, it is monitored by AMF.
>
> If it blocks for too long, AMF will also trigger a node reboot.
>
> In my test case, the CPND get blocked by CLM. It doesn't get out of 
> the saClmInitialize. How do you get the "ER cpnd clm init failed with 
> return value:31"?
>
> Following is the cpnd trace.
>
> Feb 22  8:56:41.188122 osafckptnd [736:cpnd_init.c:0183] >> cpnd_lib_init
>
> Feb 22  8:56:41.188332 osafckptnd [736:cpnd_init.c:0412] >> 
> cpnd_cb_db_init
>
> Feb 22  8:56:41.188600 osafckptnd [736:cpnd_init.c:0437] << 
> cpnd_cb_db_init
>
> Feb 22  8:56:41.188778 osafckptnd [736:clma_api.c:0503] >> saClmInitialize
>
> Feb 22  8:56:41.188945 osafckptnd [736:clma_api.c:0593] >> clmainitialize
>
> Feb 22  8:56:41.190052 osafckptnd [736:clma_util.c:0100] >> clma_startup:
>
> clma_use_count: 0
>
> Feb 22  8:56:41.190273 osafckptnd [736:clma_mds.c:1124] >> clma_mds_init
>
> Feb 22  8:56:41.190825 osafckptnd [736:clma_mds.c:1170] << clma_mds_init
>
> -AVM
>
> On 2/19/2016 12:28 PM, Nhat Pham wrote:
>
> Hi Mahesh,
>
> Could you please give more detailed information about steps to 
> reproduce the problem below? Thanks.
>
> Best regards,
>
> Nhat Pham
>
> From: A V Mahesh [mailto:[email protected]]
>
> Sent: Friday, February 19, 2016 1:06 PM
>
> To: Anders Widell  <mailto:[email protected]>
>
> <[email protected]<mailto:[email protected]>>; Nhat 
> Pham <mailto:[email protected]> 
> <[email protected]<mailto:[email protected]>>; 'Minh 
> Chau H'  <mailto:[email protected]> 
> <[email protected]<mailto:[email protected]>>
>
> Cc:[email protected]<mailto:[email protected]>
>
> <mailto:[email protected]> ; 'Beatriz Brandao'
>
> <mailto:[email protected]> 
> <[email protected]<mailto:[email protected]>>
>
> Subject: Re: [PATCH 0 of 1] Review Request for cpsv: Support 
> preserving and recovering checkpoint replicas during headless state V2 
> [#1621]
>
> Hi Anders Widell,
>
> Thanks for the detailed explanation  about CLM during headless state.
>
> HI  Nhat Pham ,
>
> Comment : 3
>
> Please see below  the problem I was interpreted now I  seeing it  
> during CLMD obscene ( during headless state ), so now CPND/CLMA need 
> to  to address below case , currently cpnd clm init
>
> failed with return value: SA_AIS_ERR_UNAVAILABLE
>
> but should be SA_AIS_ERR_TRY_AGAIN
>
> ==================================================
>
> Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO NODE STATE-> 
> IMM_NODE_FULLY_AVAILABLE 17418 Feb 19 11:18:28 PL-4 osafimmloadd: NO 
> Sync ending normally Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO Epoch 
> set to 9 in ImmModel Feb 19 11:18:28 PL-4 cpsv_app: IN Received 
> PROC_STALE_CLIENTS Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO 
> Implementer connected: 42
>
> (MsgQueueService132111) <108, 2040f>
>
> Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO Implementer connected: 43
>
> (MsgQueueService131855) <0, 2030f>
>
> Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO Implementer connected: 44
>
> (safLogService) <0, 2010f>
>
> Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO SERVER STATE:
>
> IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY Feb 19 11:18:28 PL-4 
> osafimmnd[5422]: NO Implementer connected: 45
>
> (safClmService) <0, 2010f>
>
> Feb 19 11:18:28 PL-4 osafckptnd[7718]: ER cpnd clm init failed with return
>
> value:31
>
> Feb 19 11:18:28 PL-4 osafckptnd[7718]: ER cpnd init failed Feb 19 
> 11:18:28 PL-4 osafckptnd[7718]: ER cpnd_lib_req FAILED Feb 19 11:18:28 
> PL-4 osafckptnd[7718]: __init_cpnd() failed Feb 19 11:18:28 PL-4 
> osafclmna[5432]: NO safNode=PL-4,safCluster=myClmCluster Joined 
> cluster, nodeid=2040f Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO AVD 
> NEW_ACTIVE, adest:1 Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO Sending 
> node up due to NCSMDS_NEW_ACTIVE Feb 19 11:18:28 PL-4 osafamfnd[5441]: 
> NO 1 SISU states sent Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO 1 SU 
> states sent Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO 7 CSICOMP states 
> synced Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO 7 SU states sent Feb 
> 19 11:18:28 PL-4 osafimmnd[5422]: NO Implementer connected: 46
>
> (safAmfService) <0, 2010f>
>
> Feb 19 11:18:30 PL-4 osafamfnd[5441]: NO 
> 'safSu=PL-4,safSg=NoRed,safApp=OpenSAF' Component or SU restart 
> probation timer expired Feb 19 11:18:35 PL-4 osafamfnd[5441]: NO 
> Instantiation of 'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF' 
> failed Feb 19 11:18:35 PL-4 osafamfnd[5441]: NO Reason: component 
> registration timer expired Feb 19 11:18:35 PL-4 osafamfnd[5441]: WA 
> 'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF' Presence State 
> RESTARTING => INSTANTIATION_FAILED Feb 19 11:18:35 PL-4 
> osafamfnd[5441]: NO Component Failover trigerred for
>
> 'safSu=PL-4,safSg=NoRed,safApp=OpenSAF': Failed component:
>
> 'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF'
>
> Feb 19 11:18:35 PL-4 osafamfnd[5441]: ER 
> 'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF'got Inst failed 
> Feb 19 11:18:35 PL-4 osafamfnd[5441]: Rebooting OpenSAF NodeId = 
> 132111 EE Name = , Reason: NCS component Instantiation failed, 
> OwnNodeId = 132111, SupervisionTime = 60 Feb 19 11:18:36 PL-4 
> opensaf_reboot: Rebooting local node; timeout=60 Feb 19 11:18:39 PL-4 
> kernel: [ 4877.338518] md: stopping all md devices.
>
> ==================================================
>
> -AVM
>
> On 2/15/2016 5:11 PM, Anders Widell wrote:
>
> Hi!
>
> Please find my answer inline, marked [AndersW].
>
> regards,
>
> Anders Widell
>
> On 02/15/2016 10:38 AM, Nhat Pham wrote:
>
> Hi Mahesh,
>
> It's good. Thank you. :)
>
> [AVM]  Up on rejoining of the SC`s The replica should be re-created 
> regardless of another application opens it on PL4.
>
>                ( Note : this comment is based on your explanation have 
> not yet reviewed/tested  ,
>
>                   currently i am struggling with  SC`s    not rejoining
>
> after headless state , i can provide you more on this once i  complte my
>
> review/testing)
>
> [Nhat] To make cloud resilience works, you need the patches from other 
> services (log, amf, clm, ntf).
>
> @Minh: I heard that you created tar file which includes all patches. 
> Could you please send it to Mahesh? Thanks
>
> [AVM] I understand that , before I comment more on this   please allow 
> me to
>
> understand
>
>               I am not still not very clear of the headless design in 
> detail.
>
>               For example cluster membership of PL`s   during headless 
> state
>
> ,
>
>                In the absence of SC`s  (CLMD) dose the PLs is 
> considered as
>
> cluster nodes or not (cluster membership) ?
>
> [Nhat] I don't know much about this.
>
> @ Anders: Could you please have comment about this? Thanks
>
> [AndersW] First of all, keep in mind that the "headless" state should 
> ideally not last a very long time. Once we have the spare SC feature 
> in place (ticket [#79]), a new SC should become active within a matter 
> of a few seconds after we have lost both the active and the standby SC.
>
> I think you should view the state of the cluster in the headless state 
> in the same way as you view the state of the cluster during a failover 
> between the active and the standby SC. Imagine that the active SC 
> dies. It takes the standby SC 1.5 seconds to detect the failure of the 
> active SC (this is due to the TIPC timeout). If you have configured 
> the PROMOTE_ACTIVE_TIMER, there is an additional delay before the 
> standby takes over as active. What is the state of the cluster during 
> the time after the active SC failed and before the standby takes over?
>
> The state of the cluster while it is headless is very similar. The 
> difference is that this state may last a little bit longer (though not 
> more than a few seconds, until one of the spare SCs becomes active). 
> Another difference is that we may have lost some state. With a "perfect"
>
> implementation of the headless feature we should not lose any state at 
> all, but with the current set of patches we do lose state.
>
> So specifically if we talk about cluster membership and ask the 
> question: is a particular PL a member of the cluster or not during the 
> headless state?
>
> Well, if you ask CLM about this during the headless state, then you 
> will not know - because CLM doesn't provide any service during the 
> headless state. If you keep retrying you query to CLM, you will 
> eventually get an answer - but you will not get this answer until 
> there is an active SC again and we have exited the headless state. 
> When viewed in this way, the answer to the question about a node's 
> membership is undefined during the headless state, since CLM will not 
> provide you with any answer until there is an active SC.
>
> However, if you asked CLM about the node's cluster membership status 
> before the cluster went headless, you probably saved a cached copy of 
> the cluster membership state. Maybe you also installed a CLM track 
> callback and intend to update this cached copy every time the cluster 
> membership status changes.
>
> The question then is: can you continue using this cached copy of the 
> cluster membership state during the headless state? The answer is YES: 
> since CLM doesn't provide any service during the headless state, it 
> also means that the cluster membership view cannot change during this 
> time. Nodes can of course reboot or die, but CLM will not notice and 
> hence the cluster view will not be updated. You can argue that this is 
> bad because the cluster view doesn't reflect reality, but notice that 
> this will always be the case. We can never propagate information 
> instantaneously, and detection of node failures will take 1.5 seconds 
> due to the TIPC timeout. You can never be sure that a node is alive at 
> this very moment just because CLM tells you that it is a member of the 
> cluster. If we are unfortunate enough to lose both system controller 
> nodes simultaneously, updates to the cluster membership view will be 
> delayed a few seconds longer than usual.
>
>
>
>
> Best regards,
>
> Nhat Pham
>
> -----Original Message-----
>
> From: A V Mahesh [mailto:[email protected]]
>
> Sent: Monday, February 15, 2016 11:19 AM
>
> To: Nhat Pham  <mailto:[email protected]> 
> <[email protected]<mailto:[email protected]>>;[email protected]<mailto:[email protected]><mailto:[email protected]>
>
> Cc:[email protected]<mailto:[email protected]>
>
> <mailto:[email protected]> ; 'Beatriz Brandao'
>
>  <mailto:[email protected]> 
> <[email protected]<mailto:[email protected]>>
>
> Subject: Re: [PATCH 0 of 1] Review Request for cpsv: Support 
> preserving and recovering checkpoint replicas during headless state V2 
> [#1621]
>
> Hi Nhat Pham,
>
> How is your holiday went
>
> Please find my comments below
>
> On 2/15/2016 8:43 AM, Nhat Pham wrote:
>
> Hi Mahesh,
>
> For the comment 1, the patch will be updated accordingly.
>
> [AVM]  Please hold , I will provide more comments in this week , so we 
> can have consolidated V3
>
> For the comment 2, I think the CKPT service will not be backward 
> compatible if the scAbsenceAllowed is true.
>
> The client can't create non-collocated checkpoint on SCs.
>
> Furthermore, this solution only protects the CKPT service from the 
> case "The non-collocated checkpoint  is created on a SC"
>
> there are still the cases where the replicas are completely lost. Ex:
>
> - The non-collocated checkpoint created on a PL. The PL reboots. Both 
> replicas now locate on SCs. Then, headless state happens. All replicas 
> are lost.
>
> - The non-collocated checkpoint has active replica locating on a PL 
> and this PL restarts during headless state
>
> - The non-collocated checkpoint is created on PL3. This checkpoint is 
> also opened on PL4. Then SCs and PL3 reboot.
>
> [AVM]  Up on rejoining of the SC`s The replica should be re-created 
> regardless of another application opens it on PL4.
>
>                ( Note : this comment is based on your explanation have 
> not yet reviewed/tested  ,
>
>                   currently i am struggling with  SC`s    not rejoining
>
> after headless state , i can provide you more on this once i  complte my
>
> review/testing)
>
> In this case, all replicas are lost and the client has to create it 
> again.
>
> In case multiple nodes (which including SCs) reboot, losing replicas 
> is unpreventable. The patch is to recover the checkpoints in possible 
> cases.
>
> How do you think?
>
> [AVM] I understand that , before I comment more on this   please allow
>
> me to understand
>
>               I am not still not very clear of the headless design in 
> detail.
>
>               For example cluster membership of PL`s   during headless
>
> state ,
>
>                In the absence of SC`s  (CLMD) dose the PLs is 
> considered as
>
> cluster nodes or not (cluster membership) ?
>
>                      - if not consider as  NON cluster nodes 
> Checkpoint Service API  should leverage the SA Forum Cluster
>
>                        Membership Service  and API's can fail with 
> SA_AIS_ERR_UNAVAILABLE
>
>                      - if considers as cluster nodes  we need to 
> follow all the defined rules which are defined in SAI-AIS-CKPT-B.02.02 
> specification
>
>               so give me some more time to review it completely , so 
> that we
>
> can  have consolidated patch V3
>
> -AVM
>
> Best regards,
>
> Nhat Pham
>
> -----Original Message-----
>
> From: A V Mahesh [mailto:[email protected]]
>
> Sent: Friday, February 12, 2016 11:10 AM
>
> To: Nhat Pham  <mailto:[email protected]> 
> <[email protected]<mailto:[email protected]>>;[email protected]<mailto:[email protected]><mailto:[email protected]>
>
> Cc:[email protected]<mailto:[email protected]>
>
> <mailto:[email protected]> ; Beatriz Brandao  
> <mailto:[email protected]> 
> <[email protected]<mailto:[email protected]>>
>
> Subject: Re: [PATCH 0 of 1] Review Request for cpsv: Support 
> preserving and recovering checkpoint replicas during headless state V2 
> [#1621]
>
>
> Comment 2 :
>
> After incorporating the comment one all the Limitations should be 
> prevented based on Hydra configuration is enabled in IMM status.
>
> Foe example :  if some application is trying to create
>
> non-collocated checkpoint active replica getting generated/locating on 
> SC then ,regardless of the heads (SC`s) status exist not exist should 
> return SA_AIS_ERR_NOT_SUPPORTED
>
> In other words, rather that allowing to created non-collocated 
> checkpoint when
>
> heads(SC`s)  are exit , and non-collocated checkpoint getting 
> unrecoverable after heads(SC`s) rejoins.
>
> ======================================================================
>
> =======================
>
>     Limitation: The CKPT service doesn't support recovering 
> checkpoints in
>
>     following cases:
>
>     . The checkpoint which is unlinked before headless.
>
>     . The non-collocated checkpoint has active replica locating on SC.
>
>     . The non-collocated checkpoint has active replica locating on a 
> PL and this PL
>
>     restarts during headless state. In this cases, the checkpoint 
> replica is
>
>     destroyed. The fault code SA_AIS_ERR_BAD_HANDLE is returned when 
> the client
>
>     accesses the checkpoint in these cases. The client must re-open the
>
>     checkpoint.
>
> ======================================================================
>
> =======================
>
> -AVM
>
>
> On 2/11/2016 12:52 PM, A V Mahesh wrote:
>
> Hi,
>
> I jut starred reviewing patch , I will be  giving comments as soon as 
> I crossover any , to save some time.
>
> Comment 1 :
>
> This functionality should be under  checks if Hydra configuration is 
> enabled in IMM attrName =
>
> const_cast<SaImmAttrNameT>("scAbsenceAllowed")
>
> Please see example how  LOG/AMF services implemented it.
>
> -AVM
>
>
> On 1/29/2016 1:02 PM, Nhat Pham wrote:
>
> Hi Mahesh,
>
> As described in the README, the CKPT service returns 
> SA_AIS_ERR_TRY_AGAIN fault code in this case.
>
> I guess it's same for other services.
>
> @Anders: Could you please confirm this?
>
> Best regards,
>
> Nhat Pham
>
> -----Original Message-----
>
> From: A V Mahesh [mailto:[email protected]]
>
> Sent: Friday, January 29, 2016 2:11 PM
>
> To: Nhat Pham  <mailto:[email protected]> 
> <[email protected]<mailto:[email protected]>>;[email protected]<mailto:[email protected]><mailto:[email protected]>
>
> Cc:[email protected]<mailto:[email protected]>
>
> <mailto:[email protected]>
>
> Subject: Re: [PATCH 0 of 1] Review Request for cpsv: Support 
> preserving and recovering checkpoint replicas during headless state
>
> V2 [#1621]
>
> Hi,
>
> On 1/29/2016 11:45 AM, Nhat Pham wrote:
>
>       -  The behavior of application will be consistent with other saf 
> services like imm/amf behavior  during headless state.
>
> [Nhat] I'm not clear what you mean about "consistent"?
>
> In the obscene of  Director (SC's) , what is expected return values of 
> SAF API should ( all services ) ,
>
>      which are not in aposition to  provide service at that moment.
>
> I think all services should return same  SAF ERRS., I thinks currently 
> we don't have  it , may be  Anders Widel  will help us.
>
> -AVM
>
>
> On 1/29/2016 11:45 AM, Nhat Pham wrote:
>
> Hi Mahesh,
>
> Please see the attachment for the README. Let me know if there is any 
> more information required.
>
> Regarding your comments:
>
>       -  during headless state applications may behave like during 
> CPND restart case [Nhat] Headless state and CPND restart are different 
> events. Thus, the behavior is different.
>
> Headless state is a case where both SCs go down.
>
>       -  The behavior of application will be consistent with other saf 
> services like imm/amf behavior  during headless state.
>
> [Nhat] I'm not clear what you mean about "consistent"?
>
> Best regards,
>
> Nhat Pham
>
> -----Original Message-----
>
> From: A V Mahesh [mailto:[email protected]]
>
> Sent: Friday, January 29, 2016 11:12 AM
>
> To: Nhat Pham  <mailto:[email protected]> 
> <[email protected]<mailto:[email protected]>>;
>
> [email protected]<mailto:[email protected]><mailto:[email protected]>
>
> Cc:[email protected]<mailto:[email protected]>
>
> <mailto:[email protected]>
>
> Subject: Re: [PATCH 0 of 1] Review Request for cpsv: Support 
> preserving and recovering checkpoint replicas during headless state
>
> V2 [#1621]
>
> Hi Nhat Pham,
>
> I stared reviewing this patch , so can please provide  README file 
> with scope and limitations , that will help to define 
> testing/reviewing scope .
>
> Following are minimum things we can keep in mind while 
> reviewing/accepting patch ,
>
> - Not effecting existing functionality
>
>       -  during headless state applications may behave like during 
> CPND restart case
>
>       -  The minimum functionally of application works
>
>       -  The behavior of application will be consistent with
>
>          other saf services like imm/amf behavior  during headless state.
>
> So please do provide any additional detailed in README if any of the 
> above is deviated , that allow users to know about the 
> limitations/deviation.
>
> -AVM
>
> On 1/4/2016 3:15 PM, Nhat Pham wrote:
>
> Summary: cpsv: Support preserving and recovering checkpoint replicas 
> during headless state [#1621] Review request for Trac
>
> Ticket(s):
>
> #1621 Peer 
> Reviewer(s):[email protected]<mailto:[email protected]><mailto:[email protected]>
>  
> ;[email protected]<mailto:[email protected]><mailto:[email protected]>
>   
> Pull request
>
> to:
>
> [email protected]<mailto:[email protected]><mailto:[email protected]>
>   
> Affected
>
> branch(es): default Development
>
> branch: default
>
> --------------------------------
>
> Impacted area       Impact y/n
>
> --------------------------------
>
>       Docs                    n
>
>       Build system            n
>
>       RPM/packaging           n
>
>       Configuration files     n
>
>       Startup scripts         n
>
>       SAF services            y
>
>       OpenSAF services        n
>
>       Core libraries          n
>
>       Samples                 n
>
>       Tests                   n
>
>       Other                   n
>
>
> Comments (indicate scope for each "y" above):
>
> ---------------------------------------------
>
> changeset faec4a4445a4c23e8f630857b19aabb43b5af18d
>
> Author:    Nhat Pham  <mailto:[email protected]>
>
> <[email protected]<mailto:[email protected]>>
>
> Date:    Mon, 04 Jan 2016 16:34:33 +0700
>
>       cpsv: Support preserving and recovering checkpoint replicas 
> during headless state [#1621]
>
>       Background:
>
>       ---------- This enhancement supports to preserve checkpoint 
> replicas
>
> in case
>
>       both SCs down (headless state) and recover replicas in case one of
>
> SCs up
>
>       again. If both SCs goes down, checkpoint replicas on surviving 
> nodes
>
> still
>
>       remain. When a SC is available again, surviving replicas are
>
> automatically
>
>       registered to the SC checkpoint database. Content in surviving
>
> replicas are
>
>       intacted and synchronized to new replicas.
>
>       When no SC is available, client API calls changing checkpoint
>
> configuration
>
>       which requires SC communication, are rejected. Client API calls
>
> reading and
>
>       writing existing checkpoint replicas still work.
>
>       Limitation: The CKPT service does not support recovering 
> checkpoints
>
> in
>
>       following cases:
>
>        - The checkpoint which is unlinked before headless.
>
>        - The non-collocated checkpoint has active replica locating on SC.
>
>        - The non-collocated checkpoint has active replica locating on 
> a PL
>
> and this
>
>       PL restarts during headless state. In this cases, the checkpoint
>
> replica is
>
>       destroyed. The fault code SA_AIS_ERR_BAD_HANDLE is returned when 
> the
>
> client
>
>       accesses the checkpoint in these cases. The client must re-open the
>
>       checkpoint.
>
>       While in headless state, accessing checkpoint replicas does not 
> work
>
> if the
>
>       node which hosts the active replica goes down. It will back working
>
> when a
>
>       SC available again.
>
>       Solution:
>
>       --------- The solution for this enhancement includes 2 parts:
>
>       1. To destroy un-recoverable checkpoint described above when both
>
> SCs are
>
>       down: When both SCs are down, the CPND deletes un-recoverable
>
> checkpoint
>
>       nodes and replicas on PLs. Then it requests CPA to destroy
>
> corresponding
>
>       checkpoint node by using new message CPA_EVT_ND2A_CKPT_DESTROY
>
>       2. To update CPD with checkpoint information When an active SC 
> is up
>
> after
>
>       headless, CPND will update CPD with checkpoint information by using
>
> new
>
>       message CPD_EVT_ND2D_CKPT_INFO_UPDATE instead of using
>
>       CPD_EVT_ND2D_CKPT_CREATE. This is because the CPND will create new
>
> ckpt_id
>
>       for the checkpoint which might be different with the current 
> ckpt id
>
> if the
>
>       CPD_EVT_ND2D_CKPT_CREATE is used. The CPD collects checkpoint
>
> information
>
>       within 6s. During this updating time, following requests is 
> rejected
>
> with
>
>       fault code SA_AIS_ERR_TRY_AGAIN:
>
>       - CPD_EVT_ND2D_CKPT_CREATE
>
>       - CPD_EVT_ND2D_CKPT_UNLINK
>
>       - CPD_EVT_ND2D_ACTIVE_SET
>
>       - CPD_EVT_ND2D_CKPT_RDSET
>
>
> Complete diffstat:
>
> ------------------
>
> osaf/libs/agents/saf/cpa/cpa_proc.c       |   52
>
> +++++++++++++++++++++++++++++++++++
>
> osaf/libs/common/cpsv/cpsv_edu.c |   43
>
> +++++++++++++++++++++++++++++
>
> osaf/libs/common/cpsv/include/cpd_cb.h |    3 ++
>
> osaf/libs/common/cpsv/include/cpd_imm.h   |    1 +
>
> osaf/libs/common/cpsv/include/cpd_proc.h  |    7 ++++
>
> osaf/libs/common/cpsv/include/cpd_tmr.h   |    3 +-
>
> osaf/libs/common/cpsv/include/cpnd_cb.h   |    1 +
>
> osaf/libs/common/cpsv/include/cpnd_init.h |    2 +
>
> osaf/libs/common/cpsv/include/cpsv_evt.h  |   20 +++++++++++++
>
> osaf/services/saf/cpsv/cpd/Makefile.am    |    3 +-
>
> osaf/services/saf/cpsv/cpd/cpd_evt.c      |  229
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
> ++++
>
> osaf/services/saf/cpsv/cpd/cpd_imm.c |  112
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
> osaf/services/saf/cpsv/cpd/cpd_init.c |   20 ++++++++++++-
>
> osaf/services/saf/cpsv/cpd/cpd_proc.c     |  309
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
> osaf/services/saf/cpsv/cpd/cpd_tmr.c |    7 ++++
>
> osaf/services/saf/cpsv/cpnd/cpnd_db.c     |   16 ++++++++++
>
> osaf/services/saf/cpsv/cpnd/cpnd_evt.c    |   22 +++++++++++++++
>
> osaf/services/saf/cpsv/cpnd/cpnd_init.c   |   23 ++++++++++++++-
>
> osaf/services/saf/cpsv/cpnd/cpnd_mds.c    |   13 ++++++++
>
> osaf/services/saf/cpsv/cpnd/cpnd_proc.c   |  314
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
>
>       20 files changed, 1189 insertions(+), 11 deletions(-)
>
>
> Testing Commands:
>
> -----------------
>
> -
>
> Testing, Expected Results:
>
> --------------------------
>
> -
>
>
> Conditions of Submission:
>
> -------------------------
>
>       <<HOW MANY DAYS BEFORE PUSHING, CONSENSUS ETC>>
>
>
> Arch      Built     Started Linux distro
>
> -------------------------------------------
>
> mips        n          n
>
> mips64      n          n
>
> x86         n          n
>
> x86_64      n          n
>
> powerpc     n          n
>
> powerpc64   n          n
>
>
> Reviewer Checklist:
>
> -------------------
>
> [Submitters: make sure that your review doesn't trigger any
>
> checkmarks!]
>
>
> Your checkin has not passed review because (see checked entries):
>
> ___ Your RR template is generally incomplete; it has too many
>
> blank
>
> entries
>
>          that need proper data filled in.
>
> ___ You have failed to nominate the proper persons for review and
>
> push.
>
> ___ Your patches do not have proper short+long header
>
> ___ You have grammar/spelling in your header that is unacceptable.
>
> ___ You have exceeded a sensible line length in your
>
> headers/comments/text.
>
> ___ You have failed to put in a proper Trac Ticket # into your
>
> commits.
>
> ___ You have incorrectly put/left internal data in your comments/files
>
>          (i.e. internal bug tracking tool IDs, product names etc)
>
> ___ You have not given any evidence of testing beyond basic build
>
> tests.
>
>          Demonstrate some level of runtime or other sanity testing.
>
> ___ You have ^M present in some of your files. These have to be
>
> removed.
>
> ___ You have needlessly changed whitespace or added whitespace crimes
>
>          like trailing spaces, or spaces before tabs.
>
> ___ You have mixed real technical changes with whitespace and other
>
>          cosmetic code cleanup changes. These have to be separate
>
> commits.
>
> ___ You need to refactor your submission into logical chunks; there is
>
>          too much content into a single commit.
>
> ___ You have extraneous garbage in your review (merge commits etc)
>
> ___ You have giant attachments which should never have been sent;
>
>          Instead you should place your content in a public tree to
>
> be pulled.
>
> ___ You have too many commits attached to an e-mail; resend as
>
> threaded
>
>          commits, or place in a public tree for a pull.
>
> ___ You have resent this content multiple times without a clear
>
> indication
>
>          of what has changed between each re-send.
>
> ___ You have failed to adequately and individually address all of the
>
>          comments and change requests that were proposed in the
>
> initial
>
> review.
>
> ___ You have a misconfigured ~/.hgrc file (i.e. username, email
>
> etc)
>
> ___ Your computer have a badly configured date and time; confusing the
>
>          the threaded patch review.
>
> ___ Your changes affect IPC mechanism, and you don't present any
>
> results
>
>          for in-service upgradability test.
>
> ___ Your changes affect user manual and documentation, your patch
>
> series
>
>          do not contain the patch that updates the Doxygen manual.
>
> ------------------------------------------------------------------------------
>
> Site24x7 APM Insight: Get Deep Visibility into Application Performance
>
> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>
> Monitor end-to-end web transactions and take corrective actions now
>
> Troubleshoot faster and improve end-user experience. Signup Now!
>
> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
>
> _______________________________________________
>
> Opensaf-devel mailing list
>
> [email protected]<mailto:[email protected]>
>
> https://lists.sourceforge.net/lists/listinfo/opensaf-devel
>

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Re: [devel] [PATCH 0 of 1] Review Request for cpsv: Support preserving and recovering checkpoint replicas during headless state V2 [#1621]

Reply via email to