Hi Mahesh, It's good. Thank you. :)
[AVM] Up on rejoining of the SC`s The replica should be re-created regardless of another application opens it on PL4. ( Note : this comment is based on your explanation have not yet reviewed/tested , currently i am struggling with SC`s not rejoining after headless state , i can provide you more on this once i complte my review/testing) [Nhat] To make cloud resilience works, you need the patches from other services (log, amf, clm, ntf). @Minh: I heard that you created tar file which includes all patches. Could you please send it to Mahesh? Thanks [AVM] I understand that , before I comment more on this please allow me to understand I am not still not very clear of the headless design in detail. For example cluster membership of PL`s during headless state , In the absence of SC`s (CLMD) dose the PLs is considered as cluster nodes or not (cluster membership) ? [Nhat] I don't know much about this. @ Anders: Could you please have comment about this? Thanks Best regards, Nhat Pham -----Original Message----- From: A V Mahesh [mailto:mahesh.va...@oracle.com] Sent: Monday, February 15, 2016 11:19 AM To: Nhat Pham <nhat.p...@dektech.com.au>; anders.wid...@ericsson.com Cc: opensaf-devel@lists.sourceforge.net; 'Beatriz Brandao' <beatriz.bran...@ericsson.com> Subject: Re: [PATCH 0 of 1] Review Request for cpsv: Support preserving and recovering checkpoint replicas during headless state V2 [#1621] Hi Nhat Pham, How is your holiday went Please find my comments below On 2/15/2016 8:43 AM, Nhat Pham wrote: > Hi Mahesh, > > For the comment 1, the patch will be updated accordingly. [AVM] Please hold , I will provide more comments in this week , so we can have consolidated V3 > > For the comment 2, I think the CKPT service will not be backward > compatible if the scAbsenceAllowed is true. > The client can't create non-collocated checkpoint on SCs. > > Furthermore, this solution only protects the CKPT service from the > case "The non-collocated checkpoint is created on a SC" > there are still the cases where the replicas are completely lost. Ex: > > - The non-collocated checkpoint created on a PL. The PL reboots. Both > replicas now locate on SCs. Then, headless state happens. All replicas are > lost. > - The non-collocated checkpoint has active replica locating on a PL > and this PL restarts during headless state > - The non-collocated checkpoint is created on PL3. This checkpoint is > also opened on PL4. Then SCs and PL3 reboot. [AVM] Up on rejoining of the SC`s The replica should be re-created regardless of another application opens it on PL4. ( Note : this comment is based on your explanation have not yet reviewed/tested , currently i am struggling with SC`s not rejoining after headless state , i can provide you more on this once i complte my review/testing) > In this case, all replicas are lost and the client has to create it again. > > In case multiple nodes (which including SCs) reboot, losing replicas > is unpreventable. The patch is to recover the checkpoints in possible cases. > How do you think? [AVM] I understand that , before I comment more on this please allow me to understand I am not still not very clear of the headless design in detail. For example cluster membership of PL`s during headless state , In the absence of SC`s (CLMD) dose the PLs is considered as cluster nodes or not (cluster membership) ? - if not consider as NON cluster nodes Checkpoint Service API should leverage the SA Forum Cluster Membership Service and API's can fail with SA_AIS_ERR_UNAVAILABLE - if considers as cluster nodes we need to follow all the defined rules which are defined in SAI-AIS-CKPT-B.02.02 specification so give me some more time to review it completely , so that we can have consolidated patch V3 -AVM > > Best regards, > Nhat Pham > > -----Original Message----- > From: A V Mahesh [mailto:mahesh.va...@oracle.com] > Sent: Friday, February 12, 2016 11:10 AM > To: Nhat Pham <nhat.p...@dektech.com.au>; anders.wid...@ericsson.com > Cc: opensaf-devel@lists.sourceforge.net; Beatriz Brandao > <beatriz.bran...@ericsson.com> > Subject: Re: [PATCH 0 of 1] Review Request for cpsv: Support > preserving and recovering checkpoint replicas during headless state V2 > [#1621] > > > Comment 2 : > > After incorporating the comment one all the Limitations should be > prevented based on Hydra configuration is enabled in IMM status. > > Foe example : if some application is trying to create > > non-collocated checkpoint active replica getting generated/locating on > SC then ,regardless of the heads (SC`s) status exist not exist should > return SA_AIS_ERR_NOT_SUPPORTED > > In other words, rather that allowing to created non-collocated > checkpoint when > heads(SC`s) are exit , and non-collocated checkpoint getting > unrecoverable after heads(SC`s) rejoins. > > ====================================================================== > ======================= >> Limitation: The CKPT service doesn't support recovering checkpoints in >> following cases: >> . The checkpoint which is unlinked before headless. >> . The non-collocated checkpoint has active replica locating on SC. >> . The non-collocated checkpoint has active replica locating on a PL >> and this PL >> restarts during headless state. In this cases, the checkpoint replica is >> destroyed. The fault code SA_AIS_ERR_BAD_HANDLE is returned when the >> client >> accesses the checkpoint in these cases. The client must re-open the >> checkpoint. > ====================================================================== > ======================= > > -AVM > > > On 2/11/2016 12:52 PM, A V Mahesh wrote: >> Hi, >> >> I jut starred reviewing patch , I will be giving comments as soon as >> I crossover any , to save some time. >> >> Comment 1 : >> This functionality should be under checks if Hydra configuration is >> enabled in IMM attrName = >> const_cast<SaImmAttrNameT>("scAbsenceAllowed") >> >> Please see example how LOG/AMF services implemented it. >> >> -AVM >> >> >> On 1/29/2016 1:02 PM, Nhat Pham wrote: >>> Hi Mahesh, >>> >>> As described in the README, the CKPT service returns >>> SA_AIS_ERR_TRY_AGAIN fault code in this case. >>> I guess it's same for other services. >>> >>> @Anders: Could you please confirm this? >>> >>> Best regards, >>> Nhat Pham >>> >>> -----Original Message----- >>> From: A V Mahesh [mailto:mahesh.va...@oracle.com] >>> Sent: Friday, January 29, 2016 2:11 PM >>> To: Nhat Pham <nhat.p...@dektech.com.au>; anders.wid...@ericsson.com >>> Cc: opensaf-devel@lists.sourceforge.net >>> Subject: Re: [PATCH 0 of 1] Review Request for cpsv: Support >>> preserving and recovering checkpoint replicas during headless state >>> V2 [#1621] >>> >>> Hi, >>> >>> On 1/29/2016 11:45 AM, Nhat Pham wrote: >>>> - The behavior of application will be consistent with other >>>> saf services like imm/amf behavior during headless state. >>>> [Nhat] I'm not clear what you mean about "consistent"? >>> In the obscene of Director (SC's) , what is expected return values >>> of SAF API should ( all services ) , >>> which are not in aposition to provide service at that moment. >>> >>> I think all services should return same SAF ERRS., I thinks >>> currently we don't have it , may be Anders Widel will help us. >>> >>> -AVM >>> >>> >>> On 1/29/2016 11:45 AM, Nhat Pham wrote: >>>> Hi Mahesh, >>>> >>>> Please see the attachment for the README. Let me know if there is >>>> any more information required. >>>> >>>> Regarding your comments: >>>> - during headless state applications may behave like during >>>> CPND restart case [Nhat] Headless state and CPND restart are >>>> different events. Thus, the behavior is different. >>>> Headless state is a case where both SCs go down. >>>> >>>> - The behavior of application will be consistent with other >>>> saf services like imm/amf behavior during headless state. >>>> [Nhat] I'm not clear what you mean about "consistent"? >>>> >>>> Best regards, >>>> Nhat Pham >>>> >>>> -----Original Message----- >>>> From: A V Mahesh [mailto:mahesh.va...@oracle.com] >>>> Sent: Friday, January 29, 2016 11:12 AM >>>> To: Nhat Pham <nhat.p...@dektech.com.au>; >>>> anders.wid...@ericsson.com >>>> Cc: opensaf-devel@lists.sourceforge.net >>>> Subject: Re: [PATCH 0 of 1] Review Request for cpsv: Support >>>> preserving and recovering checkpoint replicas during headless state >>>> V2 [#1621] >>>> >>>> Hi Nhat Pham, >>>> >>>> I stared reviewing this patch , so can please provide README file >>>> with scope and limitations , that will help to define >>>> testing/reviewing scope . >>>> >>>> Following are minimum things we can keep in mind while >>>> reviewing/accepting patch , >>>> >>>> - Not effecting existing functionality >>>> - during headless state applications may behave like during >>>> CPND restart case >>>> - The minimum functionally of application works >>>> - The behavior of application will be consistent with >>>> other saf services like imm/amf behavior during headless state. >>>> >>>> So please do provide any additional detailed in README if any of >>>> the above is deviated , that allow users to know about the >>>> limitations/deviation. >>>> >>>> -AVM >>>> >>>> On 1/4/2016 3:15 PM, Nhat Pham wrote: >>>>> Summary: cpsv: Support preserving and recovering checkpoint >>>>> replicas during headless state [#1621] Review request for Trac >>>>> Ticket(s): >>>>> #1621 Peer Reviewer(s): mahesh.va...@oracle.com; >>>>> anders.wid...@ericsson.com Pull request to: >>>>> mahesh.va...@oracle.com Affected branch(es): default Development >>>>> branch: default >>>>> >>>>> -------------------------------- >>>>> Impacted area Impact y/n >>>>> -------------------------------- >>>>> Docs n >>>>> Build system n >>>>> RPM/packaging n >>>>> Configuration files n >>>>> Startup scripts n >>>>> SAF services y >>>>> OpenSAF services n >>>>> Core libraries n >>>>> Samples n >>>>> Tests n >>>>> Other n >>>>> >>>>> >>>>> Comments (indicate scope for each "y" above): >>>>> --------------------------------------------- >>>>> >>>>> changeset faec4a4445a4c23e8f630857b19aabb43b5af18d >>>>> Author: Nhat Pham <nhat.p...@dektech.com.au> >>>>> Date: Mon, 04 Jan 2016 16:34:33 +0700 >>>>> >>>>> cpsv: Support preserving and recovering checkpoint replicas >>>>> during headless state [#1621] >>>>> >>>>> Background: >>>>> ---------- This enhancement supports to preserve checkpoint >>>>> replicas >>>> in case >>>>> both SCs down (headless state) and recover replicas in case >>>>> one of >>>> SCs up >>>>> again. If both SCs goes down, checkpoint replicas on >>>>> surviving nodes >>>> still >>>>> remain. When a SC is available again, surviving replicas are >>>> automatically >>>>> registered to the SC checkpoint database. Content in >>>>> surviving >>>> replicas are >>>>> intacted and synchronized to new replicas. >>>>> >>>>> When no SC is available, client API calls changing checkpoint >>>> configuration >>>>> which requires SC communication, are rejected. Client API >>>>> calls >>>> reading and >>>>> writing existing checkpoint replicas still work. >>>>> >>>>> Limitation: The CKPT service does not support recovering >>>>> checkpoints >>>> in >>>>> following cases: >>>>> - The checkpoint which is unlinked before headless. >>>>> - The non-collocated checkpoint has active replica locating >>>>> on SC. >>>>> - The non-collocated checkpoint has active replica locating >>>>> on a PL >>>> and this >>>>> PL restarts during headless state. In this cases, the >>>>> checkpoint >>>> replica is >>>>> destroyed. The fault code SA_AIS_ERR_BAD_HANDLE is returned >>>>> when the >>>> client >>>>> accesses the checkpoint in these cases. The client must >>>>> re-open the >>>>> checkpoint. >>>>> >>>>> While in headless state, accessing checkpoint replicas does >>>>> not work >>>> if the >>>>> node which hosts the active replica goes down. It will back >>>>> working >>>> when a >>>>> SC available again. >>>>> >>>>> Solution: >>>>> --------- The solution for this enhancement includes 2 parts: >>>>> >>>>> 1. To destroy un-recoverable checkpoint described above when >>>>> both >>>> SCs are >>>>> down: When both SCs are down, the CPND deletes un-recoverable >>>> checkpoint >>>>> nodes and replicas on PLs. Then it requests CPA to destroy >>>> corresponding >>>>> checkpoint node by using new message >>>>> CPA_EVT_ND2A_CKPT_DESTROY >>>>> >>>>> 2. To update CPD with checkpoint information When an active >>>>> SC is up >>>> after >>>>> headless, CPND will update CPD with checkpoint information by >>>>> using >>>> new >>>>> message CPD_EVT_ND2D_CKPT_INFO_UPDATE instead of using >>>>> CPD_EVT_ND2D_CKPT_CREATE. This is because the CPND will >>>>> create new >>>> ckpt_id >>>>> for the checkpoint which might be different with the current >>>>> ckpt id >>>> if the >>>>> CPD_EVT_ND2D_CKPT_CREATE is used. The CPD collects checkpoint >>>> information >>>>> within 6s. During this updating time, following requests is >>>>> rejected >>>> with >>>>> fault code SA_AIS_ERR_TRY_AGAIN: >>>>> - CPD_EVT_ND2D_CKPT_CREATE >>>>> - CPD_EVT_ND2D_CKPT_UNLINK >>>>> - CPD_EVT_ND2D_ACTIVE_SET >>>>> - CPD_EVT_ND2D_CKPT_RDSET >>>>> >>>>> >>>>> Complete diffstat: >>>>> ------------------ >>>>> osaf/libs/agents/saf/cpa/cpa_proc.c | 52 >>>> +++++++++++++++++++++++++++++++++++ >>>>> osaf/libs/common/cpsv/cpsv_edu.c | 43 >>>> +++++++++++++++++++++++++++++ >>>>> osaf/libs/common/cpsv/include/cpd_cb.h | 3 ++ >>>>> osaf/libs/common/cpsv/include/cpd_imm.h | 1 + >>>>> osaf/libs/common/cpsv/include/cpd_proc.h | 7 ++++ >>>>> osaf/libs/common/cpsv/include/cpd_tmr.h | 3 +- >>>>> osaf/libs/common/cpsv/include/cpnd_cb.h | 1 + >>>>> osaf/libs/common/cpsv/include/cpnd_init.h | 2 + >>>>> osaf/libs/common/cpsv/include/cpsv_evt.h | 20 +++++++++++++ >>>>> osaf/services/saf/cpsv/cpd/Makefile.am | 3 +- >>>>> osaf/services/saf/cpsv/cpd/cpd_evt.c | 229 >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> >>>> ++++ >>>>> osaf/services/saf/cpsv/cpd/cpd_imm.c | 112 >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> >>>>> osaf/services/saf/cpsv/cpd/cpd_init.c | 20 ++++++++++++- >>>>> osaf/services/saf/cpsv/cpd/cpd_proc.c | 309 >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> >>>> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> osaf/services/saf/cpsv/cpd/cpd_tmr.c | 7 ++++ >>>>> osaf/services/saf/cpsv/cpnd/cpnd_db.c | 16 ++++++++++ >>>>> osaf/services/saf/cpsv/cpnd/cpnd_evt.c | 22 +++++++++++++++ >>>>> osaf/services/saf/cpsv/cpnd/cpnd_init.c | 23 ++++++++++++++- >>>>> osaf/services/saf/cpsv/cpnd/cpnd_mds.c | 13 ++++++++ >>>>> osaf/services/saf/cpsv/cpnd/cpnd_proc.c | 314 >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> >>>> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--- >>>>> 20 files changed, 1189 insertions(+), 11 deletions(-) >>>>> >>>>> >>>>> Testing Commands: >>>>> ----------------- >>>>> - >>>>> >>>>> Testing, Expected Results: >>>>> -------------------------- >>>>> - >>>>> >>>>> >>>>> Conditions of Submission: >>>>> ------------------------- >>>>> <<HOW MANY DAYS BEFORE PUSHING, CONSENSUS ETC>> >>>>> >>>>> >>>>> Arch Built Started Linux distro >>>>> ------------------------------------------- >>>>> mips n n >>>>> mips64 n n >>>>> x86 n n >>>>> x86_64 n n >>>>> powerpc n n >>>>> powerpc64 n n >>>>> >>>>> >>>>> Reviewer Checklist: >>>>> ------------------- >>>>> [Submitters: make sure that your review doesn't trigger any >>>>> checkmarks!] >>>>> >>>>> >>>>> Your checkin has not passed review because (see checked entries): >>>>> >>>>> ___ Your RR template is generally incomplete; it has too many >>>>> blank >>>> entries >>>>> that need proper data filled in. >>>>> >>>>> ___ You have failed to nominate the proper persons for review and >>>>> push. >>>>> >>>>> ___ Your patches do not have proper short+long header >>>>> >>>>> ___ You have grammar/spelling in your header that is unacceptable. >>>>> >>>>> ___ You have exceeded a sensible line length in your >>>> headers/comments/text. >>>>> ___ You have failed to put in a proper Trac Ticket # into your >>>>> commits. >>>>> >>>>> ___ You have incorrectly put/left internal data in your comments/files >>>>> (i.e. internal bug tracking tool IDs, product names etc) >>>>> >>>>> ___ You have not given any evidence of testing beyond basic build >>>>> tests. >>>>> Demonstrate some level of runtime or other sanity testing. >>>>> >>>>> ___ You have ^M present in some of your files. These have to be >>>>> removed. >>>>> >>>>> ___ You have needlessly changed whitespace or added whitespace crimes >>>>> like trailing spaces, or spaces before tabs. >>>>> >>>>> ___ You have mixed real technical changes with whitespace and other >>>>> cosmetic code cleanup changes. These have to be separate >>>>> commits. >>>>> >>>>> ___ You need to refactor your submission into logical chunks; there is >>>>> too much content into a single commit. >>>>> >>>>> ___ You have extraneous garbage in your review (merge commits etc) >>>>> >>>>> ___ You have giant attachments which should never have been sent; >>>>> Instead you should place your content in a public tree to >>>>> be pulled. >>>>> >>>>> ___ You have too many commits attached to an e-mail; resend as >>>>> threaded >>>>> commits, or place in a public tree for a pull. >>>>> >>>>> ___ You have resent this content multiple times without a clear >>>>> indication >>>>> of what has changed between each re-send. >>>>> >>>>> ___ You have failed to adequately and individually address all of the >>>>> comments and change requests that were proposed in the >>>>> initial >>>> review. >>>>> ___ You have a misconfigured ~/.hgrc file (i.e. username, email >>>>> etc) >>>>> >>>>> ___ Your computer have a badly configured date and time; confusing the >>>>> the threaded patch review. >>>>> >>>>> ___ Your changes affect IPC mechanism, and you don't present any >>>>> results >>>>> for in-service upgradability test. >>>>> >>>>> ___ Your changes affect user manual and documentation, your patch >>>>> series >>>>> do not contain the patch that updates the Doxygen manual. >>>>> > ------------------------------------------------------------------------------ Site24x7 APM Insight: Get Deep Visibility into Application Performance APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month Monitor end-to-end web transactions and take corrective actions now Troubleshoot faster and improve end-user experience. Signup Now! http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140 _______________________________________________ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel